DIY Workshop¶
These materials were originally created for in-person workshops, and have been modified and updated to create a “Do It Yourself” workshop that you should be able to work through on your own. If you run into problems please send email to feedback@isb-cgc.org.
Step #1: Setting up Your Local Environment¶
Your Google Identity¶
You may already have a Google identity – your institutional email may be a Google identity (if your institution uses Google Apps), or you may have a personal GMail address. One way to check whether your email address is a Google-managed identity is to go to the password assistance page, select “I don’t know my password” and enter your email address. If you get a response like “Please contact your domain IT administrator” then your email address is not a Google identity.
If you dont’ have a Google identity, it only takes a minute to create one.
Installing the Google Cloud SDK¶
The Google Cloud SDK is an essential toolbox for anyone working with the Google Cloud Platform. The Cloud SDK is easy to install and runs on Linux, Mac OS X, and Windows. It includes all of the command line tools, local emulators, and libraries that you will need. There are three key command line interfaces (CLIs) that you’ll want to become comfortable using:
- gcloud enables seamless local authentication and powerful command line access to many cloud resources
- gsutil lets you access Google Cloud Storage (GCS) from the command line
- bq provides access to BigQuery from the command line
Once you have the gcloud SDK installed, you can find out what your current/default Project ID is by
running gcloud config list from the command line. To initialize your default configuration, run
gcloud init <https://cloud.google.com/sdk/gcloud/reference/init>_ and follow the instructions.
Updates to the SDK are published every week or two, so you will frequently see a message that says:
Updates are available for some Cloud SDK components. To install them, please run: $ gcloud components update.
When you see this message, simply run gcloud components update at your convenience, and follow the
instructions.
Installing Chrome¶
If you do not already use the Chrome browser, we strongly suggest that you install Google Chrome on your laptop or desktop. Although the ISB-CGC web-app should work on any modern browser, it is optimized for the Chrome browser.
Installing R and RStudio¶
If you want to be able to run R scripts locally, you will want to install R as well as the interactive environment RStudio. You can follow these tips to get started.
Step #2: Setting up Your Google Cloud Platform (GCP) Project¶
Creating / Obtaining your GCP Project¶
In order to make use of all of the data, tools, and functionality described in this workshop, you will also need your own GCP project.
We’d like to encourage you to take advantage of the free trial offered by Google. If you have already used this one-time offer (or there is some other reason you cannot use it) please see the information here about requesting an ISB-CGC provided (and funded) project. (We’ll also be happy to do that for you after you use the $300 Google credit / free trial.)
Google Cloud Platform Console¶
The Google Cloud Platform Console (which we will refer to from now on simply as the Console) is your web-based interface to your GCP Project. From the Console, you can check the overall status of your project, create and delete Cloud Storage buckets, upload and download files, spin up and shut down VMs, add members to your project, etc. No setup or installation are required.
- sign into your Chrome (or other) browser using your Google identity (the one associated with the GCP project that you created yourself or that we set up for you)
- go to the Google Cloud Platform Console
- you should automatically be signed in to your own GCP project;
- in the top blue bar, towards the right, you may be able to select between two or more projects;
- in the GCP Console, if you click on Home you will see your current Project ID on the Dashboard
- this Quick Tour of the Google Cloud Console will help you learn the basics that you are most likely to need
NOTE: If you’re just getting started working in the Google Cloud, you will probably only have one project. Over time, however, you may find that it is useful to create additional projects for any of a variety of reasons. You may have different grants or contracts that need to be charged for specific research activities, or you may have different groups of collaborators that you are working with, or you may be working with different sets of controlled-access data. All of these are good reasons to set up multiple, separate, GCP projects. When you do so, however, you will need to learn to pay attention to which project is your “current” project. Any costs that you may incur, will alwasy be charged to your current project. The types of actions that incur costs include uploading data to a storage bucket, spinning up a VM, running a BigQuery query, etc.
- If you are using the Console, you will see the Project Name in the blue bar at the top of the page, and the browser url should look like:
https://console.cloud.google.com/home/dashboard?project=<project-id>. - At the command-line, you can use the
gcloudtool to verify your current configuration (as described above). - Finally, if you are using the BigQuery Web UI, the url should look like this:
https://bigquery.cloud.google.com/project/<project-id>orhttps://bigquery.cloud.google.com/queries/<project-id>.
Enabling Required Google APIs¶
To make use of all of the functionality described in these tutorials (including running the example code available on github), you will need to have certain APIs enabled for your GCP project. Specifically, you will need the following to be enabled (some may already be enabled by default):
- Google Compute Engine
- Google Genomics
- Google BigQuery
- Google Cloud Logging
- Google Cloud Pub/Sub
This tutorial will walk you through the steps involved in enabling new APIs for your project.
Additional Quickstart Tutorials¶
ISB Cancer Genomics Cloud (ISB-CGC)¶
- Introductions, Overview etc
- Introduction to the ISB-CGC Platform
- A Quick Tour of the Google Cloud Console
- Copy/Paste Cheat Sheet (you might find this useful later on in the day)
- ISB-CGC Web App & API Endpoints
- Web-App Tutorial (walkthrough) (doc)
- API Endpoints demo (doc)
- ISB-CGC Open-Access BigQuery Tables
- Overview of TCGA data (doc)
- BigQuery SQL Tutorial
- Analysis using R (github)
- Computing in the Cloud
- Useful References: Cloud SDK cheat sheet
- Introduction to GCE (Google Compute Engine) (slides)
- Google Genomics “Pipelines” Service (slides)
- ISB-CGC Pipelines Framework (slides, github)
Other Topics¶
DREAM Challenge: Somatic Mutation Challenge – RNA¶
- DREAM challenges are powered by Sage Bionetworks
- Presentation
- Somatic Mutation Calling Challenge: RNA – Registration is now open!
Google Genomics¶
- Overview
- github repositories
- Google Genomics Cookbook with sections on:
- finding published data sources
- data-processing on the Google Cloud
- data-analysis on the Google Cloud
- accessing data using IGV, BioConductor, R, Python and more!