********************** BigQuery SQL Tutorial ********************** Exploring the TCGA data in BigQuery ----------------------------------- The ISB-CGC team has aggregated and curated the TCGA open-access clinical, biospecimen, and Level-3 molecular data and uploaded it into BigQuery tables that are open to the public. Additional tables have been added to open up new analysis options. In this tutorial, we will show you how you can begin to work with these tables from the Google BigQuery Web UI. **Note** that in order to use BigQuery, you *must* have access to (*ie* be a member of) a GCP project. Helpful BigQuery links ---------------------- For this example, we'll be working in the `Google BigQuery Web UI `_. We've tried to simplify what you need to know to get started using the ISB-CGC BigQuery tables in this quick `visual walkthrough `_. It's often helpful to have a `link to the docs `_ handy, and especially the `query reference `_. (You'll probably want to open those into new tabs of your browser, for easy access.) Let's query! ------------ In your browser, go to the `BigQuery Web UI `_ On the left side, from top to bottom we have: 1. **Compose Query** This button opens the **New Query** text box, where we can write queries. 2. **Query History** A list of your past queries... *very useful*. 3. **Job History** A list of past jobs (*eg* copying or creating tables). 4. **Your Project Datasets** Click the little blue triangle to create a new dataset or change projects. 5. **isb-cgc** Publicly accessible ISB-CGC curated datasets (including TCGA and reference data sources). 6. **More data!** Other added datasets will appear here (for example, the **genomics-public-data**, *etc*). **Note**: if you do not see the **isb-cgc** datasets, you need to add them to your "view" by clicking on the blue arrow next to your project name at the top of the left side-bar, select "Switch to Project", then "Display Project...", and enter "isb-cgc" (without quotes) in the text box labeled "Project ID". All ISB-CGC public BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface. You can repeat this process for other public datasets. Querying: Lists, Joins, and Subqueries -------------------------------------- BigQueries are very similar to regular SQL, but with some differences. (Note: you can now `enable standard SQL `_ in BigQuery.) Typically, we select some variables (aka "fields") from one or more tables, filter on some criteria, and occasionally aggregate the results (such as taking an average). In this first simple example, we are asking for the barcodes for all casess in the CESC and HNSC diseases, with an associated "primary solid tumor" sample. Note the use of the **IN** keyword. .. code-block:: sql SELECT project_short_name, case_barcode, Sample_Type_name FROM [isb-cgc:TCGA_bioclin_v0.Biospecimen] WHERE project_short_name IN ('TCGA-CESC', 'TCGA-HNSC') AND Sample_Type_name = 'Primary solid Tumor' Go ahead and cut and paste the above query directly into the New Query box, and then click the red **Run Query** button. Next, let's suppose we want to bring in some information that is available in the Clinical_data table. To do this we need to JOIN the clinical and biospecimen tables using the SQL **... JOIN ... ON ...** construct. .. code-block:: sql SELECT b.case_barcode, a.Sample_Barcode, a.project_short_name, a.Sample_Type_name, a.avg_percent_tumor_cells, b.hpv_status FROM ( SELECT case_barcode, Sample_Barcode, project_short_name, Sample_Type_name, avg_percent_tumor_cells FROM [isb-cgc:TCGA_bioclin_v0.Biospecimen] WHERE project_short_name IN ('TCGA-CESC', 'TCGA-HNSC') AND Sample_Type_name='Primary solid Tumor' ) AS a JOIN ( SELECT case_barcode, hpv_status FROM [isb-cgc:TCGA_bioclin_v0.Clinical] ) AS b ON a.case_barcode = b.case_barcode GROUP BY b.case_barcode, a.Sample_Barcode, a.project_short_name, a.Sample_Type_name, a.avg_percent_tumor_cells, b.hpv_status If you're really paying attention, you might notice that the first query returned 836 case barcodes from the Biospecimen_data table, but the second one returned only 835 participant and sample barcodes. In a few cases, the Biospecimen_data table contains information about samples that have no associated information in the Clinical_data table, and the "JOIN" operation is by default an *INNER* JOIN which returns only the *intersection* of the two tables being joined. Another way to work with multiple tables is by using subqueries. In the example below, we have an *inner* query (the middle seven lines set off by blank space) which creates a "cohort" on the fly, filtering by study and HPV status from the Clinical_data table. We then use that sub-table to filter the Biospecimen_data table, where we compute the average of the percent tumor cells, also counting how many rows went into each average, grouped according to SampleType, and then finally we sort by n. .. code-block:: sql SELECT project_short_name, Sample_Type_name, AVG(avg_percent_tumor_cells) AS avgPctTumor, COUNT(*) AS n FROM [isb-cgc:TCGA_bioclin_v0.Biospecimen] WHERE case_barcode IN ( SELECT case_barcode FROM [isb-cgc:TCGA_bioclin_v0.Clinical] WHERE hpv_status = 'Positive' AND project_short_name IN ('TCGA-CESC', 'TCGA-HNSC') ) GROUP BY project_short_name, Sample_Type_name ORDER BY n DESC Computing Statistics --------------------------- A beneficial goal is to keep as much computation on the BigQuery side as possible. That means we want to aggregate and compute functions that return summary data. In this query, we're going to look at some summary statistics in the clinical table. .. code-block:: sql SELECT case_barcode, project_short_name, gender, country, number_pack_years_smoked, (number_pack_years_smoked - mu) / sd AS z FROM [isb-cgc:TCGA_bioclin_v0.Clinical] AS a JOIN ( SELECT vital_status, AVG(number_pack_years_smoked) AS mu, STDDEV(number_pack_years_smoked) AS sd FROM [isb-cgc:TCGA_bioclin_v0.Clinical] WHERE vital_status = 'Alive' GROUP BY vital_status ) AS b ON a.vital_status = b.vital_status ORDER BY z DESC The results from this query are ordered by Z score After running a query, there are several options in the bottom **Results** panel. You can get an "Explanation" showing how the query was broken into multiple Stages, the number of input and outputs from each stage, and the amount of time spent reading, computing, *etc*. In addition, you can Download or Save the Results in various ways, including as a new BigQuery Table. If your query will return a large number of results, you may need to click the **Show Options** button to the right of the **Run Query** button and specific a "Destination Table" and then turn on the "Allow Large Results" option. Making Summary Tables --------------------- Another way to create summary information is by creating tables of counts as shown below. With summary tables, we can even compute statistics like a ChiSq. .. code-block:: sql SELECT table_cell, COUNT(*) AS n FROM ( SELECT ( CASE WHEN gender = 'MALE' AND hpv_status = 'Positive' THEN 'Male_and_HPV_Pos' WHEN gender = 'MALE' AND hpv_status = 'Negative' THEN 'Male_and_HPV_Neg' WHEN gender = 'FEMALE' AND hpv_status = 'Positive' THEN 'Female_and_HPV_Pos' WHEN gender = 'FEMALE' AND hpv_status = 'Negative' THEN 'Female_and_HPV_Neg' ELSE 'None' END ) AS table_cell, FROM [isb-cgc:TCGA_bioclin_v0.Clinical] WHERE project_short_name IN ('TCGA-CESC', 'HNSC') HAVING table_cell <> 'None' ) GROUP BY table_cell ORDER BY n DESC LiftOver from hg19 to hg38 ========================== Suppose you want to work with the newer hg38 reference. We can use BigQuery to perform the liftOver operation on the methylation probe coordinates using a simple JOIN query. (This query takes approx 25s and produces an output table with one row for each of the input rows in the input annotation table.) .. code-block:: sql SELECT a.probeID AS Illumina_probeID, a.hg19_chr AS hg19_chr, a.hg19_pos AS hg19_pos, b.hg38_chr AS hg38_chr, b.hg38_pos AS hg38_pos FROM ( SELECT IlmnID AS probeID, CHR AS hg19_chr, MAPINFO AS hg19_pos FROM [isb-cgc:platform_reference.methylation_annotation] ) a LEFT OUTER JOIN EACH ( SELECT LTRIM(hg19_ref,"chr") AS hg19_chr, hg19_pos, LTRIM(hg38_ref,"chr") AS hg38_chr, hg38_pos FROM [isb-cgc:genome_reference.liftOver_hg19_to_hg38] ) b ON a.hg19_chr=b.hg19_chr AND a.hg19_pos=b.hg19_pos