****************************
Analysis with R and BigQuery
****************************

Computing on the BigQuery side; making correlation matrices
###########################################################

In this example, we are going to compute a correlation matrix (or co-expression)
entirely on the BigQuery side. Since we're in R-land, we can easily visualize
the matrix as a heatmap. We are also going to query for gene lists using
Gene Ontology annotation, and access the End Points API to get a list of samples.


Loading libraries
=================

.. code-block:: r

    require(bigrquery, quietly = TRUE) || install.packages('bigrquery',verbose = FALSE)
    require(httpuv, quietly = TRUE) || install.packages('httpuv',verbose=FALSE)
    require(ggplot2, quietly = TRUE) || install.packages('ggplot2',verbose = FALSE)

The ISB-CGC R Examples Package
==============================

The collection of examples can be found at:
https://github.com/isb-cgc/examples-R

To install the package, we use the devtools package.

.. code-block:: r

    require(devtools, quietly = TRUE) || install.packages('devtools',verbose = FALSE)
    require(ISBCGCExamples, quietly = TRUE) || install_github("isb-cgc/examples-R", build_vignettes=F)
    

If you needed to install any of the packages, remember, you still need to import it!

Your project ID
===============

You will be using your own project ID. At certain points in the code, it will
be necessary to complete the code.

.. code-block:: r

    my_cloud_project   = "your_project_id"
    tcga_data_set      = "TCGA_hg38_data_v0"

First query
===========

Now let's see if things are working.

.. code-block:: r

    bigrquery::list_tables("isb-cgc", tcga_data_set)


Using BigQuery to compute correlation matrices
==============================================

A correlation matrix using gene expression is going to take the correlation
between each pair of genes. Usually, this matrix can then be clustered to
discover gene modules.

In this query, we're going to use the BigQuery CORR function (one of the
many built-in mathematical functions) to compute a Pearson's correlation.

You can find all the function information at:
https://cloud.google.com/bigquery/query-reference

The general structure of the query is going to be:

.. code-block:: sql

    select
      genes and correlation
    from
      subtable with genes and samples of interest
    joined to
      same subtable of genes and samples


.. code-block:: r

    q <- "
	SELECT
	  a.gene_name AS gene1,
	  b.gene_name AS gene2,
	  CORR(a.HTSeq__FPKM,
	    b.HTSeq__FPKM) AS corr
	FROM (
	  SELECT
	    *
	  FROM
	    `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
	  WHERE
	    gene_name IN ('APLN',
	      'CCL26',
	      'IL19',
	      'IL37')
	    AND project_short_name = 'TCGA-COAD' ) AS a
	JOIN (
	  SELECT
	    *
	  FROM
	    `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
	  WHERE
	    gene_name IN ('APLN',
	      'CCL26',
	      'IL19',
	      'IL37')
	    AND project_short_name = 'TCGA-COAD' ) AS b
	ON
	  a.aliquot_barcode = b.aliquot_barcode
	GROUP BY
	  gene1,
	  gene2"

    corrs <- query_exec(q,my_cloud_project, use_legacy_sql=F)

    # transform to a matrix, and give it rownames
    library(tidyr)
    corrmat <- spread(corrs, gene1, corr)
    rownames(corrmat) <- corrmat$gene2

    # visualize the matrix
    library(pheatmap)
    pheatmap(corrmat[,-1])


It's easy to make a couple changes to this query, enabling a correlation
matrix *per* study. Try it!


Getting a list of high variance genes
=====================================

When we make queries from R, the results come back as a data.frame.
Let's use the GO annotation, and get a list of genes that are
related to the immune system. The GO Annotation table is found
in the genome_reference data set, and GO:0006955 references the
immune response.

.. code-block:: r

    q <- "
    SELECT 
      DB_Object_Symbol
    FROM
      `isb-cgc.genome_reference.GO_Annotations`
    WHERE
      GO_ID = 'GO:0006955'"

    query_exec(q, my_cloud_project, use_legacy_sql=F)


That query returns 472 genes. But let's suppose we want the top 50 by
coefficient of variance.

.. code-block:: r

    q <- "
	SELECT
	  gene_name,
	  STDDEV(HTSeq__FPKM+1) / AVG(HTSeq__FPKM+1) AS cv
	FROM
	  `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
	WHERE
	  gene_name IN (
	  SELECT
	    DB_Object_Symbol
	  FROM
	    `isb-cgc.genome_reference.GO_Annotations`
	  WHERE
	    GO_ID = 'GO:0006955')
	  AND project_short_name = 'TCGA-BRCA'
	GROUP BY
	  gene_name
	ORDER BY
	  cv DESC
	LIMIT
	  50"

    result <- query_exec(q, my_cloud_project, use_legacy_sql=F)
    genes <- result$gene_name

Now we have a list of genes that we can carry to further analysis.


Getting a list of samples from the endpoints
============================================

From R we can access the cohorts we created using the web app. To do that we
use the End Points API. The API is essentially a set of html requests. A
small wrapper is included as part of the isb-cgc examples-R package.

https://github.com/isb-cgc/examples-R/blob/master/inst/doc/Working_With_Barcode_Lists.md

To get started, import the ISBCGCExamples library.

.. code-block:: r

    library(ISBCGCExamples)

The first step is creating a token. This token contains your authentication status,
and lets the service know about what information is available to you.

.. code-block:: r

    my_token <- isb_init()


To get a listing of the previously created cohorts, we can use the list_cohorts
function that takes a token, and returns a list with items including
'count', 'items', 'kind', and 'etag'. The count shows the number of saved
cohorts and the items contains information about the cohorts.

.. code-block:: r

    # first get a list of my saved cohorts.
    my_cohorts <- list_cohorts(my_token)
    names(my_cohorts)

    # to get the names of my saved cohorts
    lapply(my_cohorts$items, function(x) x$name)

Now that we have the cohort IDs, we can collect the various barcodes contained
in the cohort. These include patient barcodes, sample barcodes, and 
aliquot barcodes. To do this, we can use the barcodes_from_cohort function.

HERE I'm using my cohort #4, but change this to whatever you have saved.

.. code-block:: r

    # get the cohort IDs
    my_cohort_id <- lapply(my_cohorts$items, function(x) x$id)[[4]]

    # then ping the endpoints with the cohort ID
    my_barcodes <- cohort_barcodes(my_cohort_id, my_token)
    names(my_barcodes)

The object returned from barcodes_from_cohort is again a list, this time with
elements 'cohort_id', 'sample_count', 'case_count', 'cases', and 'samples'.
The cases and samples elements are also lists, but lists of cases or sample barcodes.

.. code-block:: r

    samples <- unlist(my_barcodes$samples)
    # 836 samples


Programmatically constructing Queries
=====================================

One of the great things about working in a scripting environment, is that our
analysis -- the queries -- we write, can be constructed programmatically.
That makes it easy to apply the same structured queries to many questions.
But also we can incorporate long lists of samples or genes into a query.

.. code-block:: r

    #function for formatting lists..
    sqf <- function(x) {
        paste("('",paste(x, collapse="','"),"')", sep="")
    }

    q <- paste("
    SELECT
      a.gene_name as gene1,
      b.gene_name as gene2,
      CORR(a.HTSeq__FPKM, b.HTSeq__FPKM) as corr
    FROM (
      SELECT
        *
      FROM
        `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
      WHERE
        gene_name IN ", sqf(genes), "
        AND sample_barcode IN ", sqf(samples), "
      ) AS a
    JOIN (
      SELECT
        *
      FROM
        `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
      WHERE
        gene_name IN ", sqf(genes), "
        AND sample_barcode IN ", sqf(samples), "
      ) AS b
    ON
      a.aliquot_barcode = b.aliquot_barcode
    GROUP BY
      gene1,
      gene2", sep=" ")

    corrs <- query_exec(q,my_cloud_project, use_legacy_sql=F)

    # transform to a matrix, and give it rownames
    library(tidyr)
    corrmat <- spread(corrs, gene1, corr)
    rownames(corrmat) <- corrmat$gene2

    # visualize the matrix
    library(pheatmap)
    pheatmap(corrmat[,-1])

.. image:: correlation_matrix.jpg
    :width: 300px
    :align: center
    :height: 300px
    :alt: alternate text

From Lists to Matrices
======================

Transform gexp_affected_genes_df into a gexp-by-samples feature matrix

.. code-block:: r

	gexp_fm = tidyr::spread(gexp_affected_genes,HGNC_gene_symbol,normalized_count)

	gexp_fm[1:5,1:5]