microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression\_matrix\_mimat.pl provided by BCGSC which reads the isoform data files and outputs expression values for “mature microRNAs”. This output matrix contains a consistent number of mature microRNAs, referred to using a combination of the microRNA gene name and the unique accession number, eg: “hsa-mir-21.MIMAT0000076”. During ETL, this string is split into two parts and stored as separate columns in the BigQuery table. The entire matrix is then melted into a flat structure (known as the tidy data format) and loaded into the table.

Only the isoform files matching the pattern %.hg19.mirbase20.isoform.quantification.txt and containing hg19 data were used. The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file.

There is also a table with the expression data and corresponding tables reprocessed by the GDC against the HG38 genomic build for isoform and expression


Have feedback or corrections? You can file an issue here or email us at feedback@isb-cgc.org.