Protein Expression (RPPA)¶
The raw protein data file contains just two columns: The “Composite Element REF”, which corresponds to the third column in the antibody annotation file, and the estimated expression value for that particular protein. The “Composite Element REF” was parsed to generate additional
information(see details in the formatting section). The BigQuery tables HG19 Protein_Expression and HG38 Protein_Expression
was populated with all TCGA Level-3 RPPA data matching the pattern - “%_RPPA_Core.protein_expression%.txt”.
The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteins, and genes. Below is the detailed explanation about the generation of the antibody, gene, protein map.
Generation of Composite_element_ref, gene, and protein name map¶
(Manual Curation of the gene and protein names)
- Check the antibody annotation files for missing columns.
- If ‘protein_name’ is missing, generate one from ‘composite_element_ref’
- Make a map of ‘composite_element_ref’,’ gene_name’, ‘protein_name’ values.
- Check any other variant of the gene and protein symbols in the table.
- HGNC Validation
- If the gene symbol is in the HGNC approved symbols, ‘Approved’. Gene_symbol = Gene_symbol.
- If not, check the Alias symbols. If found, Gene_symbol = Alias_symbol.
- If not, check the Previous symbols. If found, Gene_symbol = ‘Approved’ Gene_symbol.
- If not, Gene_symbol = Gene_symbol
- The file generated is manually curated and fed back into the algorithm.
Formatting¶
- Duplicate the rows if there are multiple genes concatenated in the “gene_name” value. For example: ‘gene_name’ with value like ‘AKT1 AKT2 AKT3’ is stored as three separate rows with each gene in a row.
- ‘Protein_Name’ is split into ‘Protein_Basename’, Phospho’ and are stored as separate columns.
- ‘Composite element ref’ is parsed to get ‘validationStatus’ and ‘antibodySource’ - both are stored as separate columns in the BigQuery table.
- Data from both Illumina GA and HiSeq platforms are stored in the same table.