Biospecimen

The Biospecimen table contains one row per TCGA sample. Each TCGA sample is uniquely represented by a TCGA barcode of length 16, eg TCGA-2G-AAM4-10A. (For more information on how TCGA barcodes were created and how to “read” a TCGA barcode, click on the preceding link.)

XML Parsing

The TCGA data at the DCC exists in XML files which have been uploaded into Google Cloud Storage. Selected fields from these XML files were then extracted and loaded into the “Biospecimen” table in BigQuery.

Some of the biospecimen values in the XML files are available on a per-slide and/or per-portion basis, and these have been aggregated and averaged. The number of slides and the number of portions per sample is also included in the table.

Filters

  • Samples for which is\_ffpe=True were removed.
  • Patients or Samples for which Project value was not TCGA were removed.

The following fields were extracted from the ssf XML file:

  • days\_to\_sample\_procurement
  • tissue\_anatomic\_site
  • tissue\_anatomic\_site\_description
  • tissue\_anatomic\_site

Have feedback or corrections? You can file an issue here or email us at feedback@isb-cgc.org.