Clinical¶
The
Clinical
table contains one row per TCGA participant (aka patient or donor).
Each TCGA participant is uniquely represented by a
TCGA barcode
of length 12, eg TCGA-2G-AAM4. (For more information on how TCGA barcodes
were created and how to “read” a TCGA barcode, click on the preceding link.)
Clinical Feature Selection¶
In the first pass, any
XML features with the tag procurement\_status=Completed
which were found to exist in at
least 20% of the participants in any one Disease (aka tumor-type) were considered for selection.
A few important features related to smoking, pregnancy, etc were added to the
list during a manual-curation pass.
Selected fields from the both the clinical, auxiliary, ssf, and omf XML files were then extracted and loaded into the BigQuery table.
Additionally, only the most recent follow-up information was included (for cases where multiple follow-up sections existed in the clinical XML file).
XML Parsing¶
Each clinical XML file is divided into admin and case blocks, and
each of these were processed separately.
While iterating through the case block of information, all elements
(XML tags) and their values were collected. For follow-up blocks, only the
most recent (based on sequence number) sub-block elements were kept.
In the final pass, case elements and follow-up elements were carefully merged with preference given to follow-up elements.
Transforms¶
Different survival-related fields are completed based on the value of the vital_status field:
- for all patients with
vital_status=Alive:- days_to_last_known_alive should not be NULL
- days_to_last_known_alive is set to days_to_last_followup
- days_to_death is set to NULL
- for all patients with
vital_status=Dead:- days_to_death should not be NULL (if it is NULL, and days_to_last_followup is not NULL, then vital_status is set to “Alive”
- days_to_last_known_alive is set to days_to_death
- days_to_last_followup is set to NULL
pregnanciesandtotal_number_of_pregnancieswere merged into a singlepregnanciesfield. Counts above four are represented as4+(e.g: [0,1,2,3,4+])number\_of\_lymphnodes\_examinedandlymph\_node\_examined\_countwere merged into a singlenumber\_of\_lymphnodes\_examinedfieldcountryandcountry_of_procurementwere merged into a- single
countryfield
The following fields were extracted from the ssf XML file:
histological\_typecountryother\_dxtobacco\_smoking\_historygleason\_score\_combinedhistory\_of\_neoadjuvant\_treatment
The following fields were extracted from the omf XML file:
other\_malignancy\_malignancy\_typeother\_malignancy\_anatomic\_siteother\_malignancy\_histological\_type
When an auxiliary XML file exists for a participant, and the batch numbers in both the clinical XML and the auxiliary XML file match, the following fields are extracted from the auxiliary XML file and added to the Clinical table:
hpv\_calls,hpv\_status,mononucleotide\_and\_dinucleotide\_marker\_panel\_analysis\_status,
Finally, the patient BMI was calculated based on the height and weight values
(when both were present) and was added to the Clinical table.