Skip to content

Instantly share code, notes, and snippets.

@agrueneberg
Last active March 2, 2024 17:04
Show Gist options
  • Save agrueneberg/404bf469f249deeef5ac to your computer and use it in GitHub Desktop.
Save agrueneberg/404bf469f249deeef5ac to your computer and use it in GitHub Desktop.

How to match RPPA samples to their patients in The Cancer Genome Atlas (TCGA)

When I wrote the TCGA Toolbox and the TCGA.rppa module, one of biggest problems was to match samples and patients.

Both sample and patient files can be accessed through the Open Access HTTP Directory. For glioblastoma multiforme (gbm), the samples can be found in the gbm/cgcc/mdanderson.org/mda_rppa_core/protein_exp/mdanderson.org_GBM.MDA_RPPA_Core.Level_3.1.0.0/ directory, and the patients can be found in the nationwidechildrens.org_clinical_patient_gbm.txt file in the gbm/bcr/biotab/clin/ directory.

Looking at the data, we will find identifiers that looks like UUIDs, and identifiers that start with TCGA-. The latter are called TCGA barcodes, and they are quite meaningful as they follow a particular pattern. Using the sample barcode TCGA-14-0871-01A-21-1898-20 as an example, we can learn that

  • the data belongs to the The Cancer Genome Atlas (TCGA-),
  • the tissue was collected at Emory University for a glioblastoma multiforme study (14-),
  • the data belongs to patient number 871 (0871-),
  • the sample is a solid tumor (01A-),
  • and a lot of other information.

One nice trait of those barcodes is that once we have the sample barcode, we can easily get the patient barcode by removing the information about the sample: TCGA-14-0871-01A-21-1898-20 becomes TCGA-14-0871, a format that matches the bcr_patient_barcode column in the patient file.

Unfortunately for us, there is a trend in the field of knowledge organization that promotes identifiers without semantics (for reasons I don't want to go into, but https://wiki.nci.nih.gov/display/TCGA/UUID+Migration+Plan#UUIDMigrationPlan-3.FAQ is a start). Therefore, the maintainers of the TCGA have been working on replacing the TCGA barcodes with random UUIDs since 2012. For some institutions and platforms this transition has been completed, and RPPA is one of them.

With UUIDs we are no longer able to extract the patient identifier from the sample identifier. So what can we do? First, we need to obtain the sample UUID from

  • either the file name of the RPPA file: the UUID is between level and file extension,
  • or from the Sample REF field of the content of an RPPA file.

Once we have obtained the UUID, we can convert it back to a TCGA barcode. There are several options to do this:

Now that we have the TCGA sample barcode, we can remove everything but the first 12 characters. The result will be our TCGA patient barcode. All we have to do now is to find the row that matches the barcode in the bcr_patient_barcode column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment