Data and datasets

What are the Source Repositories Xena pulls from?

GDC data portal (https://portal.gdc.cancer.gov/repository) for the GDC Hub data
GDC legacy archive (https://portal.gdc.cancer.gov/legacy-archive) for the TCGA Hub data
ICGC data portal (https://dcc.icgc.org/) for the ICGC Hub data
Pan Cancer Atlas publications’ data site (https://gdc.cancer.gov/node/905/) for the Pan-Cancer Atlas Hub data
TCGA ATAC-seq publication’s data site (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG) for the ATAC-seq Hub data
Nature biotechnology publication (https://doi.org/10.1038/nbt.3772) for the UCSC Toil RNAseq Recompute Hub data
Various journal publications for UCSC Public Hub data

For TCGA, which gene expression RNAseq dataset should I use for my analysis?

TCGA Pan-Cancer Atlas gene expression

For comparison across multiple or all TCGA cohorts. Dataset was generated by the TCGA PanCan Atlas project and has been normalized for batch effects. Please see the PanCan Atlas paper in Cell for more information. https://xenabrowser.net/datapages/?dataset=EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net

GDC STAR gene expression

Generated by the GDC, this data can be used to compare across TCGA cohorts as well. May not have as many batch effects removed as the PanCan Atlas work.

Toil RSEM gene expression

The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts

TCGA Gene expression RNAseq (IlluminaHiSeq)

For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.

TCGA Gene expression RNAseq (IlluminaHiSeq pancan normalized)

For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. Values in this dataset are generated at UCSC by first combining "gene expression RNAseq" values (above) of all TCGA cohorts and then mean normalizing all values per gene. This data was then divided into the 30-40 cancer types after normalization so that this data is available for each cancer type. Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.

TCGA Gene expression RNAseq (IlluminaHiSeq percentile)

For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample. The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.

What is the difference between RPPA data and RPPA_RBN data?

The TCGA RPPA data are generated at MD Anderson. RPPA data is values generated using method described at http://bioinformatics.mdanderson.org/main/TCPA:Overview. We download the RPPA values from TCGA DCC.

The RPPA_RBN data is normalized value generated using the RBN (replicate-base normalization) method developed by MDACC. For more information: http://bioinformatics.mdanderson.org/main/TCPA:Overview. We downloaded the RBN values from synapse at https://www.synapse.org/#!Synapse:syn1750330.

Can I combine data from the methylation 450k and 27k datasets?

The methylation 450k dataset has 90% of the probes from the 27k dataset. However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. We recommend looking in the literature to see what methods people have used.

What is the difference between GISTIC 2 and GISTIC 2 thresholded datasets?

Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the GISTIC 2 paper and at the Broad Institute, which is the group that processed this data.

Where is the transcript-level expression data?

As of March 2019, our transcript-level data is in the TCGA Pan-Cancer cohort. From here choose 'Advanced' and select any of the transcript-level expression datasets. Enter your transcript of interest as a Ensembl identifier (not a gene).

How do I calculate fold change (FC)?

The following instructions assume that your data has been log transformed. All the RNAseq data in Xena public data hubs have already been log transformed, either by us or by the data providers. You can always confirm this by viewing the dataset details page (start at our Explore Data pages and drill down until you get to the details page for the dataset).

Log transformed means that the output values from the gene expression caller/program have been put through the following transformation:

log2(x+theta) = y

Where x is the TPM, RSEM, etc value, "theta" is a very small value (1, 0.01, etc) added to x since you can not take the log of zero, "log2" is log base 2, and y is the transformed value.

When comparing these log transformed values, we use the quotient rule of logarithms:

log(A/B) = log(A) - log(B)

So, within our downloads (either from our bulk downloads or just a slice of the data that has not been mean normalized), say you have 2 samples with expression for a gene. In our downloads, one sample is 4 and one sample is 1. This means, because our values are log transformed,

log(A) = 4
log(B) = 1

Therefore:

log(A/B) = 4 - 1
log(A/B) = 3

This gives you a 3-fold change.

Please note that in this case we are reporting the log(fold change). Biologists often use the log(fold change) because without taking the log, down regulated genes would have values between 0 and 1, whereas up regulated genes would have any value between 1 and infinity. This distribution makes graphing and further statistical analysis difficult. Taking the log typically makes the resulting values more normally distributed, which is better for further analysis.

Can I get access to the raw TOIL data?

Yes! We host it on AWS. Note that due to how large the files are, you will need to pay the egress fees to download the files. To get started, first look through the manifests for TCGA: s3://cgl-rnaseq-recompute-toil/tcga-manifest , TARGET: s3://cgl-rnaseq-recompute-toil/target-manifest , and GTEx s3://cgl-rnaseq-recompute-toil/gtex-manifest and decide which files you want. Then using your AWS account, download the files. Contact us if you run into any issues.

Example command to get the manifest

aws s3 cp s3://cgl-rnaseq-recompute-toil/tcga-manifest . --request-pay

Now you can take look of the manifest to see the TCGA files

Example command to download a single TCGA file

aws s3 cp s3://cgl-rnaseq-recompute-toil/tcga/0106d51d-d581-4be7-91f3-b2f0c84468d1.tar.gz . --request-pay

Last updated 9 months ago

Was this helpful?