Advanced data and datasets

What are the Source Repositories Xena pulls from?

For TCGA, which gene expression RNAseq dataset should I use for my analysis?

TCGA Pan-Cancer Atlas gene expression

For comparison across multiple or all TCGA cohorts. Dataset was generated by the TCGA PanCan Atlas project and has been normalized for batch effects. Please see the PanCan Atlas paper in Cell for more information.

GDC HTseq or STAR gene expression

Generated by the GDC, this data can be used to compare across TCGA cohorts as well. May not have as many batch effects removed as the PanCan Atlas work.

Toil RSEM gene expression

The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts

TCGA Gene expression RNAseq (IlluminaHiSeq)

For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.

TCGA Gene expression RNAseq (IlluminaHiSeq pancan normalized)

For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. Values in this dataset are generated at UCSC by first combining "gene expression RNAseq" values (above) of all TCGA cohorts and then mean normalizing all values per gene. This data was then divided into the 30-40 cancer types after normalization so that this data is available for each cancer type. Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.

TCGA Gene expression RNAseq (IlluminaHiSeq percentile)

For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample. The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.

What is the difference between RPPA data and RPPA_RBN data?

The TCGA RPPA data are generated at MD Anderson. RPPA data is values generated using method described at We download the RPPA values from TCGA DCC.

The RPPA_RBN data is normalized value generated using the RBN (replicate-base normalization) method developed by MDACC. For more information: We downloaded the RBN values from synapse at!Synapse:syn1750330.

Can I combine data from the methylation 450k and 27k datasets?

The methylation 450k dataset has 90% of the probes from the 27k dataset. However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. We recommend looking in the literature to see what methods people have used.

What is the difference between GISTIC 2 and GISTIC 2 thresholded datasets?

Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the GISTIC 2 paper and at the Broad Institute, which is the group that processed this data.

Where is the transcript-level expression data?

As of March 2019, our transcript-level data is in the TCGA Pan-Cancer cohort. From here choose 'Advanced' and select any of the transcript-level expression datasets. Enter your transcript of interest as a Ensembl identifier (not a gene).