GDC data portal (https://portal.gdc.cancer.gov/repository) for the GDC Hub data
GDC legacy archive (https://portal.gdc.cancer.gov/legacy-archive) for the TCGA Hub data
ICGC data portal (https://dcc.icgc.org/) for the ICGC Hub data
Pan Cancer Atlas publications’ data site (https://gdc.cancer.gov/node/905/) for the Pan-Cancer Atlas Hub data
TCGA ATAC-seq publication’s data site (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG) for the ATAC-seq Hub data
Nature biotechnology publication (https://doi.org/10.1038/nbt.3772) for the UCSC Toil RNAseq Recompute Hub data
Various journal publications for UCSC Public Hub data
For comparison across multiple or all TCGA cohorts. Dataset was generated by the TCGA PanCan Atlas project and has been normalized for batch effects. Please see the PanCan Atlas paper in Cell for more information. https://xenabrowser.net/datapages/?dataset=EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net
Generated by the GDC, this data can be used to compare across TCGA cohorts as well. May not have as many batch effects removed as the PanCan Atlas work.
The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts
For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.
For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. Values in this dataset are generated at UCSC by first combining "gene expression RNAseq" values (above) of all TCGA cohorts and then mean normalizing all values per gene. This data was then divided into the 30-40 cancer types after normalization so that this data is available for each cancer type. Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.
For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample. The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.
The TCGA RPPA data are generated at MD Anderson. RPPA data is values generated using method described at http://bioinformatics.mdanderson.org/main/TCPA:Overview. We download the RPPA values from TCGA DCC.
The RPPA_RBN data is normalized value generated using the RBN (replicate-base normalization) method developed by MDACC. For more information: http://bioinformatics.mdanderson.org/main/TCPA:Overview. We downloaded the RBN values from synapse at https://www.synapse.org/#!Synapse:syn1750330.
The methylation 450k dataset has 90% of the probes from the 27k dataset. However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. We recommend looking in the literature to see what methods people have used.
Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the GISTIC 2 paper and at the Broad Institute, which is the group that processed this data.
As of March 2019, our transcript-level data is in the TCGA Pan-Cancer cohort. From here choose 'Advanced' and select any of the transcript-level expression datasets. Enter your transcript of interest as a Ensembl identifier (not a gene).