Loading...
Loading...
Loading...
Loading...
Loading...
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, describing tumor tissue and matched normal tissues from more than 11,000 patients, is publicly available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.
TCGA is our most used data resource. We host several versions of the TCGA data.
TCGA Pan-Cancer Atlas As its concluding project, The Cancer Genome Atlas (TCGA) Research Network completes the most comprehensive cross-cancer analysis to date: The Pan-Cancer Atlas. Xena displays the curated genomics and clinical data generated by the Pan-Cancer Atlas consortium working groups.
TCGA data from Genomic Data Commons TCGA data uniformly re-analyzed at GDC using the latest Human Genome Assembly hg38. We download all open-access tier data from GDC, compile individual files into datasets organized by cohorts (33 individual tumor cohorts as well as a Pancan cohort. Xena displays the compiled datasets.
TCGA data in the UCSC RNA-seq Recompute Compendium TCGA data has been co-analyzed with GTEx data using the UCSC bioinformatic pipeline (TOIL RNA-seq) and can be used to compare tumor vs normal gene and transcript expression from the matching tissue of origin. Xena hosts gene and transcript expression results of the UCSC RNA-seq recompute compendium.
Legacy TCGA data Data generated and published by TCGA Research Network before the Pan-Cancer Atlas publications. Xena displays the level-3 data.
This paper helps clarify the differences between the Legacy TCGA data and the TCGA data on the GDC:
Information on Xena data from GDC release v41.0
This help page is for the Genomic Data Commons (GDC) data we host from . We display all GDC open access genomic data and its accompanying phenotype/clinical data. Explore the .
In addition to the data from the GDC, we added two new phenotype/clinical fields to all GDC cohorts: age_at_earliest_diagnosis.diagnoses.xena_derived
and age_at_earliest_diagnosis_in_years.diagnoses.xena_derived
. This was done because some GDC cohorts had multiple diagnoses, each with their own age_at_diagnosis.diagnoses
. When there were multiple ages the Xena Visual Spreadsheet would display these fields as a category. In order to have a field that could always be displayed as a continuous feature, we created the age_at_earliest_diagnosis.diagnoses.xena_derived
field that has the smallest value when there were multiple entries. age_at_earliest_diagnosis_in_years.diagnoses.xena_derived
was created similarly, but also dividing the number of days by 365.
For this release, we worked to not have samples that have no genomic data and only have phenotype/clinical data. This should make visualizing data in our Visual Spreadsheet easier.
You can still view data from the older . This data will be available until October 2025. After October 2025 the data from this release will only be available for download.
For the cohort, we noted that occasionally samples were pooled into the same aliquot before sequencing was performed. Xena's visualizations are based on the sample-level, thus for these pooled aliquots there are several samples with duplicate data. An example of this is noted for case , where samples C3N-03011-04
, C3N-03011-02
, and C3N-03011-01
were all pooled into the aliquot CPT0226250007
before sequencing was performed.
TCGA, TARGET, and GTEx RNA-seq data are uniformly re-aligned to hg38 genome, and re-processed using RSEM and Kallisto methods with gencode v23 annotations to generate expression estimates for ~60,000 genes and ~200,000 transcripts, including many LncRNAs. Xena hosts and displays gene and transcript expression results of this analysis.
International Cancer Genome Consortium (ICGC) goal is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe. It includes TCGA data (U.S.A.) plus data contributed by groups from other countries in the International Cancer Genome Consortium. The resource has publically-accessible non-coding somatic mutation data from non-TCGA samples.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,600 cancer whole genomes from the International Cancer Genome Consortium. Building upon previous work which examined cancer coding regions, this project explored the nature and consequences of somatic and germline variations in both coding and non-coding regions, with specific emphasis on cis-regulatory sites, non-coding RNAs, and large-scale structural alterations.
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.
The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (), including The Cancer Genome Atlas () and Therapeutically Applicable Research to Generate Effective Treatments (), and many more.
Xena displays gene expression data from the metastatic cancer study published in
Cancer Cell Line Encyclopedia. Detailed genetic and pharmacologic characterization of a large panel (~1100) of human cancer cell lines.
We have a number of sources of pediatric data
The goal of the Treehouse Childhood Cancer Initiative (Treehouse) is to evaluate the utility of comparative gene expression analysis for difficult-to-treat pediatric cancer patients. Approaching 2000 pediatric tumor data, Treehouse has now assembled a large collection of pediatric cancer RNA-Seq, which, added to adult data, results in a compendium of over 11,000 adult and pediatric tumor-derived gene expression data. Pediatric cancer expression data are from public repository samples and from clinical samples at partner institutions, including UC San Francisco, Stanford, Children’s Hospital of Orange County and British Columbia Cancer Agency. In line with UC Santa Cruz Genomics Institute’s commitment to sharing data and to furthering research everywhere, we have made this data available for all to download and use.
The goal of the Gabriella Miller Kids First Pediatric Research Program (Kids First) is to develop to help researchers uncover new insights into the biology of childhood cancer and structural birth defects, including the discovery of shared genetic pathways between these disorders. Over 2015-2018, the program selected 26 patient cohorts for whole genome sequencing through a peer-review process.
TARGET data is intended exclusively for biomedical research using pediatric data (i.e., the research objectives cannot be accomplished using data from adults) that focus on the development of more effective treatments, diagnostic tests, or prognostic markers for childhood cancers. Moreover, TARGET data can be used for research relevant to the biology, causes, treatment and late complications of treatment of pediatric cancers, but is not intended for the sole purposes of methods and/or tool development (please see section of the OCG website). If you are interested in using TARGET data for publication or other research purposes, you must follow the .
Don't see a study or dataset that you are interested in? for yourself or your group with the data you need.
We support a wide variety of data types including:
SNPs and small INDELs
Large structural variants
Segmented copy number, gene-level copy number
Gene-, Transcript-, Exon-, Protein-, LncRNA-, and miRNA-expression
DNA methylation (genes and probes)
Phenotype, clinical data
Signature scores, classifications, derived parameters
The type of data in each study vary considerably and depend on what analyses that particular study performed
If you need a particular type of data, please see choosing a study/cohort to help you find the study with that type of data
We recommend the TCGA Pan-Cancer (PANCAN) study for most analysis. Unless you need a specific type of data or need to run a type of analysis listed below, we recommend the TCGA Pan-Cancer (PANCAN) study.
Why do we recommend this study?
We recommend it because it has the data from the Cancer Genome Atlas (TCGA) Research Network, which generated the most comprehensive cross-cancer analysis to date: The Pan-Cancer Atlas. Xena displays the curated genomics and clinical data generated by the Pan-Cancer Atlas consortium working groups.
Note that if you use the TCGA Pan-Cancer (PANCAN) to study a specific cancer type, you will need to filter down to just that cancer type.
If you don't want to filter ...
Our second most recommended datasets are the cancer-specific GDC TCGA studies. These avoid the need to filter down to a single cancer type and contain harmonized data from the Genomic Data Commons.
More information comparing the data in the GDC to the legacy TCGA data can be found here:
The table below assumes that you are interested in TCGA data. These data types may also appear in other studies, but these are the recommended studies.
Data type
Study
Dataset name
Menu
Transcript expression
TCGA Pan-Cancer (PANCAN)
TOIL Transcript expression
Advanced
lncRNA expression
TCGA Pan-Cancer (PANCAN)
TOIL Gene expression
Advanced
Exon expression
legacy TCGA datasets (per cancer type)
Exon expression
Advanced
miRNA expression
TCGA Pan-Cancer (PANCAN)
Batch Effects normalized miRNA data
Advanced
DNA methylation
Any
DNA methylation
Advanced
ATAC-seq
GDC Pan-Cancer (PANCAN)
ATAC-seq
Advanced
Varied Survival endpoints
TCGA Pan-Cancer (PANCAN)
NA (run KM plot)
--
Analysis
Study
Compare Tumor vs Normal
TCGA, TARGET, GTEx
GRCh38 coordinates
Any GDC study
Cell Line
CCLE
Disease specific survival, disease free survival, progression free survival
TCGA Pan-Cancer (PANCAN)