arrow-left

All pages
gitbookPowered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Types of data we have

We support a wide variety of data types including:

  • SNPs and small INDELs

  • Large structural variants

  • Segmented copy number, gene-level copy number

  • Gene-, Transcript-, Exon-, Protein-, LncRNA-, and miRNA-expression

  • DNA methylation (genes and probes)

  • Phenotype, clinical data

  • Signature scores, classifications, derived parameters

The type of data in each study vary considerably and depend on what analyses that particular study performed

If you need a particular type of data, please see to help you find the study with that type of data

TCGA

, a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, describing tumor tissue and matched normal tissues from more than 11,000 patients, is publicly available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.

TCGA is our most used data resource. We host several versions of the TCGA data.

  • As its concluding project, The Cancer Genome Atlas (TCGA) Research Network completes the most comprehensive cross-cancer analysis to date: The Pan-Cancer Atlas. Xena displays the curated genomics and clinical data generated by the Pan-Cancer Atlas consortium working groups.

Choosing a study/cohort

hashtag
General recommendations

We recommend the TCGA Pan-Cancer (PANCAN) study for most analysis. Unless you need a specific type of data or need to run a type of analysis listed below, we recommend the TCGA Pan-Cancer (PANCAN) study.

circle-info

More studies

hashtag

TCGA, TARGET, and GTEx RNA-seq data are uniformly re-aligned to hg38 genome, and re-processed using RSEM and Kallisto methods with gencode v23 annotations to generate expression estimates for ~60,000 genes and ~200,000 transcripts, including many LncRNAs. Xena hosts and displays gene and transcript expression results of this analysis.

hashtag

choosing a study/cohort
ICGCarrow-up-right

International Cancer Genome Consortium (ICGC) goal is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe. It includes TCGA data (U.S.A.) plus data contributed by groups from other countries in the International Cancer Genome Consortium. The resource has publically-accessible non-coding somatic mutation data from non-TCGA samples.

hashtag
PCAWGarrow-up-right

The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,600 cancer whole genomes from the International Cancer Genome Consortium. Building upon previous work which examined cancer coding regions, this project explored the nature and consequences of somatic and germline variations in both coding and non-coding regions, with specific emphasis on cis-regulatory sites, non-coding RNAs, and large-scale structural alterations.

hashtag
GDCarrow-up-right

The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.

The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCGarrow-up-right), including The Cancer Genome Atlas (TCGAarrow-up-right) and Therapeutically Applicable Research to Generate Effective Treatments (TARGETarrow-up-right), and many more.

hashtag
MET500arrow-up-right

Xena displays gene expression data from the metastatic cancer study published in Robinson et al 2017 Integrative clinical genomics of metastatic cancer.arrow-up-right

hashtag
CCLEarrow-up-right

Cancer Cell Line Encyclopedia. Detailed genetic and pharmacologic characterization of a large panel (~1100) of human cancer cell lines.

hashtag
Pediatric data

We have a number of sources of pediatric data

hashtag
KidsFirstarrow-up-right

The goal of the Gabriella Miller Kids First Pediatric Research Program (Kids First) is to develop a large-scale data resourcearrow-up-right to help researchers uncover new insights into the biology of childhood cancer and structural birth defects, including the discovery of shared genetic pathways between these disorders. Over 2015-2018, the program selected 26 patient cohorts for whole genome sequencing through a peer-review process.

hashtag
TARGETarrow-up-right

TARGET data is intended exclusively for biomedical research using pediatric data (i.e., the research objectives cannot be accomplished using data from adults) that focus on the development of more effective treatments, diagnostic tests, or prognostic markers for childhood cancers. Moreover, TARGET data can be used for research relevant to the biology, causes, treatment and late complications of treatment of pediatric cancers, but is not intended for the sole purposes of methods and/or tool development (please see Using TARGET Dataarrow-up-right section of the OCG website). If you are interested in using TARGET data for publication or other research purposes, you must follow the TARGET Publication Guidelinesarrow-up-right.

hashtag
Treehouse Consortiumarrow-up-right

The goal of the Treehouse Childhood Cancer Initiative (Treehouse) is to evaluate the utility of comparative gene expression analysis for difficult-to-treat pediatric cancer patients. Approaching 2000 pediatric tumor data, Treehouse has now assembled a large collection of pediatric cancer RNA-Seq, which, added to adult data, results in a compendium of over 11,000 adult and pediatric tumor-derived gene expression data. Pediatric cancer expression data are from public repository samples and from clinical samples at partner institutions, including UC San Francisco, Stanford, Children’s Hospital of Orange County and British Columbia Cancer Agency. In line with UC Santa Cruz Genomics Institute’s commitment to sharing data and to furthering research everywhere, we have made this data available for all to download and use.

hashtag
Interested in a dataset we don't have?

Don't see a study or dataset that you are interested in? Set up a hub, for yourself or your group with the data you need.

UCSC RNA-seq recompute compendiumarrow-up-right

TCGA data from Genomic Data Commonsarrow-up-right TCGA data uniformly re-analyzed at GDC using the latest Human Genome Assembly hg38. We download all open-access tier data from GDC, compile individual files into datasets organized by cohorts (33 individual tumor cohorts as well as a Pancan cohort. Xena displays the compiled datasets.

  • TCGA data in the UCSC RNA-seq Recompute Compendiumarrow-up-right TCGA data has been co-analyzed with GTEx data using the UCSC bioinformatic pipeline (TOIL RNA-seq) and can be used to compare tumor vs normal gene and transcript expression from the matching tissue of origin. Xena hosts gene and transcript expression results of the UCSC RNA-seq recompute compendium.

  • Legacy TCGA dataarrow-up-right Data generated and published by TCGA Research Network before the Pan-Cancer Atlas publications. Xena displays the level-3 data.

  • circle-info

    hashtag
    Please see our help page on how to choose between these different versions of the TCGA data

    This paper helps clarify the differences between the Legacy TCGA data and the TCGA data on the GDC:

    The Cancer Genome Atlas (TCGA)arrow-up-right
    TCGA Pan-Cancer Atlasarrow-up-right
    TCGA Pan-Cancer (PANCAN) studyarrow-up-right

    Why do we recommend this study?

    We recommend it because it has the data from the Cancer Genome Atlas (TCGA) Research Network, which generated the most comprehensive cross-cancer analysis to date: The Pan-Cancer Atlas. Xena displays the curated genomics and clinical data generated by the Pan-Cancer Atlas consortium working groups.

    Note that if you use the TCGA Pan-Cancer (PANCAN) to study a specific cancer type, you will need to filter down to just that cancer type.

    If you don't want to filter ...

    Our second most recommended datasets are the cancer-specific GDC TCGA studies. These avoid the need to filter down to a single cancer type and contain harmonized data from the Genomic Data Commons.

    circle-info

    GDC Data Hubarrow-up-right

    hashtag
    Differences between the GDC and the legacy TCGA data

    More information comparing the data in the GDC to the legacy TCGA data can be found here:

    hashtag
    Choosing a study by type of data

    The table below assumes that you are interested in TCGA data. These data types may also appear in other studies, but these are the recommended studies.

    Data type

    Study

    Dataset name

    Menu

    Transcript expression

    TCGA Pan-Cancer (PANCAN)

    TOIL Transcript expression

    Advanced

    lncRNA expression

    TCGA Pan-Cancer (PANCAN)

    TOIL Gene expression

    Advanced

    Exon expression

    hashtag
    Choosing a study based on a specific analysis or sample type

    Analysis

    Study

    Compare Tumor vs Normal

    TCGA, TARGET, GTEx

    GRCh38 coordinates

    Any GDC study

    Cell Line

    CCLE

    Disease specific survival, disease free survival, progression free survival

    TCGA Pan-Cancer (PANCAN)

    legacy TCGA datasets (per cancer type)

    Exon expression

    Advanced

    miRNA expression

    TCGA Pan-Cancer (PANCAN)

    Batch Effects normalized miRNA data

    Advanced

    DNA methylation

    Any

    DNA methylation

    Advanced

    ATAC-seq

    GDC Pan-Cancer (PANCAN)

    ATAC-seq

    Advanced

    Varied Survival endpoints

    TCGA Pan-Cancer (PANCAN)

    NA (run KM plot)

    --

    GDC

    Information on Xena data from GDC release v41.0

    This help page is for the Genomic Data Commons (GDC) data we host from GDC Data Release 41.0 - August 28, 2024arrow-up-right. We display all GDC open access genomic data and its accompanying phenotype/clinical data. Explore the GDC data on Xenaarrow-up-right.

    In addition to the data from the GDC, we added two new phenotype/clinical fields to all GDC cohorts: age_at_earliest_diagnosis.diagnoses.xena_derived and age_at_earliest_diagnosis_in_years.diagnoses.xena_derived. This was done because some GDC cohorts had multiple diagnoses, each with their own age_at_diagnosis.diagnoses. When there were multiple ages the Xena Visual Spreadsheet would display these fields as a category. In order to have a field that could always be displayed as a continuous feature, we created the age_at_earliest_diagnosis.diagnoses.xena_derived field that has the smallest value when there were multiple entries. age_at_earliest_diagnosis_in_years.diagnoses.xena_derived was created similarly, but also dividing the number of days by 365.

    For this release, we worked to not have samples that have no genomic data and only have phenotype/clinical data. This should make visualizing data in our Visual Spreadsheet easier.

    You can still view data from the older . This data will be available until October 2025. After October 2025 the data from this release will only be available for download.

    hashtag
    CPTAC-3

    For the cohort, we noted that occasionally samples were pooled into the same aliquot before sequencing was performed. Xena's visualizations are based on the sample-level, thus for these pooled aliquots there are several samples with duplicate data. An example of this is noted for case , where samples C3N-03011-04, C3N-03011-02, and C3N-03011-01 were all pooled into the aliquot CPT0226250007 before sequencing was performed.

    GDC Data Release v18.0 release - August 28, 2019arrow-up-right
    CPTAC-3 arrow-up-right
    C3N-03011arrow-up-right
    Before and After: A Comparison of Legacy and Harmonized TCGA Data at the Genomic Data Commons | NCI Genomic Data Commonsgdc.cancer.govchevron-right
    Before and After: A Comparison of Legacy and Harmonized TCGA Data at the Genomic Data Commons | NCI Genomic Data Commonsgdc.cancer.govchevron-right
    Logo
    Logo

    Overview of public data