User Help Pages
  • Welcome to the Help Pages for UCSC Xena
  • Tutorials and webinars
    • Webinars
    • Basic Tutorial: Section 1
    • Basic Tutorial: Section 2
    • Basic Tutorial: Section 3
    • Advanced Tutorial: Section 1
    • Advanced Tutorial: Section 2
    • Tutorial: Tumor vs Normal
    • Tutorial: Viewing your own data
    • Live examples
  • How do I ...
    • How do I make a KM plot?
    • How do I compare tumor vs normal expression?
    • How do I remove null data (gray lines) from view?
    • How do I make subgroups?
    • How do I make more than 2 subgroups?
    • How do I make subgroups with geneA high and geneB high?
    • How do I compare gene expression between subgroups?
    • How do I compare gene expression between different cancer types?
    • How do I remove duplicate samples from a KM plot?
    • How do I view multiple types of cancer together?
    • How do I filter to just one cancer type
    • How do I view my data with the data from TCGA?
    • How do I change the color of a column?
    • How do I interact with the tooltip?
    • How do I cite UCSC Xena?
  • Overview of features
    • Visual Spreadsheet
      • Coloring for Mutation Columns
      • Coloring for Segmented Copy Number Columns
    • Kaplan Meier Plots
    • Chart & Statistics View
    • Filtering and subgrouping
      • Supported search terms for finding samples
    • Differential Gene Expression
    • GSEA
    • Genomic Signatures
    • Bookmarks
    • Download Data
    • Xena Single Cell
    • TumorMap
    • MuPIT
    • Accessing data through python
    • Transcript View
    • Xena Gene Set Viewer
  • Overview of public data
    • Types of data we have
    • TCGA
    • GDC
    • More studies
    • Choosing a study/cohort
  • FAQ
    • Xena Browser
    • Data and datasets
  • Viewing your own data
    • Getting Started
    • Probes/transcripts/identifiers we recognize
    • Data format specifications and supported biological data types
    • KM plots using data from a Local Xena Hub
    • Hubs for institutions, collaborations, labs, and larger projects
    • Loading data from the command line
    • FAQ/Troubleshooting Guide
  • Technical documentation
    • Setting up Xena for your institution
    • Deep Linking Into Xena
    • Metadata Specification
  • Contact us
  • Cite us
  • Data Use Agreement
Powered by GitBook
On this page
  • What are the Source Repositories Xena pulls from?
  • For TCGA, which gene expression RNAseq dataset should I use for my analysis?
  • TCGA Pan-Cancer Atlas gene expression
  • GDC STAR gene expression
  • Toil RSEM gene expression
  • TCGA Gene expression RNAseq (IlluminaHiSeq)
  • TCGA Gene expression RNAseq (IlluminaHiSeq pancan normalized)
  • TCGA Gene expression RNAseq (IlluminaHiSeq percentile)
  • What is the difference between RPPA data and RPPA_RBN data?
  • Can I combine data from the methylation 450k and 27k datasets?
  • What is the difference between GISTIC 2 and GISTIC 2 thresholded datasets?
  • Where is the transcript-level expression data?
  • How do I calculate fold change (FC)?
  • Can I get access to the raw TOIL data?

Was this helpful?

Export as PDF
  1. FAQ

Data and datasets

Last updated 7 months ago

Was this helpful?

What are the Source Repositories Xena pulls from?

  • GDC data portal () for the GDC Hub data

  • GDC legacy archive () for the TCGA Hub data

  • ICGC data portal () for the ICGC Hub data

  • Pan Cancer Atlas publications’ data site () for the Pan-Cancer Atlas Hub data

  • TCGA ATAC-seq publication’s data site () for the ATAC-seq Hub data

  • Nature biotechnology publication () for the UCSC Toil RNAseq Recompute Hub data

  • Various journal publications for UCSC Public Hub data

For TCGA, which gene expression RNAseq dataset should I use for my analysis?

TCGA Pan-Cancer Atlas gene expression

For comparison across multiple or all TCGA cohorts. Dataset was generated by the TCGA PanCan Atlas project and has been normalized for batch effects. Please see the for more information.

GDC STAR gene expression

Generated by the , this data can be used to compare across TCGA cohorts as well. May not have as many batch effects removed as the PanCan Atlas work.

Toil RSEM gene expression

The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts

TCGA Gene expression RNAseq (IlluminaHiSeq)

For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.

TCGA Gene expression RNAseq (IlluminaHiSeq pancan normalized)

For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. Values in this dataset are generated at UCSC by first combining "gene expression RNAseq" values (above) of all TCGA cohorts and then mean normalizing all values per gene. This data was then divided into the 30-40 cancer types after normalization so that this data is available for each cancer type. Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.

TCGA Gene expression RNAseq (IlluminaHiSeq percentile)

For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample. The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.

What is the difference between RPPA data and RPPA_RBN data?

Can I combine data from the methylation 450k and 27k datasets?

What is the difference between GISTIC 2 and GISTIC 2 thresholded datasets?

Where is the transcript-level expression data?

How do I calculate fold change (FC)?

Log transformed means that the output values from the gene expression caller/program have been put through the following transformation:

log2(x+theta) = y

Where x is the TPM, RSEM, etc value, "theta" is a very small value (1, 0.01, etc) added to x since you can not take the log of zero, "log2" is log base 2, and y is the transformed value.

log(A/B) = log(A) - log(B)

So, within our downloads (either from our bulk downloads or just a slice of the data that has not been mean normalized), say you have 2 samples with expression for a gene. In our downloads, one sample is 4 and one sample is 1. This means, because our values are log transformed,

log(A) = 4

log(B) = 1

Therefore:

log(A/B) = 4 - 1

log(A/B) = 3

This gives you a 3-fold change.

Please note that in this case we are reporting the log(fold change). Biologists often use the log(fold change) because without taking the log, down regulated genes would have values between 0 and 1, whereas up regulated genes would have any value between 1 and infinity. This distribution makes graphing and further statistical analysis difficult. Taking the log typically makes the resulting values more normally distributed, which is better for further analysis.

Can I get access to the raw TOIL data?

Example command to get the manifest

aws s3 cp s3://cgl-rnaseq-recompute-toil/tcga-manifest . --request-pay

Now you can take look of the manifest to see the TCGA files

Example command to download a single TCGA file

aws s3 cp s3://cgl-rnaseq-recompute-toil/tcga/0106d51d-d581-4be7-91f3-b2f0c84468d1.tar.gz . --request-pay

The TCGA RPPA data are generated at MD Anderson. RPPA data is values generated using method described at . We download the RPPA values from TCGA DCC.

The RPPA_RBN data is normalized value generated using the RBN (replicate-base normalization) method developed by MDACC. For more information: . We downloaded the RBN values from synapse at .

The methylation 450k dataset has . However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. We recommend looking in the literature to see what methods people have used.

Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the and at the , which is the group that processed this data.

As of March 2019, our transcript-level data is in the . From here choose 'Advanced' and select any of the transcript-level expression datasets. Enter your transcript of interest as a Ensembl identifier (not a gene).

The following instructions assume that your data has been log transformed. All the RNAseq data in Xena public data hubs have already been log transformed, either by us or by the data providers. You can always confirm this by viewing the dataset details page (start at our and drill down until you get to the details page for the dataset).

When comparing these log transformed values, we use the :

Yes! We host it on AWS. Note that due to how large the files are, you will need to pay the egress fees to download the files. To get started, first look through the manifests for TCGA: , TARGET: , and GTEx and decide which files you want. Then using your AWS account, download the files. if you run into any issues.

https://portal.gdc.cancer.gov/repository
https://portal.gdc.cancer.gov/legacy-archive
https://dcc.icgc.org/
https://gdc.cancer.gov/node/905/
https://gdc.cancer.gov/about-data/publications/ATACseq-AWG
https://doi.org/10.1038/nbt.3772
PanCan Atlas paper in Cell
https://xenabrowser.net/datapages/?dataset=EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net
GDC
http://bioinformatics.mdanderson.org/main/TCPA:Overview
http://bioinformatics.mdanderson.org/main/TCPA:Overview
https://www.synapse.org/#!Synapse:syn1750330
90% of the probes from the 27k dataset
GISTIC 2 paper
Broad Institute
TCGA Pan-Cancer cohort
Explore Data pages
quotient rule of logarithms
s3://cgl-rnaseq-recompute-toil/tcga-manifest
s3://cgl-rnaseq-recompute-toil/target-manifest
s3://cgl-rnaseq-recompute-toil/gtex-manifest
Contact us