Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Step-by-step instructions for our most common use cases
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
More details about all the features we have on Xena
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Learn how to remove samples with no data, subgroup samples, and make Kaplan Meier plots
This tutorial is made for those who have never used Xena but who have completed Section 1 of the Basic Tutorial. We will cover how to filter to just the samples you are interested in, how to create subgroups, and how to run a Kaplan Meier survival analysis.
This tutorial assumes completion of the Basic Tutorial: Section 1. This tutorial begins where the Basic Tutorial: Section 1 ends.
Part A: 7 min
Part B: 15 min
Part C: 5 min
Part A
Search for samples of interest
Remove samples with no data
Part B
Make subgroups
Rename subgroups
Part C
Run a Kaplan Meier survival analysis
Use a custom time endpoint
In the Basic Tutorial Section 1 we found that we found that samples from patients that have aberrations in EGFR have relatively higher expression. These aberrations could be mutations or copy number amplifications.
Now we are going to look at whether those patient with aberrations in their samples also have a worse survival prognosis.
To ensure your columns are sorted the same as those in this tutorial, please start at this link
Our goal is to remove patient's samples with no data (i.e. null) from the view. This will make the view look cleaner and remove irrelevant samples from our Kaplan Meier survival analysis.
Type 'null' into the samples search bar. This will highlight samples that have 'null' values in any column on the screen. Null means that there is no data for that sample for that column.
Click the filter menu and select 'Remove samples'.
Delete the search term.
More information
Instead of typing 'null' and removing those samples from the view, you can also use the 'Remove samples with nulls' shortcut in the filter menu.
Our goal is to create two subgroups, those patient's with samples with aberrations in EGFR and those patient's samples without aberrations in EGFR. We will then name the subgroups.
Type '(mis OR inframe) OR B:>0.5' into the samples search bar. This will select samples that either have a missense or inframe deletion '(mis OR inframe)', or where copy number variation (column B) is greater than 0.5. Note that I arbitrarily choose a cutoff of 0.5.
You must have the copy number variation column as column B for the search term '(mis OR inframe) OR B:>0.5' to work. The 'B' in 'B:>0.5' is instructing Xena to search in column B for values that are greater than 0.5.
Click the filter menu and select 'New subgroup column'. This will create a new column that has samples that met our search term marked as 'true' (ie. those that have an EGFR aberration) and those that did not meet our search term as 'false' (ie. those that do not have an EGFR aberration).
Click the column menu for the column we just created (column B) and chose 'Display'.
Rename the display so that samples that are 'true' are instead labeled as 'EGFR Aberrations' and the samples that are 'false' are instead labeled as 'No EGFR Aberrations'. Click 'Done'
Delete the search term. This will remove the black tick marks for matching samples.
More information
Now that we have our subgroups we will run a Kaplan Meier survival analysis. Note that TCGA survival data is in days, hence the x-axis will be in days.
We can now see that there is no difference in survival between patients with EGFR aberrations and those without.
Click the column menu at the top of column B.
Choose 'Kaplan Meier Plot'.
Click 'Custom survival time cutoff' at the bottom of the Kaplan Meier plot.
Enter 3650, as this is 10 years.
More information
Starting at the end of Part A, filter down to only those patient's samples that have a missense mutation.
Starting at the end of Part A, create two subgroups: those patient's samples with EGFR expression greater than 4 and those with EGFR expression less than 4.
Starting at the end of Part A, run a Kaplan Meier analysis on the EGFR expression column.
Learn how to use Chart View and add new columns of data to a view
This tutorial is made for those who have never used Xena but who have completed Section 1 of the Basic Tutorial. We will cover how to make box plots and bar charts using our Charts and Statistics View and how to add another column of data, in particular phenotype data, to the view.
This tutorial assumes you have done Basic Tutorial: Section 1. Basic Tutorial: Section 2 is recommended but not required. This tutorial begins where the Basic Tutorial: Section 2 ends. A live link to the end of Basic Tutorial: Section 2 is at the beginning of this tutorial.
Part A: 5 min
Part B: 15 min
Part A
Create a box plot using the Charts and Statistics View
Part B
Add another column of data to the view
Add phenotype data to the view
Create a bar chart using the Charts and Statistics View
In the Basic Tutorial: Section 1 we found that patient's samples that have aberrations in EGFR have higher expression. These aberrations could be mutations or copy number amplifications.
In the Basic Tutorial: Section 2 we created two subgroups: patient's samples that have aberrations in EGFR and those without. We ran a Kaplan Meier survival analysis and found that there was no difference in survival between these two groups.
Now we are going to use the subgroups created in the Basic Tutorial: Section 2 to see if there is a statistical difference in gene expression between the two subgroups. We will also look at whether samples from male or female patients have more aberrations.
To ensure your columns are sorted the same as those in this tutorial, please start at this link.
We found that patient's samples that have aberrations in EGFR have higher gene expression. Now we are going to investigate if this difference in gene expression statistically significant.
We can now see that patient's samples with EGFR aberrations have statistically higher gene expression.
Click the 3-dot column menu at the top of the gene expression column (don't worry if you start with another column - you will be selecting the correct columns in the steps ahead).
Click 'Compare subgroups', since we want to compare the group of samples who have aberrations in EGFR to the group of samples that do not.
Click the dropdown for 'Show data from' and choose 'column C: EGFR - gene expression RNAseq - HTSeq - FPKM-UQ'.
Click the dropdown for 'Subgroup samples by' and choose 'column B: (mis OR infra) OR C:>0.5 - Subgroup'.
Click 'Done'.
More information
We will now investigate how EGFR aberrations compare between samples from men and women.
We can now see that EGFR aberrations are more common in samples from females.
Steps
Click the 'x' in the upper right corner to exit Chart View.
Hover between columns B and C until 'Click to insert a column' becomes visible. Click on it.
Choose 'Phenotypic', click in the search bar, and choose 'Advanced'.
Type 'gender' into the search bar, select 'gender.demographic' from the dropdown menu, and click 'Done'.
Click the column menu at the top of column C and choose 'Chart & Statistics'. Note that this is just another way to enter Chart View.
Click 'Compare subgroups', since we want to compare the group of samples who have aberrations in EGFR to the group of samples that do not.
'column C: gender.demographic' should already be selected for 'Show data from'. If not, select it.
'column B: (mis OR infra) OR C:>0.5 - Subgroup' should already be selected for 'Subgroup samples by'. If not, select it.
Click 'Done'.
More information
Starting at the end of Part A, create a violin plot that compares copy number variation between patient's samples that have EGFR aberrations and those that do not.
Starting at the end of Part B, add the phenotype data 'age_at_earliest_diagnosis_in_years.diagnoses.xena_derived' to the plot.
Learn how to view whole chromosomes and view advanced datasets such as exon expression
This tutorial is made for those who have basic knowledge of how to use Xena. We will cover how to view whole chromosome and how to use the advanced dataset menu to access datasets such as exon expression.
This tutorial assumes basic knowledge of how to build and read a Visual Spreadsheet. To get this, go through Basic Tutorial: Section 1.
10 min
Create a visual spreadsheet that with a chromosome-wide column and data from the advanced dataset menu.
We will look at the ERG-TMPRSS2 gene fusion in patients from the TCGA Prostate Cancer study.
ERG is an oncogene that expressed at low levels in normal prostate tissue. Some patient's prostate cancer samples have higher expression of ERG. These samples tend to have an intra-chromosomal deletion that fuses ERG to TMPRSS2. TMPRSS2 is expressed at high levels in normal prostate tissue. This allows ERG to use the TMPRSS2 promoter to increase ERG expression.
Note that column D may look slightly different, depending on how you resize and zoom the column.
We can now see that there are many patient's samples with relatively high expression of ERG (column B). This relatively high expression is not uniform across the exons of ERG, but instead is in the exons closer to the 3' end of the gene (column C). Looking at column D, we can see that these samples also have an intra-chromosomal deletion of part of chromosome 21. If we hover over the genes at either end of the deletion, we can see that the end points fall within ERG and TMPRSS2.
Start at https://xenabrowser.net/
Type 'TCGA Prostate Cancer (PRAD)', select this study from the drop down menu, and click 'To first variable'.
Type 'ERG', select the checkbox for Gene Expression and click 'To second variable'.
Type 'ERG', click 'Show Advanced', select the checkbox for 'IlluminaHiSeq' under 'exon expression RNAseq', and click 'Done'.
Click the text 'Click to insert a column' after column C. Type 'chr21', select the checkbox for Copy Number and click 'Done'.
Click on the filter menu and select 'Remove samples with nulls'
Click on the handle in the lower right corner of column E, copy number for chromosome 21. Move it to the right to make the column bigger.
Click and drag within column E, copy number for chromosome 21 to zoom into the intra-chromosomal deletion.
More information:
Add copy number data for chromosome 1.
Add DNA Methylation data for ERG.
Learn to create your first views in Xena
This tutorial is made for those who have never used Xena. We will cover how to create a Visual Spreadsheet with gene expression, mutation, and copy number variation data.
This tutorial assumes basic knowledge of
gene expression, copy number variation, and mutational genomic sequencing data
how a change in copy number variation or mutations can lead to a change in gene expression
The Cancer Genome Atlas (TCGA)
These resources can help you gain basic knowledge of these concepts:
Part A: 5 min
Part B: 10 min
Part A
Create a Visual Spreadsheet
Compare data across columns
Part B
Move columns
Resize columns
Zoom in and out
We are going to look at EGFR aberrations in patients with lung adenocarcinomas using TCGA data. We will be looking at mutations and copy number aberrations and how they change gene expression.
Our goal is to build a Visual Spreadsheet and understand the relationship between the columns of data.
Start at our home page http://xena.ucsc.edu/ and click on 'Launch Xena'. You are now in our Visual Spreadsheet Wizard.
Type 'GDC TCGA Lung Adenocarcinoma (LUAD)', select this study from the drop down menu, and click 'To first variable'.
Type 'EGFR', select the checkboxes for Gene Expression, Copy Number, and Somatic Mutation, and click 'To second variable'.
How to read a Visual Spreadsheet
Samples are on the y-axis and your columns of data are on the x-axis. We line up columns so that each row is the same sample, allowing you to easily see trends in the data. Data is always sorted left to right and sub-sorted on columns thereafter.
We can see that samples from TCGA patients that have high expression of EGFR (red, column B) tend to either have amplifications of EGFR (red, column C) or mutations in EGFR (blue tick marks, column D).
More information
Making your own Visual Spreadsheet: Which TCGA study to choose
There are 4 versions of the TCGA data in Xena. In this example we selected the TCGA data from the GDC. This page can help you decide which version of TCGA data to use for your own analysis.
We will now move the columns to change the sort order and resize columns. We will zoom in to the whole Visual Spreadsheet and also within a column.
Move columns. Click column C, copy number variation, and drag it to the left so that it becomes the first column after the samples column (i.e. column B). Note that the samples are now sorted by the values in this column.
Resize columns. Click the handle in the lower right corner of column D, mutation. Move it to the right to make the column bigger.
Zoom in on a column. Click and drag within column D. Release to zoom.
Zoom out on a column. Click the red zoom out text at the top of column D.
Zoom in on samples. Click and drag vertically in any column in the Visual Spreadsheet to zoom in on these samples.
Zoom out on samples. To zoom out click either 'Zoom out' or 'Clear zoom' at the top of the Visual Spreadsheet.
More information
Create a Visual Spreadsheet looking at TP53 gene expression and mutation in samples from patients in the GDC TCGA Lower Grade Glioma study.
Change the Visual Spreadsheet from Question 1 so that the patient's samples are sorted by mutations rather than gene expression.
Tutorials, Live Examples, and How to pages for UCSC Xena
To change the color threshold, click on the column menu at the top and choose 'Display'. From there click 'custom', enter your new thresholds, and click 'Done'.
To change the color, click on the column menu at the top and choose 'Display'. From there choose a new set of colors from the drop down.
Learn how to use the pick samples feature, how to view multiple genes in a single column, how to view a signature, and how to run a differential expression analysis
Description
This tutorial is made for those who have basic knowledge of how to use Xena. We will cover how to use the pick samples feature, how to view multiple genes in a single column, how to enter and view a signature, and how to run a differential expression analysis.
This tutorial assumes basic knowledge of how to build and read a Visual Spreadsheet. To get this, go through Basic Tutorial: Section 1. It also assumes basic knowledge of filtering. To get this, go through Basic Tutorial: Section 2.
Part A: 10 min
Part B: 5 min
Part C: 15 min
Part A
Create a visual spreadsheet with single column with multiple genes.
Filter to only Primary Tumor samples using the Pick Samples mode.
Remove nulls using the option in the filter menu
Part B
Enter and view a gene expression signature
Part C
Run a differential expression analysis.
We will investigate the PAM50 molecular subtypes in breast cancer. PAM50 is a 50-gene signature that classifies breast cancer into five molecular intrinsic subtypes: Luminal A, Luminal B, HER2-enriched, Basal-like, and Normal-like.
We will make a visual spreadsheet where we can explore the relationship between the PAM50 subtype call and the 50 genes that make up the PAM50 subtype call.
Start at https://xenabrowser.net/
Type 'TCGA Breast Cancer (BRCA)', select this study from the drop down menu, and click 'To first variable'.
Choose 'Phenotypic', select 'sample_type' from the dropdown menu, and click 'To second variable'.
Choose 'Phenotypic', click on 'advanced', type 'pam' into the search bar, select 'PAM50Call_RNAseq' from the dropdown menu, and click 'Done'. This will exit the wizard.
Click on 'Click to insert a column' after column C. Copy and paste the 50 genes, choose 'Gene Expression', and click 'Done'.
Click the handle in the lower right corner of column D, mutation. Move it to the right to make the column bigger.
List of 50 genes used to calculate the PAM50 subtype call:
UBE2T BIRC5 NUF2 CDC6 CCNB1 TYMS MYBL2 CEP55 MELK NDC80 RRM2 UBE2C CENPF PTTG1 EXO1 ORC6L ANLN CCNE1 CDC20 MKI67 KIF2C ACTR3B MYC EGFR KRT5 PHGDH CDH3 MIA KRT17 FOXC1 SFRP1 KRT14 ESR1 SLC39A6 BAG1 MAPT PGR CXXC5 MLPH BCL2 MDM2 NAT1 FOXA1 BLVRA MMP11 GPR160 FGFR4 GRB7 TMEM45B ERBB2
Click on the picker icon next to the filter menu to enter pick samples mode.
Click on the Primary Tumor samples.
Click the filter menu and select 'Keep samples'.
Exit pick samples mode by clicking on the picker icon again.
Click the filter menu and select 'Remove samples with nulls'.
More information:
We will now look at the TFAC30 gene signature and see how it relates to the PAM50 subtype calls. This gene expression signature over 30 genes predicts pathologic complete response (pCR) to preoperative weekly paclitaxel and fluorouracil-doxorubicin-cyclophosphamide (T/FAC) chemotherapy.
Click on 'Click to insert a column' after column D. Copy and paste the signature below, choose 'Gene Expression', and click 'Done'. Note you need to include the '=' as this tells Xena that you want the signature rather than to see all the genes individually.
TFAC30 gene expression signature:
=E2F3 + MELK + RRM2 + BTG3 - CTNND2 - GAMT - METRN - ERBB4 - ZNF552 - CA12 - KDM4B - NKAIN1 - SCUBE2 - KIAA1467 - MAPT - FLJ10916 - BECN1 - RAMP1 - GFRA1 - IGFBP4 - FGFR1OP - MDM2 - KIF3A - AMFR - MED13L - BBS4
We can now see that patient's samples that are labeled as 'Her2' and 'Basal' are predicted to be more likely to achieve pCR on TFAC chemotherapy.
More information
We will run a differential expression analysis comparing Basal samples to Luminal A and Luminal B samples.
Click the column menu for the PAM50 subtype call (column C) and chose 'Differential Expression'. This will open a new tab where we will run the analysis.
Choose the first subgroup to be 'Basal' and the second subgroup to be 'LumA' and 'LumB'. Hold the shift key while clicking to select multiple groups.
Click 'Submit'.
Note it can take a while for the analysis to run. Wait until it says 'Success' at the top.
More information
Learn how to view your own data using data from the Chinese Glioma Genome Atlas (CGGA)
This tutorial is made for those who have basic knowledge of how to use Xena. We will cover how to load your own data into a Xena hub on your computer. We will then view the data in the Xena Browser
We will be viewing RNAseq and clinical data from the Chinese Glioma Genome Atlas (CGGA).
To format the datasets you will need access to a spreadsheet application, such as Microsoft Excel.
To load the data into a Local Xena Hub you will need a computer where you have installation privileges.
To visualize the data, you will need basic knowledge of how to build and read a Visual Spreadsheet, how to filter samples, how to create a box plot in Chart View, and how to run a Kaplan Meier Analysis. To get this go through the Basic Tutorials, starting with Basic Tutorial: Section 1.
Part A: 10 min
Part B: 15 min
Part C: 10 min
Part A
Download data from CGGA
Use Microsoft Excel or another spreadsheet application to make small formatting adjustments. These adjustments are only to enable Kaplan Meier analyses. Data can be visualized as is.
Part B
Download and install a Local Xena Hub
Load data into the Xena Hub on your computer
Part C
Make a visual spreadsheet from the data in the Xena Hub on your computer
Create a box plot
Run a Kaplan Meier Analysis
We will start with downloading the files from the CGGA. These files already conform to our data file requirements. This is because they are matrices that have sample IDs along one axis and probe, gene, or clinical data names along the other. Additionally, the files are tab-delimited.
For more information see:
While we can load the files exactly as is, we will perform a small format adjustment so that we can create a Kaplan Meier plot. Our Kaplan Meier analyses need two columns of clinical data to create a plot: the event/censor column and the time to that event/censor. These columns need to be specially named so that our Kaplan Meier analysis recognizes them. For Overall Survival, the column names need to be 'OS' and 'OS.time'.
For more information on other supported columns for our Kaplan Meier analysis see:
Genomic and Clinical data to load into Xena
Go to http://www.cgga.org.cn/download.jsp and scroll to the DataSet ID mRNAseq_693.
Click to download the 'Clinical Data' and 'Expression Data from STAR+RSEM'. Unzip the files. The resulting files should be named 'CGGA.mRNAseq_693.RSEM-genes.20200506.txt' and 'CGGA.mRNAseq_693_clinical.20200506.txt'.
Open CGGA.mRNAseq_693_clinical.20200506.txt
in a spreadsheet application like Microsoft Excel. If the spreadsheet application asks, these files are tab-delimited.
Rename the column header 'OS' to be 'OS.time'.
Rename the column header 'Censor (alive=0; dead=1)' to be 'OS'.
Save and close the file.
There is no need to open CGGA.mRNAseq_693.RSEM-genes.20200506.txt since it is ready to be loaded into the Local Xena Hub on your computer as is.
1. Click 'VIEW MY DATA' at the top of the screen. You should see a screen similar to this:
2. Click 'Open UCSC Xena' to set your computer up to automatically open the Xena Hub when you come to this page in the future.
3. Click on 'download & run a Local Xena Hub' to download the correct installer for your computer.
4. Double-click the installer to install the Xena Hub on your computer. Follow onscreen instructions, which vary by operating system.
Please see our FAQ/Troubleshooting Guide or contact us if you encounter any problems.
1. Click 'VIEW MY DATA' at the top of the screen. You should see a screen similar to this:
2. Wait for 30 seconds. If you allowed your browser to open the Xena Hub every time you come to this screen, then it will open the Xena Hub and this dialog box will close. If you did not, you will need to go to your Applications Folder and open UCSC Xena yourself
Whether you have viewed your own data before or not, you should arrive at a screen like this:
If you have already loaded data previously, you may see datasets and cohorts listed at the bottom of the screen
Click the 'Load Data' button.
Click 'Select Data File', choose 'CGGA.mRNAseq_693_clinical.20200506.txt', and click 'Next'.
Choose 'Phenotypic Data' and click 'Next'.
Choose 'The first column is sample IDs' and click 'Next'
Choose 'These are the first data on these samples.', change the study name to 'CGGA', and click 'Import'.
Choose 'Load more data'
Click 'Select Data File', choose
'CGGA.mRNAseq_693.RSEM-genes.20200506.txt', and click 'Next'.
Choose 'Genomic Data' and click 'Next'.
Confirm selection of 'The first row is sample IDs' and click 'Next'
Choose 'I have loaded other data on these samples and want to connect to it.', select 'CGGA' from the drop down, and click 'Import'.
Note that it can take several minutes for the RNAseq data to load since it is larger.
We will look at the chromosome 1p-19q co-deletion in Chinese glioma patients and compare this to IDH1 expression.
Note that we are unable to provide links to these ending screenshots because we do not allow users to create bookmarks when viewing data from their own Local Xena Hubs. This is to protect the privacy of your data. For more information see our Bookmarks help section.
Click on 'Visualization' in the top menu bar.
Type 'CGGA', choose 'CGGA' as the study and click 'To first variable'.
Enter the gene 'IDH1', choose 'CGGA.mRNAseq_693.RSEM-genes.20200506.txt', and click 'To second variable'
Choose 'Phenotypic', click '1p19q_codeletion_status', and click 'Done'
The dataset authors annotated samples without a 1p/19q co-deletion status with 'NA'. To remove these samples, type 'NA' in the samples search bar and choose 'Remove Samples' from the filter actions menu drop down.
Compare IDH1 expression between samples with a 1p/19q co-deletion and those that do not. To do this, click on the column menu for column B (IDH1 expression) and choose 'Charts & Stats'.
Choose 'Compare Subgroups'.
Click the dropdown for 'Show data from' and choose 'column B: IDH1 - CGGA.mRNAseq_693.RSEM-genes.20200506.txt'.
Click the dropdown for 'Subgroup samples by' and choose 'column C: 1p19q_codeletion_status - CGGA.mRNAseq_693_clinical.20200506.txt'.
Click 'Done'.
Close the chart using the 'x' in the upper left corner.
Run a Kaplan Meier analysis comparing patients with high IDH1 expression to those with low IDH1 expression. To do this, click on the column menu for column B (IDH1 expression) and choose 'KM plot'
Learn how to compare tumor samples to normal samples using our TCGA TARGET GTEx study
This tutorial is made for those who have basic knowledge of how to use Xena. We will cover how to view tumor and normal samples from healthy and diseased individuals together, and how to compare gene expression for one or more genes between tumor and normal samples.
We will be using both GTEx samples as our normal samples as well as TCGA matched normal samples. More information on GTEx normal samples can be found here:
This tutorial assumes basic knowledge of how to build and read a Visual Spreadsheet. To get this, go through Basic Tutorial: Section 1.
Part A: 10 min
Part B: 5 min
Part A
Build a visual spreadsheet with the columns primary site, sample type, study, and gene expression for the TCGA TARGET GTEx study.
Filter to just colon samples.
Part B
Create a box plot using the Charts and Statistics View
We will compare MYC gene expression between patient's samples in TCGA colon adenocarcinoma tumor samples and individuals normal colon tissue in GTEx.
Our goal is to build a visual spreadsheet with the columns 'primary site', 'sample site', 'study', and gene expression for MYC for the TCGA TARGET GTEx study. We will then filter to samples in the colon.
We can now see that normal samples tend to have lower MYC gene expression.
Start at our home page http://xena.ucsc.edu/ and click on 'Launch Xena'. You are now in our Visual Spreadsheet Wizard.
Type 'TCGA TARGET GTEx', select this study from the drop down menu, and click 'To first variable'.
Type 'MYC', select the checkbox for Gene Expression and click 'To second variable'.
Choose 'Phenotypic' and select the checkboxes for 'sample type', 'study' and 'Primary site', and click 'Done'.
Type 'colon' in the samples search bar and choose 'Keep samples'.
Our goal is to see if the difference in gene expression, where normal samples tend to have lower MYC gene expression, is statistically significant.
We can now see that patient's tumor samples, both recurrent, primary, and metastatic, have higher expression compared to normal tissue, both patient's matched normal tissue from TCGA and unmatched individual's normal tissue from GTEx.
Click the column menu for column B (MYC gene expression) and choose 'Charts & Stats'
Click 'Compare subgroups', click the dropdown for 'Show data from' and choose 'column B: MYC - gene expression RNAseq - RSEM norm_count' if it is not already selected
Click the dropdown for 'Subgroup samples by' and choose 'column C: Sample Type'.
Leave the chart type as 'box plot', and click 'Done'.
Compare EGFR gene expression between patient's tumor samples and individual's normal lung tissue.
Use the find samples feature (highlighted below) to make subgroups:
First, search for all the patient's samples you want in one of your subgroups. Next, click the Filter + Subgroup menu and choose 'New subgroup column'.
This will create a new subgroup column. All the patient's samples that matched your search term will be in one subgroup labeled as 'true' and all the samples that did not match your search term will in the other subgroup labeled as 'false'.
Your new column can be used for a KM analysis or to compare gene expression.
More information:
In this example we are creating two subgroups in the TCGA Lung Adenocarcinoma study: patient's samples with aberrations in EGFR and those without. These aberrations could be mutations or copy number amplifications.
Type '(mis OR infra) OR C:>0.5' into the samples search bar. This will select samples that either have a missense or inframe deletion '(mis OR infra)', or where copy number variation (column C) is greater than 0.5. Note that I arbitrarily choose a cutoff of 0.5.
Click the filter menu and select 'New column subgroup'. This will create a new column that has samples that met our search term marked as 'true' (ie. those that have an EGFR aberration) and those that did not meet our search term as 'false' (ie. those that do not have an EGFR aberration).
For more information see our Basic Tutorial: Section 2.
See our help on renaming the subgroup labels from 'true' and 'false' to something more biologically meaningful.
Step-by-step tutorials, videos, and other materials to get you started.
Live Examples of what types of visualizations and analyses you can perform using UCSC Xena
Xena mutation views supports examination of both coding and non-coding mutations from whole genome analysis. We support viewing mutations from both gene- or coordinate- centric perspective. In the gene-centric view, you can dynamically toggle to show or hide introns from the view. This figure shows the frequent intron mutations in 321 samples from the ICGC lymphoma cohorts. These 'pile-ups' would be not be visible if viewing mutations only in the exome. These intron mutations overlap with known enhancers regions (Mathelier 2015).
This page assumes you are familiar with making 2 subgroups. If you are not, please see the help page.
This page details how to create subgroups based on the expression of 2 genes so that you create the following 4 subgroups:
geneA expression is high AND geneB expression is high
geneA expression is high AND geneB expression is low
geneA expression is low AND geneB expression is low
geneA expression is low AND geneB expression is high
To do this enter a search terms for each gene, such as 'C:>15' or 'D:<0.6' into the search box and separate each search term with a ';'.
You can see in the search bar the expression used to make column A using the example genes of CD44 and CD24.
Also note that you can use this feature on columns besides gene expression, such as copy number variation, etc. You can also use it on categorical features, for instance to compare expression of a gene and the patient's gender (male or female). Simply add the gender column to the Visual Spreadsheet and enter 'female' for one of the search terms above.
For users who wish to compare data across different types of cancer
To view multiple types of cancer patients side-by-side you will need to start with a Pan-Cancer dataset and then filter down to just the cancer types you want to see.
The contains the latest data from the PanCan Atlas project, including many hand curated datasets. It also contains some legacy TCGA data across all cancer types, including GISTIC 2 CNV estimates and miRNAseq estimates.
1. Add the phenotype column cancer type abbreviation
that details the cancer type.
that will take you to the TCGA PanCan (PANCAN) Study with that phenotype column already selected.
2. Search for the cancer type you are interested in, making sure that it is listed in the phenotype column. Separate each cancer type by 'OR'. Example: 'lgg OR gbm'. Click the Filter + subgroup menu next to the search bar and select 'Keep Samples'.
Below is an example for viewing breast and ovarian cancer together for the TCGA PanCan Atlas
Sometimes not all samples in a dataset have data. This can happen for a variety of reasons, such as a particular patient's sample did not undergo one or more analyses. In this case, we use gray, or 'null' to show that there is no data.
To remove null data use the 'Remove samples with nulls' shortcut in the filter menu.
This page assumes you have a column on screen that has the groups you would like to compare (such as 'sample type' for comparing tumor vs normal') or have already made subgroups (such as 'has mutations in EGFR' vs 'does not have mutations in EGFR'). If you need help making subgroups, please see the help page.
First, make sure that the gene or genes that you want to compare across your groups are on screen.
Click on the charts icon in the top right and choose 'Compare subgroups'.
Click the dropdown for 'Show data from' and choose your gene expression column.
Click the dropdown for 'Subgroup samples by' and choose your subgroup column.
Choose if you would like a box plot or violin plot and click 'Done'.
Below we look at patient's samples that have aberrations in EGFR in the TCGA Lung Adenocarcinoma study. We will investigate if patient's samples that have aberrations in EGFR (mutations or copy number amplifications) have higher expression.
Click the graph icon in the upper right corner to enter Chart View.
Click 'Compare subgroups', since we want to compare the group of samples who have aberrations in EGFR to the group of samples that do not.
Click the dropdown for 'Show data from' and choose 'column C: EGFR - gene expression RNAseq - HTSeq - FPKM-UQ'.
Click the dropdown for 'Subgroup samples by' and choose 'column B: (mis OR infra) OR C:>0.5 - Subgroup'.
Click 'Done'.
If your plot has an '!' icon next to the p-value this means that some patients are in your plot twice. This can happen when A) a patient has both a tumor and normal sample or when a patient has a metastasis that is part of the dataset and/or B) a tumor sample was split into multiple aliquots and then run through the same analysis twice.
This page will guide you on how to remove duplicates due to A. If there are duplicates due to B you will need to , decide how to resolve any inconsistencies between the multiple aliquots and .
Add the data column of 'sample type' from the Phenotype data
We are adding a column of data that indicates the sample type such as 'Primary Tumor', 'Normal', etc. Note that different datasets may have a different name for this the data.
2. Filter to only samples that are 'Primary tumor' by typing 'primary' into the filter search box. Next, click the filter icon next to the filter search box and chose 'Filter'. This will filter out all samples that are not primary tumor.
Note that if you are viewing a mostly metastatic cancer like melanoma you may instead need to filter on 'metastatic' instead of 'primary'
3. Run your KM analysis by clicking the caret menu at the top of the column and choosing 'Kaplan-Meier plot' It will now only have primary tumor samples in it.
Removing duplicate samples from TCGA Lower Grade Glioma KM analysis
There are two main sources of normal expression data in Xena. The first is matched normal tissue samples from TCGA patients. These patient's samples are called "solid tissue normals" and are taken from tissue near the tumor. Normal samples from TCGA patients are typically limited in number but some cancer types may have enough for a robust statistical comparison. It is important to note that their proximity to tumor means it may have tumor microenvironment signal. The second source of normal expression is . GTEx has expression data from normal tissue of individuals who do not have cancer. There are typically many more samples in GTEx then in TCGA solid tissue normals. However, experimental sample processing are different from TCGA, which may lead to batch effects.
You can use the for both types of 'normal' samples. Data from the study is from the UCSC RNA-seq Compendium, where TCGA, TARGET, and GTEx samples are re-analyzed by the same RNA-seq pipeline. This pipeline involved re-aligning the reads to hg38 genome and calling gene expression using RSEM and Kallisto methods. Because all samples are processed using a uniform bioinformatic pipeline, batch effects due to different computational processing is eliminated. Note that the samples from this study have only undergone per-sample normalization.
To compare tumor vs normal, you will need to filter down to just the samples you want to compare and then compare gene expression between your groups of samples.
More information:
There are four gene expression datasets in this study. Two are normalized using with-in sample methods. The 'RSEM norm__count' dataset is normalized by the upper quartile method, the 'RSEM expected__count (DESeq2 standardized)' dataset is by DESeq2 normalization. Therefore, these two gene expression datasets should be used.
If you are looking to compare just a few genes, you can use our to run your analysis. If you are looking to run a genome-wide differential gene expression analysis, you can use our . Note that we only allow users to run our Differential Gene Expression Analysis on less than 2,000 samples total. Thus, you will need to filter to run this analysis on this dataset.
More information:
For users who wish to use the datasets in a Pan-Can cohort but need to view just one cancer type.
1. Add the phenotype column that details the cancer type
The phenotype column will vary depending on which study you choose. See below for specific column names
2. Search for the cancer type you are interested in, making sure that it is listed in the phenotype column. Click the Filter + subgroup menu next to the search bar and select 'Keep Samples'.
For the TCGA PanCan (PANCAN), you will want to add the phenotype column:
cancer type abbreviation
For the TCGA TARGET GTEx, you will want to add the phenotype columns:
main category
study
primary_site
This page assumes that you are already viewing more than one cancer type in view. Please see the help page '' to get started with this.
If there are cancer types in view that you do not want to investigate, you will need to filter them out. Please see the help page '' to get started with this.
Steps:
Add a column of data. Enter your gene or list of genes, select 'Gene Expression' and click done.
From the column menu at the top of the new column you created, select 'Chart & Statistics'
Choose 'Compare Subgroups'
Click the dropdown for 'Show data from' and choose your gene expression column.
Click the dropdown for 'Subgroup samples by' and choose the cancer type column.
Choose if you would like a box plot or violin plot and click 'Done'.
This page assumes you are familiar with making 2 subgroups. If you are not, please see .
To make more than 2 sample subgroups, enter multiple search terms, such as 'C:>15' into the search box. Separate each search term with a ';'.
This can be used for a number of situations:
To divide a single numerical column into more than 2 subgroups (e.g. geneA high, geneA mid, and geneA low)
To make subgroups over the expression of two genes such that you get 4 subgroups (e.g. geneA high + geneB high, geneA low + geneB high, geneA high + geneB low, geneA low + geneB low)
To make subgroups over the expression of a gene and a categorical column (e.g. geneA high + Estrogen Receptor positive, geneA low + Estrogen Receptor positive, geneA high + Estrogen Receptor negative, geneA low + Estrogen Receptor negative)
To make subgroups over two categorical columns (e.g. Estrogen Receptor positive + HER2 positive, Estrogen Receptor negative + HER2 positive, Estrogen Receptor positive + HER2 negative, Estrogen Receptor negative + HER2 negative)
See below for an example of each.
In the screenshot below you can see that column D that ranges from 7.3 to 12. If you wanted to have 3 groups: 7.3 - 9, 9 - 10, and 10 - 12, you would enter:
C:>9 ; C:>10
into the search bar and then choose 'New subgroup column' from the filter/subgroup drop down menu.
In the screenshot below you can see that column E (ERBB2 gene expression) that ranges from 10 to 16. If you wanted to have 4 groups: ERBB2 > 13 + Estrogen Receptor positive, ERBB2 <= 13 + Estrogen Receptor positive, ERBB2 > 13 + Estrogen Receptor negative, ERBB2 <= 13 + Estrogen Receptor negative), you would enter:
E:>13 ; C:Negative
into the search bar and then choose 'New subgroup column' from the filter/subgroup drop down menu.
In the screenshot below, if you wanted to have 4 groups: Estrogen Receptor positive + HER2 positive, Estrogen Receptor negative + HER2 positive, Estrogen Receptor positive + HER2 negative, Estrogen Receptor negative + HER2 negative you would enter:
C:Negative ; D:Negative
into the search bar and then choose 'New subgroup column' from the filter/subgroup drop down menu.
are useful if you have a specific question.
Workshops are a great way to teach a group of people how to use Xena. They can be 1-hour, 1/2-day, or 1-day in length. Currently we are only giving workshops remotely via Zoom or a similar technology. We give workshops both within the USA and internationally. Please contact us for more information:
from 'true' and 'false' to something more biologically meaningful.
For more information see our .
that will take you to the TCGA PanCan (PANCAN) Study with that phenotype column already selected.
that will take you to the TCGA TARGET GTEx Study with those phenotype columns already selected.
from 'true' and 'false' to something more biologically meaningful.
from 'true' and 'false' to something more biologically meaningful.
from 'true' and 'false' to something more biologically meaningful.
Our search is 'contains' search, meaning the term you enter can be at the beginning, end or in the middle of a matched term. Our search is case-independent. An example is
IIA
will match 'Stage IIIA' and 'Stage IIA'. To specify a specific string, use quotes
"Stage IIA"
You can specify a certain column and mathematical expression such as
A:>2
which will find all values greater than 2 in the first column. We support the following operators
= (equal)
>= (less than or equal)
>= (greater than or equal)
< (less than)
> (greater than)
!= (not equal)
You can search any annotation on a mutation, such as the functional impact, protein position, or gene name itself
To find all samples with mutations with the protein change, enter:
V600E
To find all samples where the functional impact has the text 'frame' or 'nonsense' in it:
frame OR nonsense
To find all samples that have a mutation, search the gene annotation:
TP53
To find all samples that do not have a mutation, use the negation of the gene annotation:
!=TP53
To find all samples that do not have data in one or more columns, use:
null
and choose 'Remove samples'. To find all samples that do not have data for just one column, use:
B:null
Enter a sample ID to find a sample of interest. An example:
TCGA-DB-A4XH
If you are searching for multiple sample IDs, you will need to separate each by an 'OR'. You can copy and paste a list of sample IDs into the search bar as long as they are separated by a space, tab, or return (new line).
TCGA-DB-A4XH OR TCGA-2F-A9KO-01 OR TCGA-02-0001
To make it easy to search a specific column, we use shorthand to annotate the first column as 'A:', the second as 'B:', etc. An example is
A:YES
This will search ONLY the first column for the word 'YES'. Note that we will retain your original search if you move the columns around.
You can enter multiple search terms and we will match all of them with an implicit 'AND'. We also support 'OR'.
Use parentheses to group search terms. For example:
"Stage II" (B:Negative OR C:Negative)
will search for samples that match 'Stage II' in any column and are 'Negative' for either the second or third column.
You can also use '!=' to negate a term such as:
!=null
which will match all samples that have data across all columns.
This dynamic, powerful, and flexible view is our default view into the data.
The Visual Spreadsheet allows you to add an arbitrary number of columns of any data type (mutation, copy number, expression, protein, phenotype, methylation, etc) on any number of patient's samples into a spreadsheet-like view. We line up all columns so that each row is the same sample, allowing you to easily see trends in the data. Data is always sorted left to right and sub-sorted on columns thereafter.
Get started by going to the Xena Browser and following the wizard to enter your data of interest.
The wizard on the screen will guide you to choose a study to view and TWO columns of data to view on those samples. Note that if you do not choose at least two columns, the wizard will not exit and let you interact with the data.
You can select a cohort either by choosing 'Help me select a cohort' and searching our cohorts for you cancer type, etc. or by choosing 'I know the study I want to use' and searching for the partial or full name of the cohort you are interested in.
Enter a HUGO gene name or a dataset-specific probe names (e.g. a CpG island). You can enter one gene or multiple genes. Separate multiple genes with a space, comma, tab, or new line.
To display a genomic region, enter the genomic region, choose your dataset and click 'done'. We recongize chromosomes (e.g. chr1), arms of chromosomes (e.g. chr19q), and chromosomes coordinates (e.g. chr1:100-4,000).
After entering a gene or probe name, you will need to select one or more datasets.
We have pre-selected default datasets for most cohorts. These datasets are selected based because they are the most used datasets. Typically there is a default mutation, copy number, and expression dataset.
Xena also has more datasets than those listed in the Basic Menu. Depending on the cohort, these can include DNA methylation, exon expression, thresholded CNV data and more. To access them, click on 'Show Advanced' below:
More information on basic datasets
We annotate datasets used in the basic Visual Spreadsheet wizard with a red asterisk in our datasets pages. For an example see: https://xenabrowser.net/datapages/?cohort=TCGA%20Acute%20Myeloid%20Leukemia%20(LAML)
Patient samples are on the y-axis and your columns of data are on the x-axis. We line up all columns so that each row is the same sample, allowing you to easily see trends in the data. Data is always sorted left to right and sub-sorted on columns thereafter.
If you entered a single gene, that gene will be listed at the top of the column. If there are multiple probes mapped to that gene in the dataset you selected they will be displayed as subcolumns ordered left to right in the direction of transcription.
If you selected a positional dataset, such as segmented copy number variation or mutation we will display the gene model will be displayed at the top of the column. The gene model is a composite of all transcripts of the gene. Boxes show different exons with UTR regions being short and CDS regions being tall. We display 2Kb upstream to show the promoter region. Use the column menu to toggle to show intronic regions.
If you entered multiple genes, each gene will be listed as a subcolumn for that dataset. If there are multiple probes mapped to that gene in the dataset (i.e. if you entered a single gene then you would see the probes as subcolumns), then the probes are averaged for a single value per gene.
Note that if you entered more than one gene and selected a mutation dataset, we will only show the first gene. If you wish to see multiple mutation columns, please enter each gene individually and click 'done'
When displaying a chromosome range, genes will be shown at the top of the column, with dark blue genes being on the forward strand and red genes being on the reverse strand. Hovering over a gene will display the gene name in the tooltip. Note that introns are always shown in this mode.
Individual values vary by dataset. The legend at the bottom of the dataset will tell you the units for your particular dataset, including any normalization that was performed. If a sample does not have data for a column, it will show as gray and be labeled as 'null'.
If the entire column is gray this means we did not recognize the gene, probe, or position. If you believe this to be in error, please try an alternate name.
More information about a dataset can be found in the dataset details page. To get there, click on the column menu and choose 'About'.
The Xena Browser uses the y-axis for samples and the x-axis/columns for genomic/phenotypic features. Data from a single sample is always on the same horizontal line across all columns, allowing you to see screen-wide trends. The Xena Browser orders samples left to right first by the first columns, then the second, etc. If there are multiple genes, identifiers, probes within in a column, samples is ordered from left to right by 1st sub-column, then 2nd sub-column, and so on.
Numerical data are ordered in descending order (e.g. 3.5, 1.2, ...). Categorical data (e.g. stage, tumor type, etc) are ordered by categories. CNV data is sorted by the average of the entire column. Positional mutation data is ordered by genomic coordinates (from 5'->3') and then by the predicted impact of the mutation. Both CNV and positional mutation data has the option to instead sort by the zoomed region. Click the column menu at the top of the column and choose 'Sort by zoom region avg'.
To reverse the ordering, click the column menu at the top of the column and chose 'Reverse sort'
As the sample sort order is controlled by the left most columns, it can be useful to explore the data by moving a different column to the left.
To move a column click on the column header and drag a column to the right or left.
Click and drag any where in any column to zoom in in either direction. Zoom out to all samples by clicking the 'Clear Zoom' at the top. Zoom out to the whole column by clicking the red 'x' at the top of a column.
The Tooltip at the top of the Visual Spreadsheet shows more information about the data under the mouse. Links are links to the UCSC Genome Browser to learn more about that gene or genomic position. Alt-click to freeze and unfreeze the tooltip to be able to click on the links. Click here for more information about interacting with the tooltip.
You can change the size of a column by clicking on the bottom right corner of a column and dragging to a new size.
You can add another column of data by clicking on 'Click to add column' either on the right edge of the visual spreadsheet or by hovering between columns until 'Click to insert column' displays'.
When you are in the Xena Visual Spreadsheet, hovering the mouse over any data on the screen will trigger a tooltip to show up at the top of the view.
To freeze the tooltip, you need to "Alt-click", i.e. hold on the ALT key on your computer and at the same time click the left mouse button.
To unfreeze the tooltip, click on the close (X) icon.
This can be helpful if you want to click on the link to take you to the UCSC Genome Browser, where you can view more information about those genomic coordinates.
Kaplan Meier Survival Analyses are a way of comparing the survival of groups of patients. More information on what a Kaplan Meier analysis is can be found in this article
To generate a KM plot, click on the column menu at the top of a column and choose 'Kaplan Meier Plot'.
For numerical or continuous features, you will have the option of having 2 groups of samples, 3 groups of samples, or viewing the upper vs lower quartile. For 2 groups, we divide the samples on the median. For 3 groups, we divide samples into the upper third, middle third, and lower third.
When viewing the upper vs lower quartile, note that we only include samples that are greater than (not greater than or equal to) the upper quartile, and the same for the lower quartile.
Note that all are used to calculate the median and other dividing values, whether or not they have survival data. To see which samples have survival data, add the column 'OS' from the phenotype data.
If more than one sample has the same value, we put the samples in a group together, even if this means the groups end up being unequal in size.
For categorical features, we only show the first 10 categories.
For mutation features, we divide samples into those with any mutation and those without. To make different groups (e.g. samples with nonsense mutations vs those without), create your own subgroups and run a KM plot on the new column
We remove samples with 'null' data for all plots.
We default to Overall Survival. Users can select different end points if they are available. An example of this is in the TCGA PanCancer Study.
We default to the last time any individual in the plot was known to be alive. You can change this to be 1-year or 5-year survival by changing the time cutoff at the bottom of the screen. The statistics will automatically recalculate. TCGA data uses days as their measurement of time.
You can generate a high quality PDF by clicking the PDF icon.
You can download the data used to generate the KM plot using the download icon. It will download the Event and Time to Event columns, in addition to the sample ID, patient ID, groups, and underlying data.
When there are multiple curves or lines in a KM plot, Xena Browser compares the different Kaplan–Meier curves using the log-rank test. The Browser reports the test statistics (𝜒 2) and p-value (𝜒 2 distribution). Data is retrieved in real-time from Xena Hub(s) to a user's web browser and the test is performed in the browser to maintain your data privacy.
The statistics the Xena Browser reports are equivalent to R's survival package, survdiff, with rho=0 (default in R).
If all patients in a particular group (i.e. line) are censored before any event happens for the whole population (including all the groups), we exclude this group from the statistical analysis and perform the log-rank test on the remaining groups. We do this because we have no way to know the number of people at risk for this particular group at any of event times, and therefore can not compute any statistics for this group. R handles this exception in the same way. Although this group is removed from the statistical analysis, we still display the group in the KM plot.
Note that we do not automatically remove duplicate patients (for instance if there is a tumor and a normal sample from the same patient). You can determine if there are duplicate patients by looking for the "!" icon next to the p value. Learn how to remove duplicate samples.
Chart View will generate bar plots, box plots, violin plots, scatter plots, and distribution graphs using any of the columns in a Visual Spreadsheet. Statistics, such as Welch's t-test, Pearson's and Spearman's rank correlation, and ANOVA will be calculated automatically.
To get to the chart view click on the icon indicated below by the red box or use the column menu and select 'Chart & Statistics'.
Once you enter Chart View, it will ask you a series of questions about what type of graph you are trying to make.
Compare subgroups will allow you to compare groups of patient's samples, either those that you have made or via a categorical feature, such as sample type. It will build the appropriate graph depending on whether you have selected a continuous numerical or categorical column. This option will let you make box plots, violin plots, and bar charts.
See a distribution will let you see a histogram distribution of the data in a single column. The column can have sub-columns, either multiple probes or multiple genes, which will instead create a plot with multiple box plots.
Make a scatterplot will make a scatterplot from two continuous numerical columns. The second column can have multiple sub-columns, either multiple probes or multiple genes, which will create overlapping scatterplots
If an option is grayed out, this means that you do not have enough or the right type of data on the screen. Return to the Visual Spreadsheet and add more data.
We show statistics in the upper right corner of the screen for most graphs. If we detect it will take some time run the statistics we may instead show a button with 'run stats', so that you can decide if you would like to run the statistical test.
Advanced options available under the graph will allow you to change the scales of the axes. If you are viewing a scatterplot it will also allow you to color the points by a column of data.
Note that for violin plots, the width of each plot is does not relate to the number of samples in the plot.
To return to the Visual Spreadsheet, click either the icon in the upper left, or the 'x' close button.
How to find samples that you want to remove or keep in the view. How to make subgroups.
Use the search box at the top of the screen to first pick/find your samples of interest. Then filter to keep or removes these samples, create a new subgroup column, or zoom.
The bar highlighted above allows you to search all data on the screen for your search term. Note that it will not search data that is not on the screen. Samples that match your criteria are marked with a black bar in the Visual Spreadsheet.
You can search for samples by either typing in the search bar or by clicking on the dropper icon to enter the pick samples mode. The pick samples mode will allow you to click on a column to select samples. The search term for your picked samples will appear in the search bar. To exit the pick samples mode, click on the dropper icon again.
Note the pick samples mode tends to work best if the column you are selecting from is the first column.
Once you have your sample(s) of interest, click on the filter + subgroup menu and choose to:
Keep samples: Keep only the samples which match your criteria.
Remove samples: Remove the samples which match your criteria.
Clear sample filter: Remove ALL filters currently applied.
Remove Samples with nulls: Removes samples that have no data for one or more columns. Equivalent to typing 'null' in the search bar and choosing 'Remove samples'.
Zoom: Zoom to the samples that meet your criteria. Shift-click to zoom out.
Once you have either filtered, created a subgroup column, or zoomed to samples, your search term will be added to the search history. Access the search history by clicking the downward facing arrow at the upper right of the search bar.
Once the subgroup column is created, users can change the labels from "true" or "false" to, for example, "wild type" or "EGFR mutant" by adjusting the column display settings. To access these select the three dot menu at the top of the column and choose 'Display'
for copying a sample ID from the tooltip.
More information on
New subgroup column: Create a new column where samples that meet your criteria are annotated as 'true' and samples that don't meet your criteria are annotated as 'false'. This new columns can then be used for or in the .
To create more than 2 subgroups, please see our guide.
Note this search history will be preserved in .
Run a genome-wide differential GSEA analysis to compare groups of samples
To run a GSEA analysis, click on the 3 dot column menu at the top of a categorical column (not a numerical column) and choose 'GSEA'.
This will take you to new page where you will define the sample subgroups you would like to compare (note that you can select multiple categories for a single subgroup).
After you have your subgroups, choose a gene set library, scroll to the bottom and click 'submit'.
Due to compute limitations you can only run a total of 2000 samples through the analysis pipeline.
This will start the analysis, which make take a while to run depending on the size of the dataset. As the results are completed, the web page will update. Scroll to see more results. Once the analysis is finished it will say 'Done' at the top of the page.
The gene expression dataset chosen for a specific study/cohort is the same gene expression dataset as the one in the Basic Datasets menu.
The Advanced Visualization parameters apply to the PCA or t-SNE plot, as well as the blitzGSEA analysis itself.
Note that the GSEA analysis runs blitzGSEA, a faster implementation of a traditional GSEA analysis.
We disable running our GSEA analysis on your own data since we send the data in the analysis to various websites, which may not be secure. Currently we only offer a docker image as a method for running this pipeline on your own data. Please contact us if you need help setting this up.
Enter a genomic signature over a set of genes for a particular dataset
Genomic signatures, sometimes expressed as a weighted sum of genes, are an algebra over genes, such as "ESR1 + 0.5*ERBB2 - GRB7". Once a signature is entered, the value for each gene name for each sample are substituted and the algebraic expression is evaluated.
Open the Add column menu
Enter '=' and then your signature into the gene entry box
Select 'gene expression' as the dataset
Click 'Done'
There must be a space on both sides of the "+" and "-".
Alternatively enter a list of genes and we will automatically add a '+' in between each gene when evaluating the signature
If we can not find a gene that is part of the signature, the missing gene will be included as a zero in the expression calculation and the label will list the genes as missing.
Hess et.al. identified 30 genes whose gene expression profile is predictive of complete pathologic response to chemotherapy treatment in breast cancer.
=E2F3 + MELK + RRM2 + BTG3 - CTNND2 - GAMT - METRN - ERBB4 - ZNF552 - CA12 - KDM4B - NKAIN1 - SCUBE2 - KIAA1467 - MAPT - FLJ10916 - BECN1 - RAMP1 - GFRA1 - IGFBP4 - FGFR1OP - MDM2 - KIF3A - AMFR - MED13L - BBS4
Here we can see that the predicted chemo response signature is high in the basal subtype and low in luminal subtype. Additionally, the signature is high for ER negative samples and low for ER positive samples.
Bookmark: https://xenabrowser.net/?bookmark=2401ccb792e256d7397008b24af20565
We also have a number of signature datasets under the TCGA Pan-Cancer study from the PanCan Atlas project:
To use these signatures, go to the dataset pages (links above) to see what the names of the specific signatures are (under Identifiers). Then in the visualization enter the name of the specific signature as a gene, click 'Advanced', choose the appropriate dataset, and click 'Done'
Run a genome-wide differential gene expression analysis to compare groups of samples
To run a differential gene expression analysis, click on the 3 dot column menu at the top of a categorical column (not a numerical column) and choose 'Differential Expression'.
This will take you to new page where you will define the sample subgroups you would like to compare (note that you can select multiple categories for a single subgroup).
After you have your subgroups, scroll to the bottom and click 'submit'.
Due to compute limitations you can only run a total of 2000 samples through the analysis pipeline.
This will start the analysis, which make take a while to run depending on the size of the dataset. As the results are completed, the web page will update. Scroll to see more results. Once the analysis is finished it will say 'Done' at the top of the page.
The gene expression dataset chosen for a specific study/cohort is the same gene expression dataset as the one in the Basic Datasets menu.
The Advanced Visualization parameters only apply to the PCA or t-SNE plot. They do not apply to any other analyses.
We disable running our differential gene expression analysis on your own data since we send the data in the analysis to various websites, which may not be secure. There are 3 options to run our analysis on your own data:
Upload your data to BioJupies to run a somewhat similar analysis. BioJupies by the Ma'ayan lab will run a somewhat similar analysis to the one we run and has a very user friendly interface.
Upload your data to the Bulk RNA-seq analysis pipeline Appyter to run a very similar analysis. This pipeline is what our analysis is based off of and will require a bit more familiarity with running differential gene expression analyses. Our modifications to this analysis are just to automatically pick the best normalization, etc options based on our public data. You will need to know which options are best given your own data.
Run our pipeline on your own computer. This will give you identical results to our pipeline but requires the most engineering to set up and run. You will need to set up a docker with all the dependencies pre-installed and then download and run the notebook on this docker.
A 3D protein viewer developed by Rachel Karchin's lab
We use the MuPIT 3D protein viewer from Rachel Karchin's lab at John Hopkins to provide this visualization to our users. From their Help Page:
MuPIT interactive is an online tool that allows you to map sequence variants from their genomic position onto protein structures. Viewing a variant on protein structure can be useful in interpreting its potential biological consequences. After mapping, the variants are displayed on an interactive 3d structure. The user may turn variants on and off, and display annotations on the protein structure.
Access this tool by going to our Visualization tab and following the wizard to select samples. Next, enter your gene of interest, click 'somatic mutation' and then click 'Done'. You may need to choose another variable such as 'gene expression'.
Once you have the mutation data you're interested in, click the menu at the top of the column and chose 'MuPIT View'. This will send your mutation data to MuPIT and open their viewer in a new tab.
MuPIT Help: http://mupit.icm.jhu.edu/MuPIT_Interactive/Help.html
On the left of the figure is Xena mutation column view of ERBB2 somatic mutations from the TCGA breast cancer cohort. Users click on the MuPIT link from the caret menu at the top of the column. It will send all the mutations' genomic positions as well as their recurrence p-values to the MuPIT display. On the right side of the figure, MuPIT displays mutations in various size of bright green spheres. Large spheres for recurrent mutations. Size of the mutation spheres are determined by recurrence p values. The MuPIT display shows these ERBB2 somatic mutations cluster around the ERBB2 active site (ATP binding site in blue and proton acceptor site in teal).
There are 4 ways to download data
1. Download data in a single column of a Visual Spreadsheet In a Visual Spreadsheet, click on the column Hamburger menu, then "Download" to download just the data from the column.
2. Download data in an entire Visual Spreadsheet In a Visual Spreadsheet, clicking on the download icon in the upper right corner of the spreadsheet.
3. Bulk download a whole dataset file Click top banner "Data Sets" to navigate to the dataset of your interest, where a download url link is in the page. You can also reach the dataset page by clicking on the column Hamburger menu, then "About". Click on the download url to download the entire dataset. Or use "wget", "curl" to download from command line.
4. Via our APIs:
Our files are tab-delimited or '.tsv'. We recommend opening them in your favorite spreadsheet program, such as Microsoft Excel, which will automatically convert the tabs into new columns. Please note that if you have many thousands of samples, Microsoft Excel will likely have difficulty opening the file. In this case, the command line may work better for you.
More information about how we color mutation columns
Samples that have mutation data are white with a dot or line for the mutation for where the mutation falls in relation to the gene model at the top of the column. Mutation data is colored by the functional impact:
Red - Deleterious
Blue - Missense
Orange - Splice site mutation
Green - Silent
Gray - Unknown
Samples for which there is no mutation data are gray with no dot or line, and are marked as 'null'.
Red --> Nonsense_Mutation, frameshift_variant, stop_gained, splice_acceptor_variant, splice_acceptor_variant&intron_variant, splice_donor_variant, splice_donor_variant&intron_variant, Splice_Site, Frame_Shift_Del, Frame_Shift_Ins
Blue --> splice_region_variant, splice_region_variant&intron_variant, missense, non_coding_exon_variant, missense_variant, Missense_Mutation, exon_variant, RNA, Indel, start_lost, start_gained, De_novo_Start_OutOfFrame, Translation_Start_Site, De_novo_Start_InFrame, stop_lost, Nonstop_Mutation, initiator_codon_variant, 5_prime_UTR_premature_start_codon_gain_variant, disruptive_inframe_deletion, inframe_deletion, inframe_insertion, In_Frame_Del, In_Frame_Ins
Green --> synonymous_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, 5'Flank, 3'Flank, 3'UTR, 5'UTR, Silent, stop_retained_variant
Orange --> others, SV, upstreamgenevariant, downstream_gene_variant, intron_variant, intergenic_region
Note that we are case insensitive when we color for these terms.
For the gene-level mutation datasets (Somatic gene-level non-silent mutation):
Red (=1) --> indicates that a non-silent somatic mutation (nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, inframe indels) was identified in the protein coding region of a gene, or any mutation identified in a non-coding gene
White (=0) --> indicates that none of the above mutation calls were made in this gene for the specific sample
Pink (=0.5) --> some samples have two aliquots. In the event that in one aliquot a mutation was called and in the other no mutation was called, we assign a value of 0.5.
Bookmarks are a great way to save a particular view in Xena, either for yourself or to share with others.
To bookmark a view, click on 'Bookmark' in the top navigation bar. From here you can either click 'Bookmark' to create a bookmark URL or click 'Export' to export a file that can then be imported back to the browser.
When you click 'Bookmark' you will then need to click 'Copy Bookmark' to copy the bookmark URL to your copy buffer. Large views may take a second or two to generate a URL.
Note that your filter and subgroup history, as well as the last Chart View you created, if any, will be saved as part of the bookmark.
Bookmarks are only guaranteed for 3 months
The 'Bookmark' option will store all the data in view on our servers and provide you a link. This is the easiest way to share a view. Note that if you have any private data in view, this option will be disabled to preserve your privacy. Please also note that if you lose the link there is no way to get it back.
If you chose Export, it will give you a file with everything Xena needs to recreate your view. You can then save this file and import it back into Xena. While this option can be a bit cumbersome, it will allow you to share private data. Note that these files are still only guaranteed for 3 months, though they may last for longer.
The 'Recent Bookmarks' option will temporarily show the 15 most recent bookmarks you have created. This can be useful if you're constructing many bookmarks. Note that this menu is frequently reset so do not use this as permanent storage for a bookmark.
When you create a bookmark link, we save the data in view on our servers. To protect user data privacy, we have disabled this option when private data is in view. Please use the Export/Import option instead.
Information on Xena data from GDC release v41.0
This help page is for the Genomic Data Commons (GDC) data we host from GDC Data Release 41.0 - August 28, 2024. We display all GDC open access genomic data and its accompanying phenotype/clinical data. Explore the GDC data on Xena.
In addition to the data from the GDC, we added two new phenotype/clinical fields to all GDC cohorts: age_at_earliest_diagnosis.diagnoses.xena_derived
and age_at_earliest_diagnosis_in_years.diagnoses.xena_derived
. This was done because some GDC cohorts had multiple diagnoses, each with their own age_at_diagnosis.diagnoses
. When there were multiple ages the Xena Visual Spreadsheet would display these fields as a category. In order to have a field that could always be displayed as a continuous feature, we created the age_at_earliest_diagnosis.diagnoses.xena_derived
field that has the smallest value when there were multiple entries. age_at_earliest_diagnosis_in_years.diagnoses.xena_derived
was created similarly, but also dividing the number of days by 365.
For this release, we worked to not have samples that have no genomic data and only have phenotype/clinical data. This should make visualizing data in our Visual Spreadsheet easier.
You can still view data from the older GDC Data Release v18.0 release - August 28, 2019. This data will be available until October 2025. After October 2025 the data from this release will only be available for download.
For the CPTAC-3 cohort, we noted that occasionally samples were pooled into the same aliquot before sequencing was performed. Xena's visualizations are based on the sample-level, thus for these pooled aliquots there are several samples with duplicate data. An example of this is noted for case C3N-03011, where samples C3N-03011-04
, C3N-03011-02
, and C3N-03011-01
were all pooled into the aliquot CPT0226250007
before sequencing was performed.
A tool developed by the Stuart Lab to view samples in a 2D layout
UCSC TumorMap is a separate project developed by the Stuart Lab at UCSC. We link to them to help users gain another perspective on the data they are seeing in Xena. From their Overview page:
TumorMap is a tool that enables grouping samples based on their omic signatures in a visually accessible way. Similar to dimensionality reduction methods, Tumor Map method takes a high-dimensional omics space and produces a two dimensional visualization. Unlike most dimensionality reduction methods, the TumorMap method is able to combine multiple types of omics data (e.g. mRNA expression and methylation data types in a single map). Furthermore, TumorMap is an interactive tool that allows navigating through a tumor landscape that represents a heterogeneous multi-dimensional and multi-platform omic space of oncogenic signatures.
In the TumorMap, each node is a sample and clusters of samples indicate groups with similar oncogenic signatures and genomic alteration events. The samples in a map may be colored by various molecular, clinical, diagnostic, prognostic, and phenotypic annotations (e.g. tumor type, molecular subtype, etc.) to visualize associations with the data type used in clustering.
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, describing tumor tissue and matched normal tissues from more than 11,000 patients, is publicly available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.
TCGA is our most used data resource. We host several versions of the TCGA data.
TCGA Pan-Cancer Atlas As its concluding project, The Cancer Genome Atlas (TCGA) Research Network completes the most comprehensive cross-cancer analysis to date: The Pan-Cancer Atlas. Xena displays the curated genomics and clinical data generated by the Pan-Cancer Atlas consortium working groups.
TCGA data from Genomic Data Commons TCGA data uniformly re-analyzed at GDC using the latest Human Genome Assembly hg38. We download all open-access tier data from GDC, compile individual files into datasets organized by cohorts (33 individual tumor cohorts as well as a Pancan cohort. Xena displays the compiled datasets.
TCGA data in the UCSC RNA-seq Recompute Compendium TCGA data has been co-analyzed with GTEx data using the UCSC bioinformatic pipeline (TOIL RNA-seq) and can be used to compare tumor vs normal gene and transcript expression from the matching tissue of origin. Xena hosts gene and transcript expression results of the UCSC RNA-seq recompute compendium.
Legacy TCGA data Data generated and published by TCGA Research Network before the Pan-Cancer Atlas publications. Xena displays the level-3 data.
This paper helps clarify the differences between the Legacy TCGA data and the TCGA data on the GDC:
We recommend the TCGA Pan-Cancer (PANCAN) study for most analysis. Unless you need a specific type of data or need to run a type of analysis listed below, we recommend the TCGA Pan-Cancer (PANCAN) study.
Why do we recommend this study?
We recommend it because it has the data from the Cancer Genome Atlas (TCGA) Research Network, which generated the most comprehensive cross-cancer analysis to date: The Pan-Cancer Atlas. Xena displays the curated genomics and clinical data generated by the Pan-Cancer Atlas consortium working groups.
Note that if you use the TCGA Pan-Cancer (PANCAN) to study a specific cancer type, you will need to filter down to just that cancer type.
If you don't want to filter ...
Our second most recommended datasets are the cancer-specific GDC TCGA studies. These avoid the need to filter down to a single cancer type and contain harmonized data from the Genomic Data Commons.
More information comparing the data in the GDC to the legacy TCGA data can be found here:
The table below assumes that you are interested in TCGA data. These data types may also appear in other studies, but these are the recommended studies.
The Xena Gene Sets Viewer https://xenagoweb.xenahubs.net/xena compares gene expression, somatic mutation, and copy number variation profile of cancer related gene sets across cancer cohorts. It queries genomics data hosted on public Xena Hubs, in a similar way as other tools in the Xena Visualization suite. And then it generates gene set visualizations of those data.
Source code:
The Gene Set Viewer allows comparison of individual gene sets or pathways and their genes across two cancer tumor sample cohorts as well as comparison within the same sub cohorts.
As an overview, Figure 1 shows two cohorts, the left (olive background, TCGA Ovarian Cancer) and the right (tan background , TCGA Prostate Cancer). Figure 1A shows the selection for the analysis, Gene Set, view limit, and filter (differential versus similar). Figure 1B shows the view comparing the Mean Gene Set Score in the center and individual samples on the right. 1C shows the individual samples, with the hover result showing the sample and score in 1E. 1D provides a link directly into Xena for the given gene set. 1F provides a sharable URL link. 1G provides a login for use in uploading.
Figure 8 shows analysis of a GMT file using the BPA method [citation: thanks to Verena Friedl]. This is only available to logged in users and they may only see their own analysis and are limited to 100 pathways. Logins are any valid google login. Several public pathway sets are available including those curated from the Gene Ontology Consortium (thanks to Laurent-Philippe Albou) as well as those from the Hallmark [cite] and Pancan [cite] analyses.
BPA GENE EXPRESSION
PARADIGM IPL
REGULON ACTIVITY (only avaiable for the LUAD Cohort)
CNV ∩ MUTATION
COPY NUMBER
MUTATION
Data type
Study
Dataset name
Menu
Transcript expression
TCGA Pan-Cancer (PANCAN)
TOIL Transcript expression
Advanced
lncRNA expression
TCGA Pan-Cancer (PANCAN)
TOIL Gene expression
Advanced
Exon expression
legacy TCGA datasets (per cancer type)
Exon expression
Advanced
miRNA expression
TCGA Pan-Cancer (PANCAN)
Batch Effects normalized miRNA data
Advanced
DNA methylation
Any
DNA methylation
Advanced
ATAC-seq
GDC Pan-Cancer (PANCAN)
ATAC-seq
Advanced
Varied Survival endpoints
TCGA Pan-Cancer (PANCAN)
NA (run KM plot)
--
Analysis
Study
Compare Tumor vs Normal
TCGA, TARGET, GTEx
GRCh38 coordinates
Any GDC study
Cell Line
CCLE
Disease specific survival, disease free survival, progression free survival
TCGA Pan-Cancer (PANCAN)
We support a wide variety of data types including:
SNPs and small INDELs
Large structural variants
Segmented copy number, gene-level copy number
Gene-, Transcript-, Exon-, Protein-, LncRNA-, and miRNA-expression
DNA methylation (genes and probes)
Phenotype, clinical data
Signature scores, classifications, derived parameters
The type of data in each study vary considerably and depend on what analyses that particular study performed
If you need a particular type of data, please see choosing a study/cohort to help you find the study with that type of data
Xena's Transcript View shows transcript-specific expression or isoform percentage for 'tumor' TCGA data and 'normal' GTEX data. It allows you to compare the distribution of these values for two groups of patient samples.
This tool was created by Akhil Kamath as part of Google Summer of Code 2017. Akhil was advised by Angela Brooks and Brian Craft. Thank you Akhil for all your work!
Enter the HUGO name of your gene of interest and click 'OK'. Choose your two studies of interest from the two drop down menus. Each row in the visualization shows the transcript, transcript structure and density plots showing range of expression of that transcript.
Change the units from TMP (Transcripts Per Million) to isoform percentage using the drop-down near the top. To zoom on a row, click on it. To zoom out, click on the row again.
All RNAseq data was generated by the Toil pipeline recompute done by the UCSC Computational Core using the RSEM package. All transcripts are from Gencode V23 comprehensive annotation.
For this visualization, we numbered the exons using an in-house automated method which may not line up with exon numbering in the literature. This method is subject to change and should not be relied on to denote any exon going forward.
Regions that are intronic in all transcripts are removed. The remaining exonic regions are numbered 1..N. Different exons within a given region are labeled starting with ‘a’ for the left-most exon (in transcript direction).
For example, exon 3 is the unique exon in the third exonic region. Exons 4a and 4b are two different exons in the fourth exonic region.
Another way to say this is: different exons across all transcripts which overlap transitively will be assigned the same integer. So if one transcript has exons 4a and 4c, there must be exons in other transcripts that overlap them, and each other.
GDC data portal () for the GDC Hub data
GDC legacy archive () for the TCGA Hub data
ICGC data portal () for the ICGC Hub data
Pan Cancer Atlas publications’ data site () for the Pan-Cancer Atlas Hub data
TCGA ATAC-seq publication’s data site () for the ATAC-seq Hub data
Nature biotechnology publication () for the UCSC Toil RNAseq Recompute Hub data
Various journal publications for UCSC Public Hub data
For comparison across multiple or all TCGA cohorts. Dataset was generated by the TCGA PanCan Atlas project and has been normalized for batch effects. Please see the for more information.
Generated by the , this data can be used to compare across TCGA cohorts as well. May not have as many batch effects removed as the PanCan Atlas work.
The goal of the Toil recompute was to process ~20,000 RNA-seq samples to create a consistent meta-analysis of four datasets free of computational batch effects. This is best used to compare TCGA cohorts to TARGET or GTEx cohorts
For comparison within a single TCGA cohort, you can use the "gene expression RNAseq" data. Values in this dataset is log2(x+1) where x is the RSEM value.
For questions regarding the gene expression of a particular cohort in relation to other types tumors, you can use the pancan normalized version of the "gene expression RNAseq" data. Values in this dataset are generated at UCSC by first combining "gene expression RNAseq" values (above) of all TCGA cohorts and then mean normalizing all values per gene. This data was then divided into the 30-40 cancer types after normalization so that this data is available for each cancer type. Since there are 30-40 cancer types with RNAseq data, the TCGA pancan data can serve as a proxy of background distribution of gene expression.
For comparing with data outside TCGA, you can use the percentile version if your non-TCGA RNAseq data is normalized by percentile ranking. Values in this dataset are generated at UCSC by rank RSEM values per sample. The values are percentile ranks ranges from 0 to 100, lower values represent lower expression. You can also combine the TCGA RNAseq data with your RNAseq data, perform normalization across the combined dataset using whatever method you choose, then analyze the combined dataset further.
Log transformed means that the output values from the gene expression caller/program have been put through the following transformation:
log2(x+theta) = y
Where x is the TPM, RSEM, etc value, "theta" is a very small value (1, 0.01, etc) added to x since you can not take the log of zero, "log2" is log base 2, and y is the transformed value.
log(A/B) = log(A) - log(B)
So, within our downloads (either from our bulk downloads or just a slice of the data that has not been mean normalized), say you have 2 samples with expression for a gene. In our downloads, one sample is 4 and one sample is 1. This means, because our values are log transformed,
log(A) = 4
log(B) = 1
Therefore:
log(A/B) = 4 - 1
log(A/B) = 3
This gives you a 3-fold change.
Please note that in this case we are reporting the log(fold change). Biologists often use the log(fold change) because without taking the log, down regulated genes would have values between 0 and 1, whereas up regulated genes would have any value between 1 and infinity. This distribution makes graphing and further statistical analysis difficult. Taking the log typically makes the resulting values more normally distributed, which is better for further analysis.
Example command to get the manifest
aws s3 cp s3://cgl-rnaseq-recompute-toil/tcga-manifest . --request-pay
Now you can take look of the manifest to see the TCGA files
Example command to download a single TCGA file
aws s3 cp s3://cgl-rnaseq-recompute-toil/tcga/0106d51d-d581-4be7-91f3-b2f0c84468d1.tar.gz . --request-pay
The TCGA RPPA data are generated at MD Anderson. RPPA data is values generated using method described at . We download the RPPA values from TCGA DCC.
The RPPA_RBN data is normalized value generated using the RBN (replicate-base normalization) method developed by MDACC. For more information: . We downloaded the RBN values from synapse at .
The methylation 450k dataset has . However, we have discovered the range of data for each dataset to be slightly different. As such, we recommend applying some sort of normalization. We recommend looking in the literature to see what methods people have used.
Many copy number estimation algorithms estimate copy number variation on a continuous scale even though it is measuring something discrete (i.e. the number of copies of piece of chromosome or a gene in the cell). The GISTIC 2 thresholded data attempts to assign discrete numbers to these fragments by thresholding the data. The estimated values -2,-1,0,1,2, represent homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification respectively. More information can be found in the and at the , which is the group that processed this data.
As of March 2019, our transcript-level data is in the . From here choose 'Advanced' and select any of the transcript-level expression datasets. Enter your transcript of interest as a Ensembl identifier (not a gene).
The following instructions assume that your data has been log transformed. All the RNAseq data in Xena public data hubs have already been log transformed, either by us or by the data providers. You can always confirm this by viewing the dataset details page (start at our and drill down until you get to the details page for the dataset).
When comparing these log transformed values, we use the :
Yes! We host it on AWS. Note that due to how large the files are, you will need to pay the egress fees to download the files. To get started, first look through the manifests for TCGA: , TARGET: , and GTEx and decide which files you want. Then using your AWS account, download the files. if you run into any issues.
To make a KM plot, click on the column menu at the top of a column and choose 'Kaplan Meier Plot'.
More information about KM plots can be found in our Overview of Kaplan Meier Plots.
Please see our Viewing your own Data documentation:
There are 2 basic data formats and 2 advanced data formats. Each of these formats has one or more biological data types that it supports.
We support most types of genomic and phenotype/clinical/sample annotations. For genomic data we support calls made on the raw data including but not limited to expression calls, mutation calls, etc. This is what TCGA calls ‘Level 3’ data and is typically a value on gene, transcript, probe, etc. We do not support FASTQ, BAMs, or other ‘raw’ files. Please contact us if you have any questions.
We support tab-delimited and Microsoft Excel files (.xlsx and .xls). Tab-delimited files generally have a file name ending in .tsv or .txt, though we do not require this. Note that we load tab-delimited files much faster than Excel files. You can export a Microsoft Excel file as a tab-delimited file using the 'Save as ...' function.
Please do not have any duplicate genes/probes/identifiers or samples. We will allow you to load with duplicates but will only display the first one encountered in the file.
We assume you use a '.' to indicate a decimal place as opposed to a ',' .
Here is a folder with example data in addition to the examples below.
These are numeric data called on genomic regions (e.g. exon expression or gene-level copy number). This data is in a rectangle where samples are columns and rows are the genomic regions (e.g. HUGO gene symbol, transcript ID, probe ID, etc). We also support samples as rows and genomic regions as the columns (i.e. the opposite orientation). For supported genomic regions, please see supported gene and probe names.
RNA-seq expression (exon, transcript, gene, etc)
Array-based expression (probe, gene, etc)
Gene-level mutation
Gene-level copy number
DNA methylation
RPPA
and more ...
Contact us if you're unsure if we will support your data
For samples that do not have expression for a particular gene, either have a blank field or use "NA".
An example of a genomic matrix file (in this case, expression):
These are data on a sample or patient that is categorical in nature (e.g. Tumor Stage or 'wild type' or 'mutant' for a gene) or is numerical but non-genomic (e.g. age or a genomic signature). Samples can be columns and rows can be phenotype/clinical/sample orientation or vice versa. We support both orientations.
phenotype/patient/clinical data (age, weight, if there was blood drawn, etc)
sample/aliquot data (where it was sequenced, tumor weight, etc)
derived data (regulon activity for a gene, etc)
genomic signatures (EMT signature score, stemness score, etc)
other (whether a sample has an ERG-TMPRSS2 fusion, whether a sample has WGS data available, etc)
This is our most flexible data type. If you are wondering if your data is considered to be 'phenotypic' please contact us.
We support both numerical and categorical data. For numerical data please use a blank field for any samples which may be missing data. For categorical data you can use a blank field or "NA" for any samples which may be missing data.
Note that if you use "NA" for a missing numerical field then the Xena software will automatically treat that column as a category.
To have it be treated as a numerical field please use a blank field.
For more information about configuring your phenotype fields, such as controlling the order for categorical features, please see our Metadata Specifications.
An example of a phenotype matrix file:
For segmented data, we require the following 5 columns: sample, chr, start, end, and value. Note that your column headers must be these names exactly!
Please use 'NA' to indicate no data.
copy number
We currently accept hg38, hg19, hg18 coordinates.
Example segmented copy number data with required columns:
For positional data, we require 6 columns: sample, chr, start, end, reference, alt. Note that your column headers must be these names exactly!
Other columns that may follow are: gene, effect, DNA_VAF, RNA_VAF, and Amino_Acid_Change. These other columns are not required but will enhance the visualization of this data, such as the "gene" column will enable displaying mutations when queried by gene names in addition to queried by genomic coordinates. The “effect” column will color the mutations by effect (the default color is gray). The effect terms are "Nonsense" (color red), "Frameshift" (red), "Splice" (orange), "missense" (blue), "Silent" (green), and etc. The full list of accepted terms can be found here in our code.
Note that Xena will not call the gene, variant effect, etc for you. All gene annotation information must be included in the file
mutation data
We currently accept hg38, hg19, hg18 coordinates.
Example mutation data with the six required columns, plus the gene column:
To specify a sample is assayed but no mutation is detected, you need a line in the file with three columns filled: sample, start, end. "start" and "end" are required to be integer (if left empty, the data loader will reject the file), so use -1 to indicate that these are bogus coordinates. The rest of the columns are empty strings.
We support a number of other specialty data types such as structural variants. Please contact us if you have this data so we can help you load it.
Sample
TCGA-BA-4074-01
TCGA-BA-4075-01
TCGA-BA-4076-01
ACAP3
0.137
NA
0.022
CTRT2
0.024
0.805
0.256
ALK
0.098
0.805
1.87
sample
ER_status
disease_status
age
TCGA-BA-4074-01
positive
complete remission
63
TCGA-BA-4082-01
positive
complete remission
54
TCGA-BA-4078-01
negative
undergoing treatment
65
sample
chr
start
end
value
TCGA-V4-A9EL-01
chr1
61735
16815530
0.041
TCGA-V4-A9EL-01
chr1
16816090
17190862
-0.4227
TCGA-V4-A9EF-01
chr4
86979944
115173700
0.0414
sample
chr
start
end
reference
alt
gene
TCGA-AB-2802-03
chr2
29917721
29917721
G
A
ALK
TCGA-AB-2802-03
chr1
119270684
119270687
TTAAA
T
MYC
TCGA-AB-2867-03
chr1
150324146
150324146
T
G
PRPF3
You've run your analysis and are ready to publish your paper - congratulations! Cite the paper below to thank Xena and keep our project funded.
Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8
You can also read our paper for free at bioRxiv: https://www.biorxiv.org/content/10.1101/326470v6
If you are adding in new samples, this will require you to combine outside of Xena and then load. If you are adding new data on samples we already have, then simply load the data into a Xena Hub.
We apologize but we don't provide a simple way to do this because of the batch effects that would be present when combining most data across studies. You will need to download the data you wish to combine from TCGA, combine it yourself outside of Xena, and then load it into your own Xena hub.
Download TCGA data through our data pages
Load your data into your own Xena hub, making sure to select the cohort that you want to view your data side-by-side with when loading it.
Sample names and format are study specific. You will need to match what we already in Xena.
Our data pages have more information about the sample names for a study
Note that if you want to view a genomic signature on our gene expression data, you can do so using our genomic signature feature.
Institutional Xena Hubs allow you to share data, visualizations, and analyses with a specific group of people. Xena Hubs can be set up on any server or in the cloud. You control who has access to the Xena Hub by controlling who has access to the server on which it is hosted.
To make your data publicly available, simply make the server open to the web.
First, download the ucsc_xena_xxx.tar.gz file to your server, here:
https://genome-cancer.ucsc.edu/download/public/get-xena/index.html
The file to download is the one called "Tar archive, no updater or JRE - recommended for linux server developments". Uncompress and extract the .jar file (cavm-xxx-standalone.jar). The current version is 0.25.0.
The hub can be started with "java -jar cavm-xxx-standalone.jar". Passing option --help will display usage information.
Note that you need to use Java 8 to run the hub.
There are several options you will want to set.
To bind an external interface (instead of loopback), use "--host 0.0.0.0".
The connection between your hub and the Xena Browser is through https, use "--certfile" and "--keyfile" options to set them.
There are three paths that can be configured: the database file, the log file, and the root directory for data files to be served. These are set by --database, --logfile, and --root. If you don't set these, they will default to paths under ${HOME}/xena.
Copy the content below to a file "start_script"
Link server.jar to cavm-x.xx.x-standalone.jar
Make "start_script" executable
Run "./start_script"
Your hub is now running on "https://computer-external-ip:7223".
When a Xena Hub starts, it opens two consecutive ports, for http and https connections, e.g. 7222 and 7223. HTTP is always the lower number, and HTTPS is always the higher number. This means your hub has two urls
http://ip:7222 or https://ip:7223
Connecting via HTTP to the hub is no longer supported by modern web browsers, thus you will need to connect via HTTPS. To do this you will need an HTTPS certificate and private key. Paths to the cert and key are set with --certfile and --keyfile. This might seem redundant for a hub behind a firewall, but the web app has no influence over the security policies of the web browser. HTTPS certificates can be acquired from free public Certificate Authorities, or via NIH InCommon.
Note that the section below detailing a way to utilize ssh tunneling to get around this, which can be used for testing purposes only.
You will need to make your data file ready just like for local Xena hub on your laptop. Please see instructions on data format specifications.
You will also need to make your data's meta-data file (xxx.json) ready. Please see loading data from the command line for instructions.
Once the hub is running, and input files have been placed in the --root directory, a file can be loaded by running the jar a second time, with the -l option, like
If your hub is run on the default 7222 port, you can load data with
If your hub is running on a different port, you load data with
Please contact us at genome-cancer@soe.ucsc.edu for more assistance.
If your hub is run on the default 7222 port, you can delete data with
If your hub is running on a different port, you delete data with
Go to Data Hub page here, add "https:computer-external-ip:7223"
You can now go to the visualization and add a cohort or study listed in your hub.
If you don't have a security certificate yet but you would like to verify that the hub is working you can use ssh tunneling. An example of how to do this for AWS is below, where it is assumed that the xena hub is running on port 7222 for http and 7223 for https. In this scenario, you start the hub without using --certfile and --keyfile options.
Assuming that you typically ssh into EC2 on AWS like this,
you will now set up an ssh tunnel to port 8000 on your computer. To do this we add the -L option:
Now on your computer, http://localhost:8000 is the same as the http://aws-ip:7222. Chrome Browser does not allow a connection to http://aws-ip:7222, but it will allow a connection to http://localhost:8000.
After setting up the ssh tunnel go to Data Hub page here, add "http://localhost:8000".
Alternatively, you can run the hub behind a reverse proxy, and attach the certificate and keyfile to Apache, Nginx or AWS load balancer configurations. In this scenario, you start the hub without using --certfile and --keyfile options. This is useful if you want your hub to have a url like "https://tcga.xenahubs.net". You set up your DNS to point the hostname (tcga.xenahubs.net) to ip address of the server on which the hub is running.
An example apache configuration on AWS VM
in /etc/httpd/conf/httpd.conf
If you have a markdown file called $DOCROOT/meta/info.mdown in your hub's document root directory, the markdown file will serve as a splash page for your hub. An example is the UCSC Toil RNA-seq Recompute hub: https://toil.xenahubs.net. The corresponding markdown file is this.
<button class="hubButton" data-cohort="TCGA TARGET GTEx">Launch Xena</button>
To add a clickable button in the hub landing page, make sure the button has classname 'hubButton'. You also need to specify the cohort to view, defined by the data parameter 'data-cohort'. Once users click the button, the visualization wizard will be launched to the specified cohort. You can change the button label.
You can also have a landing page for a study cohort. An example is the TCGA TARGET GTEx cohort: https://xenabrowser.net/datapages/?cohort=TCGA%20TARGET%20GTEx. The corresponding markdown file is this. The study cohort landing page is also a markdown file, which must be hosted in the https://github.com/ucscXena/cohortMetaData repository on github. The markdown file called https://github.com/ucscXena/cohortMetaData/cohort_$cohortName/info.mdown.
<button class="cohortButton" data-bookmark="bc7f3f46b042bcf5c099439c2816ff01">Example: compare FOXM1 expression</button>
The button must has a classname 'cohortButton'. If you have the data parameter 'data-bookmark', clicking the button will take the user to the bookmark view. If you don't have the 'data-bookmark' parameter, clicking the button will take the user to the visualization wizard with an empty spreadsheet. You can change the button label. You can as many button as you want.
As Xena does not generate any of the data it displays, there is no Xena Data Use Agreement. If you use data from Xena, please cite us.
Please check with the original data providers (e.g. the GDC) for any data use restrictions. You can see more about our data providers by clicking on the Hub page.
When there are no overlapping segments, Xena displays the value and color of the copy number segment as indicated in the column legend at the bottom of the column.
When there are overlapping segments, Xena follows these steps:
Compute overlaps by slicing segments that overlap with other segments. For example if there was one segment from chr1:10000-20000 and a second segment from chr1:10050-10100, then resulting segments from this step would be chr1:10000-10050, chr1:10050-10100, and chr1:10100-20000.
For each segment defined in step 1, determine which segments in the original data overlap with this segment.
Divide data segments into those that are greater than copy number neutral (i.e. are amplifications) and those that are less than copy number neutral (i.e. are deletions). Average the segments for each of these two groups.
Find the colors corresponding to the two averages from step 3. Then pick a color that is in between those two colors on the color wheel. An example would be that if the amplifications are red and deletions are blue, the resulting color from a strong amplification and a strong deletion would be purple. Note that copy number neutral in this example would be white.
The Visual Spreadsheet wizard asks that you add at least TWO columns of data before interacting with the browser. This is because Xena was designed to allow you to find correlations within the data and you need more than one type of data on the screen to find a trend.
Add another column of data and click Done. You can always delete this column after you have completed the wizard if it is not needed.
In general, we recognize genes from the HUGO gene name space. If you gene name isn't recognized, try looking at Gene Card and see if other names listed there are recognized.
We will automatically detect and map your probes/transcripts/identifiers to HUGO gene names. For instance, we will map Affy probe IDs to HUGO gene names so that you can enter a HUGO gene name when creating a column in the Visual Spreadsheet and we will pull up the corresponding Affy probes.
You can still load your data if you do not see your identifiers listed. We will just not map them to HUGO genes for you. This means that in the visualization you will need to enter your identifiers as they appear in your file.
Affy U133 array (hg19) e.g. 1007_s_at
Affy HumanExon1.0ST (hg18) e.g. 2315101
Affy Human Gene 1.0 ST array (hg19) e.g. 7896736
Affy Human SNP6 array (hg18) e.g. CN_473963
Agilent Human gene expression 4X44K array (hg18) e.g. A_23_P100001
Agilent SurePrint G3 Human CGH array 2x400K (hg18) e.g. A_16_P01651995
Agilent Human 1A array (hg18) e.g. A_23_P149050
Exon: GENCODE 19 e.g. ENSE00000327880.1
Infinium HumanMethylation27 array GDC version (hg38) e.g. cg00000292
Infinium HumanMethylation27 array TCGA legacy version (hg18) e.g. cg26211698
Infinium HumanMethylation450 array TCGA legacy version (hg19) e.g. cg13332474
Infinium HumanMethylation450 array GDC version (hg38) e.g. cg00000029
HUGO: human gene symbol (hg18) e.g. TP53
HUGO: human gene symbol (hg19) e.g. TP53
HUGO: human gene symbol (hg38) e.g. TP53
Gene: Ensembl human genes (hg19) e.g. ENSG00000223972
Gene: Ensembl human genes (hg38) e.g. ENSG00000223972
Gene: GENCODE 19 e.g. ENSG00000223972.4
Gene: GENCODE 22 comprehensive e.g. ENSG00000223972.5
Gene: GENCODE 23 comprehensive e.g. ENSG00000223972.5
Gene: GENCODE 23 basic e.g. ENSG00000223972.5
Gene: UCSC Known genes (hg18) e.g. uc001aaa.1
Gene: UCSC Known Genes (hg19) e.g. uc001aaa.1
Transcript: GENCODE 19 comprehensive e.g. ENST00000456328.2
Transcript: GENCODE 23 comprehensive e.g. ENST00000456328.2
Transcript: GENCODE 23 basic e.g. ENST00000456328.2
Transcript: RefSeq (hg19) e.g. NM_000014
miRNA miRBase v13 stem-loop (hg18) e.g. hsa-mir-1977
miRNA miRBase v20 stem-loop (hg19) e.g. hsa-mir-1302-2
Contact us if you don't see your gene or probe names in this list and we may be able to add it for you.
If it looks like we picked the wrong set of probes, please click 'Advanced' next to the 'Import' button on the last screen of the wizard to load data. You can then pick the appropriate probes.
metadata (.json file) specification
MAPs are tsne, umap, pca embeddings in 2D or 3D, or spatial maps for spatical data.
map is a list in the .json metadata file
For each map
"label" free text. Display label of the map, should be easily readable by users
"dataSubType" a string. Describe the nature of the map, must be embedding, spatial . Note this is the dataSubType attribute for the map, not the dataSubType attribute for the file
"dimension" a list of strings. They are the column headers of the dimension columns in the data file. They are used to retrieve data from db.
If it is a spatial map, there might be microscopy image(s) associated with each map.
"unit" (optional, only relevant to spatial map) a string. The unit of map values, e.g. pixel, micrometer
"micrometer_per_unit" (optional, only relevant to spatial map) a floating point number. The physical size in micrometer (µm) of value =1 in spatial map. The parameter will be used in rendering scale bar is the spatial map. If not specified, scale will not be shown.
"spot_diameter" (optional, relevant to spatial map) a floating point number of the size of spot in map unit (not image unit). The parameter will be used to determine sphere size shown in spatial map. If not specified, the size of the sphere will be determined by the browser.
For each image
"label" free text. Display label of the image, should be easily readable
"path" file path to the image file
"offset" an array of integers. Image offset in pixel in x and y dimension. See below for conversion from spatial coordinate values in map to pixel position in this image.
"image_scalef": floating point number. A scaling factor that converts spatial coordinate values in the spatial map (e.g. pixel or micrometer) to the pixel unit in this image. It works together with the "offset" parameter to convert spatial coordinate values in the spatial map to the actual pixel positions in this image.
pixel_in_image_x = image_scalef * spatial_coordinate_x + offset_x
pixel_in_image_y = image_scalef * spatial_coordinate_y + offset_y
Note. The transcript coordinate must be in the same unit and scale as the map. Therefore no scaling or offset are needed to convert transcript coordinate and map coordinate.
"label" free text. Display label of the transcript data, should be easily readable
"path" file path to the transcript datafile
"dimension" a list of strings. Must have the same number of dimensions as the map. They are the column headers of the dimension columns in the transcript file. They are used to retrieve data from db.
Example, map without image
Example, spatial map with matching microscopy image
Example, spatial map with matching microscopy image and transcript data
Surival time unit is displayed on the x-axis of the KM plot. You specify it in metadata file under the "units" attribute.
"units" free text , KM plot x-axis unit, e.g. years, months, days
example
You customize the display of features in a phenotype file ("type": "clinicalMatrix") by adding a "clinicalFeature" file and accompanying .json file. To do this there are two steps
Add a "clinicalFeature" reference to the .json file that accompanies the phenotype file. Note the colon notation in example below.
Compose the clinicalFeaure file (tab delimitated) and its .json file.
Below is an example for adding "clinicalFeature" reference in the phenotype file .json metadata
Below is an example for clinical feature file. This file is tab-delimitated, with headers "feature", "attribute", "value". The attributes are "valueType", "state", and "stateOrder". The values for the attribute "valueType" can be "category" or "float".
Below is an example clinicalFeature file .json file (clinicalFeature.txt.json)
Below is an example of how you would do this for a feature called "your_featureName".
More information about how to specify missing data as well as how Xena decides if a column is categorical or numerical, see our Data Format Specifications.
For genomic data matrix, the optional metadata parameter colNormalization sets the default display scale. If not specified, the browser automatically determines the scale.
colNormalization: ‘true’ | ‘log2(x)’ | ‘normal2’
true: display centered by column mean, x - column average, example usage is gene expression matrix that already log transformed.
log2(x): display in log2(x+1) scale, example usage is gene expression count matrix
normal2: display value of 2 in the background color (i.e. white), typically used for copy number data where the normal = 2
for segmented copy number data, if you don't specify colNormalization, display defaults to normal=0, display value of 0 in background color (i.e. white)
example
Please cite us! Citations are an important metric to our funders. Citing us helps us continue to support Xena.
You've run your analysis and are ready to publish your paper - congratulations! Cite the paper below to thank Xena and keep our project funded.
Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8
You can also read our paper for free at bioRxiv: https://www.biorxiv.org/content/10.1101/326470v6
Do you use Xena to further your research? Tag us on Twitter when you publish and we'll promote you on our Publication Page.
How to programmatically specify Xena Browser views
Xena has the ability to draw visualizations based on parameters passed through URL. You will need to URL encode the parameters.
The list of supported parameters below is not exhaustive. If you do not see your functionality supported below please contact us.
Display Column Setup Examples
Example 1 One data column (with two subcolumns) display
Example 2 Display in chart mode
Example 3 Three data column display, one clinical data column, two genomic data columns
Example 4 Specify column width, top label, and bottom label on Column C
Example 5 Reverse sort on Column B
Example 6 Display gene average in Column C
Example 7 Display introns in Column C; Hide welcome banner
Highlight Samples Examples
Example 8: highlight TCGA-C8-A131-01 or TCGA-BH-A0DL-01 samples
Example 9: highlight samples matching arbitrary criteria, such as samples in Column B with values > 10
Sample Filtering (specify what samples to display in the view) Examples
The columns parameter is a JSON-encoded array of objects, specifying the columns to display.
To specify a single column you need to, at a minimum, specify the dataset ID, the hub where the data resides, and the fields that you want to display.
Fields can be a gene, probe, or chromosome position. All fields need to be of the same type (i.e. all genes or all probes). You can only enter one chromosome position per column.
Field should be the field ID as it appears exactly in the dataset
width: <number>
Width in pixels
columnLabel: <string>
Text for top column label
fieldLabel: <string>
Text for bottom column label
geneAverage: <boolean>
Display the gene average instead of the individual probes for a gene. You can use this only when a single gene is specified for a dataset that has probes on a gene
normalize: ‘none’ | ‘mean’ | ‘log2’ | ‘normal2’
How the data should be dynamically normalized on the fly. 'mean' is x-mean (subtract mean), applied per (sub)column. 'log2' is log2(x+1). 'normal2' is (x-2).
showIntrons: <boolean>
Show introns for mutation and segmented copy number columns
sortDirection: 'reverse'
Reverse sort the samples
sortVisible: <boolean>
Sort column on the zoomed region
Same as the columns parameter, but these columns will not be displayed. They are available for sample filtering. See the filter property of the heatmap parameter, below.
The heatmap parameter is a JSON-encoded object specifying global display options
mode: 'chart'
Display in chart mode rather than visual spreadsheet mode.
showWelcome: <boolean>
Show the welcome banner.
searchSampleList: [<string>, ...]
Highlight the specified samples in the view.
search: <string>
Equivalent to typing this text into the 'Find' feature in Xena. In this example it is highlighting the samples that for column B have a value of 'TARGET'. More examples of possible search terms.
filter: <string>
Like the search parameter, but filter the view to the matching samples. Equivalent to selecting 'Filter' from the 'Find' UI. Columns that are only needed for filtering (not visualization) can be added to the filterColumns parameter, and appear semantically after columns. For example, if columns has length two they are labeled 'B' and 'C', and the first column in filterColumns will be 'D'.
Both search and filter can be specified in the same url, in which case the samples will be filtered, and any remaining samples matching search will be highlighted. Note that the search expression should only reference columns, not filterColumns, since the latter are not available for visualization.
TCGA, TARGET, and GTEx RNA-seq data are uniformly re-aligned to hg38 genome, and re-processed using RSEM and Kallisto methods with gencode v23 annotations to generate expression estimates for ~60,000 genes and ~200,000 transcripts, including many LncRNAs. Xena hosts and displays gene and transcript expression results of this analysis.
International Cancer Genome Consortium (ICGC) goal is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe. It includes TCGA data (U.S.A.) plus data contributed by groups from other countries in the International Cancer Genome Consortium. The resource has publically-accessible non-coding somatic mutation data from non-TCGA samples.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in more than 2,600 cancer whole genomes from the International Cancer Genome Consortium. Building upon previous work which examined cancer coding regions, this project explored the nature and consequences of somatic and germline variations in both coding and non-coding regions, with specific emphasis on cis-regulatory sites, non-coding RNAs, and large-scale structural alterations.
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.
The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (), including The Cancer Genome Atlas () and Therapeutically Applicable Research to Generate Effective Treatments (), and many more.
Xena displays gene expression data from the metastatic cancer study published in
Cancer Cell Line Encyclopedia. Detailed genetic and pharmacologic characterization of a large panel (~1100) of human cancer cell lines.
We have a number of sources of pediatric data
The goal of the Treehouse Childhood Cancer Initiative (Treehouse) is to evaluate the utility of comparative gene expression analysis for difficult-to-treat pediatric cancer patients. Approaching 2000 pediatric tumor data, Treehouse has now assembled a large collection of pediatric cancer RNA-Seq, which, added to adult data, results in a compendium of over 11,000 adult and pediatric tumor-derived gene expression data. Pediatric cancer expression data are from public repository samples and from clinical samples at partner institutions, including UC San Francisco, Stanford, Children’s Hospital of Orange County and British Columbia Cancer Agency. In line with UC Santa Cruz Genomics Institute’s commitment to sharing data and to furthering research everywhere, we have made this data available for all to download and use.
In addition to the data itself, we require some metadata about your file. When you use our website to load your data we fill in this metadata for you. When you use the command line, you will need to provide this data in an additional file.
The metadata file is a .json file and follows. The metadata .json file needs to be in the same directory as the data file. The metadata file and the data file need to have the same base name, including any file extensions (e.g. my_first_dataset and my_first_dataset.json OR my_second_dataset.txt and my_second_dataset.txt.json).
There are two required fields: type and cohort.
Type can be:
'genomicMatrix' -> where samples are columns and genomic regions are rows. Note that for loading on the command line we do not support the other orientation
'clinicalMatrix' -> where samples are rows and phenotypic columns are rows. Note that for loading on the command line we do not support the other orientation
'mutationVector' ->
‘genomicSegment’->
Cohort is used to know if there are other data on the samples that you are loaded. You can either specify a pre-existing cohort or create your own. Cohort names are displayed on the dataset pages and the cohort drop down menu on the Heatmaps page.
For existing cohorts, you need to enter the cohort name EXACTLY as it appears as the existing cohort name. Note that our cohort names are case sensitive.
If you are loading a mutation or segmented copy number file you will also need to specify the reference genome. You do not need to specify this for other file types
If you are loading a file that has probes, transcripts, or exons and you would like to query your data by gene, you will need to provide a mapping file. You do not need to specify this for other file types.
If you do not see a probemap that will work for you, please let us know.
To reference a probemap you need three files:
Include the probemap reference in your data file .json
Have the probemap file in the same directory as your data file and data file .json
Also have a .json file for the probemap so that we know how to load it
Note that to reference a probemap you need to load the probemap first, then load the data file.
Put both your .tsv and .json files in your_home_directory/xena/files. Then run the jar, passing in the file name, like so:
→ loads all files
OR
→ loads just file1.tsv
Note that you will need to substitute the name of the .jar. file As of the time of writing (September 20, 2018), the name of the .jar file was cavm-0.22.0-standalone.jar. On linux this will be in the directory where you opened the archive. On Windows or MacOS, use your operating system’s file search capability to search for cavm*jar. On Windows you will need to use the full path to your home directory, instead of “~”.
Note you do not need to load the .json files. Xena will automatically look for these and load them.
→ delete just file1.tsv
→ delete file1.tsv and file2.tsv
You can always type:
for help.
Step-by-step instructions to viewing your own data
Get started viewing your own data:
We support most types of . Genomic data needs to be values called on genes, transcripts, exons, probes or some other identifier. Phenotypic/clinical/annotation data can be almost anything, including patient data (e.g. age, set, etc), clinical data (), and other data such as gene fusion calls, regulon activity, immune scores, and more. Samples can be bulk tissue, cell lines, cells, and more. We do not visualize raw data such as FASTQs or BAMs.
Data can be your own or from another source, like or a publication.
We support tab-delimited (.tsv and .txt) and Microsoft Excel files (.xlsx and .xls). Data on a Local Xena Hub can only be viewed or accessed by the same computer on which it is running, keeping private data secure.
The Local Xena Hub must be installed and running in order to load data, as well as any time you want to view data. The Local Xena Hub will remember previously loaded data.
Please use Chrome to view your own data.
Click on . You will be prompted to download and install a local Xena Hub.
Double click on the download to begin the installation of the Xena Hub. Follow the wizard to finish the install.
Mac: OSX 10.7 and above
Windows: 64-bit
Linux: ability to run a .jar file
When you loaded your genomic data we asked what type of genes, transcripts or probes you used. If you selected one of the options from the drop down menu then you can enter HUGO gene names or the identifiers in your file. If you did not select one of the options then you will need to enter the identifiers as they appear in your file.
Xena does not utilize a central rendering service, or require hubs to be publicly accessible on the internet like, for example, the UCSC Genome Browser does. Data flows in one direction, from hubs to the user agent. If the user installs a Xena Hub on their laptop, the hub is as secure as the laptop. If the user installs a Xena Hub on a local network, behind a firewall, the hub is as secure as the local network.
The Xena Browser accesses data from a local Xena Hub on the same computer by requesting data from http://127.0.0.1. The local Xena Hub will make the data within it available at this address. The local Xena Hub will only answer requests made form the user's own computer.
Users will need to use a web browser that supports this if they wish to use a Xena Hub on the loopback interface. At the time of writing, this includes Chrome, and Firefox, but not Safari.
A very limited set of metadata is considered to be not secure in the Xena architecture model. This includes cohort names and samples names. This metadata is visible to other hubs in the following scenarios. When the user selects a cohort, all hubs are queried for samples on that cohort. When the user selects a data field, the hub holding that field is queried with the field ID (e.g. gene, probe, transcript, phenotype) and all cohort sample IDs. This means, for example, that two hubs holding data on the same cohort will see the union of sample IDs from that cohort. While data queries are not made available publicly, a malicious person could gain entry to a Xena Hub and comb through logs for these queries. For these reasons, these metadata fields should not contain private information.
To visualize and perform a KM analysis, we use two columns/rows of data, time to event and event. These data must be loaded in a phenotype file. The phenotype file can contain other data as well.
Note that you will need to name the headers in your phenotype file EXACTLY what we recognize. See the list of recognized headers for each type of survival/interval below.
This data can be in days, months, years, etc.
Time to Event is a duration variable for each subject having a beginning and an end anywhere along the timeline of the complete study. It begins when the subject is enrolled into a study or when treatment begins, and ends when the end-point (event of interest, for example, death or metastasis) is reached or the subject is censored from the study.
Censoring means the total survival time for that subject cannot be accurately determined. This can happen when something negative for the study occurs, such as the subject drops out, is lost to follow-up, or the required data is not available or, conversely, something good happens, such as the study ends before the subject had the event of interest occur, i.e., they survived at least until the end of the study, but there is no knowledge of what happened thereafter.
Event indicates what the 'event' was for a patient, 1 for the event, for example, death or metastasis, and 0 for censored.
Help text was partially taken from .
Below is a table of the column/row header names we recognize for each type. Note that these header names are case sensitive.
feature | attribute | value |
---|---|---|
The goal of the Gabriella Miller Kids First Pediatric Research Program (Kids First) is to develop to help researchers uncover new insights into the biology of childhood cancer and structural birth defects, including the discovery of shared genetic pathways between these disorders. Over 2015-2018, the program selected 26 patient cohorts for whole genome sequencing through a peer-review process.
TARGET data is intended exclusively for biomedical research using pediatric data (i.e., the research objectives cannot be accomplished using data from adults) that focus on the development of more effective treatments, diagnostic tests, or prognostic markers for childhood cancers. Moreover, TARGET data can be used for research relevant to the biology, causes, treatment and late complications of treatment of pediatric cancers, but is not intended for the sole purposes of methods and/or tool development (please see section of the OCG website). If you are interested in using TARGET data for publication or other research purposes, you must follow the .
Don't see a study or dataset that you are interested in? for yourself or your group with the data you need.
Here is an example probemap file (a delimitated file):
We have many probemap files that you can see via our .
After installing a local Xena Hub, go back to to auto-start the Hub. If it does not automatically start, refresh the page or double click on the Xena Hub application on your computer. The Xena Hub application should be in your Applications folder for Mac and Windows. Note that it will take up to one minute to start up.
Most people load data into their Local Xena Hub through our , which leads you through the loading process step by step. Note that you will want to make sure your data is ahead of time.
You can also load data .
Click on . If your study is not already selected as step 1 of the wizard, then select it from the drop down and click 'Done'. Note that if you did not enter a study name your data will be under 'My Study'.
You Local Xena Hub must be running to view any data that you have loaded into it. Please ensure it is running on your computer. You can also check which studies are on your hub and what data is in them by going to the .
your_featureName
valueType
category
your_featureName
state
0
your_featureName
state
1
your_featureName
stateOrder
"0","1"
sample | OS | OS.time |
TCGA-AB-1234-01 | 0 | 100 |
TCGA-AB-6789-01 | 1 | 200 |
TCGA-CD-1234-01 | 0 | 300 |
TCGA-CD-5678-01 | 1 | 400 |
Survival Type | 'Time to Event' Header name | 'Event' Header name |
Overall Survival | OS.time | OS |
Disease free interval | DFI.time | DFI |
Disease specific survival | DSS.time | DSS |
Progression free interval | PFI.time | PFI |
Local recurrence interval | LRI.time | LRI |
Distant metastasis interval | DMI.time | DMI |
Distant disease free survival | DDFS.time | DDFS |
Invasive disease free survival | IDFS.time | IDFS |
Regional recurrence | RR.time | RR |
Relapse | Relapse.time | Relapse |
Metastasis | Metastasis.time | Metastasis |
Distant recurrence interval | DRI.time | DRI |
Distant metastasis free survival | DMFS.time | DMFS |
There are a couple of options. You can right-click the .dmg and chosen 'open'. You can also press the Control key, then click the app icon, then choose Open from the shortcut menu. These help pages might help: http://www.iclarified.com:8081/28180/how-to-open-applications-from-unidentified-developers-in-mac-os-x-mountain-lion and from Apple: https://support.apple.com/kb/PH25088?locale=en_US .
The only time the assembly matters is if you decided to visualize part or all of a chromosome, rather than a gene/probe/transcript. If you want to visualize only genes/probes/transcripts than it does not matter which assembly you choose.
You Local Xena Hub must be running to view any data that you have loaded into it. Please ensure it is started up. You can also check which studies are on your hub and what data is in them by going to the My Computer Hub page: xenabrowser.net/datapages/?host=https%3A%2F%2Flocal.xena.ucsc.edu%3A7223.
You also may not see your study if the hub is still loading the data. Wait a few minutes and refresh the page.
When you loaded your genomic data we asked what type of genes, transcripts or probes you used. If you selected one of the options from the drop down menu then you can enter HUGO gene names or the identifiers in your file. If you did not select one of the options then you will need to enter the identifiers as they appear in your file.
Yes, we will allow you to select phenotypes from both files in the visualization.
You might be able to load your file anyways, depending on the format. Give it a try and if you are unable to load it, write us an email and we may be able to fix your file for you.
You can export a Microsoft Excel file as a tab-delimited file using the 'Save as ...' function.
We require that your data files have a unix line ending. To ensure that your files have this line ending on a DOS, please follow the help here:
Note that this requirement is only for data files, not for the associated .json files.
We'd love to hear from you!
https://groups.google.com/g/ucsc-cancer-genomics-browser
genome-cancer@soe.ucsc.edu
http://xena.ucsc.edu/#whatsnew
https://twitter.com/ucscxena?lang=en
Do you use Xena to further your research? Tag us when you publish and we'll help promote you.
You've run your analysis and are ready to publish your paper - congratulations! Cite the paper below to thank Xena and help keep our project funded.