There are 2 basic data formats and 2 advanced data formats. Each of these formats has one or more biological data types that it supports.
We support most types of genomic and phenotype/clinical/sample annotations. For genomic data we support calls made on the raw data including but not limited to expression calls, mutation calls, etc. This is what TCGA calls ‘Level 3’ data and is typically a value on gene, transcript, probe, etc. We do not support FASTQ, BAMs, or other ‘raw’ files. Please contact us if you have any questions.
We support tab-delimited and Microsoft Excel files (.xlsx and .xls). Tab-delimited files generally have a file name ending in .tsv or .txt, though we do not require this. Note that we load tab-delimited files much faster than Excel files. You can export a Microsoft Excel file as a tab-delimited file using the 'Save as ...' function.
Please do not have any duplicate genes/probes/identifiers or samples. We will allow you to load with duplicates but will only display the first one encountered in the file.
We assume you use a '.' to indicate a decimal place as opposed to a ',' .
RNA-seq expression (exon, transcript, gene, etc)
Array-based expression (probe, gene, etc)
Gene-level mutation
Gene-level copy number
DNA methylation
RPPA
and more ...
An example of a genomic matrix file (in this case, expression):
These are data on a sample or patient that is categorical in nature (e.g. Tumor Stage or 'wild type' or 'mutant' for a gene) or is numerical but non-genomic (e.g. age or a genomic signature). Samples can be columns and rows can be phenotype/clinical/sample orientation or vice versa. We support both orientations.
phenotype/patient/clinical data (age, weight, if there was blood drawn, etc)
sample/aliquot data (where it was sequenced, tumor weight, etc)
derived data (regulon activity for a gene, etc)
genomic signatures (EMT signature score, stemness score, etc)
other (whether a sample has an ERG-TMPRSS2 fusion, whether a sample has WGS data available, etc)
We support both numerical and categorical data. For numerical data please use a blank field for any samples which may be missing data. For categorical data you can use a blank field or "NA" for any samples which may be missing data.
To have it be treated as a numerical field please use a blank field.
An example of a phenotype matrix file:
sample
ER_status
disease_status
age
TCGA-BA-4074-01
positive
complete remission
63
TCGA-BA-4082-01
positive
complete remission
54
TCGA-BA-4078-01
negative
undergoing treatment
65
For segmented data, we require the following 5 columns: sample, chr, start, end, and value. Note that your column headers must be these names exactly!
Please use 'NA' to indicate no data.
copy number
We currently accept hg38, hg19, hg18 coordinates.
Example segmented copy number data with required columns:
sample
chr
start
end
value
TCGA-V4-A9EL-01
chr1
61735
16815530
0.041
TCGA-V4-A9EL-01
chr1
16816090
17190862
-0.4227
TCGA-V4-A9EF-01
chr4
86979944
115173700
0.0414
For positional data, we require 6 columns: sample, chr, start, end, reference, alt. Note that your column headers must be these names exactly!
Other columns that may follow are: gene, effect, DNA_VAF, RNA_VAF, and Amino_Acid_Change. These other columns are not required but will enhance the visualization of this data, such as the "gene" column will enable displaying mutations when queried by gene names in addition to queried by genomic coordinates. The “effect” column will color the mutations by effect (the default color is gray). The effect terms are "Nonsense" (color red), "Frameshift" (red), "Splice" (orange), "missense" (blue), "Silent" (green), and etc. The full list of accepted terms can be found here in our code.
mutation data
We currently accept hg38, hg19, hg18 coordinates.
Example mutation data with the six required columns, plus the gene column:
sample
chr
start
end
reference
alt
gene
TCGA-AB-2802-03
chr2
29917721
29917721
G
A
ALK
TCGA-AB-2802-03
chr1
119270684
119270687
TTAAA
T
MYC
TCGA-AB-2867-03
chr1
150324146
150324146
T
G
PRPF3
We support a number of other specialty data types such as structural variants. Please contact us if you have this data so we can help you load it.