GADB (Genomic Annotation Database)
This document provides detailed information about how we obtained the data, what version and what pre-processing steps were done for each data source.
GADB
metadata:
Refer to “metadata_info.xlsx” file for how and from where we obtained
meta information for GADB
Pre-processing:
All of the data provided in GADB are in BED format (other than 1000 genome - vcf and sequence data - fasta). Refer to “GADB_file_formats.tsv” for different BED formats
Bed files are 0-based sorted and indexed (GIGGLE and Tabix)
Pre-processing
scripts:
Below scripts were used for every data source.
1) Bed sort and bgzip (BED_sort-bgzip.sh)
2) Giggle indexing (GIGGLE_index.sh)
3) Tabix indexing (tabix_index.sh)
Other pre-processing steps that are specific for data source are described in the documentation part below, along with the scripts used.
ENCODE data download Documentation
Version: Downloaded datasets released until 05/2018
Data download:
Access data from data matrix: https://www.encodeproject.org/matrix/?type=Experiment&status=released
Use the following filter terms:
Assay title: CHIP-seq, Dnase-seq, eCLIP, CAGE, FAIRE-seq, RNA-PET, ChIA-PET, RIP-seq, iCLIP, 5C, ATAC-seq, DNA-PET, Hi-C, Mnase-seq
Genome assembly: hg19, hg38
Available file types: bigBed narrowPeak, bigBed broadPeak, bigBed bedRnaElements, bigBed tss_peak, bigBed idr_peak, bigBed bed12, bigBed bed3, gtf, gff gff3
After making the filter selection, use batch download (below instructions will be shown on the matrix page)
Click the “Download” button download a “files.txt” file that contains a list of URLs to a file containing all the experimental metadata and links to download the file. The first line of the file has the URL or command line to download the metadata file.
The “files.txt” file can be copied to any
server.
The following command using curl can be used to download all the files in the
list:
xargs -L 1 curl -O -L < files.txt
Data pre-processing:
1) Convert bigBed to bed (bigBedToBed)
2) Convert gff3 to bed (GFF3toBed.sh)
3) Convert gtf to bed (GTFtobBed.sh)
GTEx Data download Documentation
Version: v6, v6p, v7, v8
Data download:
GTEx
v6: significant
snp-gene association
https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/GTEx_Analysis_V6_eQTLs.tar.gz
GTEx
v6P: significant
snp-gene association
https://storage.googleapis.com/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_eQTL.tar
GTEx
v6P: all
snp-gene association
https://storage.googleapis.com/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_all-associations.tar
GTEx
v7: significant snp-gene
association
https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz
GTEx v7: all snp-gene association
https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL_all_associations.tar.gz
GTEx v8
eQTL: significant snp-gene association
https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL.tar
GTEx v8
eQTL: all snp-gene association
Follow the instructions to download using google cloud
https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_eQTL_all_associations/
GTEx
v8 sQTL: significant
snp-gene association
https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL.tar
GTEx
v8 sQTL: all
snp-gene association
Follow the instructions to download using google cloud
https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_sQTL_all_associations/
Data pre-processing:
All snp-gene association – Considering the file size we filtered variants with p-value < 0.05 script used for filtering - “GTEx_filter_all_association.sh”
Each version of GTEx has different formatting structure, we used GTEx v6p all snp-gene association as default format. Below scripts are used for creating BED files from GTEx data (conversion done after filtering variants for every all snp-gene association dataset)
1) GTEx_v6_eQTL_signif_bed_conversion.sh
2) GTEx_v6p_eQTL_signif_bed_conversion.sh
3) GTEx_v6p_eQTL_all_association_bed_conversion.sh
4) GTEx_v7_eQTL_signif_bed_conversion.sh
5) GTEx_v7_eQTL_all_association_bed_conversion.sh
6) GTEx_v8_eQTL_signif_bed_conversion.sh
7) GTEx_v8_eQTL_all_association_bed_conversion.sh
8) GTEx_v8_sQTL_signif_bed_conversion.sh
9) GTEx_v8_sQTL_all_association_bed_conversion.sh
After conversion to BED format, we rearranged the columns
for consistency (to match v6p
all snp-gene association) Script used for re-formatting - “GTEx_BED_column_reformatting.sh”
ROADMAP Data download Documentation
Version: Release 9
Data download:
BroadPeak:
https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/broadPeak/
GappedPeak:
https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/gappedPeak/
NarrowPeak:
https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/narrowPeak/
i) all_hg38lift.mnemonics.bedFiles.tgz
ii) all.mnemonics.bedFiles.tgz
Data pre-processing:
1) ROADMAP enhancers – extracted enhancers from ChromHMM data – script used “ROADMAP_enhancer_extraction.sh”
FANTOM5 Documentation
Version: FANTOM5 Phase2.0
Data download:
Download the CAGS TSS bed files
http://fantom.gsc.riken.jp/5/datafiles/latest/basic/
README files are located inside every sub-folder
Enhancer
download:
http://slidebase.binf.ku.dk/human_enhancers/presets/serve/facet_expressed_enhancers.tgz
FactorBook Documentation
Version: Release date: 16-March-2014
Data download:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/factorbookMotifPos.txt.gz
Homer Documentation
Version: 17/09/2017
Data download:
http://homer.ucsd.edu/homer/data/motifs/
Human hg38 UCSC BigBed Track 170917: http://homer.ucsd.edu/homer/data/motifs/homer.KnownMotifs.hg38.170917.bigBed.track.txt
Human hg19 UCSC BigBed Track 170917: http://homer.ucsd.edu/homer/data/motifs/homer.KnownMotifs.hg19.170917.bigBed.track.txt
Data pre-processing:
1) HOMER data has identical motif names, use script “HOMER_modification_update_duplicate_motif_names.py” to fix it
TargetScan Documentation
Version: 7p2 (March 2018)
Data download:
http://www.targetscan.org/cgi-bin/targetscan/data_download.vert72.cgi
All predictions for representative transcripts: Genome (hg19) locations of all targets, partitioned into files by conservation of miRNA family and site
Default predictions (conserved sites of conserved miRNA families): Genome (hg19) locations of human predicted (conserved) targets of conserved miRNA families
Reference Genome
hg19: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/
Feb. 2009 assembly of the human genome
hg38:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/
Dec. 2013 assembly of the human genome
Gene model
Ensembl
hg19: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens
release 75
hg38: ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens
release 89
RefSeq
Used UCSC table browser to
download NCBI RefSeq data
hg19: http://genome.ucsc.edu/cgi-bin/hgTables
genome: Human, assembly: Feb. 2009 (GRCh37/hg19), group: Genes and Gene Predictions,
Track: NCBI RefSeq
hg38: http://genome.ucsc.edu/cgi-bin/hgTables
genome: Human, assembly: Dec. 2013
(GRCh38/hg38), group: Genes and Gene Predictions, Track: NCBI RefSeq
Repeats
hg19: http://genome.ucsc.edu/cgi-bin/hgTables
genome: Human, assembly: Feb. 2009
(GRCh37/hg19), group: Repeats, Track: RepeatMasker
hg38: http://genome.ucsc.edu/cgi-bin/hgTables
genome: Human, assembly: Dec. 2013
(GRCh38/hg38), group: Repeats, Track: RepeatMasker
DASHR2 Documentation
Version: DASHR2 (August-2018)
Data Download:
Small RNA called peaks data:
http://dashr2.lisanwanglab.org/download.php
Select DASHR data collection: Download sncRNA tables (annoted and unnotated from Small RNA loci table row)
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_GEO_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=DASHR2_GEO_hg38&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_GEO_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_GEO_hg38&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_dataportal_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_dataportal_hg38&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=DASHR1_GEO_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=DASHR1_GEO_hg38&table=peaks
DASHR Annotation:
http://dashr2.lisanwanglab.org/downloads/dashr.v2.sncRNA.annotation.hg19.gff
http://dashr2.lisanwanglab.org/downloads/dashr.v2.sncRNA.annotation.hg38.gff
http://dashr2.lisanwanglab.org/downloads/dashr.v2.annotation.hg19.gff
http://dashr2.lisanwanglab.org/downloads/dashr.v2.annotation.hg38.gff
Data pre-processing:
Used script “DASHR2_split_by_tissue.sh” to split the peak files by tissues.
1000Genome Documentation
Version: Phase3
Data Download:
Download the vcf files from the following FTP links
Hg19: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Hg38: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/
Data pre-processing:
VCF files for each chromosome were split by variant type (SNP, INDEL, biallelic, multiallelic), super and sub population.
Scripts used for splitting the VCF files are in the package “1kgenome_pre-processing_scripts.tar.gz”
Hg38 liftover Documentation
Liftover was performed on all of the data sources for which no hg38 data was available.
Hg19 genome coordinates are lifted using UCSC liftOver utility hg19ToHg38.over.chain.gz
Hg19 coordinates are retained in the lifted-over files (column 4, 5 and 6 are hg19 coordinates in the lifted-over files)
Download link: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz