About
FILER tutorial
- For examples on how to use FILER please check the FILER webserver tutorial
Download
-
Downloading FILER track metadata
FILER track metadata are available in templated metadata format (TSV) for both GRCh37/hg19 and GRCh38/hg38 genome builds:
Please refer to the FILER_v1_metadata_schema.xlsx [XLS spreadsheet; April 2021; 14KB] for the description of the metadata fields.
- Downloading FILER tracks
Download URLs and
wget
commands are provided in the Processed File Download URL and wget command columns. - Deploying FILER instance on your system
For installing and using a stand-alone FILER instance, please refer to the FILER bitbucket repository http://bitbucket.org/wanglab-upenn/FILER and
install_filer.sh
installation script. This will download tracks/datasets, index individual tracks (tabix) and Giggle-index individual datasets.
Frequently Asked Questions (FAQ)
Q1. How can I download individual FILER tracks?
A1. Individual tracks can be downloaded in several ways:
- Using the FILER website: from Browse page using provided Download links in the Download file column of the FILER track table.
- Using download link provided in the FILER metadata table (see Processed File Download URL column, e.g., in GRCh37/hg19 or GRCh38/hg38 FILER metadata tables)
Q2. Running FILER script fails with an error message:
e.g., declare: -A: invalid option
or ERROR: Bash version 4+ is required
A2. FILER scripts require Bash v4.3+. Please check bash --version
and update if necessary (e.g., using brew install bash
on Mac OS, yum update bash
on Cent OS, apt-get install --only-upgrade bash
for Ubuntu)
Q3. How can I download a custom subset of FILER tracks?
Website: Use data selectors/filters on Browse page to filter tracks down to a desired set. Then click on Download button above the FILER track table to download FILER track metadata for the selected tracks. Column Processed File Download URL will contain download URLs for individual tracks. Alternatively, each track can be downloaded using Download link under Download file column.
Command-line: using
bash install_filer.sh <filtered_filer_metadata_template_file> <target_filer_directory_for_install>
. Template files for all GRCh37/hg19 and GRCh38/hg38 FILER tracks. These template FILER can be filtered to obtain a desired set of tracks before runningbash install_filer.sh
(see also section on installing a custom subset of FILER tracks for examples).
Q4. Explanations for data terms
- For descriptions of the main output data types and data terms, please see data glossary, data terms, and references therein. For additional information, please also see data Summary page (e.g., the ‘Data description’ tab)
Method
This section provides detailed information on obtaining the data and data pre-processing steps for each FILER data source.
FILER metadata:
Please refer to the FILER_v1_metadata_schema.xlsx [XLS spreadsheet; April 2021; 14KB] for the description of FILER v1.0 metadata information fields.
Refer to metadata_info.xlsx file for how and from where we obtained meta information for FILER.
Pre-processing:
All of the data provided in FILER are in BED format (other than 1000 genome - vcf and sequence data - fasta). Refer to FILER_file_formats.xlsx for description of different BED formats.
BED files are 0-based sorted and indexed (GIGGLE and tabix).
Pre-processing scripts:
Below scripts were used for every data source.
- Bed sort and bgzip (BED_sort-bgzip.sh)
- Giggle indexing (GIGGLE_index.sh)
- Tabix indexing (tabix_index.sh)
Other pre-processing steps that are specific for data source are described in the documentation part below, along with the scripts used.
ENCODE data download documentation.
Version:
Downloaded datasets released until 05/2018.
Data download:
Access data from data matrix: https://www.encodeproject.org/matrix/?type=Experiment&status=released.
Use the following filter terms:
Assay title: CHIP-seq, Dnase-seq, eCLIP, CAGE, FAIRE-seq, RNA-PET, ChIA-PET, RIP-seq, iCLIP, 5C, ATAC-seq, DNA-PET, Hi-C, Mnase-seq.
Genome assembly: hg19, hg38
Available file types: bigBed narrowPeak, bigBed broadPeak, bigBed bedRnaElements, bigBed tss_peak, bigBed idr_peak, bigBed bed12, bigBed bed3, gtf, gff gff3.
After making the filter selection, use batch download (below instructions will be shown on the matrix page).
Click the “Download” button download a “files.txt” file that contains a list of URLs to a file containing all the experimental metadata and links to download the file. The first line of the file has the URL or command line to download the metadata file.
The “files.txt” file can be copied to any server. The following command using curl can be used to download all the files in the list:
xargs -L 1 curl -O -L < files.txt
Data pre-processing:
- Convert bigBed to bed (bigBedToBed)
- Convert gff3 to bed (GFF3toBed.sh)
- Convert gtf to bed (GTFtobBed.sh)
GTEx data download documentation
Version:
v6, v6p, v7, v8
Data download:
GTEx v6: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/GTEx_Analysis_V6_eQTLs.tar.gz
GTEx v6P: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_eQTL.tar
GTEx v6P: all snp-gene association https://storage.googleapis.com/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_all-associations.tar
GTEx v7: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz
GTEx v7: all snp-gene association https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL_all_associations.tar.gz
GTEx v8 eQTL: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL.tar
GTEx v8 eQTL: all snp-gene association Follow the instructions to download using google cloud https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_eQTL_all_associations/
GTEx v8 sQTL: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL.tar
GTEx v8 sQTL: all snp-gene association Follow the instructions to download using google cloud https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_sQTL_all_associations/
Data pre-processing:
All snp-gene association – Considering the file size we filtered variants with p-value < 0.05 script used for filtering - “GTEx_filter_all_association.sh”.
Each version of GTEx has different formatting structure, we used GTEx v6p all snp-gene association as default format. Below scripts are used for creating BED files from GTEx data (conversion done after filtering variants for every all snp-gene association dataset).
- GTEx_v6_eQTL_signif_bed_conversion.sh
- GTEx_v6p_eQTL_signif_bed_conversion.sh
- GTEx_v6p_eQTL_all_association_bed_conversion.sh
- GTEx_v7_eQTL_signif_bed_conversion.sh
- GTEx_v7_eQTL_all_association_bed_conversion.sh
- GTEx_v8_eQTL_signif_bed_conversion.sh
- GTEx_v8_eQTL_all_association_bed_conversion.sh
- GTEx_v8_sQTL_signif_bed_conversion.sh
- GTEx_v8_sQTL_all_association_bed_conversion.sh
After conversion to BED format, we rearranged the columns for consistency (to match v6p all snp-gene association) Script used for re-formatting - “GTEx_BED_column_reformatting.sh”.
ROADMAP data download documentation
Version:
Release 9
Data download:
BroadPeak: https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/broadPeak/
GappedPeak: https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/gappedPeak/
NarrowPeak: https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/narrowPeak/
i) all_hg38lift.mnemonics.bedFiles.tgz
ii) all.mnemonics.bedFiles.tgz
Data pre-processing:
- ROADMAP enhancers – extracted enhancers from ChromHMM data – script used “ROADMAP_enhancer_extraction.sh”.
FANTOM5 documentation
Version:
FANTOM5 Phase2.0
Data download:
Download the CAGS TSS bed files "http://fantom.gsc.riken.jp/5/datafiles/latest/basic/".
README files are located inside every sub-folder.
Enhancer download: http://slidebase.binf.ku.dk/human_enhancers/presets/serve/facet_expressed_enhancers.tgz
FactorBook documentation
Version:
Release date: 16-March-2014
Data download:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/factorbookMotifPos.txt.gz
Homer documentation
Version:
17/09/2017
Data download:
http://homer.ucsd.edu/homer/data/motifs/
Human hg38 UCSC BigBed Track 170917: http://homer.ucsd.edu/homer/data/motifs/homer.KnownMotifs.hg38.170917.bigBed.track.txt
Human hg19 UCSC BigBed Track 170917: http://homer.ucsd.edu/homer/data/motifs/homer.KnownMotifs.hg19.170917.bigBed.track.txt
Data pre-processing:
- HOMER data has identical motif names, use script “HOMER_modification_update_duplicate_motif_names.py” to fix it.
TargetScan documentation
Version:
7p2 (March 2018)
Data download:
http://www.targetscan.org/cgi-bin/targetscan/data_download.vert72.cgi
All predictions for representative transcripts: Genome (hg19) locations of all targets, partitioned into files by conservation of miRNA family and site.
Default predictions (conserved sites of conserved miRNA families): Genome (hg19) locations of human predicted (conserved) targets of conserved miRNA families.
Reference genome
hg19: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/ Feb. 2009 assembly of the human genome.
hg38: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ Dec. 2013 assembly of the human genome.
Gene model
Ensembl
hg19: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens release 75
hg38: ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens release 89
RefSeq Used UCSC table browser to download NCBI RefSeq data
hg19: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Feb. 2009 (GRCh37/hg19), group: Genes and Gene Predictions, Track: NCBI RefSeq
hg38: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Dec. 2013 (GRCh38/hg38), group: Genes and Gene Predictions, Track: NCBI RefSeq
Repeats
hg19: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Feb. 2009 (GRCh37/hg19), group: Repeats, Track: RepeatMasker
hg38: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Dec. 2013 (GRCh38/hg38), group: Repeats, Track: RepeatMasker
DASHR2 documentation
Version:
DASHR2 (August-2018)
Data Download:
Small RNA called peaks data:
http://dashr2.lisanwanglab.org/download.php
Select DASHR data collection: Download sncRNA tables (annoted and unnotated from Small RNA loci table row)
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_GEO_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=DASHR2_GEO_hg38&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_GEO_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_GEO_hg38&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_dataportal_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=ENCODE_dataportal_hg38&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=DASHR1_GEO_hg19&table=peaks
http://dashr2.lisanwanglab.org/table2csv.php?dataSourceDownload=DASHR1_GEO_hg38&table=peaks
DASHR annotation:
http://dashr2.lisanwanglab.org/downloads/dashr.v2.sncRNA.annotation.hg19.gff http://dashr2.lisanwanglab.org/downloads/dashr.v2.sncRNA.annotation.hg38.gff http://dashr2.lisanwanglab.org/downloads/dashr.v2.annotation.hg19.gff http://dashr2.lisanwanglab.org/downloads/dashr.v2.annotation.hg38.gff
Data pre-processing:
Used script "DASHR2_split_by_tissue.sh" to split the peak files by tissues.
1000Genome documentation
Version:
Phase3
Data Download:
Download the vcf files from the following FTP links:
Hg19: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.
Hg38: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/.
Data pre-processing:
VCF files for each chromosome were split by variant type (SNP, INDEL, biallelic, multiallelic), super and sub population.
Scripts used for splitting the VCF files are in the package “1kgenome_pre-processing_scripts.tar.gz”.
Hg38 liftover documentatio3
Liftover was performed on all of the data sources for which no hg38 data was available.
Hg19 genome coordinates are lifted using UCSC liftOver utility hg19ToHg38.over.chain.gz.
Hg19 coordinates are retained in the lifted-over files (column 4, 5 and 6 are hg19 coordinates in the lifted-over files).
Download link: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz.