About

FILER tutorial

Download

  1. Downloading FILER track metadata

    FILER track metadata are available in templated metadata format (TSV) for both GRCh37/hg19 and GRCh38/hg38 genome builds:

    Latest GRCh38/hg38 data

    Latest GRCh37/hg19 data

    Latest hg38-lifted data

    Please refer to the FILER_v1_metadata_schema.xlsx [XLS spreadsheet; April 2021; 14KB] for the description of the metadata fields.

  2. Downloading FILER tracks

    Download URLs and wget commands are provided in the Processed File Download URL and wget command columns.

  3. Deploying FILER instance on your system

    For installing and using a stand-alone FILER instance, please refer to the FILER bitbucket repository http://bitbucket.org/wanglab-upenn/FILER and install_filer.sh installation script. This will download tracks/datasets, index individual tracks (tabix) and Giggle-index individual datasets.

Frequently Asked Questions (FAQ)

Q1. How can I download individual FILER tracks?

A1. Individual tracks can be downloaded in several ways:

  1. Using the FILER website: from Browse page using provided Download links in the Download file column of the FILER track table.
  2. Using download link provided in the FILER metadata table (see Processed File Download URL column, e.g., in GRCh37/hg19 or GRCh38/hg38 FILER metadata tables)

Q2. Running FILER script fails with an error message:

e.g., declare: -A: invalid option or ERROR: Bash version 4+ is required

A2. FILER scripts require Bash v4.3+. Please check bash --version and update if necessary (e.g., using brew install bash on Mac OS, yum update bash on Cent OS, apt-get install --only-upgrade bash for Ubuntu)

Q3. How can I download a custom subset of FILER tracks?

  1. Website: Use data selectors/filters on Browse page to filter tracks down to a desired set. Then click on Download button above the FILER track table to download FILER track metadata for the selected tracks. Column Processed File Download URL will contain download URLs for individual tracks. Alternatively, each track can be downloaded using Download link under Download file column.

  2. Command-line: using bash install_filer.sh <filtered_filer_metadata_template_file> <target_filer_directory_for_install>. Template files for all GRCh37/hg19 and GRCh38/hg38 FILER tracks. These template FILER can be filtered to obtain a desired set of tracks before running bash install_filer.sh (see also section on installing a custom subset of FILER tracks for examples).

Q4. Explanations for data terms

  1. For descriptions of the main output data types and data terms, please see data glossary, data terms, and references therein. For additional information, please also see data Summary page (e.g., the ‘Data description’ tab)

Method

This section provides detailed information on obtaining the data and data pre-processing steps for each FILER data source.

FILER metadata:

Please refer to the FILER_v1_metadata_schema.xlsx [XLS spreadsheet; April 2021; 14KB] for the description of FILER v1.0 metadata information fields.

Refer to metadata_info.xlsx file for how and from where we obtained meta information for FILER.

Pre-processing:

All of the data provided in FILER are in BED format (other than 1000 genome - vcf and sequence data - fasta). Refer to FILER_file_formats.xlsx for description of different BED formats.

BED files are 0-based sorted and indexed (GIGGLE and tabix).

Pre-processing scripts:

Below scripts were used for every data source.

  1. Bed sort and bgzip (BED_sort-bgzip.sh)
  2. Giggle indexing (GIGGLE_index.sh)
  3. Tabix indexing (tabix_index.sh)

Other pre-processing steps that are specific for data source are described in the documentation part below, along with the scripts used.

ENCODE data download documentation.

Version:

Downloaded datasets released until 05/2018.

Data download:

Access data from data matrix: https://www.encodeproject.org/matrix/?type=Experiment&status=released.

Use the following filter terms:

Assay title: CHIP-seq, Dnase-seq, eCLIP, CAGE, FAIRE-seq, RNA-PET, ChIA-PET, RIP-seq, iCLIP, 5C, ATAC-seq, DNA-PET, Hi-C, Mnase-seq.

Genome assembly: hg19, hg38

Available file types: bigBed narrowPeak, bigBed broadPeak, bigBed bedRnaElements, bigBed tss_peak, bigBed idr_peak, bigBed bed12, bigBed bed3, gtf, gff gff3.

After making the filter selection, use batch download (below instructions will be shown on the matrix page).

Click the “Download” button download a “files.txt” file that contains a list of URLs to a file containing all the experimental metadata and links to download the file. The first line of the file has the URL or command line to download the metadata file.

The “files.txt” file can be copied to any server. The following command using curl can be used to download all the files in the list:

xargs -L 1 curl -O -L < files.txt

Data pre-processing:

  1. Convert bigBed to bed (bigBedToBed)
  2. Convert gff3 to bed (GFF3toBed.sh)
  3. Convert gtf to bed (GTFtobBed.sh)

GTEx data download documentation

Version:

v6, v6p, v7, v8

Data download:

GTEx v6: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v6/single_tissue_eqtl_data/GTEx_Analysis_V6_eQTLs.tar.gz

GTEx v6P: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_eQTL.tar

GTEx v6P: all snp-gene association https://storage.googleapis.com/gtex_analysis_v6p/single_tissue_eqtl_data/GTEx_Analysis_v6p_all-associations.tar

GTEx v7: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz

GTEx v7: all snp-gene association https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL_all_associations.tar.gz

GTEx v8 eQTL: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL.tar

GTEx v8 eQTL: all snp-gene association Follow the instructions to download using google cloud https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_eQTL_all_associations/

GTEx v8 sQTL: significant snp-gene association https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_sQTL.tar

GTEx v8 sQTL: all snp-gene association Follow the instructions to download using google cloud https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_sQTL_all_associations/

Data pre-processing:

All snp-gene association – Considering the file size we filtered variants with p-value < 0.05 script used for filtering - “GTEx_filter_all_association.sh”.

Each version of GTEx has different formatting structure, we used GTEx v6p all snp-gene association as default format. Below scripts are used for creating BED files from GTEx data (conversion done after filtering variants for every all snp-gene association dataset).

  1. GTEx_v6_eQTL_signif_bed_conversion.sh
  2. GTEx_v6p_eQTL_signif_bed_conversion.sh
  3. GTEx_v6p_eQTL_all_association_bed_conversion.sh
  4. GTEx_v7_eQTL_signif_bed_conversion.sh
  5. GTEx_v7_eQTL_all_association_bed_conversion.sh
  6. GTEx_v8_eQTL_signif_bed_conversion.sh
  7. GTEx_v8_eQTL_all_association_bed_conversion.sh
  8. GTEx_v8_sQTL_signif_bed_conversion.sh
  9. GTEx_v8_sQTL_all_association_bed_conversion.sh

After conversion to BED format, we rearranged the columns for consistency (to match v6p all snp-gene association) Script used for re-formatting - “GTEx_BED_column_reformatting.sh”.

ROADMAP data download documentation

Version:

Release 9

Data download:

BroadPeak: https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/broadPeak/

GappedPeak: https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/gappedPeak/

NarrowPeak: https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/narrowPeak/

ChromHMM: https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/

i) all_hg38lift.mnemonics.bedFiles.tgz

ii) all.mnemonics.bedFiles.tgz

Data pre-processing:

  1. ROADMAP enhancers – extracted enhancers from ChromHMM data – script used “ROADMAP_enhancer_extraction.sh”.

FANTOM5 documentation

Version:

FANTOM5 Phase2.0

Data download:

Download the CAGS TSS bed files "http://fantom.gsc.riken.jp/5/datafiles/latest/basic/".

README files are located inside every sub-folder.

Enhancer download: http://slidebase.binf.ku.dk/human_enhancers/presets/serve/facet_expressed_enhancers.tgz

FactorBook documentation

Version:

Release date: 16-March-2014

Data download:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/factorbookMotifPos.txt.gz

Homer documentation

Version:

17/09/2017

Data download:

http://homer.ucsd.edu/homer/data/motifs/

Human hg38 UCSC BigBed Track 170917: http://homer.ucsd.edu/homer/data/motifs/homer.KnownMotifs.hg38.170917.bigBed.track.txt

Human hg19 UCSC BigBed Track 170917: http://homer.ucsd.edu/homer/data/motifs/homer.KnownMotifs.hg19.170917.bigBed.track.txt

Data pre-processing:

  1. HOMER data has identical motif names, use script “HOMER_modification_update_duplicate_motif_names.py” to fix it.

TargetScan documentation

Version:

7p2 (March 2018)

Data download:

http://www.targetscan.org/cgi-bin/targetscan/data_download.vert72.cgi

All predictions for representative transcripts: Genome (hg19) locations of all targets, partitioned into files by conservation of miRNA family and site.

Default predictions (conserved sites of conserved miRNA families): Genome (hg19) locations of human predicted (conserved) targets of conserved miRNA families.

Reference genome

hg19: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/ Feb. 2009 assembly of the human genome.

hg38: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ Dec. 2013 assembly of the human genome.

Gene model

Ensembl

hg19: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens release 75

hg38: ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens release 89

RefSeq Used UCSC table browser to download NCBI RefSeq data

hg19: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Feb. 2009 (GRCh37/hg19), group: Genes and Gene Predictions, Track: NCBI RefSeq

hg38: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Dec. 2013 (GRCh38/hg38), group: Genes and Gene Predictions, Track: NCBI RefSeq

Repeats

hg19: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Feb. 2009 (GRCh37/hg19), group: Repeats, Track: RepeatMasker

hg38: http://genome.ucsc.edu/cgi-bin/hgTables genome: Human, assembly: Dec. 2013 (GRCh38/hg38), group: Repeats, Track: RepeatMasker

1000Genome documentation

Version:

Phase3

Data Download:

Download the vcf files from the following FTP links:

Hg19: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

Hg38: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/.

Data pre-processing:

VCF files for each chromosome were split by variant type (SNP, INDEL, biallelic, multiallelic), super and sub population.

Scripts used for splitting the VCF files are in the package “1kgenome_pre-processing_scripts.tar.gz”.

Hg38 liftover documentatio3

Liftover was performed on all of the data sources for which no hg38 data was available.

Hg19 genome coordinates are lifted using UCSC liftOver utility hg19ToHg38.over.chain.gz.

Hg19 coordinates are retained in the lifted-over files (column 4, 5 and 6 are hg19 coordinates in the lifted-over files).

Download link: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz.