Genome build: hg19 Data download link: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRegTfbsClustered/ Documentation: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeRegTfbsClusteredV3 Motif data: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/factorbookMotifPos.txt.gz pfm: https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/factorbookMotifPwm.txt.gz _________________________________________________________________________________________________________ Motifs processing - motif_processing.sh * Motifs with –log10(qvalue) score less than 0.69 are filtered out (–log10(0.2) = 0.69) * Motif lacking pfm files are filtered out * qvalue is calculated using the score column 10^(-score) * Binding sequence obtained using samtools Pfm processing - pwm_processing.sh * Generated individual pfm files from above link for each motif in the motif bed file * Added > to motifID Peaks processing - tf_peak_processing.sh * WgEncodeRegTfbsClusteredInputsV3.tab.gz: includes such information as the experiment's underlying Uniform TFBS table name, factor targeted, antibody used, cell type, treatment (if any), and laboratory source. * WgEncodeRegTfbsClusteredV3.bed.gz: This format consists of standard BED5 fields, followed by an experiment count field (expCount) and finally two fields containing comma-separated lists. * The first list field (expNums) contains numeric identifiers for experiments, keyed to the wgEncodeRegTfbsClusteredInputsV3 table * The second list field (expScores) contains the scores for the corresponding experiments. * Normalized the expNums and expScores in WgEncodeRegTfbsClusteredV3.bed * Numbered the experiments in WgEncodeRegTfbsClusteredInputsV3.tab.gz * Mapped the expNums between WgEncodeRegTfbsClusteredInputsV3 and WgEncodeRegTfbsClusteredV3 files * Split by cell types * Split by Transcription Factors * Added Canonical TF terms from factorbookMotifCanonical that connects different terms used for the same factor Motif overlap with peak data - overlap.py * Did bedtools intersect between motifs and peaks * bedtools intersect –a motif_data –b peak_data –wo * Added header * Rearranged columns: concatenated peak file columns as 1 column