Summary
Transcription factor (TF) binding is detectable in assay for transposase-accessible chromatin using sequencing (ATAC-seq) experiments, where bound TFs block transposase insertions, leaving a depletion of insertions known as a “footprint.” Here, we present a computational protocol for detecting genetic variants associated with footprint-inferred TF binding. We describe steps to run the PRINT footprinting software to quantify TF binding likelihood at variants across multiple genotyped ATAC-seq samples and then run regressions to measure genetic associations. This protocol can implicate causal variants in disease-associated loci.
For complete details on the use and execution of this protocol, please refer to Dudek et al.1
Subject areas: Bioinformatics, Genetics, Sequencing
Graphical abstract

Highlights
-
•
Computational protocol for mapping variants associated with transcription factor binding
-
•
Steps to predict transcription factor binding using PRINT footprinting software
-
•
Instructions to run regressions to measure associations between genotype and binding
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
Transcription factor (TF) binding is detectable in assay for transposase-accessible chromatin using sequencing (ATAC-seq) experiments, where bound TFs block transposase insertions, leaving a depletion of insertions known as a “footprint.” Here, we present a computational protocol for detecting genetic variants associated with footprint-inferred TF binding. We describe steps to run the PRINT footprinting software to quantify TF binding likelihood at variants across multiple genotyped ATAC-seq samples and then run regressions to measure genetic associations. This protocol can implicate causal variants in disease-associated loci.
Before you begin
Most disease associated variants revealed by genome-wide association studies (GWAS) are non-coding,2 and concentrated near transcription factor (TF) binding motifs.2,3 This suggests that disruptions of TF motif sequences are a common mechanism for disease risk. The assay for transposase-accessible chromatin using sequencing (ATAC-seq)4 method can be used to detect TF binding events. In this method, the transposase Tn5 inserts sequencing adapters into DNA, preferentially at genomic locations where chromatin is open.4 However, bound TFs also partially block Tn5, leaving a pattern of relatively depleted Tn5 insertion sites known as a “footprint”.5 Compared to the TF-detection method chromatin immunoprecipitation sequencing (ChIP-seq),6 ATAC-seq is cheaper, easier to process, technically more uniform, requires less sample, and footprints can detect binding sites without specifically knowing the identity of the bound TF. This makes genome-wide mapping of binding sites feasible across a large sample size from a diverse set of donors. In recent years, multiple algorithms have been developed to quantify binding strength using footprint patterns in ATAC-seq or DNase-seq data.7,8,9,10
To understand the regulatory mechanisms of GWAS variants, several studies test their association with molecular traits such as expression, splicing, or chromatin accessibility, finding quantitative trait loci (QTLs).11 Recently, we used liver ATAC-seq samples from 170 individuals, and discovered variants associated with footprint-inferred TF binding, known as footprint QTLs (fpQTLs).1 Here, we present a protocol to discover fpQTLs in a similar ATAC-seq dataset, using the footprinting algorithm PRINT12 to compute TF-binding likelihood scores. Unlike other QTL discovery, fpQTL discovery is not limited in resolution by linkage disequilibrium (LD), because TF binding at one variant will not impact the binding score at nearby variants. This makes fpQTL discovery a valuable resource for fine-mapping causal variants at GWAS loci, as well as implicating specific TFs based on sequence motifs directly altered by the variant.
This protocol describes how to discover fpQTLs in a set of ATAC-seq samples from a particular cell or tissue type across multiple individuals, where several common single-nucleotide polymorphisms (SNPs) have been genotyped. We outline steps to (1) prepare the ATAC-seq and genotype data, (2) run PRINT to compute TF binding scores at each SNP in each sample, and (3) run a regression on each SNP to test its association with TF binding score across samples.
Innovation
Compared to eQTL or GWAS variant discovery, fpQTL discovery is not limited in resolution due to linkage disequilibrium (LD). This is because the phenotype tested against each variant (the TF binding score) differs at every SNP based on local insertion patterns, and so fpQTL regressions can precisely resolve associations with binding at base-pair resolution. Our fpQTL protocol improves on an existing approach13 in a few ways. First, our approach is motif-agnostic, and so SNPs are not required to overlap a known sequence motif in order to be tested. Second, our protocol uses the newer footprinting method PRINT, which was shown to have more robust Tn5 bias correction and more accurate binding detection than previous approaches.12 Third, rather than testing every SNP against the binding score of every nearby motif, our approach tests each SNP against only one score: the score of the footprint centered at that SNP. This uses the assumption that sequence motifs are bound directly by the TF to reduce the multiple-testing burden of fpQTL discovery.
Institutional permissions
Collection of human tissue samples for ATAC-seq and genotyping must be approved by the relevant Institutional Review Board, and conducted in accordance with the Declaration of Helsinki and Istanbul.
Installation
Timing: 30–60 min
-
1.Install UNIX software.Note: The PRINT software package was developed in R. This protocol has been tested on Red Hat Enterprise Linux 9.1 using R v.4.4.0, but all provided code will assume that any Linux operating system is being used, with R >=4.3.
-
a.Check if bzip2 is available.$ bzip2 --versionNote: bzip2 is included by default on most Linux distributions. If it is not available on your HPC system, you may need to load it as a module. Otherwise, you will need to download and build the source code: https://www.sourceware.org/bzip2/
-
b.Check if BEDtools is installed.$ bedtools --versionIf it is not installed, follow the installation guide at https://bedtools.readthedocs.io/en/latest/content/installation.html.
-
c.Check if SAMtools is installed.$ samtools --versionIf it is not installed, follow the installation guide at https://www.htslib.org/download/.
-
a.
-
2.Install R software.
-
a.Check if R is available.$ R --versionIf it is not available, install the latest version by following the installation guide: https://www.r-project.org/.
-
b.Check if Python is available.$ python --versionIf it is not available, install the latest version by following the installation guide: https://www.python.org/downloads/.Note: Python is not used directly in this protocol, but the R packages keras and tensorflow require the corresponding python modules installed in a location accessible to reticulate (see troubleshooting).
-
c.Install required R packages.> install.packages("tidyverse")> install.packages("vcfR")> install.packages("reticulate")> install.packages("gtools")> install.packages("pbmcapply")> install.packages("doSNOW")> install.packages("keras")> install.packages("cladoRcpp")> install.packages("caTools")> install.packages("collapse")> install.packages("pbapply")> install.packages("betareg")> install.packages("hdf5r")> if (!require("BiocManager", quietly = TRUE))install.packages("BiocManager")> BiocManager::install("GenomicRanges")> BiocManager::install("SummarizedExperiment")> BiocManager::install("preprocessCore")> BiocManager::install("qvalue")Note: Tidyverse installation requires the C++ libraries openssl, harfbuzz and fribidi, which must be installed in their “dev” version (i.e. libharfbuzz-dev under Ubuntu systems). hdf5r additionally requires the libhdf5-dev library.
-
d.Check if git is installed.$ git --versionIf it is not available, install the appropriate version for your operating system by following the installation guide at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.
-
e.Clone the PRINT R repository in a directory of your choice.$ cd /path/to/src$ git clone https://github.com/buenrostrolab/PRINT.git
-
a.
-
3.Download PRINT input data.Note: Relevant PRINT input data (TF binding models and precomputed Tn5 bias profiles) can be downloaded from Zenodo: https://doi.org/10.5281/zenodo.7121026. Our use of this protocol in Dudek et al. used Version v.2 of this repository, but newer versions of the models are available.
-
a.Download precomputed Tn5 bias for the relevant genome.
CRITICAL: The download link provided in the code below directs to precomputed bias for the genome hg38, in v.16 of the Zenodo repository. You must change the link accordingly if you are using a different genome build, or wish to use a newer version of the bias model. See the link to the Zenodo repository above for the full list of available bias profiles.
CRITICAL: The precomputed bias file must be placed in the directory shown below in order to be found properly by PRINT.$ mkdir /path/to/src/PRINT/data/shared/precomputedTn5Bias$ cd /path/to/src/PRINT/data/shared/precomputedTn5Bias$ wget https://zenodo.org/records/15224770/files/hg38Tn5Bias.tar.gz$ tar -xzvf hg38Tn5Bias.tar.gz -
b.Download the TF binding score model.
CRITICAL: The download link provided in the code below directs to the TFBS model in v.16 of the Zenodo repository. You must change the link accordingly if you wish to use a newer version of the bias model.$ cd /path/to/src/PRINT/data/shared/$ wget https://zenodo.org/records/15224770/files/TFBS_model.h5Optional: This protocol uses the PRINT R package to calculate TF binding scores. However, the developers of PRINT have recently published a faster Python package scPrinter, which implements the exact same method. The package can be downloaded and installed following the instructions here: https://github.com/buenrostrolab/scPrinter.
-
a.
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| PRINT TFBS model and precomputed Tn5 bias | Hu et al.12 | https://doi.org/10.5281/zenodo.7121026 |
| Software and algorithms | ||
| BEDtools v.2.31.0 | Quinlan et al.14 | bedtools.readthedocs.io/en/latest/index.html |
| SAMtools v.1.16.1 | Li et al.15 | htslib.org |
| R v.4.4.0 | The R Foundation | r-project.org/ |
| Python 3 | Python Software Foundation | https://www.python.org/ |
| PRINT R | Hu et al.12 | github.com/buenrostrolab/PRINT |
| tidyverse | Wickham et al.16 | https://www.tidyverse.org/ |
| Vcfr | Knaus and Grünwald17 | https://cran.r-project.org/package=vcfR |
| GenomicRanges | Lawrence et al.18 | https://bioconductor.org/packages/GenomicRanges/ |
| betareg | Cribari-Neto and Zeileis19 | https://cran.r-project.org/package=betareg |
| preprocessCore | Bolstad20 | https://bioconductor.org/packages/preprocessCore |
| qvalue | Storey et al.21 | https://bioconductor.org/packages/qvalue |
| Other | ||
| CPU | Intel | Intel Core i9-10900X CPU @ 3.70GHz |
| RAM | CRUCIAL | 128 GB |
| Operating system | Red Hat, Inc. | GNU/Linux Red Hat Enterprise 9.1 |
Step-by-step method details
Extraction of ATAC-seq fragment coordinates for each sample
Timing: 10 min per sample (∼17 h for 100 samples)
To read in ATAC-seq data, PRINT requires “fragment file” which lists the coordinates of each DNA fragment in the sequencing library, generated by a Tn5 insertion on either end. Here, we demonstrate how to make this file for each ATAC-seq sample.
Note: For simplicity, this step describes the protocol for processing a single ATAC-seq sample. You will need to loop over all of your samples in sequence, or process them in parallel to save time.
-
1.
Process reads from the ATAC-seq library using a standardized pipeline.
ATAC-seq reads can be processed using the standard ENCODE pipeline: encodeproject.org/atac-seq/. Specifically, two types of files should be generated.-
a.Align ATAC-seq reads to generate .bam files, one for each sample.
-
b.Call open chromatin peaks across all samples to generate a single peak file (e.g. in .narrowPeak format).Note: This peak file will be used to filter for SNPs located within open chromatin.
CRITICAL: Before continuing this protocol, the user should ensure that all genomic files (ATAC-seq .bam, peak .narrowPeak, genotype .vcf) use the same chromosome naming convention (e.g. “chr12” or “12”). -
c.Ensure that minimum quality control thresholds are met according to ENCODE standards, for example:
-
i.Each sample should have at least 50 million non-mitochondrial, paired reads (i.e. 25 million fragments).Note: The number of fragments can be checked easily after the fragment file is created by counting the number of lines (see step 3).
-
ii.The percentage of mapped reads (alignment rate) should be at least 95%.
-
iii.The number of called peaks across all samples should be at least 150,000.For a full list of QC threshold recommendations, see encodeproject.org/atac-seq/#standards.
-
i.
-
a.
-
2.
Sort the alignment file by read name.
Note: In order to match up read pairs in the next step, BEDtools requires that aligned reads are sorted by name. To sort the file, SAMtools requires that it is indexed, which is run as a preliminary step below.
$ SAMPLE_NAME="sample1"
$ BAM="/path/to/align/$SAMPLE_NAME.bam"
$ SORTED_BAM="/path/to/align_sorted/$SAMPLE_NAME.namesorted.bam"
$ CORES=32 # adjust based on available CPUs
$ MEM="4G" # memory PER core
$ samtools index -@ $CORES $BAM
$ samtools sort -n -@ $CORES -m $MEM -o $SORTED_BAM $BAM
-
3.Extract and filter fragments from alignment file.Note: Because most computational steps in the following script are single-threaded, we recommend running it on one CPU per sample, in parallel across samples if possible. However, sorting and compression can be sped up if multiple CPUs are available, so we’ve provided that option in the code.$ FRAG_FILE="/path/to/frags/$SAMPLE_NAME.tsv.gz"$ CORES=1$ MEM="16G" # Total memory$ awkcommand='{prefix = ($1 ∼ /ˆchr/) ? "" : "chr";if($1 ∼ /ˆchr[0-9]+$|ˆ[0-9]+$/ && ($1==$4) && ($6-5>$2+4) && ($9=="+" || ($2==$5 && $3==$6))) {print prefix $1,$2+4,$6-5,BARCODE}}'$ bedtools bamtobed -i $SORTED_BAM -bedpe 2> /dev/null | ∖awk -v OFS="\t" -v BARCODE="$SAMPLE_NAME" "$awkcommand" | ∖sort --parallel=$CORES -S $MEM -k1,1 -k2,2n -k3,3n | ∖uniq -c | ∖awk -v OFS="\t" '{print $2, $3, $4, $5, $1}' | ∖pigz -p $CORES > \$FRAG_FILEBelow is an explanation of the steps performed in the above script:
-
a.bedtools bamtobed -bedpe.This command writes BAM alignments in BEDPE (“paired-end”) format. The relevant columns of this format are:$1 – chromosome (read1)$2 – start (read1)$3 – end (read1)$4 – chromosome (read2)$5 – start (read2)$6 – end (read2)…$9 – strand (read1)The start and end coordinates of each read is joined with its pair in the same row.Note: This command warns about all unpaired reads, so we prevent excessive output by piping stderr to /dev/null. Note that to debug this step, you should remove this pipe. See documentation for more details: https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html.
-
b.awk command.Several filtering and data cleaning steps are performed by the awk command.
-
i.prefix.This ensures that chromosomes are notated in the format “chr12” rather than “12”, for consistency across samples.
-
ii.if-statement filtering.The if statement contains four filtering conditions, separated by && (and). They are, in order: (1) keep only autosomal chromosomes 1–22, (2) keep only read pairs matched to the same chromosome, (3) keep only read pairs where the start insertion occurs before the end insertion (discard incorrect alignments).Note: The final filtering step (4) is usually not necessary, but can remove incorrect alignments – see https://github.com/HYsxe/PRINT/issues/6 for details.
-
iii.print.The final step of awk returns the chromosome, start, and end coordinates of each fragment.Note: The BARCODE variable is a dummy column, since PRINT is configured to work with single-cell data.Note: In order to convert read coordinates into fragment coordinates, they must be shifted to account for where Tn5 inserted relative to the resulting reads. Most standard pipelines shift +4 bp for reads mapping to the + strand, and −5 bp for reads mapping to the – strand. However, the shift +4/-4 is now believed to align more closely with models of Tn5 sequence bias.22 As such, PRINT assumes that input fragments were shifted by +4/-5, and internally converts them to +4/–4.
-
i.
-
c.sort | uniq -c | awk.This step counts duplicates, and moves the count column to the right.Note: The odds that two fragments have the exact same coordinates are very low, and more likely correspond to PCR duplicates. This step removes those duplicates. The count column is ignored by PRINT, but may be saved for QC.
-
d.pigz -c $CORES.A multi-threaded gz compression tool.Optional: If you only have one CPU, you can use gzip instead.
-
a.
Preparation and filtering of genotype data
Timing: 4 h
Most genotype data is in standardized formats such as vcf, which are provided directly by genotyping services (e.g. Gencove, CD Genomics, GENEWIZ). This step will convert a vcf file into a simpler additive matrix which can easily be parsed by R.
-
4.Format genotypes into additive genotype matrix.
-
a.Load the vcf variant data into R.> library(tidyverse)> library(vcfR)> library(GenomicRanges)> vcf <- read.vcfR("/path/to/snps/genotypes.vcf.gz",verbose = FALSE)> snp_info <- getFIX(vcf) %>% as.data.frame()> colnames(snp_info) <- c("chrom", "pos", "snp_id", "ref_allele","alt_allele", "qual", "filter")> snp_info <- snp_info %>%mutate(snp_id_uniq = make.unique(snp_id),pos = as.integer(pos))
-
b.Reformat the genotype values to be additive (equal to the number of non-reference alleles).> gt <- extract.gt(vcf)> additive_code <- c("0|0"=0L, "0|1"=1L, "1|0"=1L, "1|1"=2L,"0/0"=0L, "0/1"=1L, "1/0"=1L, "1/1"=2L)> genotype_matrix <- additive_code[gt] %>% matrix(nrow = nrow(gt))> rownames(genotype_matrix) <- snp_info$snp_id_uniq> colnames(genotype_matrix) <- colnames(gt)
-
a.
-
5.Filter SNPs by minor allele frequency (MAF) and open chromatin.
-
a.Calculate the MAF.> snp_info$genotype0_counts <- rowSums(genotype_matrix == 0)> snp_info$genotype1_counts <- rowSums(genotype_matrix == 1)> snp_info$genotype2_counts <- rowSums(genotype_matrix == 2)> snp_info$maf <- (snp_info$genotype1_counts +2∗snp_info$genotype2_counts) / (2∗ncol(genotype_matrix))> snp_info$maf <- pmin(snp_info$maf, 1-snp_info$maf)
-
b.Find overlaps with open chromatin regions (OCR).
CRITICAL: You must ensure that chromosome names are formatted consistently between the vcf (snp_info) and the peak file (ocr). For example, if the peak file labels chromosomes as “chr12” but the vcf labels it as “12”, you must convert one of the tables to match.> ocr <- read.table("/path/to/peaks/ATAC.narrowPeak")> ocr_range <- GRanges(seqnames = ocr$V1,ranges = IRanges(start = ocr$V2,end = ocr$V3))> snp_range <- GRanges(seqnames = snp_info$chrom,ranges = IRanges(start = snp_info$pos,end = snp_info$pos))> overlaps <- GenomicRanges::findOverlaps(snp_range,ocr_range, select = "arbitrary")> overlaps <- ifelse(is.na(overlaps), 0, 1)> snp_info$ocr <- overlaps -
c.Filter the table of SNPs.> snp_info <- snp_info %>%filter(filter == "PASS",ocr == 1,maf > 0.05,nchar(ref_allele) == 1,nchar(alt_allele) == 1 # SNPs only) %>%select(-qual, -filter, -ocr)
-
d.Filter the genotype matrix.> genotype_matrix <- genotype_matrix[snp_info$snp_id_uniq, ]
-
e.Write output files.> snp_info %>% write.table("/path/to/snps/snp_info_filtered.txt",quote = FALSE, row.names = FALSE, sep = "\t")> genotype_matrix %>%saveRDS("/path/to/snps/genotype_matrix_filtered.Rds")
-
a.
Calculation of TF binding scores at SNPs using PRINT
Timing: 3 days per sample
In this step, we will use PRINT to calculate a TF binding score (TFBS) at every SNP in every sample.
All R code in this step will first require loading PRINT.
> PRINTdir <- "/path/to/src/PRINT/"
> source(paste0(PRINTdir, "code/utils.R"))
> source(paste0(PRINTdir, "code/getCounts.R"))
> source(paste0(PRINTdir, "code/getBias.R"))
> source(paste0(PRINTdir, "code/getFootprints.R"))
> source(paste0(PRINTdir, "code/getSubstructures.R"))
> source(paste0(PRINTdir, "code/visualization.R"))
> source(paste0(PRINTdir, "code/getGroupData.R"))
> source(paste0(PRINTdir, "code/footprintTracking.R"))
> source(paste0(PRINTdir, "code/getTFBS.R"))
-
6.
Get precomputed Tn5 bias profile in SNP regions.
CRITICAL: Remember to change refGenome if you are using a genome other than hg38.
> CORES <- 32
> project <- footprintingProject(projectName = "dummy",
refGenome = "hg38")
> mainDir(project) <- PRINTdir
> snp_info <- read.delim("/path/to/snps/snp_info_filtered.txt")
> w <- 100
> regions <- GRanges(seqnames = snp_info$chrom,
ranges = IRanges(start = snp_info$pos-w,
end = snp_info$pos+w))
> regionRanges(project) <- regions
> project <- getPrecomputedBias(project, nCores = CORES)
> saveRDS(regionBias(project), "/path/to/snps/snp_region_bias.Rds")
Note: By default, PRINT uses the flanking regions 100 bp on either side of the footprint region to calculate the background insertion distribution. As such, the regions created here are 201 bp wide, centered on the SNP.
-
7.Run PRINT to calculate TFBS.Note: For simplicity, this step describes the protocol for processing a single ATAC-seq sample. You will need to loop over all of your samples in sequence, or process them in parallel to save time. The timing of this step assumes that one sample is run on a single CPU. As this step is extremely computationally intensive, we recommend running multiple samples in parallel on multiple CPUs, for example by submitting parallel jobs on an HPC cluster. We recommend allocating 32 GB of memory per sample.
-
a.Prepare and configure the project.> CORES <- 2> CHUNK_SIZE <- 1e5> SAMPLE_NAME <- "sample1"> project <- footprintingProject(projectName = SAMPLE_NAME,refGenome = "hg38")> mainDir(project) <- PRINTdir> dataDir(project) <- paste0("/path/to/PRINT_output/",SAMPLE_NAME, "/")> dir.create(dataDir(project), showWarnings = FALSE)> barcodeGroups <- data.frame(barcode = SAMPLE_NAME,group = 1L)> barcodeGrouping(project) <- barcodeGroups> groups(project) <- mixedsort(unique(barcodeGroups$group))> groupCellType(project) <- "your_cell_type"Note: You should change the value of CORES to reflect the number of available CPUs.
-
b.Load SNP regions and bias.> snp_info <- read.delim("/path/to/snps/snp_info_filtered.txt")> w <- 100> regions <- GRanges(seqnames = snp_info$chrom,ranges = IRanges(start = snp_info$pos-w,end = snp_info$pos+w))> regionRanges(project) <- regions> regionBias(project) <- readRDS("/path/to/snps/snp_region_bias.Rds")
-
c.Load fragments.> pathToFrags <- paste0("/path/to/frags/", SAMPLE_NAME, ".tsv.gz")> project <- getCountTensor(project,pathToFrags,barcodeGroups,returnCombined = FALSE,chunkSize = CHUNK_SIZE,nCores = CORES)
-
d.Load models.> for(kernelSize in 2:100) {> dispModel(project, as.character(kernelSize)) <-> readRDS(sprintf(> "%s/data/shared/dispModel/dispersionModel%dbp.rds",> PRINTdir, kernelSize))> }> TFBS_model_path <- paste0(PRINTdir, "data/shared/TFBS_model.h5")> TFBindingModel(project) <- loadTFBSModel(TFBS_model_path)
-
e.Calculate TFBS.> project <- getTFBS(project,tileSize = 1,innerChunkSize = 100,chunkSize = CHUNK_SIZE,nCores = CORES)
-
f.Combine results across chunks.> TFBSDir <- paste0(dataDir(project), "chunkedTFBSResults/")> nChunks <- length(list.files(TFBSDir))> TFBS <- c()> for (i in 1:nChunks) {> cat("\tChunk ", i, "\n")> TFBSChunkData <- readRDS(sprintf("%s/chunk_%d.rds",> TFBSDir, i))>> TFBS <- c(TFBS, sapply(TFBSChunkData,> function(x) {x$TFBSScores}))> }> saveRDS(> data.frame(snp_id_uniq = snp_info$snp_id_uniq, TFBS = TFBS),> paste0(dataDir(project), "TFBS.Rds")> )Optional: Instead of using the original PRINT R package, you may also calculate TFBS using the new scPrinter Python library (github.com/buenrostrolab/scPrinter). We have uploaded a script which implements this same step using scPrinter to the following repository: github.com/maxdudek/fpQTL_protocol. scPrinter will download all required data files (TFBS model, precomputed bias) automatically, and will also handle getting the precomputed bias. Note that the Python version will write the TFBS results in a different format than the R version.
CRITICAL: If you are using the Python scPrinter package to calculate TFBS, your snp_info table must not contain any SNPs which have the same genomic coordinates (chrom:pos) as another SNP. Such SNPs must first be filtered out, or you will get a “name already exists” ValueError from h5py. This is not necessary if you are using the PRINT R package.
-
a.
Running of per-SNP regressions to discover fpQTLs
Timing: 2 h
For every SNP, we will run a beta regression23,24 to test if the SNP’s genotype is associated with its TFBS across samples.
-
8.
Consolidate PRINT results across samples into TFBS matrix.
The previous step output one TFBS file per sample; now we need to merge TFBS across samples to run regressions.
> library(tidyverse)
> library(pbapply)
> snp_info <- read.delim("/path/to/snps/snp_info_filtered.txt")
> all_sample_names <- paste0("sample", 1:100) # ex. 100 samples
> TFBS_filenames <- paste0("/path/to/PRINT_output/",
all_sample_names, "/TFBS.Rds")
> TFBS_list <- pblapply(TFBS_filenames, readRDS)
> TFBS_matrix <- TFBS_list %>%
pblapply(function(x) {x$TFBS}) %>%
do.call(cbind, .)
> colnames(TFBS_matrix) <- all_sample_names
> rownames(TFBS_matrix) <- snp_info$snp_id_uniq
> TFBS_matrix %>% saveRDS("/path/to/PRINT_output/TFBS_matrix.Rds")
Optional: If you used the Python scPrinter package to calculate TFBS, then the code to consolidate TFBS output across samples is different. The script to implement this step for scPrinter can be found in our repository (github.com/maxdudek/fpQTL_protocol).
-
9.
Quantile normalize TFBS matrix.
Note: Quantile normalization transforms the distribution of TFBS in each sample to be equal to the average observed empirical distribution across all samples. This transformation is commonly performed in QTL discovery, such as expression QTLs.25 We also add small offsets for 0 or 1 values so that the TFBS is strictly between 0 and 1.
> library(preprocessCore)
> TFBS_matrix <- readRDS("/path/to/PRINT_output/TFBS_matrix.Rds")
> TFBS_matrix_normalized <- normalize.quantiles(TFBS_matrix)
> rownames(TFBS_matrix_normalized) <- rownames(TFBS_matrix)
> colnames(TFBS_matrix_normalized) <- colnames(TFBS_matrix)
> TFBS_matrix_normalized[TFBS_matrix_normalized == 0] <-
.Machine$double.xmin
> TFBS_matrix_normalized[TFBS_matrix_normalized == 1] <-
1 - .Machine$double.eps/2
> saveRDS(TFBS_matrix_normalized,
"/path/to/PRINT_output/TFBS_matrix_normalized.Rds")
-
10.Run a beta regression for each SNP to test association with TFBS.
-
a.First, load libraries and data.> library(tidyverse)> library(parallel)> library(qvalue)> library(betareg)> TFBS_matrix <-readRDS("/path/to/PRINT_output/TFBS_matrix_normalized.Rds")> genotype_matrix <-readRDS("/path/to/snps/genotype_matrix_filtered.Rds")> snp_info <- read.delim("/path/to/snps/snp_info_filtered.txt")> regression_covariates <-read.delim("/path/to/sample_data/covariates.txt")# Confirm that row/column order is consistent> all(colnames(TFBS_matrix) == regression_covariates$sample_name)> all(colnames(TFBS_matrix) == colnames(genotype_matrix))> all(rownames(TFBS_matrix) == rownames(genotype_matrix))> all(rownames(TFBS_matrix) == snp_info$snp_id_uniq)Note: You may want to include sample covariates (e.g. sex, batch, genotype PCs) in your regression. The code in this step assumes that the table regression_covariates has one row for each sample in the TFBS and genotype matrices, in the same order, and one column for each covariate.
-
b.Define a helper function to run a single regression.run_regression_on_variant <- function(i) {x <- genotype_matrix[i,]y <- TFBS_matrix[i,]model.formula <- "y ∼ x"covariates.formula <- " + regression_covariates$sex +regression_covariates$batch +regression_covariates$PC1 +regression_covariates$PC2 +regression_covariates$PC3"model.formula <- paste0(model.formula, covariates.formula)model <- betareg(as.formula(model.formula), link="cloglog")coef <- summary(model)$coefficients$meanbeta <- coef[2,1]pval <- coef[2,4]r_squared <- summary(model)$pseudo.r.squaredreturn(c(beta = beta, r_squared = r_squared, pval = pval))}
-
c.Run all regressions across SNPs in parallel.> CORES <- 16> cl <- makeCluster(CORES,outfile="cluster_out_%.0f.txt" %>%sprintf(as.numeric(Sys.time())))> f = file(); sink(file=f) # Silence output> clusterEvalQ(cl, { suppressPackageStartupMessages({library(tidyverse); library(betareg)}) })> sink(); close(f)> clusterExport(cl=cl, varlist=c("genotype_matrix", "TFBS_matrix", "regression_covariates"))> regression_result <- parSapply(cl, 1:nrow(TFBS_matrix),run_regression_on_variant)> stopCluster(cl)Optional: If you only have one core available, you can alternatively run the regressions in sequence:> regression_result <- sapply(1:nrow(TFBS_matrix),run_regression_on_variant)
-
d.Consolidate results into a table.> TFBS_regression <- data.frame(beta = regression_result["beta", ],r_squared = regression_result["r_squared", ],pval = regression_result["pval", ]) %>%cbind(snp_info)
- e.
-
a.
-
11.
Output final list of fpQTL SNPs (at a false discovery rate of 5%).
> TFBS_regression <-
read.delim("/path/to/regression/regression_results.txt")
> fpQTLs <- TFBS_regression[TFBS_regression$qval < 0.05,]
> write.table(fpQTLs, "/path/to/regression/fpQTLs_FDR5.txt",
quote = FALSE, row.names = FALSE, sep = "\t")
Expected outcomes
In the first major step, every sequenced ATAC-seq sample is processed into a text file containing one row per ATAC-seq fragment. Each fragment is listed with the following fields: chromosome, the positions of the two Tn5 insertions which generated it, the name of the sample (which PRINT interprets as the cell “barcode” and uses it to group fragments), and the count of fragment duplicates (Figure 1). Additionally, a peak file should be generated for open chromatin regions, which contains fields for chromosome, start, and end.
Figure 1.
Extraction of fragments from ATAC-seq reads
After the ATAC-seq libraries are sequenced, paired reads are aligned to the reference genome and filtered, to identify fragments resulting from pairs of Tn5 insertions. Elements of this figure adapted from Dudek et al.1
In step two, genotype data in a VCF format is transformed into a P × N matrix, where P is the number of SNPs, and N is the number of samples (Figure 2). This step also produces a table of SNP information, whose fields include position, a unique identifier, the reference and alternate alleles, and the minor allele frequency.
Figure 2.
Preparation and filtering of genotype data
Genotype data is transformed from the variant call format into an additive matrix, and SNPs are filtered so that they are common and lie within an open chromatin region.
In step three, the fragment files are fed into PRINT, which calculates a single TFBS at each SNP in each sample. PRINT writes its output data for each sample in chunks, which are stored as .rds (Rdata) files. The code in Step 3 merges these chunks into a single TFBS.Rds file for each sample. Then, at the start of Step 4, these files are consolidated further into a single P × N matrix of TFBS (Figure 3).
Figure 3.
Calculation of TFBS
PRINT is run on all samples to calculate a TFBS at every SNP in every sample. PRINT splits its output into multiple chunks for each sample, but these results are consolidated into a single TFBS matrix.
In the final step, a beta regression is run for each SNP. The output of this step is a table listing the parameters of the regressions, one row for each SNP. This table includes the same fields as the SNP information table, as well as the regression beta (the slope/coefficient relating genotype to TFBS), the pseudo-R-squared value from beta regression, the P-value of association (under the null hypothesis that beta = 0), and the Q-value which is used to control false discovery rate (Figure 4).
Figure 4.
Testing for association between TFBS and genotype
For each SNP, a regression is run across samples to test for association between TFBS and genotype. Sample covariates are also included in the regression. The final output of fpQTL discovery is a P-value of association for every SNP. Elements of this figure adapted from Dudek et al.1
Limitations
The accuracy of TF binding prediction is limited by the amount of Tn5 insertions available to infer the presence of a footprint. Thus, TFBS calculations are unreliable for low-coverage ATAC-seq samples (<10 million fragments). Furthermore, using ATAC-seq data from bulk tissue samples may mask footprint associations that only occur in a specific cell type. However, this protocol can easily be modified to accept single-cell ATAC-seq data by generating cell-type-specific fragment files. PRINT can also be configured to use barcodes to group fragments by cell type. This protocol can only predict the effects of SNPs, rather than other variants such as indels. This is because indels complicate the positioning of insertions across samples when the TFBS is calculated.
This protocol also does not consider the effect that each SNP could have on the Tn5 sequence bias, which weakly influences the positions where Tn5 is inserted.4 The R version of PRINT corrects for this Tn5 sequence bias, but relies on the reference genome, and so the bias of the alternative allele is not considered. The new Python library scPrinter supports the creation of custom bias profiles using donor-specific genomes, so future expansions of this protocol may be able to account for SNP bias effects. Finally, while fpQTL discovery can infer that a SNP changes the strength of TF binding, it does not actually identify the affected TF. However, the identity of the TF may be inferred by considering the sequence motifs which underlie the fpQTL, and the direction in which they are disrupted by the alternate allele.
Troubleshooting
Problem 1
Related to extraction of ATAC-seq fragment coordinates for each sample: Error loading libraries when running samtools or bedtools:
> error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory
Potential solution
Ensure that bzip2 is available in your environment.
$ which bzip2
If you work in an HPC environment, you may need to load bzip2 as a module (Similarly, you should load modules ncurses and zlib as well, if available). Otherwise, you may need to download and build the source code: https://www.sourceware.org/bzip2/.
Problem 2
Related to calculation of TF binding scores at SNPs using PRINT: Issues installing and running keras, related to a local Python environment. The R packages keras and tensorflow require the corresponding python modules installed in a location which is accessible to reticulate. Sometimes, conflicts can emerge with local installations of Python.
Potential solution
Using R, install a virtual Python environment, inside which the packages can be installed.
> library(reticulate)
> virtualenv_create("r-reticulate",<textboxend>
python = install_python(version = "3.12"))
> library(keras)
> install_keras(version = "2.16",
envname = "r-reticulate")
> library(tensorflow)
> tf$constant("Hello TensorFlow!")
Problem 3
Related to calculation of TF binding scores at SNPs using PRINT: Multiple possible errors relating to the loading of R-Python libraries by PRINT, for example:
> Importing the numpy C-extensions failed
Potential solution
Add the following line to your R code before running PRINT functions.
> Sys.setenv(PYTHONPATH="")
Sometimes, R will try to access the wrong Python installation because this environment variable gets populated with other paths, such as when loading modules on an HPC. Resetting the variable should get PRINT to look for the right Python installation.
Problem 4
Related to multiple steps, but especially calculation of TF binding scores at SNPs using PRINT: “Out of memory” errors.
Running PRINT can be very memory intensive, and we recommend allocating 32 GB of RAM while running TFBS calculation. For the other steps, 16 GB should be sufficient. However, memory usage can differ based on sample size, ATAC-seq read depth, and the number of called SNPs, and so it is possible that PRINT or other steps of the protocol will run out of memory and terminate.
Potential solution
First, if you have more memory available to allocate, such as on an HPC, you can simply increase the allocated memory for a given step and run it again. We recommend doubling the available memory after an “out of memory” error.
If memory is more constrained, or the memory use by PRINT is too demanding, another option is to separate your fragment files by chromosome, and run them through PRINT separately. For example, you can extract just fragments from “chr21” by running.
$ SAMPLE_NAME="sample1"
$ FRAG_FILE="/path/to/frags/$SAMPLE_NAME.tsv.gz"
$ OUTFILE="/path/to/frags/$SAMPLE_NAME.chr21.tsv.gz"
$ zcat $FRAG_FILE | grep "ˆchr21" | gzip > $OUTFILE
Note that this will require slightly reconfiguring the scripts to run PRINT and consolidate PRINT files, as they will now require the chromosome to be specified.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Struan F.A. Grant (grants@chop.edu).
Technical contact
Technical questions on executing this protocol should be directed to and will be answered by the technical contact, Max F. Dudek (maxdudek@upenn.edu).
Materials availability
This study did not generate new unique materials.
Data and code availability
All original code in this study has been deposited at: github.com/maxdudek/fpQTL_protocol. Demo data files to test this code have been deposited to Zenodo: https://doi.org/10.5281/zenodo.13686851 (STAR_protocol_example_data.tar.gz).
Acknowledgments
We thank all members of the Grant and Almasy labs for their feedback on this project. M.F.D. is supported by the National Science Foundation Graduate Research Fellowship Program (NSF GRFP). L.A. is funded by NIAAA U10 AA008401. B.F.V. gratefully acknowledges support from the NIH/NIDDK (UM1 DK126194 and U24 DK138512). S.F.A.G. is funded by UM1 DK126194, R01 HD056465, and the Daniel B. Burke Endowed Chair for Diabetes Research.
Author contributions
M.F.D. developed the method and wrote the manuscript. B.M.W. assisted in developing the method. B.F.V., L.A., and S.F.A.G. supervised and reviewed the manuscript.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Max F. Dudek, Email: maxdudek@gmail.com.
Struan F.A. Grant, Email: grants@chop.edu.
References
- 1.Dudek M.F., Wenz B.M., Brown C.D., Voight B.F., Almasy L., Grant S.F.A. Characterization of non-coding variants associated with transcription-factor binding through ATAC-seq-defined footprint QTLs in liver. Am. J. Hum. Genet. 2025;112:1302–1315. doi: 10.1016/j.ajhg.2025.03.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J., et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sakaue S., Weinand K., Isaac S., Dey K.K., Jagadeesh K., Kanai M., Watts G.F.M., Zhu Z., Accelerating Medicines Partnership® RA/SLE Program and Network, Brenner M.B. Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles. Nat. Genet. 2024;56:615–626. doi: 10.1038/s41588-024-01682-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Buenrostro J.D., Giresi P.G., Zaba L.C., Chang H.Y., Greenleaf W.J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods. 2013;10:1213–1218. doi: 10.1038/nmeth.2688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yan F., Powell D.R., Curtis D.J., Wong N.C. From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis. Genome Biol. 2020;21:22. doi: 10.1186/s13059-020-1929-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Furey T.S. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat. Rev. Genet. 2012;13:840–852. doi: 10.1038/nrg3306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Li Z., Schulz M.H., Look T., Begemann M., Zenke M., Costa I.G. Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 2019;20:45. doi: 10.1186/s13059-019-1642-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ouyang N., Boyle A.P. TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence. Genome Res. 2020;30:1040–1046. doi: 10.1101/gr.258228.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bentsen M., Goymann P., Schultheis H., Klee K., Petrova A., Wiegandt R., Fust A., Preussner J., Kuenne C., Braun T., et al. ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat. Commun. 2020;11:4267. doi: 10.1038/s41467-020-18035-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yang T., Henao R. TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile. PLoS Comput. Biol. 2022;18 doi: 10.1371/journal.pcbi.1009921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Aguet F., Alasoo K., Li Y.I., Battle A., Im H.K., Montgomery S.B., Lappalainen T. Molecular quantitative trait loci. Nat. Rev. Methods Primers. 2023;3:4. doi: 10.1038/s43586-022-00188-6. [DOI] [Google Scholar]
- 12.Hu Y., Horlbeck M.A., Zhang R., Ma S., Shrestha R., Kartha V.K., Duarte F.M., Hock C., Savage R.E., Labade A., et al. Multiscale footprints reveal the organization of cis-regulatory elements. Nature. 2025;638:779–786. doi: 10.1038/s41586-024-08443-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ouyang N., Boyle A.P. Quantitative assessment of association between noncoding variants and transcription factor binding. bioRxiv. 2022 doi: 10.1101/2022.11.22.517559. Preprint at. [DOI] [Google Scholar]
- 14.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wickham H., Averick M., Bryan J., Chang W., McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., et al. Welcome to the Tidyverse. J. Open Source Softw. 2019;4:1686. doi: 10.21105/joss.01686. [DOI] [Google Scholar]
- 17.Knaus B.J., Grünwald N.J. vcfr: a package to manipulate and visualize variant call format data in R. Mol. Ecol. Resour. 2017;17:44–53. doi: 10.1111/1755-0998.12549. [DOI] [PubMed] [Google Scholar]
- 18.Lawrence M., Huber W., Pagès H., Aboyoun P., Carlson M., Gentleman R., Morgan M.T., Carey V.J. Software for Computing and Annotating Genomic Ranges. PLoS Comput. Biol. 2013;9 doi: 10.1371/journal.pcbi.1003118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Cribari-Neto F., Zeileis A. Beta Regression in R. J. Stat. Softw. 2010;34:1–24. https://www.jstatsoft.org/article/view/v034i02 [Google Scholar]
- 20.Bolstad, B. preprocessCore: A Collection of Pre-processing Functions. Bioconductor. http://bioconductor.org/packages/preprocessCore/.
- 21.Storey J.D., Bass A.J., Dabney A., Robinson D. Bioconductor; 2023. Qvalue: Q-Value Estimation for False Discovery Rate Control.http://bioconductor.org/packages/qvalue/ [Google Scholar]
- 22.Wolpe J.B., Martins A.L., Guertin M.J. Correction of transposase sequence bias in ATAC-seq data with rule ensemble modeling. NAR Genom. Bioinform. 2023;5 doi: 10.1093/nargab/lqad054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zeileis A., Cribari-Neto F., Grün B., Kosmidis I. 2024. betareg: Beta Regression. Version 3.2-1. [Google Scholar]
- 24.Ferrari S., Cribari-Neto F. Beta Regression for Modelling Rates and Proportions. J. Appl. Stat. 2004;31:7. doi: 10.1080/0266476042000214501. [DOI] [Google Scholar]
- 25.Laboratory DA, Fund NC, Site—NDRI BCS, Site—RPCI BCS, Resource—VARI BC, of Miami BBRU, Bank BE, Management LBP, Study ELSI, Battle A., Brown C.D. Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Storey J.D., Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All original code in this study has been deposited at: github.com/maxdudek/fpQTL_protocol. Demo data files to test this code have been deposited to Zenodo: https://doi.org/10.5281/zenodo.13686851 (STAR_protocol_example_data.tar.gz).

Timing: 30–60 min


