lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

M Grace Gordon; Fumitaka Inoue; Beth Martin; Max Schubach; Vikram Agarwal; Sean Whalen; Shiyun Feng; Jingjing Zhao; Tal Ashuach; Ryan Ziffra; Anat Kreimer; Ilias Georgakopoulous-Soares; Nir Yosef; Chun Jimmie Ye; Katherine S Pollard; Jay Shendure; Martin Kircher; Nadav Ahituv

doi:10.1038/s41596-020-0333-5

. Author manuscript; available in PMC: 2021 Feb 1.

Published in final edited form as: Nat Protoc. 2020 Jul 8;15(8):2387–2412. doi: 10.1038/s41596-020-0333-5

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

M Grace Gordon ^1,^2,^3,¹⁶, Fumitaka Inoue ^1,^2,^16,^✉, Beth Martin ^4,¹⁶, Max Schubach ^5,^6,¹⁶, Vikram Agarwal ^4,⁷, Sean Whalen ⁸, Shiyun Feng ^1,², Jingjing Zhao ^1,², Tal Ashuach ⁹, Ryan Ziffra ^1,², Anat Kreimer ^1,^2,⁹, Ilias Georgakopoulous-Soares ^1,², Nir Yosef ^9,¹⁰, Chun Jimmie Ye ^1,^2,^10,^11,¹², Katherine S Pollard ^2,^8,^10,¹³, Jay Shendure ^4,^14,^15,^✉, Martin Kircher ^5,^6,^✉, Nadav Ahituv ^1,^2,^✉

PMCID: PMC7550205 NIHMSID: NIHMS1632591 PMID: 32641802

Abstract

Massively parallel reporter assays (MPRAs) can simultaneously measure the function of thousands of candidate regulatory sequences (CRSs) in a quantitative manner. In this method, CRSs are cloned upstream of a minimal promoter and reporter gene, alongside a unique barcode, and introduced into cells. If the CRS is a functional regulatory element, it will lead to the transcription of the barcode sequence, which is measured via RNA sequencing and normalized for cellular integration via DNA sequencing of the barcode. This technology has been used to test thousands of sequences and their variants for regulatory activity, to decipher the regulatory code and its evolution, and to develop genetic switches. Lentivirus-based MPRA (lentiMPRA) produces ‘in-genome’ readouts and enables the use of this technique in hard-to-transfect cells. Here, we provide a detailed protocol for lentiMPRA, along with a user-friendly Nextflow-based computational pipeline—MPRAflow—for quantifying CRS activity from different MPRA designs. The lentiMPRA protocol takes ~2 months, which includes sequencing turnaround time and data processing with MPRAflow.

Introduction

Gene regulatory elements control a gene’s transcription. These include sequences that activate transcription, such as promoters and enhancers; silencers that repress a gene; and insulators that restrict genes from interacting with certain regulatory elements. Nucleotide variation in these elements can have a major effect on phenotype. Mutations within them have been shown to be a major cause of human disease¹. For example, >90% of all human disease genome-wide association studies (GWASs) have shown associations with noncoding variants², which colocalize with potential gene regulatory elements³. In addition, gene regulatory elements can be major drivers of evolutionary speciation, driving differences between species such as morphology, diet, and behavior⁴. These sequences can also be used as genetic switches to tune transgenes to specific levels in certain cell types or tissues.

In this protocol, we focus on gene activation associated regulatory elements, promoters and enhancers. These sequences can be identified in a genome-wide manner by biochemical methods such as chromatin immunoprecipitation followed by sequencing (ChIP-seq⁵), DNase I hypersensitive sites sequencing (DNase-seq^6,7), assay for transposase-accessible chromatin using sequencing (ATAC-seq⁸), cleavage under targets and release using nuclease (CUT&RUN⁹), Hi-C¹⁰ and others. However, these methods only help annotate CRSs, and additional experimental assays must be performed in order to validate their predicted activity. Reporter assays are commonly used to characterize CRSs. In this assay, the CRS is placed either upstream of a reporter gene (i.e., in the case of testing promoters) or upstream of a minimal promoter followed by a reporter gene (i.e., in the case of testing enhancers). If the sequence is an activating regulatory element, it will turn on the reporter gene, providing a measurable output. However, these assays are primarily done on an individual basis and as such cannot assess the thousands of CRSs and their variants that have been identified via the aforementioned biochemical assays. Massively parallel reporter assays overcome this hurdle, providing the ability to test hundreds of thousands of sequences and their variants in parallel for their regulatory function¹¹. This is done either by measuring RNA expression driven by the CRS by pairing it to a transcribed barcode, or by using the CRS itself as a barcode, as is done in the self-transcribing active regulatory region sequencing (STARR-seq) assay¹².

Here, we describe both a lentivirus-based MPRA (lentiMPRA) and MPRAflow, a computational tool for MPRA analysis that is based on the Nextflow framework¹³ (Fig. 1a). lentiMPRA can be used in any cell type that can be efficiently infected via lentivirus, providing the ability to carry out MPRA in a broad range of cell types and tissues. In addition, owing to the viruses’ inherent genomic integration, it provides an ‘in-genome’ readout, which we have shown provides more robust results that can be better predicted by both biochemical and sequence-based features as compared with episomal-based MPRA¹⁴. MPRAflow is a user-friendly computational pipeline that is compatible with a broad range of MPRA experiments.

Fig. 1 | — a, Summary of lentiMPRA and MPRAflow. The lentiMPRA library is sequenced to associate CRSs and barcodes and to infect cells, using three replicates. DNA and RNA from the cells are sequenced to determine barcode transcription and CRS activity. b, CRS oligonucleotide. A 200-base CRS (gray) is flanked by PCR adaptor sequences (light green). c, First-round PCR. PCR primers add sequences that are complementary to the vector (black) to the upstream side, as well as minimal promoter (mP, blue) and spacer sequences (yellow) downstream of the CRS oligonucleotide. d, Second-round PCR. Reverse primer adds the barcodes (red-striped section) and GFP complementary sequences (green). e, Plasmid construct. f, Amplification for CRS-barcode association. Primers add P5 (purple) and sample index (gray-striped section) upstream and P7 (pink) downstream. g, Sequencing library structure. h, Sequencing reaction. Paired-end reads specify the CRS sequence, with index read 1 providing the barcode and index read 2 reading the sample index for multiplexing. i, Integrated DNA and expressed RNA in infected cells. j, Amplification for barcode counting. Primers add P5 and sample index upstream and P7 and UMI, brown stripe) downstream. k, Sequencing library structure. l, Sequencing reaction. Paired-end reads give barcode, index read 1 gives UMI, and index read 2 provides sample index for multiplexing. ARE, anti-repressor element; LTR, long terminal repeat; WPRE, Woodchuck hepatitis virus posttranscriptional regulatory element.

Development of the protocol

We developed lentiMPRA to overcome the following limitations: (i) descriptive assays that detect potential regulatory elements (such as ChIP-seq, DNase-seq, ATAC-seq, CUT&RUN and Hi-C) identify candidate sequences within chromatin, yet most MPRAs analyze sequences in an episomal context; (ii) episomal-based MPRA is limited to cells that can be easily transfected. Lentivirus-based assays overcome both these limitations. Lentiviruses integrate into the genome, providing an in-genome readout. In addition, they can infect a large number of cells and tissue types, providing a more diverse range of cellular environments for MPRA. In this protocol, we further develop lentiMPRA by placing a barcode in the 5′ UTR of the reporter gene. This 5′ UTR barcoding method uses a shorter distance between the CRS and the barcode (102 bp) than previous 3′ UTR barcoding methods (801 bp), reducing the risk of CRS–barcode swapping¹⁵. In addition, unlike previous lentiMPRA, in which each CRS is synthesized together with multiple barcodes in a custom array, the 5′ UTR barcoding strategy adds barcodes via the PCR primer. This enables the ability to clone and test hundreds of thousands of CRSs using lentiMPRA.

To subsequently analyze MPRA results, several home-brewed MPRA computational analysis pipelines exist that are tailored to a specific lab and MPRA technique. However, these tools are not transferable between labs because of the large variability in MPRA designs, lack of documentation, complicated input files and the lack of parameterization of these tools. We thus developed MPRAflow, which provides a user-friendly, flexible, parallelized tool for quantifying CRS activity from a variety of MPRA experimental designs, including lentiMPRA, episomal-based MPRA and saturation mutagenesis designs, with easily interpretable visualizations that can be readily adopted by users regardless of their computational level. In addition to providing normalized fold change per CRS, MPRAflow can generate input files for MPRAnalyze¹⁶, a tool that calculates a transcription rate for each tested CRS by fitting a generalized linear model with DNA and RNA counts. This pipeline enables the entire analysis to be completed with two commands on a terminal, greatly simplifying the computational tasks associated with MPRAs and therefore increasing the usability of this protocol.

Applications of the method

lentiMPRA can be used for numerous research purposes, such as analyzing hundreds of thousands of different candidate enhancers and their variants (e.g., rare and common GWAS-associated single-nucleotide polymorphisms (SNPs), evolutionary variants) in the genome, decoding the regulatory code, determining how it evolved in other species and generating specific genetic switches. It provides the ability to carry out these experiments in hard-to-transfect cells (e.g., primary cells, neurons, and many others) and integrates into the nucleus, providing an in-genome readout that we have shown is more reproducible and more predictive of functionality than either biochemical annotations or sequence-based models¹⁴.

MPRAflow uses the pipelining tool Nextflow¹³, which automatically runs MPRA processing code (written in Python, Bash, and R), manages all necessary packages and environments with Anaconda¹⁷, and is compatible with a multitude of computational architectures, including a variety of high-performance computing (HPC) clusters and cloud computing systems. In addition, technical replicates and experimental conditions are parallelized through these HPC systems. Because MPRAflow is a package that allows non-bioinformatic researchers to easily analyze MPRA data, it can greatly increase the usability of this method in labs that do not have in-house bioinformaticians. In addition, MPRAflow provides easily interpretable graphics and produces files correctly formatted for readily available tools for further in-depth bioinformatic analysis such as MPRAnalyze¹⁶.

Comparisons with other methods

Several different varieties of MPRA are available, such as episomal barcode-based MPRAs, STARR-seq, and others¹¹. lentiMPRA differs from these methods because it provides an in-genome readout in a wider range of cell types. In STARR-seq, the CRS itself acts as the barcode. This attribute can potentially impact results because of the binding of RNA-associated factors and the RNA stability of the assayed sequence¹⁵. Using on average more than fifty 15-bp barcodes per CRS in lentiMPRA reduces this impediment. CRSs are usually generated via oligonucleotide synthesis but can also be produced by other processes, such as PCR or DNA capture-based methods. Barcodes can be added either as part of the synthesis or via PCR, providing flexibility in cloning design. Because lentiviruses integrate throughout the genome, we introduced anti-repressors on either side of the virus that, together with having >50 barcodes per assayed sequence, assist in overcoming differences due to varying genomic integration sites.

Previous MPRA processing tools have mainly focused on CRS library design or determination of CRS activity from count matrices, overlooking the computationally expensive task of processing sequencing data. MPRAflow is based on computational methods used in our previous MPRA work^{14,15,18–20} and contains three utilities: association, count, and saturation mutagenesis. The association utility processes demultiplexed .fastq files and assigns barcodes to the CRS that are cloned with in the random pairing design. Sensitive alignment of merged paired-end reads provides robustness against sequencing and synthesis errors without strict read filters, even when CRS libraries contain sequences that differ by only one nucleotide. The count utility processes demultiplexed .fastq files to perform quality control (QC) across replicates, normalizes barcode count tables per CRS, and quantifies log₂(RNA/DNA) ratios per CRS. MPRAnalyze inputs can also be produced using the count utility. Saturation mutagenesis dissolves multiple variants per CRS into single-variant ratios by applying a multivariate linear model, and it can be combined with the count utility. Each utility is executed with a single command on a terminal, and all utilities provide easily interpretable visualizations of all analyses performed.

Experimental design

Library design

CRSs can be identified using many of the aforementioned biochemical assays (ChIP-seq, DNase-seq, ATAC-seq, CUT&RUN, GWAS, Hi-C and others). Variants of interest within these CRSs can be identified via GWAS, GTEx (https://www.gtexportal.org/home/), various genomic websites such as Genome Aggregation Database (gnomAD²¹), comparative genomics and many other databases. The CRSs and variants tested ultimately depend on the goal of the study. Negative and positive controls should be included in the lentiMPRA library. For negative controls, sequences that could be used are those that are known not to be active in the assayed cell type, having silencing marks such as H3K27me3 within this tissue, or scrambled CRSs that are randomly selected from the library. For positive controls, sequences that are known to function as promoters/enhancers in this cell/type or tissue could be used. If such data do not exist, one can characterize CRSs from the cell where the lentiMPRA will be done via the aforementioned biochemical assays. These controls should be present within every technical and biological condition that will be tested. Tools such as MPRAnator²² or MPRA Design Tools²³ can assist in choosing regions to test via MPRA and assembling the .fasta files required to order the libraries. Libraries can contain up to hundreds of thousands of sequences, depending on the infection efficiency of the cells (see Supplementary Table 1). The length of these sequences can also vary (as long as the combined length is not >10 kb, the optimal packaging capacity of lentivirus), depending on how the CRSs are generated (i.e., oligonucleotide synthesis, PCR, or capture). For more information on library design, see Box 1.

Box 1 |. Library design criteria.

Oligonucleotide synthesis length via Agilent is currently limited to 230 nucleotides (Fig. 1b, Extended Data Fig. 1a). Because the sequence should include 15-base common sequences (adaptors) on both sides of the CRS to amplify the library via PCR, the maximum length of the CRS is 200 bp. Twist Bioscience currently provides oligonucleotide synthesis service for oligonucleotides up to 300 bp and could increase CRS length. Three pairs of adaptor sequences are shown below. If two or three independent libraries are synthesized on the same array, use the second and third options. Asterisks represent the 200-base CRS sequence.
- 5′-AGGACCGGATCAACT**200base_CRS**CATTGCGTGAACCGA-3′
- 5′-AATGCTAGCGCATGG**200base_CRS**CTGCAACCTACGGAA-3′
- 5′-TTACGAGCCGTAGTC**200base_CRS**GCATCTCAACGTGGT-3′
The number of oligonucleotides that can be synthesized by Agilent is currently limited to 244,000 (Twist Bioscience’s current limit is 696,000). The number of CRSs that can be tested with lentiMPRA is also limited by the infectability of the cells that are used (i.e., MOI is a limitation) and the number of cells available for the experiment (e.g., primary cells, hypo-proliferative cells), as explained in Box 2. For example, we have analyzed 164,000 CRSs with 50 barcodes per CRS using 15 million HepG2 cells at an MOI of 50. This can be simulated using Supplementary Table 1.
We recommend avoiding the use of homopolymers (>8 bases) in the design because they may cause a higher synthesis error rate.

Library generation

For this protocol, we will focus on oligonucleotide synthesis because it is currently the most cost-effective way to generate fixed-length CRSs. Here, the synthesized oligonucleotide pool of the CRSs is amplified via two rounds of PCR, first to add the minimal promoter, and then to add the barcode. The amplified fragments are cloned via Gibson assembly into the SbfI/AgeI site of the pLS-SceI vector to construct the library. The resulting library is digested with I-SceI to remove any vector that did not receive an insert. The recombination products are then electroporated into competent cells and plated onto ampicillin plates. Sanger sequencing of 16 colonies is then used to confirm the proper assembly of the library. The number of plates will dictate the number of barcodes each CRS will have on average. The number of colonies required for plasmid extraction will depend on the number of CRSs tested and the desired number of barcodes per CRS. Generally, it is ideal to have at least 50 barcodes per CRS, and the total number of colonies should roughly equal the desired library complexity. We recommend limiting the complexity of the library because of the finite nature of the multiplicity of infection (MOI) and the associated increase in sequencing costs. The complexity recommended in this protocol is 0.5–12 million total barcodes. The library should then be midi-prepped to extract the final plasmid library.

Association sequencing

To associate the barcode to the CRS, PCR is performed on the plasmid library to add flowcell sequences and sample indexes to the CRS–barcode pairs. The PCR product is then gel-extracted at the appropriate insert size (~471 bp for a 200-bp CRS) and sent for paired-end sequencing with an index read for barcode sequence, using custom primers provided in this protocol.

Lentiviral prep

The next step is to generate a lentivirus library. This is done by transfecting 293T cells with the plasmid library. Following 2 days in culture with titer boost reagent, the virus is collected and concentrated. To titrate the lentivirus, the cell type of interest is plated into 8 wells of a 24 well plate and infected with varying volumes of the virus (0, 1, 2, 4, 8, 16, 32, 64 μL) in each well. Cells are monitored for viability throughout this time in order to determine whether certain concentrations are toxic to them. Following a 3-d incubation (to reduce non-integrating lentivirus), genomic DNA is extracted from each well. qPCR is then carried out for each condition using primers against genomic DNA, integrated viral DNA, and plasmid backbone DNA. The MOI is calculated for each viral concentration (Supplementary Table 2). These values are then plotted against the viral volume to calculate the viral titer. Conditions need to be adjusted if cells are not viable.

Infection and sequencing

The lentiMPRA library is then infected into the cells of interest and incubated for 3 d. The number of cells required is determined on the basis of library complexity and the highest MOI that the cells can be infected with that is not toxic to the cells. We strongly recommend carrying out three technical replicates for each biological condition tested to assess reproducibility. The cells are then washed to reduce non-integrating lentivirus, and DNA and RNA are simultaneously extracted. RNA is treated with DNase and reverse transcription is done using construct-specific primers that contain P7 flowcell sequences and unique molecular identifiers (UMIs), to preserve the true counts of molecules through the amplification process. PCR is carried out on the DNA and RNA samples to amplify barcodes, adding P5 flowcell sequence and sample index upstream, and P7 flowcell sequence and UMI to the barcode. The sequencing libraries are then pooled and sent for paired-end sequencing with a UMI and sample index read.

Data processing

We built a computational tool, MPRAflow, to easily process demultiplexed .fastq data resulting from lentiMPRA and other MPRA experiments. If the barcodes are randomly paired with CRSs, the association utility can be run to assign barcodes to the appropriate CRS. We provide a workflow tailored to testing distinct CRSs, using Burrows–Wheeler Aligner (BWA²⁴) to align sequences to the ordered oligonucleotide pool, a workflow for libraries containing single-nucleotide variants of the same CRS, using Bowtie2²⁵ and a list of the expected positions of the variants. The resulting pairing is then used in the count utility, which processes the barcode sequencing of the DNA and RNA to create normalized log₂(RNA/DNA) ratios for the transcriptional activity of each CRS tested, along with easy-to-interpret visualizations. If more robust statistical analyses are desired, we provide the option to generate input files for MPRAnalyze¹⁶, a generalized linear model approach. In addition, we provide an alternative workflow for quantifying expression of CRS libraries produced with saturation mutagenesis. It processes data into a matrix of RNA count, DNA count, and N binary columns indicating whether a specific sequence variant was associated with the barcode (T), which are used to fit a multiple linear regression model of log₂(RNA_j) ~ log₂(DNA_j) + N + offset (j ∈ T) and report the coefficients of N as effects for each variant. The utility processes multiple replicates and conditions in parallel if an HPC cluster is available but can also be run locally. This code is freely available on GitHub (https://github.com/shendurelab/MPRAflow).

Necessary expertise

Basic molecular biology and cell culture skills are required to perform lentiMPRA. For MPRAflow, a basic familiarity with command-line tools is needed.

Limitations

lentiMPRA has several limitations. These include a limitation in the number of CRSs that can be tested in cells that are not amenable to high lentivirus concentrations, although this can be ameliorated by using a larger number of cells. The use of oligonucleotide synthesis to generate the CRS library can also limit the number of sequences that can be tested, as well as their length. Improvements in DNA synthesis, as well as PCR or DNA capture–based methods, may ultimately overcome this limitation. Techniques that enable multiplex pairwise assembly of oligonucleotides²⁶ could also be used to increase CRS size by patching together specific oligonucleotides.

As for MPRAflow, although this tool is applicable to many types of MPRA, it does not support STARR-seq workflows because it does not include functionality for peak calling.

Materials

Biological materials

! CAUTION Cell lines should be regularly checked to ensure that they are authentic and that they are not infected with mycoplasma.

293T cells (ATCC, cat. no. CRL-3216, RRID: CVCL_0063)
Cell lines of interest. All data shown in this protocol were generated from HepG2 cells (ATCC, cat. no. HB-8065, RRID: CVCL_0027)

Reagents

DMEM (Life Technologies, cat. no. 11995-065)
FBS (VWR International, cat. no. 89510-194)
Penicillin–streptomycin (Life Technologies, cat. no. 15140-122)
Trypsin-EDTA (0.05%; Life Technologies, cat. no. 25300-062)
Polybrene (Sigma-Aldrich, cat. no. TR-1003-G)
DPBS (Sigma-Aldrich, cat. no. D8537)
Wizard SV Genomic DNA Purification System (Promega, cat. no. A2361)
SsoFast EvaGreen Supermix (Bio-Rad, cat. no. 1725204)
Primers and adaptors (custom-made by IDT with standard desalting; Supplementary Table 3)
UltraPure DNase/RNase-free distilled water (Life Technologies, cat. no. 10977-023)
SurePrint 244K Oligonucleotide Libraries (Agilent, cat. no. G7223A)
TE buffer (Tris–EDTA; Teknova, cat. no. T0225)
NEBNext High-Fidelity 2× PCR Master Mix (New England BioLabs, cat. no. M0541L)
Ethyl alcohol (Sigma-Aldrich, cat. no. E7023-500ML)
Buffer EB (Qiagen, cat. no. 19086)
HighPrep PCR reagent (MagBio Genomics, cat. no. AC60050)
Gel loading dye (6×; New England BioLabs, cat. no. B7025S)
SeaKem LE agarose (Lonza, cat. no. 50004)
TAE (Thermo Fisher Scientific, cat. no. BP13324)
SYBR Safe DNA gel stain (Invitrogen, cat. no. S33102)
QIAquick Gel Extraction Kit (Qiagen, cat. no. 28704)
CutSmart buffer (10×; New England BioLabs, cat. no. B7204S)
AgeI-HF (New England BioLabs, cat. no. R3552L)
SbfI-HF (New England BioLabs, cat. no. R3642L)
I-SceI (New England BioLabs, cat. no. R0694S)
NEBuilder HiFi DNA Assembly Master Mix (New England BioLabs, cat. no. E2621L)
NEB 10-beta electrocompetent cells (New England BioLabs, cat. no. C3020K)
LB base (Life Technologies, cat. no. 12780029)
LB agar plates (15 cm; Teknova, cat. no. L5002)
Carbenicillin (Teknova, cat. no. C2130)
QIAprep Spin Miniprep Kit (Qiagen, cat. no. 27106)
Qiagen Plasmid Plus Midi Kit (Qiagen, cat. no. 12945)
DNA Ladder (1 kb; New England BioLabs, cat. no. N3232S)
DNA ladder (100 bp; New England BioLabs, cat. no. N3231S)
Qubit dsDNA HS Assay Kit (Life Technologies, cat. no. Q32851)
Qubit RNA HS Assay Kit (Life Technologies, cat. no. Q32852)
OPTI-MEM (Life Technologies, cat. no. 31985070)
EndoFectin (Genecopoeia, cat. no. EFL1001-01)
psPAX2 (Addgene, cat. no. 12260; RRID: Addgene_12260)
pMD2.G (Addgene, cat. no.12259; RRID: Addgene_12259)
pLS-SV40-mP-EGFP (Addgene, cat. no. 137724; RRID: Addgene_137724)
pLS-SceI (Addgene, cat. no. 137725; RRID: Addgene_137725)
ViralBoost reagent (Alstem, cat. no. VB100)
Lenti-X Concentrator (Takara, cat. no. 631232)
AllPrep DNA/RNA Mini Kit (Qiagen, cat. no. 80204)
RNase-free DNase Set (Qiagen, cat. no. 79256)
2-Mercaptoethanol (Bio-Rad, cat. no. 1610710) ! CAUTION 2-Mercaptoethanol is toxic, so it should be handled in a hood while wearing disposable gloves.
TURBO DNA-free Kit (Life Technologies, cat. no. AM1907)
SuperScript II Reverse Transcriptase (Life Technologies, cat. no. 18064-071)
SYBR Green I nucleic acid gel stain (10,000×; Invitrogen, cat. no. S7563)

Equipment

Pipettes (20 μL, 200 μL and 1,000 μL; Rainin, cat. nos. 17014392, 17014391 and 17014382)
Filter tips (20 μL, 200 μL and 1,000 μL; Rainin, cat. nos. 17005860, 17005859 and 17007081)
Serological pipettes (5 mL, 10 mL and 25 mL; Genesee Scientific, cat. nos. 12-102, 12-104 and 12-106)
Pipet-Aid XP (Drummond, cat. no. 4-000-101)
Cell culture plates (24 well, 10 cm, and 15 cm; Genesee Scientific, cat. nos. 25-107, 25-202, and 25-203)
Inverted fluorescence microscope (Leica, DMIL LED)
CO₂ incubator (Thermo Fisher Scientific, Thermo Forma Series II water jacketed)
Hemocytometer (Hausser Scientific, cat. no. 3200)
DNA LoBind tubes (Eppendorf, cat. no. 022431021)
PCR tubes (8-strip; Axygen, cat. no. PCR-0208-FCP-C)
Vortex mixer (Thermo Fisher Scientific, cat. no. 88880017)
Spectrophotometer (NanoDrop 8000; Thermo Fisher Scientific, cat. no. ND-8000-GL)
Qubit fluorometer (Life Technologies, cat. no. Q32857)
qPCR instrument (QuantStudio v. 6 Flex Real-Time PCR System; Applied Biosystems, cat. no. 4485699)
Thermal cycler (ProFlex PCR system; Applied Biosystems, cat. no. 4484073)
DynaMag-2 magnet (Thermo Fisher Scientific, cat. no. 12321D)
Tabletop centrifuge (Myspin 6; Thermo Fisher Scientific, cat. no. 75004061)
Gel electrophoresis system (Mupid-2plus; Takara, cat. no. AD110)
Gel casting set (Takara, cat. no. AD216)
Gel combs (Takara, cat. no. AD214)
Safe Imager 2.0 Blue-Light Transilluminator (Life Technologies, cat. no. G6600)
Heating dry bath (Thermo Fisher Scientific, cat. no. 88880027) ! CAUTION The temperature displayed by the digital thermometer in the heat bath may not be accurate. To calibrate the instrument, we recommend using an alcohol thermometer to measure the temperature of water in a tube placed on the instrument.
Alcohol thermometer
Cuvettes (1-mm gap; BTX Harvard Apparatus, cat. no. 450124)
Gemini X2 Electroporation System (BTX Harvard Apparatus, cat. no. 452007)
37 °C shaker (New Brunswick Scientific, Excella E24)
Round-bottom tubes (14 mL; Corning, cat. no. 352059)
Tubes (50 mL; Corning, cat. no. 352070)
37 °C incubator (Boekel scientific, cat. no. 133001)
T225 flasks (Corning, cat. no. 431082)
Centrifuge (Eppendorf, cat. no. 022625501)
Polyethersulfone (PES) filter units (0.45 μm; Thermo Fisher Scientific, cat. no. 165-0045)
Cell lifters (Corning, cat. no. 3008)
Luer-Lok syringes (3 mL; BD, cat. no. 309657)
Needles (20-gauge; BD, cat. no. 305179)
Parafilm (Heathrow Scientific, cat. no. HS234526B)

Software

Conda (https://docs.conda.io/en/latest/miniconda.html)
Linux (https://www.linux.org/pages/download/)

Reagent setup

DMEM (with 10% (vol/vol) heat-inactivated FBS)

Incubate FBS at 55 °C for 40 min. Supplement DMEM with 10% (vol/vol) heat-inactivated FBS and 1% (vol/vol) penicillin–streptomycin. Store at 4 °C for up to 3 months.

DMEM (with 5% (vol/vol) heat-inactivated FBS)

Supplement DMEM with 5% (vol/vol) heat-inactivated FBS and 1% (vol/vol) penicillin–streptomycin. Store at 4 °C for up to 3 months.

80% Ethanol

Dilute 8 mL of ethyl alcohol with 2 mL of UltraPure distilled H₂O. Store at room temperature (RT; 22–25 °C) for up to 2 weeks.

LB medium

Suspend 20 g of LB base in 1 L of distilled water and sterilize by autoclaving. Store at RT for up to 4 months.

TAE-agarose gels

Dissolve SeaKem LE agarose in TAE by boiling. Either 0.3 mg, 0.45 mg or 0.54 mg of agarose per 30 mL TAE should be used to obtain 1%, 1.5% or 1.8% (wt/vol) gel, respectively. Add 3 μL of SYBR safe DNA gel stain and cast the gel using an appropriate gel casting set and comb, as described below.

Procedure

Library amplification ● Timing 3 h

1
Dissolve the Agilent oligonucleotide (10 pmol) (Fig. 1b and Box 1) in 100 μL TE buffer to obtain a 100 nM solution.

Set up the first-round PCR reaction. This reaction adds a vector overhang sequence upstream and minimal promoter and adaptor sequences downstream of the CRSs (Fig. 1c, Extended Data Fig. 1b).

Reagent	Volume (μL)	Final conc.
Agilent oligonucleotide (100 nM)	2	1 nM
NEBNext High-Fidelity 2× PCR Master Mix	100	1×
5BC-AG-f01 (100 μM)	1	0.5 μM
5BC-AG-r01 (100 μM)	1	0.5 μM
Ultrapure distilled H₂O	96
Total volume	200

Cycle no.	Denature	Anneal	Extend
1	98 °C, 2 min
2-6 (5 cycles)	98 °C, 15 s	60 °C, 20 s	72 °C, 30 s
7			72 °C, 5 min

Reagent	Volume (μL)	Final conc.
First-round PCR product	Variable (100 ng)
NEBNext High-Fidelity 2× PCR Master Mix	200	1×
5BC-AG-f02 (100 μM)	2	0.5 μM
5BC-AG-r02 (100 μM)	2	0.5 μM
Ultrapure distilled H₂O	Make up to 400 μL
Total volume	400

Cycle no.	Denature	Anneal	Extend
1	98 °C, 2 min
2-13 (12 cycles)	98 °C, 15 s	60 °C, 20 s	72 °C, 30 s
14			72 °C, 5 min

Reagent	Volume (μL)	Final conc.
pLS-SceI	Variable (10 μg)
CutSmart buffer (10×)	20	1×
AgeI-HF (20 U/μL)	5 (100 U)	0.5 U/μL
SbfI-HF (20 U/μL)	5 (100 U)	0.5 U/μL
Ultrapure distilled H₂O	Make up to 200 μL
Total volume	200

Reagent	Volume (μL)	Final conc.
Linearized pLS-SceI (from Step 36)	Variable (1 μg)
Purified insert DNA (from Step 29)	Variable (250 ng)
NEBuilder HiFi DNA Assembly Master Mix	100	1×
Ultrapure distilled H₂O	Make up to 200 μL
Total volume	200

Reagent	Volume (μL)	Final conc.
Recombination product	44
CutSmart buffer (10×)	5	1×
I-SceI (20 U/μL)	1 (20 U)	0.4 U/μL
Total volume	50

Reagent	Volume (μL)	Final conc.
Plasmid library	Variable (40 ng)
NEBNext High-Fidelity 2× PCR Master Mix	100	1×
pLSmP-ass-i# (100 μM)	1	0.5 μM
pLSmP-ass-gfp (100 μM)	1	0.5 μM
Ultrapure distilled H₂O	Make up to 200 μL
Total volume	200

Cycle no.	Denature	Anneal	Extend
1	98 °C, 1 min
2-16 (15 cycles)	98 °C, 15 s	60 °C, 20 s	72 °C, 3 min
17			72 °C, 5 min

Read	Cycles	Primer	Output
Read 1	146	pLSmP-ass-seq-R1	CRS (upstream, forward)
Read 2	146	pLSmP-ass-seq-R2	CRS (downstream, reverse)
Index read 1	15	pLSmP-ass-seq-ind1	Barcode (forward)
Index read 2	10	pLSmP-rand-ind2	Sample index

Reagent	Volume (μL)	Final conc.
Template DNA (10 ng/μL)	2.5 (25 ng)
SsoFast EvaGreen Supermix	5	1×
Forward primer (100 μM)	0.1	1 μM
Reverse primer (100 μM)	0.1	1 μM
UltraPure distilled H₂O	2.3
Total volume	10

reference_name	variant_positions	ref_bases	alt_bases
ref_1	130	A	T
ref_2	108	G	A
ref_3	67, 99	A, C	C, T

Option	Description
--fastq-insert	Full path to library association .fastq file for insert (must be surrounded with quotes)
--variants	Tab-separated values (.tsv) file with reference name, variant positions, ref bases, alt bases; only input for variant analysis workflow
--fastq-bc	Full path to library association .fastq file for barcode (must be surrounded with quotes)
--design	Full path to .fasta file of ordered oligonucleotide sequences (must be surrounded with quotes)
--name	Name of the association. Files will be named after this
--fastq-insertPE	Full path to library association .fastq file for read 2 if the library is a paired-end library (must be surrounded with quotes)
--min-cov	Minimum coverage of barcode to count it (default = 3)
--min-frac	Minimum fraction of barcodes to map to single insert (default = 0.5)
--mapq	Map quality (default = 30)
--baseq	Base quality (default = 30)
--cigar	Require exact match, for example, 200 million (default = none)
--outdir	The output directory where the results will be saved and what will be used as a prefix (default = outs)
--w	Specific name for work directory (default = work)
--with-timeline	Creates HTML file showing processing times
--split	Number of read entries per .fastq chunk for faster processing (default = 2000000)
--labels	.tsv file with the oligonucleotide pool .fasta file and a group label (for example, positive_control); if no labels are desired, a file will be automatically generated
--h, --help	Help message

Option	Description
--dir	.fasta directory (must be surrounded with quotes)
--association	pickle dictionary from library association process
--design	.fasta file of ordered insert sequences
--e, --experiment	Experiment .csv file
--labels	.tsv file with the oligonucleotide pool .fasta file and a group label (e.g., positive_control); a single label will be applied if a file is not specified
--outdir	The output directory where the results will be saved (default = outs)
--bc-length	Length of barcode (default = 5)
--umi-length	Length of UMI when given (default = 10)
--no-umi	Flag if no UMI was used in the experiment
--merge-intersect	Only retain barcodes in RNA and DNA fractions (TRUE/FALSE, default = FALSE)
--mpranalyze	Flag to generate only MPRAnalyze outputs
--thresh	Minimum number of observed barcodes to retain insert (default = 10)
--w	Specific name for work directory (default = work)
--with-timeline	Create HTML file showing processing times
--h, --help	Help message

Option	Description
--dir	Directory of count files (must be surrounded with quotes)
--assignment	Variant assignment file
--e,--experiment	Experiment .csv file
--outdir	The output directory where the results will be saved (default = outs)
--thresh	Minimum number of observed barcodes to retain insert (default = 10)
--pvalue	Minimum P value for significant variant effects (default = 1e-5)
--w	Specific name for work directory (default = work)
--with-timeline	Create HTML file showing processing times
--h, --help	Help message

Cycle no.	Denature	Anneal and extend	Gradient increase
1	95 °C, 1 min
2-36 (35 cycles)	95 °C, 10 s	60 °C, 30 s
37			60-95 °C in 15 min

Reagent	Volume (μL)	Final conc.
RNA	Variable (60 μg total RNA)
P7-pLSmP-ass16UMI-gfp (100 μM)	0.25	0.25 μM
dNTP mix (10 mM, from SuperScript II Reverse Transcriptase)	5	0.5 mM
UltraPure distilled H₂O	Make up to 65 μL
Total volume	65

Reagent	Volume (μL)	Final conc.
DNA or cDNA	100 (12 μg DNA or entire RT product)
NEBNext High-Fidelity 2× PCR Master Mix	200	1×
P7-pLSmP-ass16UMI-gfp (100 μM)	2	0.5 μM
P5-pLSmP-5bc-i# (100 μM)	2	0.5 μM
UltraPure distilled H₂O	96
Total volume	400

Cycle no.	Denature	Anneal	Extend
1	98 °C, 1 min
2-4 (3 cycles)	98 °C, 10 s	60°C, 30 s	72 °C, 1 min
5			72 °C, 5 min

Reagent	Volume (μL)	Final conc.
First-round PCR product	5
NEBNext High-Fidelity 2× PCR Master Mix	10	1×
P7 (100 μM)	0.1	0.5 μM
P5 (100 μM)	0.1	0.5 μM
SYBR Green I nucleic acid gel stain (100×)	0.1	1×
Ultrapure distilled H₂O	4.7
Total volume	20

Cycle no.	Denature	Anneal	Extend
1	98 °C, 1 min
2-31 (30 cycles)	98 °C, 10 s	60 °C, 30 s	72 °C, 1 min

Cycle no.	Denature	Anneal	Extend
1	98 °C, 1 min
2-X (X cycles)	98 °C, 10 s	60 °C, 30 s	72 °C, 1 min

Step	Problem	Possible reason	Solution
Step 29	Low DNA yield. At least 250 ng of insert DNA is required for the recombination reaction	DNA amplification was not enough. DNA loss during gel extraction	Multiply the PCR reaction or increase the number of cycles in the second-round PCR to 15. More cycles (>15 cycles) can decrease the library complexity
Step 37	Uncut vector DNA appears on the gel	Insufficient restriction enzyme reaction	Perform restriction digestion twice or three times (Steps 30-36)
Step 61	Contamination with empty vectors	Vector linearization and/or I-SceI digestion were not sufficient	Lower rates of empty vector contamination (<10%, one or two out of 16 colonies) are acceptable. In this case, proceed with the protocol. If the rate is >10%, redo vector linearization with a longer incubation time and make sure you have achieved complete linearization using an agarose gel. Perform I-SceI digestion with a longer incubation time
	Mutation and indels observed in the plasmids	These can be derived from synthesis/PCR/sequencing errors	As these errors are unavoidable, we usually observe that >50% of sequences contain mutations and/or deletions. Proceed with the protocol; these erroneous sequences should be ruled out during the analysis step. Synthesis error rates might be improved by ordering oligonucleotides that are synthesized with high-fidelity from the manufacturer
Box 2, step 5	Low infection efficiency	Polybrene concentration may not be appropriate	Optimization of Polybrene concentration may be required. Seed cells in a 24-well plate, infect with control virus, along with different amounts of Polybrene (e.g., 0, 2, 4, 8, 16, 32 μg/mL at final concentration), and observe cell death and GFP expression. In our experience, 8 μg/mL works well for most cell types, including HepG2 cells, K562 cells, H1 hESCs (human embryonic stem cells), and WTC11 iPSCs (induced pluripotent stem cells). Polybrene kills neural cell types, including neural progenitors, and should be avoided when using those types of cells

PERMALINK

lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements

M Grace Gordon

Fumitaka Inoue

Beth Martin

Max Schubach

Vikram Agarwal

Sean Whalen

Shiyun Feng

Jingjing Zhao

Tal Ashuach

Ryan Ziffra

Anat Kreimer

Ilias Georgakopoulous-Soares

Nir Yosef

Chun Jimmie Ye

Katherine S Pollard

Jay Shendure

Martin Kircher

Nadav Ahituv

Abstract

Introduction

Fig. 1 |. Schematics of lentiMPRA.

Development of the protocol

Applications of the method

Comparisons with other methods

Experimental design

Library design

Box 1 |. Library design criteria.

Library generation

Association sequencing

Lentiviral prep

Infection and sequencing

Data processing

Necessary expertise

Limitations

Materials

Biological materials

Reagents

Equipment

Software

Reagent setup

DMEM (with 10% (vol/vol) heat-inactivated FBS)

DMEM (with 5% (vol/vol) heat-inactivated FBS)

80% Ethanol

LB medium

TAE-agarose gels

Procedure

Library amplification ● Timing 3 h

Vector linearization ● Timing 7 h to overnight

Recombination and electroporation ● Timing 3 d (5 h hands-on time)

Colony counting and plasmid library prep ● Timing 3 h

Sequencing for CRS-barcode association ● Timing 2-4 weeks (4 h hands-on time plus sequencing turnaround time)

Lentivirus packaging ● Timing >1 week (>5.5 h hands-on time)

Lentivirus titration ● Timing >1 week (3 h hands-on time)

Box 2 |. Test infection of the cells to be used for lentiMPRA ● Timing >1 week (3 h hands-on time).

Lentivirus infection and DNA/RNA extraction ● Timing >1 week (>3.5 h hands-on time)

Reverse transcription ● Timing 4 h

Library prep and sequencing for RNA and DNA barcode counts ● Timing 2-4 weeks (6 h hands-on time plus sequencing turnaround time)

Data processing ● Timing total process time 4 h-4 d, depending on read depth; around 1 h hands-on time

Fig. 2 |. Overview of MPRAflow association utility.

Table 1 |.

Fig. 3 |. Overview of count utility.

Table 2 |.

Fig. 4 |. Overview of saturation mutagenesis utility.

Table 3 |.

Troubleshooting

Timing

Anticipated results

Reporting Summary

Data availability

Code availability

Extended Data

Extended Data Fig. 1 |. Sequence scheme of lentiMPRA.

Extended Data Fig. 2 |. Time complexity study of MPRAflow.

Supplementary Material

Table 4 |.

Acknowledgements

Footnotes

Reference