Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Oct 19.
Published in final edited form as: Nat Protoc. 2012 May 3;7(6):1024–1041. doi: 10.1038/nprot.2012.039

Genome wide copy number analysis of single cells

Timour Baslan 1,2, Jude Kendall 1, Linda Rodgers 1, Hilary Cox 1, Mike Riggs 1, Asya Stepansky 1, Jennifer Troge 1, Kandasamy Ravi 1, Diane Esposito 1, B Lakshmi 3, Michael Wigler 1, Nicholas Navin 4, James Hicks 1
PMCID: PMC5069701  NIHMSID: NIHMS816808  PMID: 22555242

Summary

Copy number variation (CNV) is increasingly recognized as an important contributor to phenotypic variation in health and disease. Most methods for determining CNV rely on admixtures of cells, where information regarding genetic heterogeneity is lost. Here, we present a protocol that allows for the genome wide copy number analysis of single nuclei isolated from mixed populations of cells. Single nucleus sequencing (SNS), combines flow sorting of single nuclei based on DNA content, whole genome amplification (WGA), followed by next generation sequencing to quantize genomic intervals in a genome wide manner. Multiplexing of single cells is discussed. Additionally, we outline informatic approaches that correct for biases inherent in the WGA procedure and allow for accurate determination of copy number profiles. All together, the protocol takes ~3 days from flow cytometry to sequence-ready DNA libraries.

Introduction

Copy number variation is an important source of genetic variation in humans and other organisms and is known to influences phenotypic traits 1. Many studies have associated copy number variants with normal phenotypes, such as human olfactory receptors and smell 2 and the amylase gene and diet3. Importantly, copy number variation has been linked with a wide range of deleterious phenotypes and disorders such as obesity 4, psychiatric disorders5, and cancer6. In cancer, the commonality of copy number alterations has led to intense investigations of the copy number landscapes of tumors7,8. Thousands of tumors, across many cancer types, have been profiled using a variety of copy number detection techniques. These investigations have allowed for the identification of disease associated alterations that have subsequently been used to guide therapeutic decisions, for example using amplification of the ERBB2 locus to qualify patients for Herceptin 9.

The most commonly utilized method in interrogating the copy number landscape of genomes has been Array Comparative Genomic Hybridization (aCGH)10. aCGH technology, based on differential labeling of sample and reference (e.g. tumor and matched normal), DNA with fluorophores, hybridization to arrays containing oligonucleotide probes, and subsequent analysis of fluorometric signal ratios, allows for calling of the copy number profile of discrete genomic intervals. The study of copy number alterations using aCGH has led to the identification of recurrent amplifications and deletions across many human cancers11,12. aCGH however, is not without limitations13. An important drawback of this technique, specifically in the study of cancer genomes, stems from the utilization of whole genomic DNA purified from tumor tissue where genomically normal cells are almost always present. The presence of such “non-tumor” components dilutes CGH signals and can result in inaccurate copy number calling of certain genomic segments (for example, single copy deletions or duplications in polyploid tumors). Furthermore, even if tumor cells are enriched by means such as Laser-Capture-Microdissection (LCM), aCGH does not allow for the characterization of tumor heterogeneity where multiple clones with distinct genomic profiles might be present in the tissue samples. Thus, methods that can obviate such shortcomings are of pivotal importance.

Major advances in genomic research have paralleled the emergence of next generation sequencing technology14-16. The highly quantitative nature of the sequencing data, along with the ever increasing output of next generation sequencing machines have led to the adoption of sequencing technologies in all facets of genomic research. In the area of copy number analysis for example, many laboratories have successfully leveraged the power of high throughput sequencing in profiling genome copy number landscapes with notable advantages over aCGH17,18. The depth of data generated by cancer sequencing projects have allowed investigators to discern the phylogeny of tumorgenesis, and rekindled the cancer community’s interest in a long known facet of tumor biology; intra-tumor heterogeneity. Somatic mutations identified in whole genome sequencing efforts were found not to be present in all of the cells constituting the tumor mass, but rather present at varying percentages in the tumor cell populations19,20. With the notion that the study of tumor heterogeneity is necessary to understand tumor biology, numerous reports started to emerge, describing the heterogeneous nature of cancer in greater detail21-23.

In order to better understand and characterize tumor heterogeneity, we developed an approach, Single Nucleus Sequencing (SNS), that allows for the genome wide characterization of a single cell copy number profile24. SNS, combines flow sorting of single nuclei, whole genome amplification, and next generation sequencing to characterize copy number alterations. To study the evolutionary dynamics and population structure of tumors, SNS was used in sequencing 100 single cells of two breast tumors, one with a matching liver metastasis. The data allowed, for the very first time, a comprehensive view of the evolutionary processes occurring in tumor cells. One tumor was shown to contain 3 distinct tumor subpopulations that likely originated from a common precursor and later diverged phylogenetically. In the second tumor, one with the matching liver metastasis, the data indicated that the primary tumor mass was formed by a single clonal expansion of a highly aneupolid cell, which later migrated, seeded the metastasis, and underwent very limited further genomic evolution. Furthermore, in both tumors, a subpopulation was identified comprising abnormal cells lacking evidence of a clear common precursor. Although not described in detail here, it is clear that the SNS methodology is not limited to nuclei but can be applied to whole cells isolated by flow cytometry using fluorophore detection of surface markers and/or endogenously expressed fluorescence proteins. To that end, since our initial report, we have successfully applied single cell analysis to human circulating cells sorted by EpCAM fluorescence and to mouse cells expressing various fluorescent proteins (data unpublished K.R & J.H).

SNS offers a unique approach in characterizing cellular heterogeneity based on genome wide copy number variation. However, given that SNS relies upon the quantification of copy number variation using sparse sequencing data, heterogeneity that might arise due to other genomic aberrations such as single nucleotide variants (SNVs) and short insertions and deletions (INDELS) will be missed. Such is the case with certain hematological cancers such as Acute Myeloid Leukemia (AML), in which a subset of tumors is characterized by a cytogenetically normal genome. In that case, alternative approaches, such as deep exome sequencing could offer a view of the underlying heterogeneity. Nevertheless, given that the vast majority of epithelial tumors display markedly rearranged genomes, SNS offers a valuable tool in dissecting clonal populations in tumors. Here, we present a detailed explanation of the working protocol of SNS.

Overview of the procedure: Benchwork

The experimental protocol for SNS involves three discrete steps: flow sorting of single nuclei, whole genome amplification of the DNA, and library construction for sequencing on the Illumina platform. In SNS, flow sorted nuclei are deposited into wells in a 96 well plate format. Whole genome amplification is performed using the Sigma-Aldrich GenomePlex WGA4 kit. Single cell amplification is based upon a proprietary amplification method that randomly fragments the genome and uses a unique combination of primer extension pre-amplification and degenerate oligonucleotide primers/adaptors to generate DNA fragments, 200-1000 bps in length distributed across the genome and flanked by a universal adaptor sequence. The resulting library is then amplified using universal oligonucleotide primers with defined cycling parameters. While multiple whole genome amplification kits are currently available from different vendors, in our experience, the GenomePlex kit offers the most robust and consistent results.

After amplification, WGA amplified DNA is processed for sequencing library preparation just like normal genomic DNA. The WGA protocol attaches unique 30 nucleotide termini at the ends of the DNA molecules. As such, prior to library construction, DNA is sheared to allow removal of the adaptor sequences by sonication. Sonicated DNA is then processed using standard Illumina library preparation protocol with end repair, 3′ A-overhang addition and adaptor ligation. Adaptor ligated libraries are purified using agarose gel electrophoresis, which is robust and generates high quality libraries. Alternatively, when processing many libraries, it is more suitable to purify sequencing libraries using the AMPure beads purification system offered by Agencourt, which is amenable to scaling. Following purification, sequencing libraries are enriched using PCR. Initially, we sequenced each single cell on a lane of Illumina GAIIx instrument. Since then, with the increasing capacity of the HiSeq platform and newly developed in-house informatic tools, we have adopted multiplexing using DNA barcodes to sequence many single cells on a single lane (see Box 1). Pooling of individually barcoded samples is accomplished during the last steps of library preparation, after amplification and quantification using the Agilent Bioanalyzer instrument. After quantification using the Bioanalyzer, each sample is diluted to 10 nM and samples are pooled yielding a final library at 10 nM concentration. Figure 1 shows the schematic for the experimental workflow.

Box 1. Multiplexing of single cell libraries.

With the increase in throughput afforded by the HiSeq machine, multiplexing of single cells is warranted. Furthermore, using simulations to estimate the number of reads required to reproduce an accurate copy number profile, we have empirically determined that approximately 2 million, uniquely mapped reads, are sufficient to quantify the copy number profile using the varbin algorithm with 50 thousand bins (data not shown). To multiplex samples, we use a collection of barcodes designed by our laboratory of 7 nucleotides in length (8 barcode sequences that we have tested and verified are provided as Supplementary Data). Barcode distributions are generally uniform with barcode ratio values (expected/observed) consistently between 0.8 and 1.2. Alternatively, the TruSeq indexing system can be used as well. For more details regarding the TruSeq indexing system please refer to Illumina, Inc. website.

Figure 1.

Figure 1

Schematic of the experimental workflow of SNS. Step numbering corresponds to the Steps of the Procedure. The FACS Aria image is courtesy of © Bacton, Dickinson and Company. Reprinted with permission. HiSeq2000 image is courtesy of Illumina, Inc.

Overview of the procedure: Informatic analysis

To obtain copy number profiles of single cells, sequencing data is processed using a variety of computational and algorithmic tools that include Bowtie, SAMtools, Python, and the SPlus/R software package. Sequence data is first mapped using the Bowtie algorithm with defined parameters. Once mapped, sequencing reads are processed through a series of tools using the SAMtools package to remove PCR duplicates and arrange the sequencing reads in proper format suitable for downstream analysis. Only uniquely mapped reads are used in determining copy number profiles.

In determining the copy number profiles, uniquely mapped reads are processed using an in-house developed Python algorithm (Varbin) that counts sequence read density in genomic intervals (bins). The Varbin algorithm, provided in the Supplementary Methods, differs from previous sequence based copy number detection algorithms in that, in contrast to previous tools that divide the genome into fixed bins17,18, Varbin divides the genome into bins of variable length adjusted such that the number of potential uniquely mapping reads in each bin is normalized across the genome. To determine bin sizes used in the published work24, we simulated 200 million sequences from the human genome (HG18/NCBI36), 48 nucleotides in length while introducing single nucleotide errors at a frequency similar to that encountered during Illumina sequencing. These simulated sequences were mapped back to the human genome (HG18) with defined parameters. Chromosomal bins were assigned based on the proportion of mapped simulated sequence reads, with each bin containing an equal number. This resulted in approximately 50,000 distinct, non-overlapping bins. We have since revised the method using the hg19 reference genome and with a non-random algorithm to specify bin boundaries with concordant results. Bin boundaries determined from hg19 simulations are provided here. Furthermore, due to inherent mapping errors, some bins accumulate high read counts and appear as high focal amplifications. These bins are discussed in more detail in the informatics section of the Procedure.

Critically, because we simulated single end reads from hg19 and mapped the sequencing reads using the Bowtie algorithm to define the variable bin boundaries, only Illumina single end sequencing data that are mapped with Bowtie are useful for determining the copy number profile with the boundaries that are supplemented in this protocol. If BWA is preferred to Bowtie, then the simulations will have to be repeated to define a new set of bin boundaries. The same applies to paired-end sequencing data or sequence data obtained using a different platform (ABI SOLiD for example).

The output file of the Varbin algorithm contains sequence counts in the assigned genomic bins. This data is processed to yield integer copy number values via a variety of algorithmic tools such as the Kolmogorov-Smirnov (KS) or circular binary (CBS) segmentation, and Gaussian kernel smoothed density plots, usually done with software package SPlus or its non-proprietary version, R.

Finally, on occasions, we observe single cell copy number profiles that contain unusually large homozygous chromosomal deletions or what appears to be “shredding of chromosomes”. The nature of these profiles, i.e. biological or technical artifacts, is currently unknown and under investigation. A discussion of these profiles, which we term “Genome Sector Loss” is offered in the sections that follow.

In the Procedure, the informatics section is illustrated by taking the reader through an analysis example. The programs to use and example output files are provided in the Supplementary Methods and Supplementary Data, respectively. Supplementary Note provides a brief overview while Supplementary Figure 1 provides a concise outline of all the steps with input-program-output labels.

Experimental Design

Sample preparation and flow cytometry

In our initial report, we described sequencing of single cells isolated from frozen tissue as well as single cells isolated from cell lines grown in tissue culture 24. While many techniques are available for the isolation of single cells (such as micromanipulation), we have empirically determined that flow cytometry offers a sensitive and reproducible approach. For sample preparation from frozen tissue, it is important to keep the tissue on dry ice in order maintain the tissue’s integrity for subsequent analysis. Generally, we remove a small sample of the tissue (1 mm × 1mm) using no. 11 scalpels and transfer the piece to a Petri dish while maintaining the original tissue on dry ice.. For first time users, we generally recommend starting with cell line material to test the protocol. It is also recommended that prior to setting the gates for flow cytometry, a control sample (we use a diploid lymphoblastic cell line) is run and at least 5,000 events of the examined sample (we usually collect 10,000) are recorded in the DAPI channel to provide a clear picture on where the gates should be set (Figure 2).

Figure 2.

Figure 2

Flow sorting of single nuclei based on DNA content. (a) Dot plot view of DAPI stained nuclei. Gate, drawn on the diagonal, excludes cellular debris and doublets and captures single nuclei. The black dots represent cellular debris and doublets, while the green and red dots represent diploid and non-diploid fractions from the single nuclei gate respectively. The dot plots are drawn with DAPI-H and DAPI-W to allow for enhanced precision in distinguishing subpopulations. (b,c) Examples of histograms drawn from single nuclei gates illustrating a diploid profile (b), and an aneuploidy profile (c).

Flow cytometric determination of nuclear DNA content through DAPI staining also provides a means for identifying and isolating tumor subpopulations based on ploidy. It is important to note that when handling different tumor samples, ploidy profiles (specifically aneuploid peaks) can differ depending on the tumor sample. As such, when sorting different samples, care must be taken in setting up appropriate gates. In addition, given that certain tumors have a high proliferative index, overlap between the G2/S phase flow cytometry peak of diploid cells with that of the aneuploid peak might occur. Nonetheless, given the capacity of SNS to resolve genome profiles at the single cell level, these cells (G2/S phase of diploids) are easily identified once the informatics analysis of sequencing data is performed. Given the high precision required for single cell analysis, we also describe here important steps to consider when performing single cell flow cytometry, such as droplet delay and break-off (Box 2).

Box 2. Critical steps to take into consideration for FACS setup for sorting single cells.

There are three critical elements to FACS setup for single cell sorting: The sample lines, flow cell, and nozzle must be clean, droplet break-off must be stable, and the automatic cell deposition unit must be perfectly positioned. To achieve this, the following steps should be followed:

  1. Perform “flow cell clean” using FACSRinse and allow it soak for 10 minutes, then “flow cell clean” with ddH20 and soak for an additional 10 minutes. Run FACSClean through the sample line for 10 minutes at the highest flow rate setting and follow with ddH20 for another 10 minutes. Insert nozzle, turn on stream and allow it stabilize for 30 minutes. Immediately before sorting, run ddH20 and record events for 5 minutes to verify there are zero events.

  2. Turn on Sort Test. Open cover and sort block door. Visually inspect the side streams. The streams should be tight and steady. Because only the far left stream is used for sorting into plates, do not forget to check the left stream with the other turned off. Position the far left stream to the center opening of the splash shield. Close sort block door and cover.

  3. Determine break-off point and drop delay using Accudrop. Drop formation and break-off must be stable. Sort precision should be set for “single cell”; which selects high purity over yield.

  4. Check the ACDU alignment. The ACDU is not designed to hold a PCR size plate. A 96-well Falcon tissue culture plate is used as a holder for the PCR plate. Apply an adhesive plate seal to the surface of a 96-well PCR plate and smooth out the bubbles and wrinkles. Make sure the PCR plate is relatively flat and does not bow in the middle. Insert the PCR plate firmly into the Falcon plate and place into ACDU holder. Because the diameters of the wells of the PCR plate are much smaller than the TC plate and the volume of lysis buffer is small, the sorted drops must be centered precisely in the middle of the wells. Set up sort layout to sort 100 Accudrop beads per well onto the surface of the film. Deposit beads, remove the plate from the ACDU and visually examine the position of the drops. Continue to adjust ACDU until all wells are positioned correctly.

Whole genome amplification (WGA) of single cells

As controls for the whole genome amplification of single cells, the flow cytometry settings can be adjusted as to leave certain wells empty (i.e. no deposition of a single cell). The products of these wells, when quantified for DNA, do not yield any measurable quantity of WGA DNA and when run on an agarose gel or Agilent’s Bioanalyzer do not display the WGA product smear indicative of a successful WGA reaction.

Sequence library construction

Samples selected for sequence library construction are analyzed by gel electrophoresis or using the Bioanalyzer instrument to observe the WGA product spread between 100 and 1000 bps (Figure 3). In selecting samples for sequence library construction, we also take into account the DNA concentrations of the amplification products. Generally, diploid cells yield approximately 30 μl of material at approximately 200 ng/μl (range from 175-275 ng/μl). We avoid using WGA products of diploid sorted cell that have concentrations exceeding 300 ng/μl or below 175 ng/μl. Similarly, for aneuploid fractions, we observe DNA concentrations ranging from 250 to 400 ng/μl and proceed with only the samples displaying this concentration range. The reasoning behind exclusion of such WGA amplification products relates to concerns regarding non-uniform, incomplete amplification or over amplification of single cell genomes of the aforementioned products. We generally use 2 μg of WGA DNA to start with the library construction process, however, we have routinely been able to generate good libraries from as little as 0.5 μg of DNA using the method described in this protocol. The protocol and “timing” given here are intended for the construction of a single Illumina single cell WGA library. Processing of multiple samples, for example for multiplexing purposes, is likely to increase the “timing”.

Figure 3.

Figure 3

WGA amplification profiles of single cell DNA from 4 different single cells. (a) WGA DNA spreads (100-1000bps) of single cell genomes as measured on the Bioanalyzer. S1-S2-S3-S4 refers to 4 different single cell amplified products. (b) An example histogram of DNA spread from cell S1 as measured by the Bioanalyzer.

Previously, we reported sonication of WGA DNA using the ultrasonic disruptor, the Bioruptor. However, we have since switched to using the focus acoustics system of Covaris since it allows for higher throughput. Selection of the sonication programs depends on the desired insert length of the libraries. Figure 4 illustrates the sonication profiles using multiple programs on the Covaris E210.

Figure 4.

Figure 4

Sonication profiles of WGA DNA using different sonication programs on the Covaris E210 instrument. (a) Sonication profiles as measured on the Bioanalyzer S1: non-sonicated WGA DNA. S1 400 / S1 300 / S1 200: WGA DNA sonicated using 400+/− ∣300+/− ∣ 200+/− Covaris E210 programs respectively. Profiles represent size distributions of DNA molecules (in base pairs) of the samples. The choice of which sonication program to use is dependent upon the desired sequencing library length and the type of sequencing that will be implemented. We generally use the 300+/− program when sequencing 76 base pair reads on the Illumina platform. (b) Histograms illustrating sonication profiles as measure by the Bioanalyzer.

Informatic analysis

The central tenet of copy number analysis is based upon the idea that sequenced molecules are a random sample of the genome, and that by computing local read density relative to the average read density, it is possible to infer copy number. The method described here splits the genome into non-overlapping regions (bins) expected to have the same average number of reads based on a reference genome. This is accomplished by taking 50 base pair sequences starting at each position in the reference genome, mapping them back to the reference, eliminating reads that map to multiple places in the genome (multimappers), and then setting bin boundaries such that each bin contains roughly the same number of uniquely mappable positions. The protocol doesn’t give a uniform distribution. The main source of non-uniformity results from variation in GC content across the genome. To adjust for this bincounts are normalized based on GC content.

MATERIALS

REAGENTS

. Cells of interest: Human tumor tissue, cell cultures grown in a cell culture dish of any kind, mouse tissue [] CRITICAL All experiments that use human tissue and animals should comply with institutional and national guidelines

. Cell culture medium appropriate for cell type

. Trypsin (Invitrogen, cat. no: 25200-056)

. 1X PBS (Gibco, cat. no: 14249-95)

. DAPI (Invitrogen, cat. no: D1306)

. Whole Genome Amplifications Kit-WGA4 (Sigma-Aldrich, cat. no: WGA4-50RXN) CRITICAL Sigma-Aldrich WGA4 kits yields relatively uniform distributions of amplification products across the genome to facilitate copy number analysis using the SNS method. It is imperative to utilize this kit for obtain reliable results

. QIAquick 96 well purification kit (Qiagen, cat. no: 28181)

. Ethanol 100% 200 proof (Ultra-Pure, cat.no: 200-CSPTP) ! CAUTION Flammable – Keep away from open flame

. NP-40 (USB, cat. no: 19628) ! CAUTION Contains materials which may cause respiratory tract, eye, and skin irritation. May be harmful if swallowed. Handle with appropriate care

. MgCl2 (VWR, cat. no: JT24440-1

. NaCl (Fisher, cat. no: S271-10)

. 10mM Tris Base pH7.8 (Fisher, cat. no: BP152-5)

. CaCl2 (VWR, cat. no: JT1332-1)

. BSA (Sigma Alrich, cat. no: A7906-50G)

. QIAquick PCR purification kit (Qiagen, cat. no: 28106) ! CAUTION Buffer PB contains irritant chaotropic salts. Take appropriate care when handling

. MiniElute PCR purification kit (Qiagen, cat. no: 28006) ! CAUTION Buffer PB contains irritant chaotropic salts. Take appropriate care when handling

. QIAquick gel extraction kit (Qiagen, cat. no: 28704) ! CAUTION Buffer QG contains irritant chaotropic salts. Take appropriate care when handling

. Agarose (Lonza, cat. no: 50004)

. GeneRuler 50bp DNA Ladder (Fermentas Life Sciences, cat. no: SM0373)

. FACSRinse (BD Biosciences cat. no. 340346)

. FACSClean (BD Biosciences cat. no. 340345)

. Accudrop Beads (BD Biosciences cat. no.345249)

. Single cell Lysis Buffer (see REAGENT SETUP)

. Sucrose (USB, cat. no: 57-50-1) (see REAGENT SETUP)

. Elution buffer EB (supplied with Qiagen PCR/Gel purification kits)

. T4 DNA polymerase (NEB, cat. no: M0203L)

. dNTPs 10mM each (supplied as 100mM, see REAGENT SETUP) (Roche, cat. no: 1 969 064)

. T4 DNA ligase buffer with/10 mM ATP (NEB, cat. no: B0202S)

. Klenow DNA polymerase (NEB, cat. no: M0210L)

. T4 PNK (NEB, cat. no: M0201L)

. NEB Buffer 2 (supplied with Klenow DNA polymerase)

. Agencourt AMPure 50 ml (Beckman Coulter, cat. no: A63880)

. dATP 1mM (supplied as 100mM, see REAGENT SETUP)

. Klenow fragment (3′-5′ exo-) (NEB, cat. no: M0212L)

. Quick Ligation kit (NEB, cat. no: M2200L)

. Sequencing Oligo Adaptors (IDT)

. Ethidium Bromide 10 mg/μl (Sigma, cat. no: 057K8609) ! CAUTION Mutagen and potential carcinogen

. TAE 50X (Invitrogen, cat. no: 24710)

. Phusion HF PCR master mix (NEB, cat. no: MO531L)

. Sequencing Oligos primers (IDT)

. Agilent DNA high sensitivity kit (Agilent, Cat. No: 5067-4626)

EQUIPMENT

. Scalpels #11 blade (VWR, cat. no: 89176-382)

. Tissue culture plates

. Cell culture medium appropriate for cell line of interest

. Polystyrene round-bottom tube 5ml (Falcon, cat. no: 352058)

. Polystyrene round-bottom tube with cell-strainer cap, 5 ml (Falcon cat. no: 352235)

. Flow Cytometer FACSAriaII (BD Biosciences)

. 96 well PCR tubes (Thermo Scientific, cat. no: AB-0731)

. 0.2 ml 8-well PCR trip tubes (AB Applied Biosystems, cat. no: N8010580)

. Agarose Gel Electrophoresis Unit (Thermo Scientific)

. Heating block (50 °C)

. Thermocycler (MJ Research, cat no: PTC-225)

. Vacuum manifold (Qiagen, cat. no: 19504)

. 96 well elution plates (Thermo Scientific, cat. no: AB-0796)

. Adhesive tape (Marsh Bioproducts, cat. no: AB-0626)

. 1.5 ml centrifuge tubes (Eppendrorf)

. Centrifuge (VWR, cat. no: 80076-424)

. Minicentrifuge (VWR, cat. no: 80094-172)

. UV transilluminator ! CAUTION UV radiation is harmful for the unprotected eye and skin

. Sonicator Covaris E210 (Covaris, cat. no: 500008)

. Sonication tubes (Covaris, cat. no: S20045)

. DynaMag -2 magnet (Invitrogen, cat. no: 123-21D)

. DynaMag - 96 Side (Invitrogen, cat. no: 123-31D)

. Nanodrop ND-1000 Spectrophotometer (Thermo Scientific, cat. no. ND-1000)

. Agilent 2100 Bioanalyzer (Agilent Technologies, cat. no. G2938C)

. Illumina Genome Analyzer and associated Equipment

. Bowtie software package. http://bowtie-bio.sourceforge.net/index.shtml (ref. 25)

. Samtools software package. http://samtools.sourceforge.net/ (ref. 26)

. Python software package. http://www.python.org/

. R software package http://www.r-project.org/

. CBS segmentor http://www.bioconductor.org/packages/2.8/bioc/html/DNAcopy.html. This is an R package used to segment the bincount data into non-overlapping regions of differing copy number 27

REAGENT SETUP

NST buffer Mix the following components in ddH2O for a final volume of 800mL; 146nM NaCl, 10mM Tris Base pH7.8, 1mM CaCl2, 21mM MgCl2, 0.05% BSA, and 0.2% NP40. NST buffer can be prepared and stored at 4 °C for up to 5 months.

NST-DAPI buffer To the 800 mL of NST buffer, add 200 mL of MgCl2 at a concentration of 106mM. Afterwards, dissolve 10mg of DAPI and store at 4 °C protected from light. Solution is stable for up to 5 months.

1mM dATP Make 1mM dilutions of the original 100mM stock in EB buffer and store at −20 °C for up to six months.

10 mM dNTP Mix each dNTP for a final concentration of 10mM each in EB buffer and store at −20 °C for up to six months.

Sucrose loading dye Prepare a 40% sucrose solution by adding 40 grams of sucrose to 100 mL of H2O. Solution can be stored at room temperature for up to 3 months.

Single Cell Lysis Buffer Single Cell Lysis Buffer is prepared by mixing 800 μL H2O with 100 μL of Mixture #1, with Mixture #1 prepared by mixing 6 μL Proteinase K with 96 μL 10X Single Cell Lysis and Fragmentation Buffer

EQUIPMENT SETUP

Covaris E210 Sonicator Set the sonicator parameters to the following to obtain DNA distributions around 300+/− bps size range (200-400 bp range); duty cycle – 10%, Intensity – 4, cycles/burst − 200, and time 80 s. Make sure that the water bath temperature is at 4 °C. CRITICAL It is imperative that the Covaris water bath temperature is at 4 °C to ensure proper sonication and reproducible results

FACS AriaII The FACSAriaII is configured with a high powered air launched 350nm UV laser and 450/50 bandpass filter. It is equipped with an ACDU (automated cell deposition unit). A 70 micron integrated nozzle is used and the fluidics pressure is set to 70 psi.

For DNA content analysis set parameters for DAPI area, width and height. Change threshold to DAPI and a value of 5000. We adjust PMT voltages using a normal diploid human lymphoblastoid cell line stained with DAPI as the control.

PROCEDURE

Sample preparation and flow cytometry *TIMING 4 h

DAPI staining of unfixed nuclei for FACS
  • 1)

    To perform DAPI staining of nuclei from tissue follow Option A. To perform DAPI staining nuclei from cell cultures follow Option B.

  • A.

    DAPI staining of nuclei from tissue

  • I.

    Place piece of frozen tissue in a 60mm TC plate. Add 0.2-1.0ml of NST-DAPI buffer, depending on the size of tissue. For fine needle aspirates or core biopsies use 0.2-0.5 ml buffer. For larger pieces of tissue use 1-2mm3 of tissue in 1ml buffer.

  • II.

    Using two fine-point disposable scalpels, cut and tease apart the tissue in the buffer until the pieces are very fine. Gently mix with 1ml pipetmen tip.

  • III.

    Transfer the sample to a 5 ml Falcon round-bottom tube, leaving behind as much of the solids as possible. Hold on wet ice and protect from light for at least 10 minutes and no longer than 3 hours.

  • IV.

    Do not vortex nuclei. Vortexing will result in substantial damage to the nuclei.

  • V.

    Prior to running on flow cytometer filter sample thru a 5 ml Falcon round-bottom tube with cell-strainer cap.

  • B.

    DAPI staining of nuclei from cultured cells

  • I.

    Harvest cells either by trypsinization of monolayer cultures (resuspend in complete medium) or collection of suspension cultures in medium.

  • II.

    Using a hemacytometer count the number of cells.

  • III.

    Transfer 0.5-1.0 × 106 cells to a 15 ml conical centrifuge tube.

  • IV.

    Gently centrifuge at 105 xg for 4 minutes.

  • V.

    Aspirate the medium, careful not to disturb cell pellet.

  • VI.

    With index finger flick the tube until pellet seems to be dispersed and not solid.

  • VII.

    Add 1ml NST-DAPI Buffer per 0.5-1.0 × 106 cells.

  • VIII.

    Transfer to a 5 ml Falcon round-bottom tube (polystyrene) and hold on wet ice and protect from light for at least 10 minutes and no longer than 3 hours.

  • IX.

    Do not vortex nuclei. Vortexing will result in substantial damage to the nuclei.

  • X.

    Prior to running on flow cytometer filter sample thru a 5 ml Falcon round-bottom tube with cell-strainer cap.

  • 2)

    Run the sample on FACSAria II cell sorter (or any comparable cell sorter). For assistance in running the FACSAria II cell sorter, refer to the “Users Guide” manual provided by BD Biosciences, Part No. 640760 Rev.A, usually supplied with the instrument.

  • 3)

    Create a dot plot that plots DAPI area on the y and DAPI pulse height on the x axis. (Figure 2a) For assistance in setting dot plots and gates for DNA content analysis, refer to Westro et al. 2001 28.

  • 4)

    Set gate (#1) on population of single nuclei (which appears on the diagonal) and exclude doublets or debris.

  • 5)

    Create a histogram derived from gate #1 (single nuclei), that plots count on y axis and DAPI area (DNA content) on the x axis on a linear scale.

  • 6)

    Record data on 10,000 counts of single nuclei.

  • 7)

    Set gates on populations of interest. (Figure 2b)

  • 8)

    Sort fraction(s) into 96-well PCR plate prepared with 9 μL Single Cell Lysis Buffer and kept on ice.

(Plates with lysis buffer are prepared by aliquoting 9 μL into each well)

WGA amplification *TIMING 6 h

  • 9)

    Incubate plates for 1 hour at 50 °C followed by 4 minutes at 99 °C using a thermocycler.

  • 10)

    Quick spin and cool on ice * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing, however, we generally carry the WGA reaction all the way to Step 17, before the 96 well plate purification

  • 11)

    Prepare Mixture #2 (3 μL per sample) by mixing the following components

    Component Each Sample 96 Well Plate (100 samples)
    1X Single Cell Library 2 μL 200μL
    Preparation Buffer
    Library Stabilization 1 μL 100 μL
    Total 3 μL 300 μL
  • 12)

    Add 3 μL of Mixture #2 to each sample, quick spin, and incubate in thermocycler at 95 °C for 2 minutes

  • 13)

    Quick spin and replace on ice

  • 14)

    Add 1 μL of Library Preparation Enzyme, quick spin, and incubate in a thermal cycler as follows:

    16 °C for 20 minutes

    24 °C for 20 minutes

    37 °C for 20 minutes

    75 °C for 5 minutes

    4 °C hold

  • 15)

    Quick spin and incubate on ice.

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing, however, we generally carry the WGA reaction all the way to step 17, before the 96 well plate purification

  • 16)

    Prepare Mixture #3 by mixing the following components

    Component Each Sample 96 Well Plate (100 samples)
    10X Amplification 7.5 μL 750 μL
    Master Mix
    H2O 48.5 μL 4.85mL
    WGA DNA Pol 5 μL 500 μL
    Total 61 μL 6.1 mL
  • 17)

    Add 60 μL of Mixture 3, mix well, quick spin, and incubate in thermal cycler as follows:

    Initial denaturation 95 °C for 3 minutes
    Denature 94 °C for 30 seconds
    Anneal/Extend 65 °C for 5 minutes
    Repeat for a total of 25 cycles
    Hold 4 °C

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

  • 18)

    Quick spin and proceed to QIAquick 96 well plate purification

QIAquick 96 well plate PCR Purification *TIMING 1 h

  • 19)

    Place a QIAquick 96 well plate into a vacuum manifold with vacuum turned off

  • 20)

    Aliquot 300 μL of Buffer PB into all wells.

  • 21)

    Transfer PCR amplification mixture (from Step 18) into wells and mix well by pipetting several times

  • 22)

    Turn on vacuum, and allow PCR mixture to flow through until membranes are dry

  • 23)

    Wash 2X with 900 μL buffer PE per well

  • 24)

    Vacuum until dry

  • 25)

    Remove 96 QIAquick 96 well plate from vacuum manifold, shake to remove excess fluid and mount on a waste collection plate

  • 26)

    Centrifuge at 1470 xg for 5 minutes

  • 27)

    Mount the QIAquick 96 well plate onto a fresh collection plate

  • 28)

    Add 50 μL of buffer EB per well and incubate for 1 minute

  • 29)

    Centrifuge at 1470 xg for 5 minutes to elute DNA

  • 30)

    After elution (generally yielding approximately 30 μL of DNA), use Nanodrop to determine WGA DNA concentrations. Samples can be run on agarose gel or Bioanalyzer to determine amplification profile (for further details, refer to Agilent Bioanalyzer manual)

    ? TROUBLESHOOTING

    * CRITICAL STEP Generally, we achieve 90% success rate in amplifying single cell genomes from 96 well plates. The nanodrop readings and bioanalyzer profiles of successfully amplified single cell DNA should exhibit readings and profiles similar to those mentioned in the Experimental Design section. Only WGA DNA products exhibiting the aforementioned parameters are selected for library construction

* PAUSE POINT Samples can be stored at − 20 °C until further processing

Sonication *TIMING 30 min

  • 31)

    Prepare 2 μg of WGA DNA in a total volume of 75 μL (bring up to volume with EB)

  • 32)

    Transfer mixtures to Covaris Micro-tubes

  • 33)

    Sonicate DNA as follows: Duty cycle – 10%, Intensity – 4, cycles/burst – 200, and time 80 s.

    * CRITICAL STEP Make sure that the water bath temperature is at 4 °C.

  • 34)

    Transfer Covaris Micro-tubes to tube holders

  • 35)

    Quick spin to collect material and transfer to fresh PCR tubes to proceed with Library preparing

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

End repair of sonicated WGA DNA to generate blunt ends *TIMING 45 min

  • 36)

    Prepare the following master mix in a 1.5-ml centrifuge tube for each sample. Mix carefully by pipetting up and down

    Reagent Volume
       T4 DNA ligase buffer with 10nM ATP    10 μL
    T4 DNA Polymerase 5 μL
    T4 Polynucleotide Kinase (PNK) 5 μL
    dNTP mix (10mM each) 4 μL
    Klenow DNA Polymerase 1 μL
  • 37)

    Transfer 25 μL of End repair mix from Step 36 to each sample from Step 35 and mix well by pipetting. Incubate in a thermal cycler for 30 minutes at 20 °C

  • 38)

    Purify each sample using the QIAquick PCR purification kit following the manufacturer’s protocol

  • 39)

    Elute each sample in 30 μL of EB

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

3′ A-overhang addition to blunted DNA ends *TIMING 45 min

  • 40)

    Prepare the following master mix in a 1.5 mL centrifuge tube. Mix well by pipetting.

    Reagent Volume
    10X Klenow buffer (NEB buffer #2) 5 μL
    1 mM ATP 10 μL
    Klenow fragment (3′-5′ exo-) 10 μL
  • 41)

    Transfer 25 μL of master mix from Step 40 to each sample from Step 39, mix well by pipetting, and incubate at 37 °C for 30 minutes

  • 42)

    Purify each sample using MiniElute PCR Purification kit following manufacturer’s instructions.

  • 43)

    Elute each sample in 17 μL of EB buffer

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

Illumina adaptor Ligation to DNA fragments *TIMING 25 min

  • 44)

    Prepare the following master mix in a 1.5 mL-centrifuge tube. Mix well by pipetting

    Reagent Volume
    Quick ligation buffer   25 μL
    Quick Ligase   2 μL
  • 45)

    Add 6 μL of 10μM PE adaptor mix (10μM each of PE5/7) to each 17μL DNA sample from Step 43 *CRITICAL STEP 6 μL of 10 μM PE adaptor mix is the amount that have been critically determined to work effectively when using input DNA of quantity 500 ng to 2 μg. If less DNA is used, adaptor mix quantity would have to be adjusted. However, given that the process of WGA from single cells yields DNA on the order of 4-5 μg, there should be plenty of DNA to construct sequencing libraries from.

CRITICAL STEP Samples can be barcoded and resulting libraries can be multiplexed and run together on an Illumina lane, refer to Box 1 for multiplexing

  • 46)

    Add 27 μL of ligation master mix from Step 44, mix well, and incubate at 20 °C for 15 minutes.

  • 47)

    Purify each sample using the MiniElute PCR purification kit following manufacturer’s instructions.

  • 48)

    Elute each sample in 16 μL of EB buffer

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

Size selection and gel purification of DNA-Adaptor ligation products *TIMING 2.5 h

  • 49)

    Prepare a 2% agarose gel using 1X TAE prepared with Ethidium Bromide.

    CRITICAL STEP Alternatively, samples can be purified using Agencourt AMPure beads, refer to Box 3 and Figure 5 for purification using AMPure beads

  • 50)

    Add 4 μL of sucrose loading dye to eluted DNA samples from Step 48 (total volume 20 μL)

  • 51)

    Load 20 μL of the Adaptor Ligated DNA samples into the gel wells with a DNA ladder. When loading multiple samples on the same DNA gel, make sure to leave at least 1 empty well between samples as well as 1 empty well between DNA ladders and samples to avoid cross-contamination

  • 52)

    Run agarose gel at 100-120 volts for approximately an hour to obtain clear separation of ladder 200-250-300 bands

  • 53)

    Place gel on UV transilluminator and using a clean scalpel for each sample, mark the positions of the 200-300 bp products. Turn off transilluminator and cut the gel slices at the marked positions

  • 54)

    Use the Qiagen gel extraction kit to purify the DNA for the agarose gel slices following the manufacturer’s instructions.

  • 55)

    Elute the DNA in 30 μL of EB buffer.

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

Box 3. Bead purification of sequencing libraries.

Purification of DNA sequencing libraries using agarose gel electrophoresis has traditionally been the standard practice. However, when adapted to purify a large number of samples (for multiplexing purposes for example), agarose gel purification becomes a limiting step because of time and labor intensive considerations associated with scaling. An alternative method that is rapidly being adopted in sequencing laboratories is magnetic bead purification. Many bead purification products are offered through a variety of vendors, of which we have adopted the Agencourt AMPure XP purification system offered by Beckman Coulter Genomics. The Agencourt AMPure XP system utilizes solid-phase paramagnetic bead technology in selectively enriching DNA fragments 100 bp and larger while at the same time efficiently removing excess nucleotides, salts and enzymes. Furthermore, depending on volumetric ratios of beads to purification reactions, the Agencourt systems allows for selective enrichment of DNA fragments of particular lengths (Figure 5). We routinely purify libraries using 30 μL of beads for paired-end 76 Illumina sequencing runs. Below is a description of a working protocol for the purification of a single library using the DynaMag -2 magnet (the DynaMag -2 magnet can accommodate up to 16 samples). If more samples are to be processed, for example 96 samples, the DynaMag - 96 Side magnet can be used.

Adaptor ligation reactions for purification using the AMPure system

Adaptor ligation reactions for purification using the AMPure system are setup differently from ligation reactions intended for gel electrophoresis purification (as described in Steps 44-46 of the Procedure). For libraries to be purified using AMPure beads, we perform the ligation reaction in a total volume of 75 μL (35 μL of Quick Ligation buffer, 6 μL of 10 μM adaptors (10 μM of each PE5/PE7), 2 μL of Quick Ligase, and 32 μL of DNA library). Also, after completion of the ligation reaction (15 min at 20 °C), the mixture is heated at 65 °C for 15 minutes to deactivate the DNA ligase. Libraries are then purified using AMPure beads without prior reaction clean up using QIAquick columns (Steps 47-48).

Purification of sequencing libraries using Agencourt AMPure XP system
  1. Aliquot 30 μL of AMPure beads into a clean Eppendorf tube(s)

    * CRITICAL STEP Make sure the beads are at room temperature for at least 30 minutes prior to purification

    * CRITICAL STEP Make sure the magnetic beads are in suspension by gentle shaking the bottle prior to aliquoting

  2. Transfer the Adaptor Ligation mixture(s) to the Eppendorf tube(s) containing the AMPure beads and mix well by pipetting up and down 10 times

  3. Allow the beads/ligation reaction mixture(s) to incubate at room temperature for 5 minutes

  4. Place the Eppendorf tube(s) onto DynaMag -2 magnet and allow the mixture(s) to stand for 5 minutes for efficient collection of the beads

  5. Maintain the reaction mixture on the magnet and aspirate or pipette the cleared solution carefully and discard (solution should be clear after magnetic separation, compared to brown when beads are in suspension)

  6. Add 200 μL of 70% Ethanol to the beads in the Eppendorf tube(s) on the magnet and gently mix by inverting the magnet a couple of times

  7. Aspirate or pipette the 70% ethanol and discard while maintaining the tube(s) on the magnet

  8. Repeat for a total of 2 washes

  9. Allow the beads to dry for ~3 minutes

    * CRITICAL STEP Make sure not to overdry the magnetic beads as that will result in lower yields of DNA

  10. Off the magnet, add 30 μL of buffer EB and mix thoroughly by pipetting

  11. Allow the mixture to stand at room temperature for 5 minutes

  12. Transfer back to the magnet and incubate for ~3 minutes to separate the beads from the solution

  13. Transfer the eluant to a fresh PCR tube and proceed with library amplification

Figure 5.

Figure 5

Sequencing library size profiles following AMPure bead purification using different volumes of beads. The amount of beads indicated was added to a ligation reaction of 75 μL volume (50 – 40 – 30 – 20 μL of beads). Replicas are shown. Profiles illustrate the differing sequence library size profiles obtained when using different volumes of AMPure beads. We generally perform library purification using 30 μL of beads and sequence 76 base pairs on the Illumina platform.

PCR enrichment of Adaptor-ligated DNA products *TIMING 2.5 h

  • 56)

    Add 5 μL of PE5/PE7 mixture at 10μM (each primer) to each 30 μL DNA sample from Step 55.

  • 57)

    Add 30 μL of Phusion DNA polymerase and mix by pipetting thoroughly

  • 58)

    Incubate mixture in thermal cycler as follows:

    Stage 1: 98 °C for 30 s
    Stage 2: 98 °C for 10 s
    Stage 3: 65 °C for 30 s
    Stage 4: 72 °C for 30 s
    Stage 5: Repeat stages 2-4 10 times for a total of 11 cycles
    Stage 6: 72 °C for 5 min
    Stage 7: Hold at 4°C
  • 59)

    Purify the PCR amplification products using QIAquick PCR purification kit following manufacturer’s instructions.

  • 60)

    Elute in 30 μL of EB buffer

    * PAUSE POINT Samples can be spun down and kept at − 20 °C until further processing

  • 61)

    Measure DNA concentrations using the Nano-drop spectrophotometer.

  • 62)

    After obtaining DNA concentrations, make a 30 μL total volume dilution of the library at 10 ng/μL

  • 63)

    Run 1 μL of the 10 ng/μL library dilution on the Agilent Bioanalyzer

  • 64)

    Using the “Peak Range” option on the Bioanalyzer, gate on the DNA library peaks between 150-350 bps.

    ? TROUBLESHOOTING

  • 65)

    If necessary, dilute the sample to 10nM, pool different libraries if applying multiplex sequencing, and send for sequencing

Informatic Analysis

  • 66)

    Prepare a reference genome for use with Bowtie. To begin the analysis, first the reference genome has to be configured for use with bowtie. Download the hg19 reference genome from the UCSC Genome Browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. Download the file named: chromFa.tar.gz. Detailed instructions are on the web page.

  • 67)

    Change the sequence of the pseudo-autosomal regions on chrY to N’s; a sample python program to do so is included in Supplementary Methods, labeled hg19.chrY.psr.py.

    CRITICAL STEP The sequence of the pseudo-autosomal regions on chrY are an exact copy of the corresponding regions on chrX. Since you will use reads that map to exactly one place in the reference genome, it is necessary to eliminate one of the two copies of the pseudo-autosomal regions in order to normalize that region with the rest of the X chromosome.

  • 68)

    In order to use Bowtie it is necessary to prepare an index file for the reference genome. The command to create the Bowtie index is named hg19.bowtie.build.bash in the Supplementary Methods. The files with “hap” annotation in their names are haplotype variants. These are not used in our copy number analysis. Also, we use our modified version of chrY rather than the reference version.

Computing bin boundaries

  • 69)

    A ‘bin boundaries’ file for 50,000 bins in hg19 is provided in the Supplementary Data with the title hg19.bin.boundaries.50k.bowtie.k50.sorted.txt. If this file is used it is not necessary to complete Steps 69-76. Otherwise, to compute the bin boundaries, make ‘reads’ files from the reference genome. For the chromosomes to be used for copy number analysis start at position one in the chromosome sequence and take the first 50 bases. Also create read ID strings and quality score strings in a format readable by Bowtie. These can be Illumina format or fastq format. Output these to a file and continue likewise at position 2 and 3 and 4 etc. until the end of the chromosome is reached. A sample program (hg19.generate.reads.k50.py) is provided in the Supplementary Methods. The program utilizes the input file: [chromlist.txt], which is also provided in Supplementary Methods. This creates separate files, each with 150 million reads. For hg19 chromosomes 1 through 22 and X and Y there will be 21 files of this size.

    CRITICAL STEP Mapping three billion reads can take up to 500 hours of computer time. If multiple computers are available this mapping step can be split up and the parts distributed and run concurrently.

  • 70)

    Using Bowtie, map the reads created using the same mapping parameters expected to be used when mapping real data. An example command is:

    /filepath/bowtie-0.12.7/bowtie -S -t -n 2 -e 70 -m 1 --best --strata --solexa1.3-quals hg19 /filepath/sequence.part.0.k50.txt /filepath/sequence.part.0.k50.sam.

    A sample python program to create and submit Sun Grid Engine jobs (bowtie.qsub.py) is provided in Supplementary Methods.

  • 71)

    Create a file listing the sizes of the chromosomes to be used for the copy number analysis. A sample program is provided in the Supplementary Methods (hg19.chrom.sizes.py). The necessary file for further processing is also provided in Supplementary Data (hg19.chrom.sizes.txt).

  • 72)

    Genome positions with reads that map back to where they came from and nowhere else in the genome are called ‘mappable positions’. The goal is to create a set of bins, each having the same number of mappable positions. Summarize the list of mappable positions in a file with one row for each contiguous block of mappable positions. These blocks are called ‘goodzones’. A sample program to create the list of goodzones from the mapped read files is included in the Supplementary Methods (hg19.bowtie.goodzones.k50.py). The file listing the goodzones is also provided in the Supplementary Data (hg19.goodzones.bowtie.k50.bed). This file is used to compute the bin boundaries.

    CRITICAL STEP If it is desired to create a file with more or fewer bin boundaries (e.g. 5000 or 100,000), such a file can be computed from this goodzones file without having to recreate and remap three billion reads from the reference genome. Just start at the next step in the protocol.

  • 73)

    From the goodzones file, compute the number of mappable positions on each chromosome. A sample program is provided in the Supplementary Methods (hg19.chrom.mappable.bowtie.k50.py). The output file from this program is also provided in Supplementary Data (hg19.chrom.mappable.bowtie.k50.txt).

  • 74)

    After deciding on how many bins are desired, from the goodzones file and the number of mappable positions in each chromosome, compute the bin boundaries. A number of bins is allocated to each chromosome proportional to the number of mappable positions on that chromosome relative to all the chromosomes being used in the copy number analysis. And the the number of mappable positions for each bin is computed as mappable positions divided by number of bins, rounding up when the fractional bin accumulated passes 1 and adding one mappable position to the last bin on the chromosome if necessary. A sample program is provided in the Supplementary Methods (hg19.bin.boundaries.50k.py).

    CRITICAL STEP The choice of the number of genomic bins to be used in the analysis depends on a number of factors. Segmentation algorithms generally perform better with more data points. However, the variance due to sampling is very high if the median bin count is low – below 20 reads per bin, for example. Another consideration is variation due to small scale differences in the genome. We normalize the bin counts for each sample based on GC content. This is sufficient at the scales we have been using (for example using 50 or 240 thousand bins), however, at a much smaller scale, for example using 2.5 million bins, GC normalization alone might not be sufficient to correct for WGA biases that might be independent of GC content. For the 50K bins supplied in this paper, we generally achieve a median read count of 35 reads that is sufficient to allow genome wide copy number determination.

  • 75)

    Sort the bin boundaries file:

    sort –k 3,3n hg19.bin.boundaries.50k.bowtie.k50.txt >

    hg19.bin.boundaries.50k.bowtie.k50.sorted.txt

    The input data for the sorting is provided in Supplementary Data

    (hg19.bin.boundaries.50k.bowtie.k50.txt). Use this sorted bin boundaries file for subsequent processing.

  • 76)

    Compute the GC content in each bin. This will be used in Step 78 for GC normalization. This consists of computing the percentage of G and C bases in each bin from the reference genome. A sample program is provided in the Supplementary Methods (hg19.varbin.gc.content.50k.bowtie.k50.py). The output file is also provided in the Supplementary Data (hg19.varbin.gc.content.50k.bowtie.k50.txt). Figure 6 illustrates the schematic for genome configuration and bin boundary definition as well as the steps downstream necessary to infer genome copy number.

Figure 6.

Figure 6

Schematic of the informatics workflow of SNS. Blue number inserts refer to the Steps of the Procedure.

Sequence mapping and data analysis

  • 77)

    If there are multiple barcoded samples in a lane of sequence data these must first be allocated to separate files. In our system, the barcodes are the first 8 bases of each read. The eighth base position is always a T, so only the first 7 positions are needed to identify the samples. A file listing the barcode sequences and barcode IDs is used to determine which are valid barcode sequences and to which output file they are allocated. A sample program to do this is provided in the Supplementary Methods (barcode.split.sr01.py). A sample barcode file is also provided in Supplementary Data (barcode.8.txt). Once the sequence data is split into separate files for each sample, carry out processing as described below.

  • 78)

    Map the reads to the reference genome:

    /filepath/bowtie-0.12.7/bowtie -S -t -n 2 -e 70 -3 0 -5 0 -m 1 --best --strata hg19 /filepath/SRR054616.fastq /filepath/SRR054616.sam

    An example data set can be downloaded from the NCBI Short Read Archive (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=SRR054616). This data set is from a single cell from Navin et al. 24. The accession ID is SRR054616. The -3 and -5 parameters indicate how many bases to trim from the 3′ and 5′ ends of each read. The example data set was not barcoded so it is not necessary to trim bases from the 5′ end. These reads are only 36 bases in length. Since the bin boundaries were computed using 50 base reads it is desirable to map a number of bases as close to 50 as possible. If all the samples in a project were sequenced at 36 base pair length, then it would be desirable to re-compute bin boundaries with 36 base reads from the reference genome. Bases from the 3′ end of the reads can be trimmed if more than 50 bases are available for mapping to match the computation of the bin boundaries. On more recent sequencing runs it is typical to have 100 base reads. If many reads have the WGA primer sequence at the 5′ end of the read immediately following the barcode sequence an additional 30 bases can be trimmed. The 5′ parameter would then be 38 and the 3′ parameter would be 12 leaving 50 bases to be mapped. A Sun Grid Engine script to map reads for the sample cell is provided in Supplementary Methods (SRR054616.bowtie.qsub).

  • 79)

    Convert the output to .bam file format:

    /filepath/samtools-0.1.16/samtools view -Sb -o /filepath/SRR054616.bam /filepath/SRR054616.sam

  • 80)

    Sort the .bam file:

    /filepath/samtools-0.1.16/samtools sort /filepath/SRR054616.bam /filepath/SRR054616.sorted

  • 81)

    Remove reads likely to be PCR duplicates:

    /filepath/samtools-0.1.16/samtools rmdup -s /filepath/SRR054616.sorted.bam /filepath/SRR054616.rmdup.bam

  • 82)

    Create a .bam file index:

    /filepath/samtools-0.1.16/samtools index /filepath/SRR054616.rmdup.bam

  • 83)

    Create a .sam file from the sorted .bam file with duplicates removed:

    /filepath/samtools-0.1.16/samtools view -o /filepath/SRR054616.rmdup.sam /filepath/SRR054616.rmdup.bam

  • 84)

    Count the number of reads in each bin:

    /filepath/Python-2.7.1/python /filepath/varbin.50k.sam.py /filepath/SRR054616.rmdup.sam /filepath/SRR054616.varbin.50k.txt /filepath/SRR054616.varbin.50k.stats.txt

    The output files of the varbin algorithm are provided in the Supplementary Data. A sample python program to do this is provided in the Supplementary Methods (varbin.50k.sam.py)

  • 85)

    Run the R script provided in the Supplementary Methods (SRR054616.cbs.r) to perform the GC content normalization and CBS segmentation and plot graphs:

    /usr/bin/R CMD BATCH /filepath/SRR054616.cbs.r /filepath/SRR054616.cbs.r.out

    The R script brings in the data file, adds one to each bincount, normalizes the bincount based on GC content using lowess smoothing, uses the CBS segmentor to find non-overlapping regions of differing copy number and output genome plots (Figure 7 and Supplementary Data). Figure 7 shows a genome plot of normalized bin counts and segmentation. There is one gray point for each normalized bin count. The blue line shows the seg.mean value from CBS. The high peaks near the centromeres are artifacts of inaccurate genome assembly in the highly repetitive regions near some of the centromeres and telomeres. These bins can be masked using the file (hg19.50k.k50.bad.bins.txt), provided in the Supplementary Data. This file is a list of 50,000 0’s and 1’s with 1 indicating the bin is to be masked. These “bad bins” are from empirical observation of a number of samples sequenced in our lab. These are provided as an example.

  • 86)

    (Optional) Run the R script provided in the Supplementary Methods (SRR054616.copynumber.r) to estimate copy number. This will only work if there are enough regions of the genome at varying copy numbers to allow the algorithm to work. For genomes that are near diploid we assume the majority of the genome is copy number 2 and estimate other regions based on the segment ratio relative to two:

Figure 7.

Figure 7

Genome plot of normalized bin counts and segmentation illustrating the genome wide copy number profile of example single cell SRR054616. Blue line shows the seg.mean values from the Circular-Binary-Segmentation (CBS).

/usr/bin/R CMD BATCH /filepath/SRR054616.copynumber.r /filepath/SRR054616.copynumber.r.out

This script will output a density plot of segment value differences and plots of each chromosome showing the adjusted bincounts, segmentation values and copy number estimates (Figure 8 and Supplementary Data). The density plot shows the Gaussian kernel smoothed density of differences in seg.mean values for differences between all segments called by the segmentor weighted by segment length. The second peak represents the mode of the seg.mean difference between segments one copy number apart. This is used to estimate copy number for the genome. Figure 9 shows a close up view of a region on chromosome 4 illustrating the normalized bin count for each bin on the chromosome. The blue line is the seg.mean as called by the CBS segmentation algorithm. The red line is the estimated copy number. Supplementary Data provides the output files from the copy number R script. As mentioned earlier in the text, occasionally, we observe single cell copy number profiles that contain large homozygous deletions or what appears to be “shredding” of chromosomes. Figure 10 provides an illustration of those profiles and Box 4 provides a discussion

Figure 8.

Figure 8

Density plot of segment value differences. The density plot shows the Gaussian kernel smoothed density of differences in seg.mean values for differences between all segments called by the segmentor weighted by segment length. The second peak represents the mode of the seg.mean difference between segments one copy number apart.

Figure 9.

Figure 9

A close up view of a region on chromosome 4 illustrating normalized bin count for each bin on the chromosomal segment. The blue line is the seg.mean as called by the CBS segmentation algorithm. The red line is the estimated copy number.

Figure 10.

Figure 10

Representative illustration of Genome Sector Loss (GSL), where large homozygous deletions and patterns consistent with chromosomal “shredding” are evident. (a) Whole genome view of a single cell collected out of a “normal” diploid flow sorting gate. (b) A close up view of the profile of the same cell on chromosomes 7 and 8.

*TIMING

Steps 1-8, flow sorting single nuclei, 4 h

Steps 9-18, whole genome amplification, 6 h

Steps 19-30, QIAquick 96 well plate PCR purification, 1 h

Steps 31-35, DNA sonication, 30 min

Steps 36-39, end repair of sonicated WGA DNA, 45 min

Steps 40-43, 3′ A-overhang addition, 45 min

Steps 44-48, adaptor ligation to DNA, 25 min

Steps 49-55, size selection and library gel purification, 2.5 h

Steps 56-65, library enrichment and quantification, 2.5 h

Steps 66-86, informatic analysis, variable and depends on computer processing power

? TROUBLESHOOTING

Troubleshooting advice can be found in Table 1.

Table 1.

Troubleshooting table

Step Problem Possible reason Solution
30 Failure to amplify
 single cell
 genome
Generally, problems
 with
 amplificatio
 n stem from
 flow sorting
 problems

 96 well plate not
 properly
 aligned for
 sorting;
 single cell
 deposition
 device was
 not checked
 or device
 position was
 moved
 following
 alignment

 Breakoff not stable;
 sample line,
 flow cell or
 nozzle is not
 clean
Make sure flow cytometry parameters
 are set properly to capture single cells
 in 96 well plates


 Perform test sort using beads to determine
 that drops are deposited precisely in
 the center of each of the 96 wells. If
 necessary use the instrument device
 positioning feature to make
 adjustments



 Perform proper cleaning of instrument (refer
 to Box 2) and check for air bubbles in
 the sample line and the flow cell

 If the flow cytometry facility experiences
 temperature fluctuations, check
 breakoff and drop delay settings
 regularly and adjust accordingly
Breakoff not stable;
 room
 temperature
 has changed
 significantly
 (ambient air
 temperature
 affects that
 size and
 flight of sort
 droplets
Check sheath and sample pressures. Check in-
 line filters and tubing connections.
 Call instrument service engineer

 Monitor breakoff and repeat drop delay if any
 minor changes are observed
Breakoff not stable;
 fluidics
 pressure is
 not stable
Changing some sort setting values will alter the
 drop delay. Perform drop delay
 determination again
64 Low yield of enriched
 DNA library
 (too many
 adaptor-
 adaptor
 linkers)
Drop delay not
 correct (the
 drop delay
 value
 determines
 which drop
 will be
 deflected);
 breakoff has
 drifted
Lower the amount of adaptors used in ligation.
 Alternatively, DNA libraries with a lot
 of adaptor-adaptor amplification
 product contaminants can be re-
 purified and amplified using limited
 cycles (e.g. 4-5 cycles)
Drop delay not
 correct (the
 drop delay
 value
 determines
 which drop
 will be
 deflected);
 instrument
 sort setting
 was
 changed at
 some point
 following
 sort set-up
Low ratio of DNA to
 adaptor-
 adaptor
 ligation
 products

ANTICIPATED RESULTS

In our previous report24, we performed, as a proof-of-concept, SNS on multiple single nuclei isolated from the human breast cancer cell line SK-BR-3 and compared the genome wide copy number profiles to profiles obtained from sequencing bulk DNA from a million cells as well as profiles determined using DNA on aCGH. Copy number profiles from the different samples are highly concordant and reproducible with R2 correlation values of ~0.9. Additionally, SNS was performed on single nuclei from a diploid immortalized fibroblast cell line (SKN1) with a normal “flat” copy number profile. The results from SNS on single SKN1 nuclei, illustrating a normal “flat” profile, again prove the reproducibility of the approach. Importantly, when analyzing many single cells from cancer tissue specimen, as done in Navin et al. (ref. 24), the clustering of the copy number profiles yields evolutionary trees of tumor progression that are legible and interpretable. Furthermore, the quantitative nature of the data that are produced with the SNS method (Figures 7 and 9) allows for the accurate identification of genomic copy number alterations and will help in furthering our understanding of cancer biology. Since our initial report, we have applied SNS to many additional breast tumors as well as tumors of different anatomical origins and we reproducibly obtain quantitative and intelligible genome wide copy number profiles.

Supplementary Material

Supp Data

Supplementary Data. Input/output data for the informatics procedure of the protocol required to reproduce copy number profiles from single cell sequencing data. Supplementary Data Legend outlines all the data with descriptions.

Supp Fig

Supplementary Figure 1. Outline of the informatics section. Provide a concise outline of all the steps with input-program-output labels.

Supp Methods

Supplementary Methods. Contains all programs required to process single cell sequencing data for copy number determination, from genome preparation to sequence mapping and copy number estimation. Supplementary Methods Legend outlines all the programs and their utilities.

Supp Note

Supplementary Note. Descriptions of data and program files. Text providing a brief overview of the data and program files included in the Supplementary Methods and Supplementary Data.

Box 4. Genome sector loss (GSL).

In approximately 5% of single cell profiles, we observe an as yet unexplained phenomenon in which one or more chromosomes has been either completely lost (homozygous loss), or appear ‘shredded’ as if multiple regions up to 20 Mb in length from a single chromosome have been randomly lost from the nucleus. We observe this phenomenon to varying degrees in all types of samples, whether from cell culture, normal or malignant tissue. Such Genome Sector Loss (GSL) can affect any chromosome and the breakpoints are not shared among different cells from the same source or sorting session. These profiles of these cells are highly disordered and appear distinct from those reported for ‘pseudodiploid’ cells in our initial publication 24. In the absence of a biological or physical explanation for these cells we consider them at least moribund, and although we include them in our lineage analysis, they do not contribute to the clonal lineage trees.

It is not clear whether the cause of GSL lies in the sorting process, perhaps by shear stress on the nuclei as they pass through the nozzle, or has a biological explanation related to abortive cell division or the observed fragmentation of chromosomes during programmed cell death (apoptosis) 29. It is also tempting to relate the observation of ‘shredded’ chromosomes among the GSL profiles to the recently reported events leading to ‘chromothrypsis’, in which segments of shredded chromosomes reform in a highly rearranged yet viable state 30. The potential explanations for GSL are currently under investigation. In any case, the phenomenon affects only a small minority of profiled nuclei.

ACKNOWLEDGEMENTS

We acknowledge all the members of the Hicks and Wigler labs for discussions during the technology development. We thank W. McCombie, E. Ghiban, and L. Gelley for their advice and technical assistance and A. Gordon for informatics support and assistance with figures. We also thank three anonymous reviewers for their valuable comments and suggestions.N.N. is supported by grants from Texas STARS and the Alice Kleberg Reynolds Foundation. This work was supported by grants to M.W. and J.H. from the Department of the Army (W81XWH04-1-0477), the Breast Cancer Research Foundation, and the Simons Foundation. M.W. is an American Cancer Society Research Professor.

Footnotes

AUTHOR CONTRIBUTIONS T.B., J.K., N.N, J.H. and M.W designed methods and analyzed data. T.B., L.R., H.C, M.R., A.S. and J.T, performed experiments. R.K., D.E, and B.L contributed to the technology development. J.K., N.N., T.B., J.H. and M.W developed informatic analysis. T.B., J.K and J.H wrote the paper.

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests

References

  • 1.Beckmann JS, Estivill X, Antonarakis SE. Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nat Rev Genet. 2007;8:639–646. doi: 10.1038/nrg2149. [DOI] [PubMed] [Google Scholar]
  • 2.Hasin Y, et al. High-resolution copy-number variation map reflects human olfactory receptor diversity and evolution. Plos Genet. 2008;4:e1000249. doi: 10.1371/journal.pgen.1000249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Perry GH, et al. Diet and the evolution of human amylase gene copy number variation. Nat Genet. 2007;39:1256–1260. doi: 10.1038/ng2123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Speliotes EK, et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet. 2010;42:937–948. doi: 10.1038/ng.686. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sebat J, et al. Strong association of de novo copy number mutations with autism. Science. 2007;316:445–449. doi: 10.1126/science.1138659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Beroukhim R, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463:899–905. doi: 10.1038/nature08822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bignell GR, et al. Signatures of mutation and selection in the cancer genome. Nature. 2010;463:893–898. doi: 10.1038/nature08768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Russnes HG, et al. Genomic architecture characterizes tumor progression paths and fate in breast cancer patients. Sci Transl Med. 2010;2:38ra47. doi: 10.1126/scitranslmed.3000611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Shiu KK, Natrajan R, Geyer FC, Ashworth A, Reis-Filho JS. DNA amplifications in breast cancer: genotypic-phenotypic correlations. Future Oncol. 2010;6:967–984. doi: 10.2217/fon.10.56. [DOI] [PubMed] [Google Scholar]
  • 10.Shinawi M, Cheung SW. The array CGH and its clinical applications. Drug Discov Today. 2008;13:760–770. doi: 10.1016/j.drudis.2008.06.007. [DOI] [PubMed] [Google Scholar]
  • 11.Santarius T, Shipley J, Brewer D, Stratton MR, Cooper CS. A census of amplified and overexpressed human cancer genes. Nat Rev Cancer. 2010;10:59–64. doi: 10.1038/nrc2771. [DOI] [PubMed] [Google Scholar]
  • 12.Pinkel D, Albertson DG. Array comparative genomic hybridization and its applications in cancer. Nat Genet. 2005;37(Suppl):S11–17. doi: 10.1038/ng1569. [DOI] [PubMed] [Google Scholar]
  • 13.Praulich I, et al. Clonal heterogeneity in childhood myelodysplastic syndromes--challenge for the detection of chromosomal imbalances by array-CGH. Genes Chromosomes Cancer. 2010;49:885–900. doi: 10.1002/gcc.20797. [DOI] [PubMed] [Google Scholar]
  • 14.Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. [DOI] [PubMed] [Google Scholar]
  • 15.Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. doi: 10.1016/j.tig.2007.12.007. [DOI] [PubMed] [Google Scholar]
  • 16.Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010;11:685–696. doi: 10.1038/nrg2841. [DOI] [PubMed] [Google Scholar]
  • 17.Chiang DY, et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods. 2009;6:99–103. doi: 10.1038/nmeth.1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Alkan C, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009;41:1061–1067. doi: 10.1038/ng.437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Shah SP, et al. Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature. 2009;461:809–813. doi: 10.1038/nature08489. [DOI] [PubMed] [Google Scholar]
  • 20.Ding L, et al. Genome remodelling in a basal-like breast cancer metastasis and xenograft. Nature. 2010;464:999–1005. doi: 10.1038/nature08989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Campbell PJ, et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature. 2010;467:1109–1113. doi: 10.1038/nature09460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Yachida S, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature. 2010;467:1114–1117. doi: 10.1038/nature09515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Anderson K, et al. Genetic variegation of clonal architecture and propagating cells in leukaemia. Nature. 2011;469:356–361. doi: 10.1038/nature09650. [DOI] [PubMed] [Google Scholar]
  • 24.Navin N, et al. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472:90–94. doi: 10.1038/nature09807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;15:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Venkatraman ES, Olshen AB. A faster cicular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–63. doi: 10.1093/bioinformatics/btl646. [DOI] [PubMed] [Google Scholar]
  • 28.Wersto RP, et al. Doublet discrimination in DNA cell-cycle analysis. Cytometry. 2001;46:296–306. doi: 10.1002/cyto.1171. [DOI] [PubMed] [Google Scholar]
  • 29.Nagata S, Nagase H, Kawane K, Mukae N, Fukuyama H. Degradation of chromosomal DNA during apoptosis. Cell Death Differ. 2003;10:108–116. doi: 10.1038/sj.cdd.4401161. [DOI] [PubMed] [Google Scholar]
  • 30.Stephans PJ, et al. Massive genomic rearragement acquired in a single catastrophic event during cancer development. Cell. 2011;144:27–40. doi: 10.1016/j.cell.2010.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Data

Supplementary Data. Input/output data for the informatics procedure of the protocol required to reproduce copy number profiles from single cell sequencing data. Supplementary Data Legend outlines all the data with descriptions.

Supp Fig

Supplementary Figure 1. Outline of the informatics section. Provide a concise outline of all the steps with input-program-output labels.

Supp Methods

Supplementary Methods. Contains all programs required to process single cell sequencing data for copy number determination, from genome preparation to sequence mapping and copy number estimation. Supplementary Methods Legend outlines all the programs and their utilities.

Supp Note

Supplementary Note. Descriptions of data and program files. Text providing a brief overview of the data and program files included in the Supplementary Methods and Supplementary Data.

RESOURCES