Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Jul 11;66(5):e202300278. doi: 10.1002/bimj.202300278

Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control

Raphael O Betschart 1, Cristian Riccio 1, Domingo Aguilera‐Garcia 2, Stefan Blankenberg 1,3,4, Linlin Guo 3, Holger Moch 2, Dagmar Seidl 2, Hugo Solleder 1, Felicia Sandberg 1, Alexandre Thiéry 1, Raphael Twerenbold 3,4,5, Tanja Zeller 3,4,5, Martin Zoche 2, Andreas Ziegler 1,3,4,6,
PMCID: PMC12859534  PMID: 38988195

ABSTRACT

Rapid advances in high‐throughput DNA sequencing technologies have enabled large‐scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short‐read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS‐HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR‐free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross‐contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.

Keywords: DNA sequencing, DRAGEN, high‐throughput sequencing, Illumina NovaSeq 6000, next‐generation sequencing

1. Introduction

Genome‐wide association studies (GWAS) using high throughput microarrays revolutionized genetic epidemiology in the past 15 years (Tam et al. 2019). For example, at the beginning of 2024, the GWAS Catalog contained more than half a million top associations (Sollis et al. 2023). A major technological shortcoming of microarray‐based GWAS is that only a few hundred thousand to a few million selected chromosomal positions of the genome are genotyped. Furthermore, rare variants and population‐specific variants are not well covered by microarrays for GWAS. Because of the clinical importance of mutations in patients, technologies enabling sequencing of every base in the genome are of great interest.

A wealth of different whole genome sequencing (WGS) and whole exome sequencing technologies was thus developed in the past 20 years (Barba, Czosnek, and Hadidi 2014; Hayden 2014). Sequencing a whole genome currently costs well below 1000 USD, and we can soon expect to have a WGS for just 100 USD (Mobley 2021). This cost will be substantially lower than the cost of microarray genotyping in 2007, which was around 400 USD per sample. Illumina is the market leader by owning about 80% of the DNA sequencing market globally (Mobley 2021). In the past few years, Illumina has expanded towards making sequencing a standardized just‐push‐a‐button system. For example, Illumina acquired Edico Genome's DRAGEN Bio‐IT platform (DRAGEN) in 2018, a computer system designed for analyzing WGS data. In 2020, Illumina bought Enancio, whose software Lena is now the compression software DRAGEN original read archive (ORA).

Although the high throughput sequencing technology is already very advanced and generates plenty of data, the bottlenecks are the subsequent processing and analysis approaches. This is best illustrated by the large amount of data, and the size of the raw compressed sequence data files is approximately 65 gigabytes (GB) for the project reported here. Moreover, even the best sequencing technology available is not error‐free. Even with high‐quality single samples, the total number of errors across all samples will be large. Quality control (QC) is, therefore, of utmost importance in all different steps of a sequencing study.

This sheer amount of data together with its need for QC has called for efficient approaches to

  • (i)

    process the raw sequencing data into data that contain comprehensive information about the polymorphic sites,

  • (ii)

    perform the union of all individual preprocessed sequences, a step coined joint calling or multisample calling,

  • (iii)

    efficiently perform QC on the sequence data at several steps during the sequencing study, and

  • (iv)

    efficiently compress and store the sequence data after preprocessing.

Statistical association analysis can only be performed after completing steps (i) to (iii) successfully.

The main aim of this paper thus is to describe the preprocessing and QC steps in detail and apply them to data from the GENEtic SequencIng Study Hamburg–Davos (GENESIS‐HD), a WGS study involving more than 9000 whole genomes, which is introduced in Section 2. In Section 3.1, we sketch Illumina's sequencing technology. The general preprocessing and QC criteria are described in Section 3.2. In Section 4, we summarize the most important QC metrics. These are illustrated with data from GENESIS‐HD in Section 5. Finally, we provide empirical data on the compression of raw data using the novel DRAGEN ORA in Section 6.

2. Motivating Study: The GENEtic SequencIng Study Hamburg–Davos

The GENESIS‐HD is a collaborative effort. It was planned to sequence 9000 individuals in total, of which approximately 8000 were to come from the population‐based Hamburg City Health Study (HCHS); for a description of the HCHS design (see Jagodzinski, Johanse, and Koch‐Gromus 2020). About 1000 subjects were to be selected from patient‐based clinical cohorts with distinct cardiovascular characteristics. The clinical cohorts included subjects with myocardial infarction at a young age (premature MI), patients with coronary artery disease from the INTERCATH study, patients with cardiomyopathy from the GrAPHIC study, and patients with atrial fibrillation from the AHFRI study (Table 1). Recruitment into two different studies could not be avoided due to data regulations, and a total of 69 subjects participated in two different studies. A brief description of the cohorts is provided in the Appendix.

TABLE 1.

Samples included in the GENESIS‐HD project. Sample sizes include subjects who were sequenced repeatedly.

Study Specific properties Sample size before QC a Sample size after QC a
HCHS Individuals from population‐based epidemiological Hamburg City Health Study (HCHS) with deep‐phenotyping, including multiomics data, comprehensive imaging, functional testing, and outcome data. 8123 7397
Premature MI Patients with myocardial infarction at age ≤ 50 years. Harmonized dataset of Young Myocardial Infarction (MIYoung) study, Gutenberg Infarction Study (GIS), and Biomarkers in Acute Cardiac Care (BACC) study. 517 441
INTERCATH For patients undergoing invasive coronary angiography, genetic analyses were performed in predefined phenotypes: (i) premature onset of coronary artery disease (CAD), (ii) absence of CAD despite numerous cardiovascular risk factors, and (iii) focal versus diffuse angiographic forms of CAD. 468 405
AFHRI Patients from the Atrial Fibrillation with High‐Risk Individuals (AFHRI) study. Sequencing was performed in subjects with lone atrial fibrillation, which is defined as (unexplained) atrial fibrillation in the absence of conventional risk factors. 149 108
GrAPHIC Highly selected patients with primary dilated cardiomyopathies from the Geno‐ And PHenotyping of PrImary Cardiomyopathy (GrAPHIC) study. 47 46
GIAB Genome in a bottle (father/mother/son) 4/8/72 b 4/8/69 b
a

A total of 72 subjects were included in two different studies because the studies were recruiting independently, and double inclusion into these two studies as well as for WGS could not be avoided due to data protection regulations. Three duplicates were removed in the genetic similarity analysis, and one subject during Het/Hom analysis.

b

Sample sizes for GIAB denote the number of replicates of the sequenced parents‐offspring trio.

For QC purposes, 16 samples of a single individual were sequenced in a blinded fashion to assess repeatability. This subject was sequenced before choosing between two different sequencing technologies. Five samples were sequenced three times each to assess reproducibility, and an additional 48 samples were sequenced in two different runs as pilot phase (96 samples). The samples from the pilot phase are not considered further in this work.

Three subjects from the genome in a bottle (GIAB) consortium were also included as part of the QC to assess repeatability (Zook and Salit 2011). More specifically, the Ashkenazim trio (son NA24385, NIST ID HG002; father NA24149, NIST‐ID HG003; mother NA24143, NIST‐ID HG004) was sequenced four times in different runs to allow for investigating Mendelian inheritance patterns. The son HG002 was sequenced another 68 times in different runs, and the mother HG004 was sequenced in the last four runs (Table 1).

3. State of the Art in Short‐Read Whole Genome Sequencing

3.1. Second‐Generation Sequencing Technologies

Second‐generation technologies nowadays mostly rely on sequencing by synthesis (SBS). The main steps of Illumina's short read SBS, used in GENESIS‐HD, are provided in Figure 1. SBS is a direct sequencing method (Drmanac, Drmanac, and Chui 2002), where each base position in the DNA is determined individually using fluorescent tags in sequencing to identify the bases. For example, Illumina's NovaSeq 6000, used in GENESIS‐HD, is based on a two‐color system to determine the four different bases. Green and red images are interpreted as T and C, respectively. The base A is called when both green and red are present (Figure 1c, displayed as yellow), and unlabeled (gray) signals are interpreted as base G (see below, Figure 1c). Since only two images are required per cycle, both sequencing and data processing can be done faster compared to the four‐color HiSeq technology.

FIGURE 1.

FIGURE 1

Key steps in the sequencing by synthesis approach as utilized, for example, in Illumina's short read sequencing. (a) During the library preparation step, DNA is extracted, purified, and fragmented into short pieces of a well‐defined size. Fragmentation approaches that randomly shear the DNA are preferable because the cut points of fragments vary. Next, “forward read” (violet in Figure 1a and b) and “reverse read” (violet in Figure 1a and b) adapters are attached at both ends of each DNA fragment, allowing fragments to bind to a glass surface. Additionally, attaching a so‐called unique dual index to each sample permits parallel sequencing of multiple samples on the same flow cell. (b) During the first bridge amplification and sequencing step, polymerase chain reaction (PCR) amplification is first used to increase the number of copies of each DNA. Then, reads 1 and 2 are performed using the “forward read” (Figure 1b, row 1) and “reverse read” (Figure 1b, row 2) adapters, respectively. For Read 1, “reverse read” adapters are washed away and the “forward read” adapters binding to the glass surface are kept. Sequencing of Read 1 is then performed from the top in the direction of the glass surface using labeled nucleotides binding to the DNA strand to determine the four bases. Specifically, bases seen as green and red images are interpreted as T and C, respectively. The base A is called in the case of both green and red (Figure 1b, displayed as yellow), and unlabeled (gray) signals are interpreted as base G (see below, Figure 1b). Once Read 1 is completed, a second bridge amplification is performed. Read 2 is then performed similarly, this time by washing away the “forward read” adapters and keeping “reverse read” adapters binding to the glass surface. Read 2 reads can be expected to be of lower quality than Read 1 reads (Tan et al. 2019). (c) During the alignment and data analysis step, the read sequences are mapped and aligned to a reference genome, resulting in an assembled sequence. Additionally, variant calling is performed to identify sites where the genotype of the assembled sequences deviates from the genotype of the reference genome. Figure created with BioRender.com.

Polymerase chain reaction (PCR) amplification can introduce biases and errors, impacting the accuracy of the sequencing data, so samples are often prepared using a PCR‐free library preparation kit. PCR‐free library preparation is a set of reagents and protocols designed for sequencing without the need for PCR amplification. The resulting libraries are loaded onto a flow cell (Figure 1a). Although library preparation can be done without PCR amplification, the PCR step is involved in the next step. Here, clusters of DNA fragments are bridge‐amplified in a process termed cluster generation (Figure 1b). With this amplification step, millions of copies of single‐stranded DNA are obtained. Disadvantages of this technology include the short length of sequencing reads—generally between 75 and 300 base pairs, making it difficult to reliably call structural variants (SVs), short tandem repeats (STRs), and homologous regions. An expansion to longer‐read sequences of 5–7 kilobases was recently released (Marx 2023). For a general overview of NGS technologies and generations, the reader is referred to Ari and Arikan (2016), Goodwin, McPherson, and McCombie (2016), or Slatko, Gardner, and Ausubel (2018). Laboratory‐based quality management and quality assurance procedures, including standard operating procedures, can be found in Endrullat et al. (2016).

3.2. Data Analysis Pipeline—From Raw Sequence Data to Files Ready for Association Analysis

The last step of Figure 1 shows that the raw data need to be processed before being used for further statistical analysis. The major steps are identical for all preprocessing pipelines (Figure 2). In the first step, the raw sequence data coming out of the machine as base call files (BCL) are demultiplexed and converted into the standard FASTQ file format. Next, all reads, that is, the raw sequences—are mapped and aligned against a reference genome, yielding a binary alignment map (BAM) file. Mapping is analogous to identifying the area in which a specific piece of a puzzle will fit, while alignment describes the process by which the puzzle piece is placed into the puzzle at its specific position.

FIGURE 2.

FIGURE 2

Primary and secondary analyses in next‐generation sequencing (NGS). The output files from the sequencer are Illumina's binary base call (BCL) files. They include the sequence information for all subjects pooled prior to the sequencing experiment, and which were sequenced on the same flow cell (glass slide). The primary analysis converts BCL files to de‐multiplexed FASTQ files. The FASTQ files may be compressed to save disk space. Mapping and alignment of the sequences is the first key step of secondary analysis. Either the .fastq.gz files or further compressed files may be used as the starting point. The output is generally written to a binary alignment map (BAM) file, which may also be compressed to save disk space. Single sample variant calling is the next main step. The standard output from variant calling is either genomic variant call format (GVCF) files or variant call format (VCF) files. GVCF is a special type of VCF that records all genomic sites, variant or not, compared to the reference genome. This is important for the joint genotyping, that is, multiple sample variant calling of a cohort, which results in a joint VCF (jVCF) or multisample VCF (MSVCF). Quality control metrics are calculated at four different levels. Additional QC metrics are calculated after multisample calling, which finalizes secondary analyses. Subsequent analyses are termed tertiary analyses in this context.

After mapping and alignment, positions at which a sample deviates from the sequence of the reference genome can be determined. This process is done individually for each sample and is, therefore, termed single‐subject variant calling. The standard output from a single sample variant calling is a file in the genomic variant call format (GVCF). For downstream association analysis, the genotypes from all sequenced samples need to be merged. This final step in the data preprocessing is either called joint variant calling, yielding jVCF files, or multisample variant calling (MSVCF; Figure 2). FASTQ files are not required for further analysis once preprocessing has been completed. It is reasonable to efficiently store them in the case that the preprocessing pipeline is updated, and all analysis steps from FASTQ to MSVCF need to be repeated. BAM, GVCF, and VCF files may also be compressed to save disk space. Compression tools include DRAGEN ORA (Illumina Inc. 2023), Genozip (Lan et al. 2021), and VCFShark (Deorowicz, Danek, and Kokot 2021), among others.

Several steps of the preprocessing pipeline merit additional discussion.

From FASTQ to single sample GVCF: A brief comparison of preprocessing pipelines—The most frequently used pipeline for processing sequence data as described above is the Genome Analysis Toolkit (GATK; Lin, Chang, and Hsu 2022), which is a collection of command‐line tools for discovering genetic variation in sequencing data. The team from the Broad Institute provides an end‐to‐end workflow, called the GATK Best Practices (Van der Auwera, Carneiro, and Hartl 2013). This Best Practices guideline has more details than the overview provided in Figure 2; see, for example, Koboldt (2020) or Wright, Gola, and Ziegler (2017). The main benefits of the GATK pipeline are that it operates on regular infrastructure and open‐source code and that its software packages are free of charge.

An alternative pipeline is the Dynamic Read Analysis for Genomics (DRAGEN) pipeline, which operates on a separate single‐server hybrid hardware/software platform when run locally. DRAGEN combines a field‐programmable gate array (FPGA) and a CPU, enabling it to run massively parallel functions on the FPGA and single‐threaded functions on the CPU (Miller, Farrow, and Gibson 2015). One advantage of using the DRAGEN is that a single call of the proprietary DRAGEN software allows one to perform all steps from mapping and alignment to single subject variant calling in one run. It is fast (Betschart et al. 2022), and there is no need to switch from one software package at one stage of the preprocessing to another. However, this also restricts flexibility. Furthermore, both the software (closed source) and the hardware are proprietary, and there are additional licensing fees.

Although GATK is the standard pipeline for preprocessing of WGS data, we have recently shown that more variation could be detected and that called variation had higher accuracy when mapping and alignment were done with DRAGEN 3.8.4 compared to GATK 4.2.4.1 (Betschart et al. 2022). For this analysis, we used samples from the GIAB consortium (Zook and Salit 2011), for which a truth set of sequences exists.

Several single sample variant callers are available, which give higher accuracy than the standard GATK variant caller. For example, DeepVariant (Poplin, Chang, and Alexander 2018a) won the first precisionFDA Truth Challenge for short‐read sequencing in 2016 and had the highest accuracy of single nucleotide polymorphisms (SNPs). The winner of the second precisionFDA Truth Challenge for short‐read sequencing in 2020 was DRAGEN, which outperformed other pipelines, in particular in difficult‐to‐call genomic regions (Olson, Wagner, and McDaniel 2020). GATK 4.5.0.0 includes an option to perform variant calling in a so‐called DRAGEN mode, based on DRAGEN version 3.7.8.

From single sample GVCF to joint VCF or multisample VCF—The standard output from single sample variant calling is a GVCF file. The difference to the variant call format (VCF) is that GVCF files supplement the variant calls with information on nonvariant regions. Furthermore, confidence estimates that the regions match the reference genome are provided (Yun et al. 2021). GVCF files thus are a prerequisite for multisample variant calling. Joint genotyping of large cohorts is done by dividing the genome into smaller chunks, such as chromosomes or even smaller chunks of 100 kilobases.

Harmonizing the representation of overlapping alleles is algorithmically challenging. The computational time can be high, and joint calling scales super‐linearly in both runtime and output size when the cohort grows and more rare variants are genotyped (Lin, Rodeh, and Penn 2018). The standard approach to joint genotyping is GATK's GenotypeGVCFs tool (Poplin, Ruano‐Rubio, and DePristo 2018b). As with alternative packages, such as GLnexus (Lin, Rodeh, and Penn 2018; Yun et al. 2021) or SparkGOR (Guðbjartsson, Ísleifsson, and Ragnarsson 2022a), GVCFs from all sequenced samples are transformed into a VCF that contains a complete matrix for every variant of all the samples. The disadvantage of these joint calling approaches is that the joint calling needs to start from scratch again if samples are added or removed.

Another approach, in which multisample calling could be done iteratively in batches of, say, 1000 samples would be preferable. If samples are added, a batch or several batches could be added. This approach would be especially beneficial in studies with several thousands of samples. Illumina's iterative Iterative gVCF Genotyper (Illumina Inc. 2022) follows this workflow, allowing users to incrementally aggregate batches of samples into existing batches and has three major advantages. First, multisample variant calling does not need to be repeated, whenever new samples become available. Second, the size of the required input data is substantially reduced. Third, batches can be analyzed in parallel. For each batch, a census file is generated which stores summary statistics of all the variants and homozygous reference blocks among samples in the batch (cns: census file). Information about the samples in the batch is also stored (cht: cohort file). When the different batch census files are generated, which are to be merged, users can aggregate them into a single global census file.

Confidence in base calls (genotype calls), Phred score, and coverage—The more reads that are available at a particular chromosomal position (i.e., base position), the higher the confidence in the genotype call at this particular position. The quality of base calling is expressed as the Phred quality score, which logarithmically relates base calling error probabilities p: Q=10logp. Phred scores of 10 and 30 correspond to base error probabilities of 1 in 10 and 1 in 1000, respectively.

The average number of reads that align to a specific chromosomal position is termed coverage, with the median coverage around 35× in many current projects. The higher the coverage, the higher the degree of confidence in the genotype call at this particular base position.

Terminology: Primary, secondary, and tertiary analysis—Many sequencing companies, laboratories, and bioinformaticians call all the steps described in Figure 2 secondary analysis. In this context, all laboratory steps until the completion of sequencing are named primary analysis (Oliver, Hart, and Klee 2015; Pereira, Oliveira, and Sousa 2020). Finally, downstream genetic association analyses involving all samples are termed tertiary analyses in this context. The terminology is different in genetic epidemiology. Here, GWAS is usually called primary analysis and approaches for prioritizing the most important results secondary analysis or fine‐mapping (Cantor, Lange, and Sinsheimer 2010). In clinical trials, the terminology is yet different.

4. Important Quality Control Criteria: Concepts

QC procedures in WGS analysis have received less attention than statistical approaches for the association analysis of WGS data. However, several protocols for data preprocessing and QC have been made available (Adelson, Renton, and Li 2019; Craig, Vena, and Ramkissoon 2012; Guo, Zhao, and Sheng 2014b; Panoutsopoulou and Walter 2018). QC criteria have also been recently described for the TOPMed program (Taliun, Harris, and Kessler 2021) and the UK biobank (Halldorsson, Eggertsson, and Moore 2022), which involved more than 50,000 and more than 150,000 WGS, respectively. The QC criteria that we aimed to apply to the GENESIS‐HD project are summarized in the following sections. They comprise a total of 35 QC criteria, nine of which are to be applied to the raw reads, that is, FASTQ files (Section 4.1). An additional 10 criteria for investigating the mapping quality were formulated for the use with the BAM files after mapping and alignment (Section 4.2). A set of seven criteria was to be applied to the GVCF files after a single sample variant calling (Section 4.3). Finally, a set of nine criteria is provided for variant filtering on the MSVCF files after joint calling (Section 4.4).

As outlined in Figure 2, QC is performed at four different stages during a WGS study. Below, we describe the most important QC criteria according to these stages.

4.1. Quality Control Step 1: Analysis of Raw Sequence Data

After de‐multiplexing and converting BCL files to FASTQ files, the first QC step can be performed.

  • QC criterion 1.1 – number of reads: The expected number of reads for an S4 flow cell on the NovaSeq 6000 is between 16 and 20 billion paired‐end reads (Illumina Inc. 2019).

  • QC criterion 1.2 – read length distribution: The NovaSeq 6000 was configured for 150 bp reads, so any reads shorter than 150 bp could indicate issues with the DNA fragmentation step or DNA degradation. Inserts smaller than 150 bp result in shorter reads during bcl2fastq conversion. Furthermore, DNA fragment size tends to decrease with sample age (Craig, Vena, and Ramkissoon 2012).

  • QC criterion 1.3 – per sequence GC content distribution: Per sequence GC content distribution is displayed as a histogram or an estimated density of the mean GC content for all reads. Reads sequenced from a random library are expected to follow a normal distribution centered on the overall GC content of the sequenced organism. A deviation from the normal distribution suggests that the sequenced library might either have been contaminated or that adapter sequences might have been present. Unusually shaped distributions with sharp peaks often suggest the presence of adapter sequences, whereas broader or multiple peaks indicate contamination with a different species. In contrast, a shift in the distribution can suggest a systematic bias (Pfeifer 2017). While all samples can be visually inspected to assess this QC criterion, establishing a quantitative threshold for large‐scale sequencing studies is challenging.

  • QC criterion 1.4 – per base quality profile: The per‐base quality profile can be displayed as quality score boxplots for each sequencing cycle. DNA templates in each cluster are synchronously elongated one base at a time along each cycle. The signal‐to‐noise ratio decreases over the cycles due to a variety of factors. A decrease in base quality towards the end of reads is expected. In paired‐end sequencing, the R2 reads have a lower quality profile than the R1 reads (Tan et al. 2019). Quality profiles depend on the sequencing instrument model and the sequencing center (Ma, Shao, and Tian 2019). Other factors that may impair base quality are phasing and prephasing (Brockman, Alvarez, and Young 2008; Guo et al. 2014a; Thrash, Arick, and Peterson 2018), template damage due to laser imaging or mechanical issues on the sequencing platform (Guo et al. 2014a).

  • QC criterion 1.5 – per sequence quality profile: The per sequence quality score can be displayed as a histogram of the mean sequence quality for all reads per run or as the percentage of bases with at least Q30 scores for each run. This display should be separated into R1 and R2 reads. An alternative grouping of samples may be by well position because the same unique dual indexes are used in the same well position across multiple runs. The factors influencing this metric are the same as the ones for the per‐base quality profile (QC criterion 1.4). Good sequencing runs show a sharp peak at high‐quality values, while bimodal or broader distribution indicates sets of poor‐quality reads. Poor‐quality runs also show a low percentage of Q30 bases.

  • QC criterion 1.6 – per base sequence content: Per base sequence composition represents the relative proportion of each nucleotide by a sequencing cycle. The relative proportion of each read position's nucleotide should reflect the overall nucleotide composition of the genome, without any significant differences in the nucleotide composition along the read (Pfeifer 2017). In PCR‐free WGS, biases originate mainly from adapter contamination or nonrandom DNA fragmentation.

  • QC criterion 1.7 – per base N content: The per base N content at each sequencing cycle represents the proportion of undetermined bases, that is, bases that could not be confidently called as A, T, C, or G. As sequencing quality decreases along cycles, a slight increase towards the end of reads is expected. However, sharp spikes indicate issues in the sequencing cycle. This QC criterion is closely related to QC criterion 1.6.

  • QC criterion 1.8 – sequence duplication: Sequence reads may be duplicated for various reasons. First, natural duplicates arise when independent DNA fragments obtained during fragmentation are incorporated into the library (Bailey, Krajewski, and Ladunga 2013). These fragments either arise at different genomic locations with high sequence similarities (Zhou, Ng, and Drautz‐Moses 2019) or from the same genomic location if the amount of input DNA is high. However, these should be negligible if random fragmentation is done in library preparation. Second, PCR duplicates arise when a PCR‐based protocol is used during library preparation (Craig, Vena, and Ramkissoon 2012). Third, optical duplicates originate in nonpatterned flow cells when a single cluster is mistakenly identified as two or more clusters during laser image analysis (Zhou and Rokas 2014). Finally, when a released library molecule hybridizes in an adjacent well, exclusion amplification duplicates may occur in patterned flow cells, such as the ones used in the NovaSeq instrument. Although the frequency of sequence duplication depends on multiple factors, and the duplication fractions for TruSeq PCR‐free DNA libraries on NovaSeq 6000 ranged from 6.22% (Arora, Shah, and Johnson 2019) to 10.78% (Broad Institute 2020), the effect of duplicates on called variants is generally negligible (Ebbert, Wadsworth, and Staley 2016).

  • QC criterion 1.9 – adapter content: The adapter content is represented by the number of adapter sequence occurrences in the reads. If adapters are trimmed during bcl2fastq conversion, many reads will be shorter than 150 bp, and per base frequencies by sequencing, position will vary.

4.2. Quality Control Step 2: Analysis of Mapped and Aligned Data

Once mapping and alignment have been completed, QC step 2 can be performed. Many of the QC criteria at this stage deal with properties about coverage and read depth.

  • QC criterion 2.1 – average depth of coverage: The average depth of coverage is defined as the average number of overlapping reads within the sequenced area.

    We recommend computing coverage metrics without double‐counting bases with overlapping R1 and R2 reads, that is, by using the ‐‐qc‐coverage‐ignore‐overlaps = true option in DRAGEN.

  • QC criterion 2.2 – breadth of coverage: The breadth of coverage is the percentage of the reference genome covered by aligned reads at a particular depth (Sims et al. 2014). It depends on various factors, including the sequencing platform (Hwang, Lee, and Park 2014; Lam, Clark, and Chen 2012) and the sequencing depth (Wang et al. 2011).

    Generally, one can focus on the genome percentage with a coverage of ≥1× and plot the breadth of coverage for each sample.

  • QC criterion 2.3 – insert size distribution: The insert size is the length of the sequence between the adapters. Inserts shorter than the length of a single read can lead to adapter contamination, also termed adapter read‐through (Costello, Fleharty, and Abreu 2018). Bimodal and broader insert size distribution can indicate the presence of chimeric library segments (Anvar, Khachatryan, and Vermaat 2014). While insert sizes should be larger than twice the read length, the TruSeq PCR‐free library on a MiSeq had decreasing quality with increasing insert size (Huptas, Scherer, and Wenning 2016).

  • QC criterion 2.4 – read depth distribution: The read depth distribution is the profile of read depths across all sequenced loci, displayed as a histogram of depth across the entire genome. Several factors influence the shape of this distribution, such as amplification techniques, which can lead to distribution bias (Hou, Wu, and Shi 2015), or the choice of the reference genome (Pan, Kusko, and Xiao 2019). In short‐read sequencing, studies have reported different expected distributions, such as Poisson (Wang, Sui, and Wu 2016), Gamma (Sarin et al. 2008), or even normal distributions.

  • QC criterion 2.5 – properly paired reads: Reads mapped to a reference genome are categorized as either mapped or unmapped. Mapped reads can be uniquely mapped or multimapped, with multimapping caused by repetitive regions in the reference genome (Yang et al. 2019). Mapping failures may be caused by PCR errors, sequencing errors, genomic diversity, reads from regions absent from the reference genome, or contamination from a different species.

    Mapped paired‐end reads may be further categorized into concordantly mapped reads, also termed properly paired reads, and discordantly mapped reads. Reads are properly paired when R1 and R2 reads are mapped in pairs in close vicinity. Reads are mapped discordantly, for example, when only one of the reads has been mapped or when the distance between reads exceeds a given threshold.

  • QC criterion 2.6 – cross‐sample contamination: There are two primary sources for sequence contamination: cross‐species contamination, that is, contamination from a species other than human, and within‐species contamination. While most reads from other species, such as bacteria or viruses would be unmapped and not affect the rest of the pipeline, within‐species cross‐sample contamination is harder to detect and may lead to more detected variants. Within‐species contamination may occur during any stage of the wet‐lab process, from sample collection to library preparation and flow cell loading (Moustafa, Xie, and Kirkness 2017). Index swapping or misassignment during de‐multiplexing can also cause within‐species contamination.

  • QC criterion 2.7 – duplicate samples: Samples may be sequenced repeatedly in WGS studies for QC purposes or because the same subject was included in different studies. Accidental duplications and the inclusion of monozygotic (MZ) twins are additional possibilities. Duplicate samples should be identified during QC. The standard approach involves estimating the kinship coefficient, which is the probability that two homologous alleles drawn from each of two individuals are identical by descent (IBD), and the Jacquard Δ7 coefficient, that is, the probability that a pair of subjects has two alleles IBD. In practice, the kinship coefficient is often plotted against the probability of sharing zero alleles IBD. Manichaikul et al. (2010) proposed to classify subjects as identical or MZ twins if the kinship coefficient exceeds 1/23/2, and the proportion of zero alleles shared IBD is <0.1.

  • QC criterion 2.8 – sex chromosome aneuploidies (SCAs): To apply this criterion, sex needs to be estimated from the read alignments. This involves scaling the estimated X and Y coverages by the estimated autosomal coverage for obtaining the X and Y chromosomal dosages. A scatterplot of X versus Y chromosomal dosages may be generated. An automated classification of male subjects may fail in WGS studies because of the age‐dependent mosaic loss of the Y chromosome (Guo, Dai, and Zhou 2020). Subjects with SCAs may be identified from this plot. For example, individuals with Klinefelter syndrome (XXY) are expected to have a dosage of two for the X chromosome and one for the Y chromosome.

  • QC criterion 2.9 – Sex chromosome composition versus reported sex: The estimated sex may be compared against the reported sex. Sex inconsistencies between observed and reported sex may be caused by handling errors, disagreement between genomic and self‐reported sex, or chromosomal anomalies. On average, only half of the sample swaps may be detected by investigating differences between the observed and the reported sex.

  • QC criterion 2.10 – duplicate marking: Duplicate marking is similar to sequence duplication (QC criterion 1.8). However, duplicate reads are defined as fragments that have both ends, not just from one read, mapping to the same position during alignment.

4.3. Quality Control Step 3: After Single Sample Variant Calling

  • QC criterion 3.1 – SNP concordance with database: Concordance with an external database, such as dbSNP (Sherry, Ward, and Kholodov 2001) or the 1000 Genomes Project (Byrska‐Bishop, Evani, and Zhao 2021), measures the proportion of previously reported variants. Variants that are not found in these databases are considered to be novel mutations. In homogeneous populations, samples showing more or fewer novel variants than expected could indicate issues in the wet‐lab or dry‐lab processes (Jew and Sul 2019). SNPs concordance values vary substantially between publications, and they range from approximately 70% to almost 100% (Ebbert, Wadsworth, and Staley 2016; Koboldt et al. 2010; Xu, Lin, and Tang 2019). Concordance depends, for example, on the underlying population, the coverage, the sequencing platform, the bioinformatics pipeline, including variant filters. It may also be computed during QC step 4, that is, after multisample calling.

  • QC criterion 3.2 – allelic balance: Allelic balance is the proportion of alternative read counts among the total read count at a specific position. It serves as a variant quality indicator and is often used to filter variants (Abnizova, Te Boekhorst, and Orlov 2017; Jew and Sul 2019; Lam, Clark, and Chen 2012). Heterozygous sites for diploid organisms are expected to show an allelic balance centered around 0.5 (Guo et al. 2014a; Muyas, Bosio, and Puig 2019). However, in practice, the mean allelic balance tends to be biased towards the reference allele, with a reported value of 0.483 (range: 0.447−0.499) (Guo, Samuels, and Li 2013). Because the depth differs between subjects, the allelic balance (AB) distribution ABi of subject i is modeled as a binomial distribution with parameters Di and pi, where the depth of subject i at Di is modeled by a Poisson distribution with parameter λD and the variant frequency follows a normal distribution with parameters μ=p, and the variance σ2 denotes the error of sample i in the pooling process (Guo, Samuels, and Li 2013).

  • QC criterion 3.3 – sample call missingness: We define the sample call missingness as the ratio of the number of missing genotypes over genome length. Sample call missingness depends on various technological factors, such as the DNA quality, sequencing coverage, sequencing technology, the sequencing instrument, and the sites considered. A common threshold is to exclude samples with more than 5% missing genotypes from further analysis (Somineni, Nagpal, and Venkateswaran 2021). This criterion may also be used at the subject level after multisample calling in QC step 4.

  • QC criterion 3.4 – Het/Hom ratio: The Het/Hom ratio is the ratio of heterozygous to nonreference homozygous variants. Mathematical analysis has shown that the Het/Hom ratio should be 2.0 for polymorphisms in the Hardy–Weinberg equilibrium (HWE; Guo et al. 2014a). However, empirical work revealed that the Het/Hom ratio is ancestry‐specific, necessitating the inference of ancestry information for all samples prior to its use. Because of population heterogeneity, expected Het/Hom ratios differ substantially between populations (Supernat et al. 2018; Wang et al. 2015). The expected Het/Hom ratio is around 1.5–1.6 for EUR‐like populations (Wang et al. 2015). For AFR‐like populations, it is approximately 1.9–2.0; a higher variability from 1.3 to 1.7 was observed for AMR‐like populations (Wang et al. 2015). The variability was low for EAS‐like populations (1.3–1.35) and between 1.5 and 1.6 for SAS‐like populations (Wang et al. 2015). Furthermore, different values are obtained for SNPs, insertions, and deletions.

  • QC criterion 3.5 – genetic similarity and genetic ancestry groups: While multiethnic studies, such as the TOPMed program (Taliun, Harris, and Kessler 2021) aim to adjust for population heterogeneity, that is, genetic dissimilarity between subjects, during association analysis, GENESIS‐HD aims for a genetically similar study population. Two approaches may be used in the literature. First, principal components (PCs) may be estimated (Price et al. 2006) and integrated into the general population substructure as determined by the 1000 Genomes Project (The 1000 Genomes Project Consortium 2015). The more SNPs available for integration, the better the identification of genetic similarity (Koenig, Yohannes, and Nkambule 2023). The PC approach might not represent the true branched population structure if estimated directly from the study data (Elhaik 2022), a limitation common to many data reduction techniques. Hierarchical clustering is an alternative method to PCs for identifying genetic ancestry groups, commonly used in studies on genomic diversity (Chen, Zhang, and Kang 2012). A single linkage agglomeration can be used, for example, using the Nei genetic distance (Nei 1978) or the kinship coefficient (Kirkpatrick, Ge, and Wang 2019). Both genetic similarity and genetic ancestry can be a QC filter, for example, to detect a mismatch between self‐reported and genetically inferred genetic ancestry. However, genetic diversity is critical for the discovery, replication, and fine mapping of genetic loci associated with a phenotype (Peterson, Kuchenbaecker, and Walters 2019).

  • QC criterion 3.6 – Ti/Tv ratio: The Ti/Tv ratio is the ratio of transitions to transversions among SNPs and indicates overall SNP calling quality. In humans, the expected Ti/Tv ratio is around 2.0 across the whole genome, and 2.1 is the often‐cited point estimate (DePristo, Banks, and Poplin 2011; The International HapMap Consortium 2003). In coding regions, the Ti/Tv ratio is generally reported to be between 3 and 4 (Challis, Yu, and Evani 2012; Marth, Yu, and Indap 2011). The Ti/Tv ratio depends on the specific variants included in its calculation and on the filtering of samples, such as population heterogeneity. The Ti/Tv ratio may be used as a QC criterion beyond QC step 3 and act as a general quality indicator for the study.

  • QC criterion 3.7 – malformed GVCF: The GVCFs may be malformed and should be tested for validity. Specifically, a site may have identical reference and alternative alleles.

4.4. QC Step 4: After Multisample Variant Calling

  • QC criterion 4.1 – Minor allele count (MAC): The minor allele count, also termed alternate allele count, removes sequencing errors and genetic markers with very low variability. A standard filter is to remove singletons, that is, mutations that occur only once among all subjects. However, if singletons are excluded, that is, MAC = 1 is used as a filter, and power may be reduced in the association study because only founder mutations are kept. MAC is also related to sequencing error fractions (Kofler, Orozco‐terWengel, and De Maio 2011).

  • QC criterion 4.2 – read depth (DP)/sequencing depth: Diverse criteria have been proposed for using read depth (DP) at a site for filtering. For example, the mean sequencing depth has been used as a filter, and in one article it was set to be ≥10 in the overall study population (Kelly, Sun, and He 2022). By investigating DP separately in cases and controls, differential bias may be avoided in association analysis (Chen, Graf, and Chen 2021). Setting minimum and maximum DP thresholds for the alternate allele ensures high‐quality polymorphisms. For example, Chen, Graf, and Chen (2021) required a DP of at least 4, whereas Natarajan, Peloso, and Zekavat (2018) specified a DP range of 10–200 for the alternate allele. Slightly weaker is the criterion that the DP had to be ≥8 in at least 90% of the samples containing the alternate allele (Chen, Graf, and Chen 2021).

  • QC criterion 4.3 – missing genotype calls: Sites with a large proportion of missing genotypes should be filtered out to avoid bias in association analysis (Ziegler, Thompson, and König 2008) In general, sites are filtered if more than 5% of the genotypes are missing (Kelly, Sun, and He 2022; Natarajan, Peloso, and Zekavat 2018). However, thresholds vary from 1% (Gilly, Park, and Png 2020) to 10% (Erikson, Bodian, and Rueda 2016). Filtering may be done for the overall sample or by group, such as separately for cases and controls to avoid potential bias in association analyses.

  • QC criterion 4.4 – Hardy–Weinberg equilibrium: Departure from the HWE can be caused by factors such as population stratification, copy number variation, inbreeding, assortative mating, selection, or migration (Ziegler, Van Steen, and Wellek 2011). In most human populations, these factors minimally impact HWE. However, selection plays an important role in some human traits, such as immunity (Mathieson, Lazaridis, and Rohland 2015; Ziegler, Van Steen, and Wellek 2011), and targets many chromosomal regions (Akey 2009). If selection plays a role, Hardy–Weinberg filtering should be done with care so that truly associated sites are not filtered out.

    One additional reason for a deviation from HWE is segmental duplications. If the reference genome contains only one copy of the two regions, both chromosomal segments align to the same part of the reference genome, making subjects appear heterozygous. Centromeres and telomeres are enriched with segmental duplications, they are thus not randomly distributed across the genome (Abdullaev, Umarova, and Arndt 2021). Such recurrent heterozygous sites can be filtered by investigating deviation from HWE.

    An inbreeding coefficient may be used to filter for deviation from HWE. For diallelic genetic markers, the most commonly used inbreeding coefficient is based on the ratio of observed to expected heterozygotes (Ziegler and König 2010), and it is estimated by f^=1n122np^(1p^), where n12 denotes the number of observed heterozygotes, n denotes the total sample size and p^ denotes the allele frequency of the minor allele. To filter recurrent heterozygous sites, polymorphisms having a low inbreeding coefficient can be excluded, and thresholds of f^<0.3 or f^<0.9 have been used in the literature (Morrison, Huang, and Yu 2017; Natarajan, Peloso, and Zekavat 2018). However, the negative thresholds only filter sites with excessive heterozygosity but not sites with excessive homozygosity. We thus prefer filter criteria that filter out sites with any deviation from HWE.

    A wealth of literature is available on how the strength of deviation from HWE can be measured, and how corresponding statistical tests can be conducted (Wellek, Goddard, and Ziegler 2010; Ziegler and König 2010; Ziegler et al. 2011). The analysis of deviation from HWE on the X chromosome requires special attention and specific test statistics (Graffelman and Weir 2016; Wellek and Ziegler 2019).

    For variant filtering, a test for deviation from HWE is performed, excluding polymorphisms with p‐values below a specific threshold, ranging between 10−3 and 10−14 (Chen, Graf, and Chen 2021; Morrison, Huang, and Yu 2017; Zhan et al. 2021). We recommend using a liberal filtering threshold of 10−9 and 10−12, especially when selection might play a role.

  • QC criterion 4.5 – minor allele frequency (MAF): Instead of filtering by MAC, the MAF may be used for filtering (Ziegler, Thompson, and König 2008), and thresholds as low as 0.1% (Somineni, Nagpal, and Venkateswaran 2021) have been suggested. The MAF may also be considered as a filter for statistical power as power is low in single marker analysis in the case of a low MAF.

  • QC criterion 4.6 – genotype quality/mapping quality of polymorphism: A polymorphism may also be considered by its genotype quality and the root mean square of the mapping quality of the polymorphism may serve as filter criterion. For example, it was set to ≥30 in Chen, Graf, and Chen 2021.

  • QC criterion 4.7 – allelic balance: Allelic balance serves as a QC criterion at both the subject (QC criterion 3.4) and the polymorphism level. A heterozygous genotype call should have an allelic balance of around 0.5. A filter criterion might require an allelic balance between 0.25 and 0.75 for a polymorphism (Chen, Graf, and Chen 2021). An allelic balance of at least 15% in at least one sample with an alternate allele is another filtering criterion (Taliun, Harris, and Kessler 2021).

  • QC criterion 4.8 – strand bias: If strand information is available after multi sample calling, the balance of the alternate allele across forward and reverse reads can be tested.

  • QC criterion 4.9 – base repetitions and homopolymer runs: Homopolymer runs, that is, repetitions of the same base, should be short. Polymorphisms with other repetitive patterns, such as repetitions of the bases GC, should also be excluded. One approach is to mask such polymorphisms with software, such as RepeatMasker (Tempel 2012), or to use stringent thresholds, such as the restriction to homopolymer runs shorter than 6 bp (Chen, Graf, and Chen 2021).

5. Short‐Read Sequencing and Preprocessing Applied to GENESIS‐HD

Core elements of the sequencing protocol for the GENESIS‐HD study are as follows: DNA was extracted from EDTA blood samples at the University Medical Center Eppendorf (UKE) in Hamburg. DNA derived from LCL for the three GIAB samples was obtained from the Coriell Institute for Medical Research. After the arrival of the DNA samples at the University Medical Center Zurich, the DNA concentration was measured with PicoGreen, followed by an automated DNA normalization with Hamilton Robotics. Libraries were prepared using the Illumina TruSeq DNA PCR Free Library Prep protocol HT (Illumina Inc., San Diego, CA, USA) for WGS. The protocol steps were (1) fragmentating 1 µg of genomic DNA to 350 bp with Covaris LE220‐plus, (2) fragment cleanup, (3) end repair, (4) size selection, (5) 3′‐end adenylation, and (6) adapter ligation. Libraries were quantified and quality‐checked on an iSeq100 (Illumina). Samples were normalized according to the quantification values, and 54 samples were pooled for sequencing on an Illumina NovaSeq 6000 sequencer. Samples were sequenced twice on S4 flow cells with 300 cycles (2 × 150 reads) with an estimated coverage of 15× each, following Illumina protocols. The target was for more than 95% of samples to achieve a coverage of at least 30×.

QC was ongoing throughout the sequencing, as the NovaSeq 6000 processed 500–700 samples monthly. During normal operation, data were transferred in batches of approximately 250 subjects, and QC reports were generated for these batches. General QC was performed at three timepoints: after sequencing 900 subjects, 6000 subjects, and at the end of sequencing.

DRAGEN version 3.8.4 was run on all samples without flow cell failure for preprocessing, that is, mapping, alignment, and single sample variant calling. By setting ‐‐soft‐read‐trimmers = polyg,adapter, we did not trim adapters, thus expected all reads to maintain a length of 151 bp.

The percentage of within‐species contamination in samples was estimated using the DRAGEN with the ‐‐qc‐cross‐cont‐vcf option and the most likely contaminant was estimated using COMET (Thiéry et al. 2020). The duplicate marking option was enabled (–enable‐duplicate‐marking) for estimating the proportion of duplicated sequences per sample.

To identify a genetically similar population, we used the PC analysis approach (Section 4.3). The sequence data from the 1000 Genomes Project, Phase 3 (1KG) were used for the PC analysis (The 1000 Genomes Project Consortium 2015). All diallelic SNPs with a MAF of at least 10% and missing frequency below 10% were used to estimate the PCs. All GENESIS‐HD samples that passed QC up to step 2 and the Het/Hom ratio filter were then projected into the PC space estimated from the 1KG data. GIAB samples were added to the PC plot.

The ellipse to define EUR‐like ancestry was derived by first defining a rectangle using the GIAB sample, which was closest to the cluster of 1KG EUR‐like subjects from the 1KG samples. Second, the ellipse was derived vertically and horizontally following the first two axes from the PC analysis. The ellipse was chosen at the boundary of the GIAB sample closest to the EUR‐like subjects. To investigate the adequacy of the estimated PCs, we generated a genetic similarity plot, in which subjects from the 1KG were displayed together with the geographical origin of subjects from GENESIS‐HD. Subjects were assigned a geographic region using the United Nations M49 standard region codes (United Nations 1998) when the reported country of birth was the same for at least three grandparents

Multisample variant calling was done with the Iterative gVCF Genotyper. Lossless compression of the .fastq.gz files was done with DRAGEN ORA version 2.5.5 alongside DRAGEN version 3.10 because it allowed for a hash‐based integrity check. Important code for this work is provided as the Supporting Information.

6. Important Quality Control Criteria: Application to the GENESIS‐HD Study

The actual sample sizes before QC and after QC are displayed in Table 1. Figure 3 illustrates the flowchart for the GENESIS‐HD study. After the pilot phase, a total of 9412 samples, with 8123 from HCHS, were sequenced between November 2019 and January 2022.

FIGURE 3.

FIGURE 3

Flowchart of the GENESIS‐HD study. FASTQ files were analyzed from 9412 samples. Two runs had flowcell failures and were repeated. DRAGEN failed on 22 subjects, and two GVCFs were malformed after variant calling. Since one sample may fail several criteria in one QC filtering step, sample sizes within a block of removed samples do not sum to the total number of removed samples in the specific stage. The QC filter criteria are explained in detail in Section 3 and illustrated in Section 4.

6.1. Quality Control Step 1: Raw Sequence Data Analysis During the Conduct of the Sequencing Experiment

All flow cells had between 16 billion and 20 billion reads, so all passed QC criterion 1.1. With a mean insert size of 350 bp, the NovaSeq configured for 151 bp reads and no adapter trimming (see Section 5), we expected all reads to be 151 bp long. All reads were 151 bp long and thus passed QC criterion 1.2. QC criterion 1.3 on the per sequence GC content distribution and QC criterion 1.4 on the per base quality profile were not used as filter criteria in the final QC analyses.

Three sequencing runs had quality issues when QC criterion 1.5 was applied. Figure 4 displays the percentage of Q30 bases for 20 runs. The runs displayed in black show that the percentage of Q30 bases was around 93% for Read 1 and around 90% for Read 2. The lower quality of Read 2 reads was expected (Section 4.1). The run displayed in green and shown first (flow cell HVNNNDSXX) had a lower percentage of Q30 bases for all subjects, even lower than the typical percentage of Q30 bases for Read 2 reads (Figure 4). Because of the low quality of the Read 1 reads for flow cell HVNNNDSXX, all samples on this flow cell were re‐sequenced. The run with the replacement flow cell HM5MKDSXY is shown in light blue in Figure 4 and had percentages of Q30 bases for both Reads 1 and 2 in the expected range.

FIGURE 4.

FIGURE 4

Percentage of Q30 bases by run for 20 selected flow cells. Each point represents one sample, and 54 samples were pooled per run. The upper panel shows Read 1; the lower panel shows Read 2. The run for flow cell HVNNNDSXX (shown in green) had a much lower proportion of Q30 bases for Read 1 than other runs (shown in black). A repeat of this run (flow cell HM5MKDSXY, shown in light blue) had a normal proportion of Q30 bases.

Samples failed QC criterion 1.6 if the difference between A and T, or C and G proportions at any sequencing position exceeded two standard deviations from the average base content. Visual inspection identified two unusual runs (Figures 5 and 6).

FIGURE 5.

FIGURE 5

Base content (proportion) for each base by read position for the runs with flow cell HMWFKDSXY. Panel A: Read 1 and panel B: Read 2. The Illumina instrument reported an error, and there was a writing error from the sequencing machine to the server for base positions 46, 63, and 64. This run was considered a flow cell failure and thus repeated.

FIGURE 6.

FIGURE 6

Base content (in percentage) for each base by read position for the runs with flow cell HJYVJDSX2. Panel A: Read 1 and panel B: Read 2. The Illumina instrument did not report an error, although there was a clear visible drop of bases for Read 1 at position 47. The run was kept for further analyses.

Figure 5 shows the percentage of each base per read, separately for Reads 1 (panel A) and 2 (panel B), for run HMWFKDSXY. The base content was lower for Read 2 for flow cell HMWFKDSXY at base positions 46, 63, and 64 for all four bases. The per base N content was correspondingly high at these positions (QC criterion 1.7, not shown). The Illumina instrument reported a writing error. The replacement flow cell HMWFKDSXY had normal base content. Figure 6 displays a similar error for Read 1 of flow cell HJYVJDSX2 at base position 45. The Illumina instrument reported no error, and the flow cell was kept for further analyses.

QC criteria 1.8 and 1.9 were not applied in this study because of a lack of acceptable duplication rates in the literature and because soft‐trimmed sequences were used in this study.

Two runs with a total of 106 samples had to be repeated because of flow cell failures. An additional 22 samples could not be processed by DRAGEN. This left us with 9284 samples after QC step 1 (Figure 3).

6.2. Quality Control Step 2: Data Analysis After Mapping and Alignment

We excluded 118 samples with a median coverage below 28× (QC criterion 2.1; Figure 7). We did not use the breadth of coverage to exclude samples (QC criterion 2.2). We excluded 16 samples with a median insert size below 350 bp to avoid adapter read‐through (QC criterion 2.3; Figure 8). We did not use the read depth distribution (QC criterion 2.4) to exclude samples, but we excluded 74 samples for not having at least 96% of properly paired reads (QC criterion 2.5; Figure 9).

FIGURE 7.

FIGURE 7

Median autosomal coverage by run. Each point represents one sample, and the red dashed line represents a median autosomal coverage of 28×. Samples with median autosomal coverage < 28× were excluded.

FIGURE 8.

FIGURE 8

Median insert size for each sample by run. Since the read length was 150 bp, samples with a median insert size below 350 bp were excluded (red dashed line). Average median insert size plus/minus standard deviation are given as green solid and green dashed lines, respectively.

FIGURE 9.

FIGURE 9

Percentage of properly paired reads for each sample by run. Panel A: all samples; panel B: samples with at least 80% properly paired reads. All samples with less than 96% properly paired reads (red dashed line) were excluded from further analyses.

The percentage of within‐species contamination was estimated and plotted per sample (QC criterion 2.6; Figure 10), and 165 samples with a contamination level above 1% were excluded. One run exhibited contamination levels above 15% in all samples from the first column of the well plate, with no measurements recorded for the second column (Figure 11). The reason is a handling error of the eight‐channel multichannel pipette, reported by the laboratory technician before sequencing started.

FIGURE 10.

FIGURE 10

Estimated cross‐sample contamination (proportion) for each sample by run. The red dashed line corresponds to a cross‐sample contamination of 0.01. The base contamination level of DRAGEN was at 0.001. All samples with contamination exceeding 0.01 were excluded from further analyses.

FIGURE 11.

FIGURE 11

Plate layout for one run with sample cross‐contamination. Empty wells are displayed in light gray.

The analysis for identifying duplicate samples (QC criterion 2.7) had the biggest effect in QC step 2. One individual was sequenced 16 times for repeatability analysis, and only one of these datasets was kept. Because of initial low‐quality sequencing results, 130 samples were resequenced, and the low‐quality datasets were excluded. We found that 72 participants were included in two distinct studies. Subjects that passed the QC criteria on the raw FASTQ files and after mapping and alignment were included in the duplicate analysis. When multiple sequences for a sample had contamination levels of 1% or less (QC criterion 2.6), the dataset with the lowest contamination level was selected. If contamination levels were identical, the dataset with the highest median coverage was selected. Among the duplicate samples, 54 out of 72, so three‐quarters, belonged to participants from both the AHFRI and HCHS studies. For two‐thirds of these cases, the HCHS dataset was used for further analysis. Finally, seven samples were removed for which a sample swap was identified. For example, sequences of two different subjects recruited on the same day were identical. There obviously was an error in the handing of the corresponding blood samples.

Sex inference (QC criteria 2.8 and 2.9) had a low impact (Figure 12). The two panels of this figure illustrate the effect of filtering up to QC step 2.7. The 39 subjects with an unclear chromosomal pattern, displayed in gray in panel A, had already been filtered out (panel B). The clouds for males and females were more homogeneous after filtering, and only clear sex patterns remained. Ten cases of SCAs were observed and excluded (QC criterion 2.8): three subjects with Turner syndrome (X0), three with Jacobs syndrome (XYY), and four with Klinefelter syndrome (XXY). These numbers match well the expected numbers for these syndromes in the general population: 1 in 2000 females for Turner syndrome, 1 in 658 males for Klinefelter syndrome, and 1 in 1020 males for Jacobs syndrome (Berglund, Stochholm, and Gravholt 2020). One subject was excluded because of a mismatch between reported and observed sex (QC criterion 2.9).

FIGURE 12.

FIGURE 12

Sex inference. The dosage of chromosome Y is plotted against the dosage of chromosome X, in panel A for all samples, and in panel B for all samples with filtering up to and including QC criterion 2.7. Ten subjects were estimated to have a sex chromosome aneuploidy. Three subjects were identified with Turner syndrome (X0), three with Jacobs syndrome (XYY), and four with Klinefelter syndrome (XXY). Subjects displayed in gray had high levels of cross‐sample contamination and were filtered out.

The proportion of duplicated sequences per sample was estimated, but no filtering was performed with this criterion (QC criterion 2.10).

6.3. Quality Control Step 3: After Single‐Sample Variant Calling

After QC step 2, 8748 samples remained (Figure 3). In QC step 3, 397 samples were removed. We measured SNP concordance with a public database in QC step 4 but did not filter based on this criterion (QC criterion 3.1). The check for allelic balance (QC criterion 3.2) would have to be established for every genomic position and every sample, and there is no established threshold for filtering. We thus refrained from using this QC criterion. In the present study, we did not filter samples by sample call missingness (QC criterion 3.3). However, we investigated site missingness in QC step 4 (Section 6.4), which had an immediate effect on the sample call missingness.

For QC criterion 3.4, Figure 13 displays the Het/Hom ratio for all samples (panel A) and for samples filtered up to QC step 2 (panel B). The Het/Hom ratio was more homogeneous when QC filters up to step 2 filters were applied. The average Het/Hom ratio was close to the expected value of 1.65 (Wang et al. 2015; Wang, Beckmann, and Roussos 2018), and the filtering of samples with Het/Hom ratios below 1.5 or above 1.8 seems justified from the display in Figure 13.

FIGURE 13.

FIGURE 13

Het/Hom ratio by run. Panel A: all samples; panel B: all samples with filtering up to QC step 2. Samples displayed in yellow were excluded by genetic similarity analysis (QC criterion 3.5). The expected Het/Hom ratio was 1.65. The red dashed lines are at 1.5 and 1.8 and represent the filter thresholds for the Het/Hom ratio in this study.

PC analysis (QC criterion 3.5) excluded 293 subjects who were classified as being from a non‐EUR‐like ancestry group (Figure 14). Subjects of 1KG AFR‐like and 1KG SAS‐like ancestry groups match well between both studies. The genetic similarity plot, in which subjects from the 1KG project are displayed with the geographical origin of subjects from GENESIS‐HD, shows a good agreement between the estimated genetic similarity and the grandparental countries of birth (Figure 15).

FIGURE 14.

FIGURE 14

Genetic similarity analysis. The first two principal components (PC) are displayed for all subjects from GENESIS‐HD, which passed QC up to step 2 and filtering by the Het/Hom ratio. Data from 2504 subjects from the 1000 Genomes (1KG) Project, Phase 3 were integrated into the plot (The 1000 Genomes Project Consortium 2015). All subjects outside of the ellipse were filtered due to genetic dissimilarity with the 1KG EUR‐like subjects. AFR: 1KG AFR‐like; AMR: 1KG AMR‐like (Mexican ancestry from Los Angeles, USA; Puerto Rican from Puerto Rico; Colombian from Medellín, Colombia; Peruvian from Lima, Peru); EAS: 1KG EAS‐like; EUR: 1KG EUR‐like; SAS: 1KG SAS‐like.

FIGURE 15.

FIGURE 15

Geographic origin of subjects and genetic similarity analysis. The first two principal components (PC) are displayed for all subjects from GENESIS‐HD with information on the geographical origin of grandparents according to standard United Nations M49 region codes (United Nations 1998). Subjects displayed had to have passed QC up to step 2 and filtering by the Het/Hom ratio. Description of 1000 Genomes (1KG) Project, Phase 3, data, see legend to Figure 14.

The Ti/Tv ratio was calculated for all samples without applying it as a filter at this stage (QC criterion 3.6). However, it was used as a general quality indicator in QC step 4. DRAGEN output two malformed GVCFs, which had to be excluded from further analyses (QC criterion 3.7). These GVCFs were characterized by partly identical reference and alternative alleles.

Following QC step 3 and the exclusion of the GIAB samples, we were left with a total of 8351 samples. that were then subjected to batch‐based multisample calling using the iterative gVCF genotyper (Illumina Inc., 2022).

6.4. Quality Control Step 4: After Multisample Variant Calling

The final step of QC investigated the polymorphic sites. We restricted our analysis to diallelic markers, as most analysis programs only work with these types of polymorphisms. We set the MAC threshold to at least 2 to exclude singletons (QC criterion 4.1). Depth per site is unavailable following multisample calling and, therefore, not used for filtering (QC criterion 4.2). However, DRAGEN outputs DP per sample per site and genotype quality. Variant calls with a depth of zero or one or a genotype quality below 5 were set to missing so that DP and genotype quality were indirectly considered through the missingness filter. Polymorphic sites with missing genotypes exceeding 5% were filtered out (QC criterion 4.3).

A total of 210,953 (0.46%) sites were filtered out because of deviation from HWE using a p‐value filter criterion of 10−9 (QC criterion 4.4). HWE removed all sites that were heterozygous for at least 98% of the individuals (Table 2). Most of these polymorphisms were found around the centromere, and the chromosomes with the largest number of these sites were chromosomes 17 and 20 (Figure 16). We did not apply QC criteria 4.5 through 4.9 in this study.

TABLE 2.

Descriptive statistics on recurrent heterozygosity by chromosome.

Chr # sites # het sites Prop het sites Chr # sites # het sites Prop het sites
1 3,743,591 233 6.22 × 10−5 12 2,144,645 53 2.47 × 10−5
2 3,910,519 644 1.65 × 10−5 13 1,668,778 63 3.77 × 10−5
3 3,197,265 161 5.04 × 10−5 14 1,453,822 77 5.30 × 10−5
4 3,132,165 148 4.73 × 10−5 15 1,334,443 21 1.57 × 10−5
5 2,872,255 31 1.08 × 10−5 16 1,522,733 462 3.03 × 10−4
6 2,797,644 17 6.08 × 10−6 17 1,337,091 2066 1.55 × 10−3
7 2,654,795 440 1.66 × 10−4 18 1,311,985 186 1.42 × 10−4
8 2,484,766 50 2.01 × 10−5 19 1,070,542 322 3.00 × 10−4
9 2,053,981 1268 6.17 × 10−4 20 1,075,095 3468 3.23 × 10−3
10 2,253,519 258 1.15 × 10−5 21 644,814 180 2.79 × 10−4
11 2,230,600 133 5.96 × 10−5 22 689,902 814 1.18 × 10−3

Note: A site was called heterozygous (het site) if 98% or more of the subjects were heterozygous. Only sites are considered that had a minor allele count of at least two and had at most 5% missing genotypes per site over all subjects, that is, after application of quality control including QC criterion 4.3.

Abbreviations: Chr, chromosome; # het sites, the number of heterozygous sites; Prop het sites, the proportion of heterozygous sites.

FIGURE 16.

FIGURE 16

Recurrent heterozygous sites. The distribution of sites with excess heterozygosity per 50 kilobases is given in yellow. Regions with low mappability per 50 kilobases are shown in green. A site is defined as having excess heterozygosity if the number of samples with a heterozygous call is above 98%. Mappability was calculated from the hg38 reference genome (github.com/broadinstitute/ichorCNA/tree/master/inst/extdata). Stretches of low mappability regions are observed on the short arms of the acrocentric chromosomes 13, 14, 15, 21, and 22, that is, the chromosomes, where one arm is substantially shorter than the other (Antonarakis 2022). Furthermore, they are observed in the centromeres of chromosomes 1 and 9, and on chromosome Y, which are all difficult to map (Aganezov, Yan, and Soto 2022; Rhie, Nurk, and Cechova 2023).

7. Compression of .fastq.gz Files Using the DRAGEN ORA Compression

After all samples were processed with the pipeline in Figure 2, the .fastq.gz files were compressed using DRAGEN ORA version 2.5.5, which allowed for a hash‐based integrity check. Figure 17 shows the compression ratio, which depends on the proportion of the mismatched bases for both SNPs and Indels (Figure 18). On average, .fastq.ora files were approximately 5.6 times smaller than .fastq.gz files (Figure 17). Specifically, the median compression ratio was 5.62 (quartiles 5.51–5.72). The raw sequence data (.fastq.gz) of the study had almost 700 terabytes (TB), and after compression only about 160 TB were required for storing the sequence data. Median .fastq.gz size was 67.4 GB (quartiles 64.9–70.0 GB) per sample, and median .fastq.ora file size was 12.0 GB (quartiles 11.4–12.7 GB).

FIGURE 17.

FIGURE 17

File size for (A) .fastq.gz and (B) .fastq.ora files by median autosomal coverage. While the average file size per sample was approximately 65 GB for .fastq.gz files at 35× coverage, .fastq.ora files were approximately 12 GB large.

FIGURE 18.

FIGURE 18

Compression ratio for .fastq.gz versus .fastq.ora files by A) median autosomal coverage and B) mismatch bases. While there seems to be no relationship between compression ratio and autosomal coverage, there is almost a linear relationship between the proportion of mismatch bases, that is, deviation from the reference genome, and the compression ratio.

8. Discussion

This work provided a comprehensive summary of approaches for preprocessing and QC, starting from the raw FASTQ files to the final MSVCF. The compilation of the filter criteria (Section 4), together with the demonstration of their importance on the real data (Section 6), may serve as a blueprint for other studies. Its application to GENESIS‐HD, a WGS study involving more than 9000 samples, showed that five QC criteria filtered more than 100 samples each.

First, the genetic similarity filter led to the exclusion of 293 subjects. A strict criterion was used to ensure that the final sample represents a homogeneous EUR‐like population. Second, the relatedness filter resulted in the exclusion of 224 subjects (Figure 3). For many subjects, this relatedness was intended. One subject was sequenced 16 times, and the sequencing of 130 samples was repeated because of low quality in the first run. We additionally excluded the WGS of 72 samples because subjects were included in two different studies. These duplications could not be avoided because DNAs for the WGS study were selected using pseudonymized subject identifiers. Furthermore, seven subjects were excluded because of sample swaps that we detected. Lastly, filtering following contamination and coverage analyses led to the exclusion of 126 samples and 118 samples, respectively, while checks on the raw FASTQ data revealed technical problems with two runs resulting in the filtering of 106 samples.

The last QC step filters samples and polymorphisms on the MSVCF. Most polymorphisms that had to be filtered out were located close to the centromere and were characterized by heterozygous genotypes for almost all subjects. Despite the great technological improvement of sequencing over the past years, the genome sequence had unsolved genomic gaps with a total length of >150 megabases until quite recently, making up about 5% of the human genome sequence (Zhao et al. 2020b). Recent work has closed these gaps (Nurk, Koren, and Rhie 2022; Rhie, Nurk, and Cechova 2022). Consequently, sequencing studies, such as WGS studies, still have chromosomal gaps, especially around the centromeres of chromosomes 1, 9, and 16 and the short arms of chromosomes 13, 14, 15, 21, and 22 (Figure 15), which may partly be closed by recalling (Nurk, Koren, and Rhie 2022). These considerations highlight one important difference between genetic data and clinical trial data. While the data from clinical trials are considered final after a database closure, constant improvements can be expected for WGS data. In particular, recalling all WGS data from the very beginning, that is, the FASTQ file, has to be expected.

The pipeline described in this work from the FASTQ files to the single‐sample GVCFs represents one of the first applications of Illumina's DRAGEN, which outperforms the commonly used GATK pipeline (Betschart et al. 2022; Zhao et al. 2020a). One limitation of using the DRAGEN for preprocessing is that the QC criteria are restricted to those available for DRAGEN. However, although other QC criteria from the GATK pipeline could be calculated by running the corresponding relevant procedures, we found the QC criteria in the DRAGEN pipeline to be adequate. Another limitation of our work is that we only report results from the Germline Small Variant Caller of the DRAGEN. Specialized callers may be used for other types of genetic variation, such as short tandem repeats or copy number variation (Garg et al. 2022; Ziaei Jam, Li, and DeVito 2023).

The adequacy of the filter criteria used in our study was confirmed by a PC analysis that investigated the matching of the marker‐based genetic similarity with the reported countries of birth for the grandparents (Figure 15), in analogy to Novembre, Johnson, and Bryc (2008). In this analysis, we did not estimate the PCs anew but used the ones determined by the 1KG project (The 1000 Genomes Project Consortium 2015). Figure 12 contrasts the sex inference before and after filtering and demonstrates that the filter criteria up to QC criterion 2.7 were effective. The same conclusion can be drawn from Figure 13, in which the Het/Hom ratios are displayed before and after filtering up to QC step 2. Finally, we observed an increase in the Ti/Tv ratio from 1.96 for the full dataset to 2.20 for the final multisample VCF dataset. The final Ti/Tv ratio is at the upper limit of the expected Ti/Tv ratio for WGS studies (DePristo, Banks, and Poplin 2011; The International HapMap Consortium 2003).

The present work represents one of the first applications of DRAGEN ORA compression. While the standard .fastq.gz files had a size of approximately 65 GB, the average file size of the corresponding .fastq.ora compressed file was approximately 12 GB. The raw data from this study could be reduced from almost 700 TB to approximately 160 TB.

All analyses were run locally on a high‐performance computing cluster at Cardio‐CARE (Davos, Switzerland) as described before (Betschart et al. 2022). Each compute node was equipped with 2 AMD EPYC 7742 CPUs, 2 TB RAM, approximately 11 TB NVMe, and the CentOS 8 operating system. For pipeline comparisons, each pipeline was given 64 cores. Two DRAGEN Servers v3 were used for the DRAGEN pipeline. In this computational environment, the DRAGEN 3.8.4 required 36 ± 2 min (mean ± standard deviation) per sample from FASTQ to GVCF. Runtime per sample for the first part of the pipeline, that is, from FASTQ to BAM was 182 ± 36 min for GATK and 18 ± 1 min for DRAGEN (Betschart et al. 2022). Runtimes per sample for variant calling were 134 ± 20 min for GATK in the standard mode and 18 ± 1 min for DRAGEN (Betschart et al. 2022). ORA compression required approximately 5 min per sample with a coverage of 35×, the decompression needed approximately 2 min. One batch with 1000 samples took approximately 9 h to be processed with the Iterative gVCF Genotyper on one compute node with 128 threads, that is, cores. Finally, the merging of all batches required a total of 15 h on one compute node with 128 threads. The majority of the time spent was on indexing the MSVCF file. The pure computation time from FASTQ to GVCF was approximately 113 days for 9000 samples. Running eight batches with the Iterative gVCF genotyper took approximately 3 days. With the additional 15 h for merging, multisample calling required less than 4 days in total.

Could we have performed the analyses about 10 years ago with such ease? The simple answer to this question is “no.” First, sequencing costs have substantially decreased. Second, the laboratory technology has improved. Third and most importantly, from a biostatistical point of view, efficient and accurate analysis of large‐scale WGS data has been the computational bottleneck for many years (Plüss, Kopps, and Keller 2017). Preprocessing, QC, and joint calling of large numbers of samples are computationally intensive (Poplin, Ruano‐Rubio, and DePristo 2018b). As a result, hardware performance limitations were sometimes exceeded until quite recently.

Computational efficiency increased substantially because of hardware improvements, such as the introduction of FPGA‐based computers. For example, while the preprocessing of a single WGS sample from the FASTQ file to the GVCF took several hours on a well‐equipped CPU‐based system a decade ago, it requires <30 min on an FPGA‐based computer (Betschart et al. 2022; Zhao et al. 2020a). Alternative approaches for data preprocessing have been developed, such as the GPU‐based NVIDIA Parabricks (O'Connell, Yosufzai, and Campbell 2022). Another major hardware improvement is related to the CPU‐based systems. For example, we used compute nodes equipped with 2 × 64 cores, >11 TB nonvolatile memory express (NVMe), and 2 TB RAM for multisample calling. Furthermore, faster network connections and cloud computing have become standard, allowing greater data throughput from the storage to the compute units. Cloud computing requires 25 min to preprocess one WGS sample with a coverage of 34× at the cost of 1 USD (Broad Institute 2020).

Software development has led to a further increase in efficiency. For example, the new mapper BWA‐MEME is substantially faster than the older BWA‐MEM2 and BWA‐MEM programs (Jung and Han 2022). The most critical step of a large‐scale WGS study is the joint calling of many samples, where runtime and memory usage are linear with the number of samples. Here, several programs have been developed to speed up the joint calling and to save storage compared to the joint calling with GATK (Guðbjartsson, Þór Ísleifsson, and Ragnarsson 2022b; Kendig, Baheti, and Bockol 2019; Lin, Rodeh, and Penn 2018). Additionally, this step can nowadays be accomplished easily with a batch‐based multisample calling approach. In our experience, this approach is even more efficient than joint calling with other software packages. First, it appears to be faster. For comparison, the batch‐based multisample caller could complete the multisample calling of more than 8000 samples within four days on our local high‐performance computing cluster. A classical joint sample caller required approximately 5 times longer. Second, samples can be easily added to the multisample call file without the need to start the joint calling from the beginning.

Accuracy has improved in addition to speed and the reduction in memory usage. For example, more precise mapping and alignment procedures have been developed (Betschart et al. 2022; Roddey, Catreux, and Chen 2022). In addition, more accurate reference genomes have become available, and the most recent one is the Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14), released in early 2022 (Genome Reference Consortium 2022). Furthermore, variant callers making more accurate calls have been proposed (Betschart et al. 2022; Koboldt 2020; Supernat et al. 2018; Zhao et al. 2020a), some of which have competed in the precisionFDA Truth Challenges. Improved and standardized QC procedures also contribute to the accuracy at all stages of a WGS study, that is, from the first steps in the laboratory to the last steps before association analysis. Another reason for higher accuracy is the improvement in laboratory technologies. For example, it is nowadays standard to work with a PCR‐free library preparation protocol, which outperforms PCR‐based protocols (Zhou, Zhou, and Zeng 2022).

All these improvements have made WGS studies a reality with thousands or even hundreds of thousands of samples. The sheer amount of WGS data creates new challenges for association analyses between genetic markers and traits of interest, both computationally and statistically. This wealth of data allows studying many kinds of variation, such as Indels, STRs, SVs, or diallelic genetic markers, say SNPs. The different types of data all need their specific statistical methods for association analyses. It is beyond the scope of this paper to address the aspects of association analyses. Instead, we refer readers to the literature for reviews on statistical methods for common variants, that is, polymorphisms with an MAF above a reasonable threshold, say 5% (Tam et al. 2019; Uffelmann, Huang, and Munung 2021; Ziegler, Thompson, and König 2008). Furthermore, several reviews have examined statistical methods for the analysis of rare variants (Asimit and Zeggini 2010; Bansal et al. 2010; Dering et al. 2011; Derkach, Lawless, and Sun 2014; Nicolae 2016; Povysil et al. 2019).

An important aspect of the application of tests for both common and rare variants is whether adjustments for population stratification and/or for (cryptic) relatedness need to be done. In our study, we decided to restrict the sequence data set to a homogeneous EUR‐like sample. Furthermore, in association analysis, we will only include subjects with an estimated familial relationship further apart than third degree, which corresponds to an estimated kinship coefficient lower than 1/29/2 (Manichaikul et al. 2010). Important familial relationships have been detected with the kinship coefficients plotted against the probability of sharing 0 alleles IBD (results not shown). Combining both approaches will allow us to perform association analyses for independent subjects that require an ethnically homogeneous population. Adjustments for population stratification and cryptic relatedness in downstream association analysis can thus be omitted.

In summary, preprocessing, single‐sample calling, multisample calling, and QC of large WGS studies are now feasible within reasonable time. Efficient QC procedures are readily available. With the 100–200 USD genome being available soon, limitations have shifted from the sequencing technology to bioinformatics and biostatistics.

Conflicts of Interest

R.O.B., S.B., C.R., F.S., and A.Z., are employees of Cardio‐CARE, a 100% daughter of the Kühne Foundation. A.T. and H.S. were employees of Cardio‐CARE. S.B., R.T., A.Z., and T.Z. are listed as co‐inventors of an international patent on the use of a computing device to estimate the probability of myocardial infarction (International Publication Number WO2022043229A1). Cardio‐CARE, R.T., and T.Z. are shareholders of the ART‐EMIS Hamburg GmbH. A.Z. is the associate editor of the Biometrical Journal.

Open Research Badges

This article has earned an Open Data badge for making publicly available the digitally‐shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section.

This article has earned an open data badge “Reproducible Research” for making publicly available the code necessary to reproduce the reported results. The results reported in this article were reproduced partially due to confidentiality issues.

Supporting information

Supporting Information

Acknowledgments

We gratefully acknowledge funding of the whole genome sequencing study by the Kühne Foundation. Raphael Twerenbold holds a professorship in clinical cardiology at the University Medical Center Hamburg‐Eppendorf, supported by the Kühne Foundation, and reports research support from the German Center for Cardiovascular Research (DZHK). Tanja Zeller is supported by the German Research Foundation, the EU Horizon 2020 program, the EU ERANet and ERAPreMed Programmes, the German Centre for Cardiovascular Research (DZHK, 81Z0710102), and the German Ministry of Education and Research. We are grateful to Patricia Bartoschek, Satya Bhowmik, Anna Lena Engels, Tim Hartmann, Yumi Hartmann, Lilia Kisselmann, Anna‐Lena Post, and René Riedl for excellent laboratory work with DNA extraction, data management, and quality control.

INTERCATH

The INTERCATH cohort is a prospective observational study of patients who underwent invasive coronary angiography at the University Heart and Vascular Center Hamburg (Waldeyer, Seiffert, and Staebe 2017). Overall, approximately 3000 patients were enrolled between 2015 and 2020. Next to the precise classification of the coronary anatomy using different scoring methods, comorbidities, standard laboratory parameters, dietary habits, and lifestyle information were obtained. Also, genomic material stored in LVL tubes is available. Primary endpoints include major adverse cardiovascular events, revascularization procedures, cardiac hospitalizations, and all‐cause mortality. A median follow‐up of 3.5 years is available. The aim is to investigate the potential genetic background of different manifestations of coronary artery disease (CAD) according to the risk factor burden and different anatomical features. Endpoints of interest include premature CAD, that is, CAD at a young age, absence of CAD in elderly patients despite the presence of multiple cardiovascular risk factors, as well as focal CAD, angiographically defined as patients with a localized coronary artery stenosis, in comparison to a diffuse vascular disease. Ethics committee (EC): Ethik‐Kommission der Ärztekammer Hamburg, file number PV4303, date of vote: Jan 22, 2014. Registration: clinicaltrials.gov, NCT04936438, date of registration: June 23, 2021, recruitment: start of enrolment Jan 01, 2015, halted. Written informed consent has been obtained from all study participants.

GrAPHIC

The Geno‐ And PHenotyping of PrImary Cardiomyopathy (GrAPHIC) study is a diseased cohort study to precisely genotype and phenotype patients with heart failure due to primary cardiomyopathy. The study focuses on dilated cardiomyopathy, while patients having a history of CAD, valvular heart disease, inflammatory and toxic conditions, or other causal factors for dilated cardiomyopathy are not included in GrAPHIC. The study evaluates clinical phenotypes and genetic, and histological patterns of patients with stable disease. Patients undergo cardiac biomarker testing, broad genetic analyses, histological assessment of endomyocardial biopsies, and multimodal imaging including cardiac MRI. Enrolled patients were followed up after 12 months by repeated MRI and biobanking. EC: Ethik‐Kommission der Ärztekammer Hamburg, PV5990, date of vote: Aug 19, 2019. Written informed consent has been obtained from all study participants.

AFHRI

The AFHRI study is an epidemiological, prospective, single‐center cohort study to improve atrial fibrillation (AF) risk stratification in high‐risk individuals in the context of the establishment of a biobank. The objective of the study is to better understand the pathogenesis of AF and other cardiovascular diseases (coronary artery disease, stroke). Outpatients at the reception of the cardiology outpatient clinic are the targeted study population. Participants aged 18–85 years old who personally signed an informed consent were enrolled. Individuals with insufficient knowledge of the German language, and physical or psychological incapability to cooperate in the investigations were excluded from the study. n = 1200 participated with a median follow‐up of 2.6 years from 2012 until 2020, and enrolment is still ongoing.

The primary endpoint of the study is incident AF irrespective of its type of manifestation (paroxysmal, persistent, or permanent). Secondary endpoints are myocardial infarction, cardiovascular death, and heart failure. Follow‐up examinations are performed after 2 and 4 years. EC: Ethik‐Kommission der Ärztekammer Hamburg, AFHRI main study: PV PV5705, date of vote: 2013. /AFHRI B: PV3982, date of vote: 2016, /AFHRI C: PV5878. Date of vote: Jan 29, 2019. Written informed consent has been obtained from all study participants.

MIYoung | Gutenberg Infarction Study (GIS) | Biomarker in Acute Cardiac Care (BACC)

All three cohorts (MlYoung, GIS, BACC) consist of patients with acute myocardial infarction (Neumann, Sörensen, and Schwemer 2016). Within this project, patients with early onset myocardial infarction, defined as any myocardial infarction <50 years of age, are analyzed. EC for MIYoung: Ethik‐Kommission der Ärztekammer Hamburg, PV4379, date of vote: August 19, 2013, and Ethik‐Kommission der Ärztekammer Schleswig‐Holstein, 048/15 (m), date of vote: Apr 30, 2015. EC for GIS: Ethik‐Kommission der Ärztekammer Rheinland‐Pfalz, 837.211.08 (6208), date of vote: Jun 26, 2008. EC for BACC: Ethik‐Kommission der Ärztekammer Hamburg, PV4306, date of vote: Jul 25, 2013. Registration for BACC: clinicaltrials.gov, NCT02355457, date of registration: Feb 4, 2015, recruitment: start of enrolment Jul 01, 2013, active. Written informed consent has been obtained from all study participants.

Hamburg City Health Study

The Hamburg City Health Study (HCHS) is an ongoing, prospective, long‐term, population‐based cohort study (Jagodzinski, Johanse, and Koch‐Gromus 2020). A random sample of up to 45,000 participants between 45 and 74 years of age from the general population of Hamburg, Germany, are to be enrolled with an extensive phenotypic baseline assessment. To date, more than 18,000 individuals have been enrolled. The primary objective of this study is to gain profound knowledge about risk factors and mechanisms involved in the development of major chronic diseases in the general population. A major focus will lie on the identification of genetic polymorphisms associated with major diseases, for example, coronary heart disease, heart failure, and atrial fibrillation. Participants undergo extensive phenotyping, including imaging examinations and biobanking. The protocol includes validated self‐reports via questionnaires regarding lifestyle and environmental conditions, dietary habits, physical condition and activity, sexual dysfunction, professional life, psychosocial context and burden, quality of life, digital media use, occupational, medical, and family history as well as healthcare utilization. The assessment is completed by genomic and proteomic characterization, which will be shared between UKE and Cardio‐CARE within this project. Beyond the identification of classical risk factors for major chronic diseases and survival, the core intention is to gather valid prevalence and incidence and to develop complex models predicting health outcomes based on a multitude of examination data, imaging, biomarker, psychosocial, and behavioral assessments. EC: Ethik‐Kommission der Ärztekammer Hamburg, PV5131, date of vote: Oct 20, 2015. Registration: clinicaltrials.gov, NCT03934957, date of registration: May 2, 2019, recruitment start: Feb 08, 2016, active. Written informed consent has been obtained from all study participants.

Funding: We gratefully acknowledge funding of the whole genome sequencing study by the Kühne Foundation. Raphael Twerenbold holds a professorship in clinical cardiology at the University Medical Center Hamburg‐Eppendorf, supported by the Kühne Foundation, and reports research support from the German Center for Cardiovascular Research (DZHK). Tanja Zeller is supported by the German Research Foundation, the EU Horizon 2020 programme, the EU ERANet and ERAPreMed Programmes, the German Centre for Cardiovascular Research (DZHK, 81Z0710102), and the German Ministry of Education and Research.

Data Availability Statement

One GIAB trio dataset generated during the current study is available in the Sequencing Read Archive (SRA) repository, accession number: PRJNA907182. All GIAB data, having approximately 6 TB, are available from the corresponding author on reasonable request for collaborative projects. Patient and study participant data may not be shared due to privacy issues.

References

  1. Abdullaev, E. T. , Umarova I. R., and Arndt P. F.. 2021. “Modelling Segmental Duplications in the Human Genome.” BMC Genomics [Electronic Resource] 22: 496. 10.1186/s12864-021-07789-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Abnizova, I. , Te Boekhorst R., and Orlov Y. L.. 2017. “Computational Errors and Biases in Short Read Next Generation Sequencing.” Journal of Proteomics and Bioinformatics 10: 1–17. 10.4172/jpb.1000420. [DOI] [Google Scholar]
  3. Adelson, R. P. , Renton A. E., Li W., et al. 2019. “Empirical Design of a Variant Quality Control Pipeline for Whole Genome Sequencing Data Using Replicate Discordance.” Scientific Reports 9: 16156. 10.1038/s41598-019-52614-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Aganezov, S. , Yan S. M., Soto D. C., et al. 2022. “A Complete Reference Genome Improves Analysis of Human Genetic Variation.” Science 376: eabl3533. 10.1126/science.abl3533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Akey, J. M. 2009. “Constructing Genomic Maps of Positive Selection in Humans: Where Do We Go From Here?” Genome Research 19: 711–722. 10.1101/gr.086652.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Antonarakis, S. E. 2022. “Short Arms of Human Acrocentric Chromosomes and the Completion of the Human Genome Sequence.” Genome Research 32: 599–607. 10.1101/gr.275350.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Anvar, S. Y. , Khachatryan L., Vermaat M., et al. 2014. “Determining the Quality and Complexity of Next‐Generation Sequencing Data Without a Reference Genome.” Genome Biology 15: 555. 10.1186/s13059-014-0555-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Ari, Ş. , and Arikan M.. 2016. “Next‐Generation Sequencing: Advantages, Disadvantages, and Future.” In: Plant Omics: Trends and Applications, edited by K. R. Hakeem H. Tombuloğlu , and G. Tombuloğlu, pp 109–135. Cham: Springer. [Google Scholar]
  9. Arora, K. , Shah M., Johnson M., et al. 2019. “Deep Whole‐Genome Sequencing of 3 Cancer Cell Lines on 2 Sequencing Platforms.” Scientific Reports 9: 19123. 10.1038/s41598-019-55636-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Asimit, J. , and Zeggini E.. 2010. “Rare Variant Association Analysis Methods for Complex Traits.” Annual Review of Genetics 44: 293–308. 10.1146/annurev-genet-102209-163421. [DOI] [Google Scholar]
  11. Bailey, T. , Krajewski P., Ladunga I., et al. 2013. “Practical Guidelines for the Comprehensive Analysis of ChIP‐seq Data.” PLOS Computational Biology 9: e1003326. 10.1371/journal.pcbi.1003326. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Bansal, V. , Libiger O., Torkamani A., and Schork N. J.. 2010. “Statistical Analysis Strategies for Association Studies Involving Rare Variants.” Nature Reviews Genetics 11: 773–785. 10.1038/nrg2867. [DOI] [Google Scholar]
  13. Barba, M. , Czosnek H., and Hadidi A.. 2014. “Historical Perspective, Development and Applications of Next‐Generation Sequencing in Plant Virology.” Viruses 6: 106–136. 10.3390/v6010106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Berglund, A. , Stochholm K., and Gravholt C. H.. 2020. “The Epidemiology of Sex Chromosome Abnormalities.” American Journal of Medical Genetics. Part C, Seminars in Medical Genetics 184: 202–215. 10.1002/ajmg.c.31805. [DOI] [PubMed] [Google Scholar]
  15. Betschart, R. O. , Thiéry A., Aguilera‑Garcia D., et al. 2022. “Comparison of Calling Pipelines for Whole Genome Sequencing: An Empirical Study Demonstrating the Importance of Mapping and Alignment.” Scientific Reports 12: 21502. 10.1038/s41598-022-26181-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Broad Institute . 2020. “Germline Short Variant Discovery (SNPs + Indels).” Accessed September 19, 2023. https://gatk.broadinstitute.org/hc/en‐us/articles/360035535932‐Germline‐short‐variantdiscovery‐SNPs‐Indels.
  17. Brockman, W. , Alvarez P., Young S., et al. 2008. “Quality Scores and SNP Detection in Sequencing‐by‐Synthesis Systems.” Genome Research 18: 763–770. 10.1101/gr.070227.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Byrska‐Bishop, M. , Evani U. S., Zhao X., et al. 2021. “High Coverage Whole Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios.” BioRxiv. 10.1101/2021.02.06.430068. [DOI]
  19. Cantor, R. M. , Lange K., and Sinsheimer J. S.. 2010. “Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application.” American Journal of Human Genetics 86: 6–22. 10.1016/j.ajhg.2009.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Challis, D. , Yu J., Evani U. S., et al. 2012. “An Integrative Variant Analysis Suite for Whole Exome Next‐Generation Sequencing Data.” BMC Bioinformatics [Electronic Resource] 13: 8. 10.1186/1471-2105-13-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Chen, D. , Zhang X., Kang H., et al. 2012. “Phylogeography of Quercus Variabilis Based on Chloroplast DNA Sequence in East Asia: Multiple Glacial Refugia and Mainland‐Migrated Island Populations.” PLoS ONE 7: e47268. 10.1371/journal.pone.0047268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Chen, Y. , Graf L., Chen T., et al. 2021. “Rare Variant MX1 Alleles Increase Human Susceptibility to Zoonotic H7N9 Influenza Virus.” Science 373: 918–922. 10.1126/science.abg5953. [DOI] [PubMed] [Google Scholar]
  23. Costello, M. , Fleharty M., Abreu J., et al. 2018. “Characterization and Remediation of Sample Index Swaps by Non‐redundant Dual Indexing on Massively Parallel Sequencing Platforms.” BMC Genomics [Electronic Resource] 19: 332. 10.1186/s12864-018-4703-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Craig, J. M. , Vena N., Ramkissoon S., et al. 2012. “DNA Fragmentation Simulation Method (FSM) and Fragment Size Matching Improve aCGH Performance of FFPE Tissues.” PLoS ONE 7: e38881. 10.1371/journal.pone.0038881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Deorowicz, S. , Danek A., and Kokot M.. 2021. “VCFShark: How to Squeeze a VCF File.” Bioinformatics 37: 3358–3360. 10.1093/bioinformatics/btab211. [DOI] [PubMed] [Google Scholar]
  26. DePristo, M. A. , Banks E., Poplin R., et al. 2011. “A Framework for Variation Discovery and Genotyping Using Next‐Generation DNA Sequencing Data.” Nature Genetics 43: 491–498. 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Dering, C. , Hemmelmann C., Pugh E., and Ziegler A.. 2011. “Statistical Analysis of Rare Sequence Variants: An Overview of Collapsing Methods.” Genetic Epidemiology 35: S12–S17. 10.1002/gepi.20643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Derkach, A. , Lawless J. F., and Sun L.. 2014. “Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results.” Statistical Science 29: 302–321. doi: 10.1214/13-Sts456. [DOI] [Google Scholar]
  29. Drmanac, R. , Drmanac S., Chui G., et al. 2002. “Sequencing by Hybridization (SBH): Advantages, Achievements, and Opportunities.” Advances in Biochemical Engineering/Biotechnology 77: 75–101. 10.1007/3-540-45713-5_5. [DOI] [PubMed] [Google Scholar]
  30. Ebbert, M. T. , Wadsworth M. E., Staley L. A., et al. 2016. “Evaluating the Necessity of PCR Duplicate Removal from Next‐Generation Sequencing Data and a Comparison of Approaches.” BMC Bioinformatics [Electronic Resource] 17, no. 7: 239. 10.1186/s12859-016-1097-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Elhaik, E. 2022. “Principal Component Analyses (PCA)‐Based Findings in Population Genetic Studies Are Highly Biased and Must Be Reevaluated.” Scientific Reports 12: 14683. 10.1038/s41598-022-14395-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Endrullat, C. , Glokler J., Franke P., and Frohme M.. 2016. “Standardization and Quality Management in Next‐Generation Sequencing.” Applied and Translational Genomics 10: 2–9. 10.1016/j.atg.2016.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Erikson, G. A. , Bodian D. L., Rueda M., et al. 2016. “Whole‐Genome Sequencing of a Healthy Aging Cohort.” Cell 165: 1002–1011. 10.1016/j.cell.2016.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Garg, P. , Jadhav B., Lee W., Rodriguez O. L., Martin‐Trujillo A., and Sharp A. J.. 2022. “A Phenome‐Wide Association Study Identifies Effects of Copy‐Number Variation of VNTRs and Multicopy Genes on Multiple Human Traits.” American Journal of Human Genetics 109: 1065–1076. 10.1016/j.ajhg.2022.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Genome Reference Consortium . 2022. “Genome Assembly GRCh38.p14.” Accessed September 19, 2023. https://www.ncbi.nlm.nih.gov/data‐hub/genome/GCF_000001405.40.
  36. Gilly, A. , Park Y. C., Png G., et al. 2020. “Whole‐Genome Sequencing Analysis of the Cardiometabolic Proteome.” Nature Communications 11: 6336. 10.1038/s41467-020-20079-2. [DOI] [Google Scholar]
  37. Goodwin, S. , McPherson J. D., and McCombie W. R.. 2016. “Coming of Age: Ten Years of Next‐Generation Sequencing Technologies.” Nature Reviews Genetics 17: 333–351. 10.1038/nrg.2016.49. [DOI] [Google Scholar]
  38. Graffelman, J. , and Weir B. S.. 2016. “Testing for Hardy–Weinberg Equilibrium at Biallelic Genetic Markers on the X Chromosome.” Heredity 116: 558–568. 10.1038/hdy.2016.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Guðbjartsson, H. , Ísleifsson H. Þ., Ragnarsson B., et al. 2022a. “Ultra‐Fast Joint‐Genotyping with SparkGOR.” BioRxiv, 10.1101/2022.10.25.513331. [DOI]
  40. Guðbjartsson, H. , Þór Ísleifsson H., Ragnarsson B., et al. 2022b. “Ultra‐Fast Joint‐Genotyping with SparkGOR.” BioRxiv, 10.1101/2022.10.25.513331. [DOI]
  41. Guo, X. , Dai X., Zhou T., et al. 2020. “Mosaic Loss of Human Y Chromosome: What, How and Why.” Human Genetics 139: 421–446. 10.1007/s00439-020-02114-w. [DOI] [PubMed] [Google Scholar]
  42. Guo, Y. , Samuels D. C., Li J., et al. 2013. “Evaluation of Allele Frequency Estimation Using Pooled Sequencing Data Simulation.” Scientific World Journal 2013: 895496. 10.1155/2013/895496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Guo, Y. , Ye F., Sheng Q., Clark T., and Samuels D. C.. 2014a. “Three‐Stage Quality Control Strategies for DNA Re‐sequencing Data.” Briefings in Bioinformatics 15: 879–889. 10.1093/bib/bbt069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Guo, Y. , Zhao S., Sheng Q., et al. 2014b. “Multi‐Perspective Quality Control of Illumina Exome Sequencing Data Using QC3.” Genomics 103: 323–328. 10.1016/j.ygeno.2014.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Halldorsson, B. V. , Eggertsson H. P., Moore K. H. S., et al. 2022. “The Sequences of 150,119 Genomes in the UK Biobank.” Nature 607: 732–740. 10.1038/s41586-022-04965-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Hayden, E. C. 2014. “Is the $1,000 Genome for Real?” Nature 10.1038/nature.2014.14530. [DOI] [Google Scholar]
  47. Hou, Y. , Wu K., Shi X., et al. 2015. “Comparison of Variations Detection Between Whole‐Genome Amplification Methods Used in Single‐Cell Resequencing.” Gigascience 4: 37. 10.1186/s13742-015-0068-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Huptas, C. , Scherer S., and Wenning M.. 2016. “Optimized Illumina PCR‐Free Library Preparation for Bacterial Whole Genome Sequencing and Analysis of Factors Influencing De Novo Assembly.” BMC Research Notes 9: 269. 10.1186/s13104-016-2072-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Hwang, K. B. , Lee I. H., Park J. H., et al. 2014. “Reducing False‐Positive Incidental Findings With Ensemble Genotyping and Logistic Regression Based Variant Filtering Methods.” Human Mutation 35: 936–944. 10.1002/humu.22587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Illumina Inc . 2019. “NovaSeq 6000 System Specifications.” Accessed September 19, 2023. https://emea.illumina.com/systems/sequencing‐platforms/novaseq/specifications.html.
  51. Illumina Inc . 2022. “DRAGEN Iterative gVCF Genotyper Quick Start Guide. ILLUMINA.” Accessed September 19, 2023. https://support‐docs.illumina.com/SW/DRAGEN_v39/Content/SW/DRAGEN/gVCFGenotyper.htm.
  52. Illumina Inc . 2023. “DRAGEN ORA Compression and Decompression. Illumina.” Accessed February 4, 2024. https://support‐docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/ORACompression.htm.
  53. Jagodzinski, A. , Johanse C., Koch‐Gromus U., et al. 2020. “Rationale and Design of the Hamburg City Health Study.” European Journal of Epidemiology 35: 169–181. 10.1007/s10654-019-00577-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Jew, B. , and Sul J. H.. 2019. “Variant Calling and Quality Control of Large‐Scale Human Genome Sequencing Data.” Emerging Topics in Life Sciences 3: 399–409. 10.1042/ETLS20190007. [DOI] [PubMed] [Google Scholar]
  55. Jung, Y. , and Han D.. 2022. “BWA‐MEME: BWA‐MEM Emulated With a Machine Learning Approach.” Bioinformatics 38: 2404–2413. 10.1093/bioinformatics/btac137. [DOI] [PubMed] [Google Scholar]
  56. Kelly, T. N. , Sun X., He K. Y., et al. 2022. “Insights From a Large‐Scale Whole‐Genome Sequencing Study of Systolic Blood Pressure, Diastolic Blood Pressure, and Hypertension.” Hypertension 79: 1656–1667. 10.1161/HYPERTENSIONAHA.122.19324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Kendig, K. I. , Baheti S., Bockol M. A., et al. 2019. “Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy.” Frontiers in Genetics 10: 736. 10.3389/fgene.2019.00736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Kirkpatrick, B. , Ge S., and Wang L.. 2019. “Efficient Computation of the Kinship Coefficients.” Bioinformatics 35: 1002–1008. 10.1093/bioinformatics/bty725. [DOI] [PubMed] [Google Scholar]
  59. Koboldt, D. C. 2020. “Best Practices for Variant Calling in Clinical Sequencing.” Genome Medicine 12: 91. 10.1186/s13073-020-00791-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Koboldt, D. C. , Ding L., Mardis E. R., and Wilson R. K.. 2010. “Challenges of Sequencing Human Genomes.” Briefings in Bioinformatics 11: 484–498. 10.1093/bib/bbq016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Koenig, Z. , Yohannes M. T., Nkambule L. L., et al. 2023. “A Harmonized Public Resource of Deeply Sequenced Diverse Human Genomes.” BioRxiv, 10.1101/2023.01.23.525248. [DOI]
  62. Kofler, R. , Orozco‐terWengel P., De Maio N., et al. 2011. “PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals.” PLoS ONE 6: e15925. 10.1371/journal.pone.0015925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Lam, H. Y. , Clark M. J., Chen R., et al. 2012. “Performance Comparison of Whole‐Genome Sequencing Platforms.” Nature Biotechnology 30: 78–82. 10.1038/nbt.2065. [DOI] [Google Scholar]
  64. Lan, D. , Tobler R., Souilmi Y., and Llamas B.. 2021. “Genozip: A Universal Extensible Genomic Data Compressor.” Bioinformatics 37: 2225–2230. 10.1093/bioinformatics/btab102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Lin, M. F. , Rodeh O., Penn J., et al. 2018. “GLnexus: Joint Variant Calling for Large Cohort Sequencing.” BioRxiv, 343970. 10.1101/343970. bioRxiv. [DOI]
  66. Lin, Y. L. , Chang P. C., Hsu C., et al. 2022. “Comparison of GATK and DeepVariant by Trio Sequencing.” Scientific Reports 12: 1809. 10.1038/s41598-022-05833-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Ma, X. , Shao Y., Tian L., et al. 2019. “Analysis of Error Profiles in Deep Next‐Generation Sequencing Data.” Genome Biology 20: 50. 10.1186/s13059-019-1659-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Manichaikul, A. , Mychaleckyj J. C., Rich S., Daly K., Sale M., and Chen W. M.. 2010. “Robust Relationship Inference in Genome‐Wide Association Studies.” Bioinformatics 26: 2867–2873. 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Marth, G. T. , Yu F., Indap A. R., et al. 2011. “The Functional Spectrum of Low‐Frequency Coding Variation.” Genome Biology 12: R84. 10.1186/gb-2011-12-9-r84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Marx, V. 2023. “Method of the Year: Long‐Read Sequencing.” Nature Methods 20: 6–11. 10.1038/s41592-022-01730-w. [DOI] [PubMed] [Google Scholar]
  71. Mathieson, I. , Lazaridis I., Rohland N., et al. 2015. “Genome‐Wide Patterns of Selection in 230 Ancient Eurasians.” Nature 528: 499–503. 10.1038/nature16152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Miller, N. A. , Farrow E. G., Gibson M., et al. 2015. “A 26‐Hour System of Highly Sensitive Whole Genome Sequencing for Emergency Management of Genetic Diseases.” Genome Medicine 7: 100. 10.1186/s13073-015-0221-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Mobley, I. 2021. “How Did Illumina Dominate the Sequencing Market?” Accessed September 19, 2023. https://frontlinegenomics.com/how‐did‐illumina‐monopolize‐the‐sequencing‐market.
  74. Morrison, A. C. , Huang Z., Yu B., et al. 2017. “Practical Approaches for Whole‐Genome Sequence Analysis of Heart‐ and Blood‐Related Traits.” American Journal of Human Genetics 100: 205–215. 10.1016/j.ajhg.2016.12.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Moustafa, A. , Xie C., Kirkness E., et al. 2017. “The Blood DNA Virome in 8,000 Humans.” PLoS Pathogens 13: e1006292. 10.1371/journal.ppat.1006292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Muyas, F. , Bosio M., Puig A., et al. 2019. “Allele Balance Bias Identifies Systematic Genotyping Errors and False Disease Associations.” Human Mutation 40: 115–126. 10.1002/humu.23674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Natarajan, P. , Peloso G. M., Zekavat S. M., et al. 2018. “Deep‐Coverage Whole Genome Sequences and Blood Lipids Among 16,324 Individuals.” Nature Communications 9: 3391. 10.1038/s41467-018-05747-8. [DOI] [Google Scholar]
  78. Nei, M. 1978. “Estimation of Average Heterozygosity and Genetic Distance from a Small Number of Individuals.” Genetics 89: 583–590. 10.1093/genetics/89.3.583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Neumann, J. T. , Sörensen N. A., Schwemer T., et al. 2016. “Diagnosis of Myocardial Infarction Using a High‐Sensitivity Troponin I 1‐Hour Algorithm.” JAMA Cardiology 1: 397–404. 10.1001/jamacardio.2016.0695. [DOI] [PubMed] [Google Scholar]
  80. Nicolae, D. L. 2016. “Association Tests for Rare Variants.” Annual Review of Genomics and Human Genetics 17: 117–130. 10.1146/annurev-genom-083115-022609. [DOI] [Google Scholar]
  81. Novembre, J. , Johnson T., Bryc K., et al. 2008. “Genes Mirror Geography Within Europe.” Nature 456: 98–101. 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Nurk, S. , Koren S., Rhie A., et al. 2022. “The Complete Sequence of a Human Genome.” Science 376: 44–53. 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. O'Connell, K. A. , Yosufzai Z. B., Campbell R. A., et al. 2022. “Accelerating Genomic Workflows Using NVIDIA Parabricks.” BioRxiv, 10.1101/2022.07.20.498972. bioRxiv. [DOI]
  84. Oliver, G. R. , Hart S. N., and Klee E. W.. 2015. “Bioinformatics for Clinical Next Generation Sequencing.” Clinical Chemistry 61: 124–135. 10.1373/clinchem.2014.224360. [DOI] [PubMed] [Google Scholar]
  85. Olson, N. , Wagner J., McDaniel J., et al. 2020. “precisionFDA Truth Challenge V2: Calling Variants from Short‐ and Long‐Reads in Difficult‐to‐Map Regions.” BioRxiv. 10.1101/2020.11.13.380741. bioRxiv. [DOI]
  86. Pan, B. , Kusko R., Xiao W., et al. 2019. “Similarities and Differences Between Variants Called With Human Reference Genome HG19 or HG38.” BMC Bioinformatics [Electronic Resource] 20: 101. 10.1186/s12859-019-2620-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Panoutsopoulou, K. , and Walter K.. 2018. “Quality Control of Common and Rare Variants.” Methods in Molecular Biology 1793: 25–36. 10.1007/978-1-4939-7868-7_3. [DOI] [PubMed] [Google Scholar]
  88. Pereira, R. , Oliveira J., and Sousa M.. 2020. “Bioinformatics and Computational Tools for Next‐Generation Sequencing Analysis in Clinical Genetics.” Journal of Clinical Medicine 9, no. 1: 132. 10.3390/jcm9010132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Peterson, R. E. , Kuchenbaecker K., Walters R. K., et al. 2019. “Genome‐Wide Association Studies in Ancestrally Diverse Populations: Opportunities, methods, Pitfalls, and Recommendations.” Cell 179: 589–603. 10.1016/j.cell.2019.08.051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Pfeifer, S. P. 2017. “From Next‐Generation Resequencing Reads to a High‐Quality Variant Data Set.” Heredity 118: 111–124. 10.1038/hdy.2016.102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Plüss, M. , Kopps A. M., Keller I., et al. 2017. “Need for Speed in Accurate Whole‐Genome Data Analysis: GENALICE MAP Challenges BWA/GATK More Than PEMapper/PECaller and Isaac.” Proceedings of the National Academy of Sciences of the United States of America 114: E8320–E8322. 10.1073/pnas.1713830114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Poplin, R. , Chang P. C., Alexander D., et al. 2018a. “A Universal SNP and Small‐Indel Variant Caller Using Deep Neural Networks.” Nature Biotechnology 36: 983–987. 10.1038/nbt.4235. [DOI] [Google Scholar]
  93. Poplin, R. , Ruano‐Rubio V., DePristo M. A., et al. 2018b. “Scaling Accurate Genetic Variant Discovery to Tens of Thousands of Samples.” BioRxiv, 201178. 10.1101/201178. bioRxiv. [DOI]
  94. Povysil, G. , Petrovski S., Hostyk J., Aggarwal V., Allen A. S., and Goldstein D. B.. 2019. “Rare‐Variant Collapsing Analyses for Complex Traits: Guidelines and Applications.” Nature Reviews Genetics 20: 747–759. 10.1038/s41576-019-0177-4. [DOI] [Google Scholar]
  95. Price, A. L. , Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A., and Reich D.. 2006. “Principal Components Analysis Corrects for Stratification in Genome‐Wide Association Studies.” Nature Genetics 38: 904–909. 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  96. Rhie, A. , Nurk S., Cechova M., et al. 2022. “The Complete Sequence of a Human Y Chromosome.” BioRxiv. 10.1101/2022.12.01.518724. [DOI]
  97. Rhie, A. , Nurk S., Cechova M., et al. 2023. “The Complete Sequence of a Human Y Chromosome.” Nature 621: 344–354. 10.1038/s41586-023-06457-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Roddey, C. , Catreux S., Chen W. T., et al. 2022. “Application of DRAGEN Graph Read Alignment to Challenging Medically Relevant Genes and Other Difficult Regions in GRCh38 and T2T‐CHM13 Genomes. Poster PB2906”. Accessed September 19, 2023. https://www.ashg.org/wp-content/uploads/2022/09/ASHG2022-PosterAbstracts.pdf#%5B%7B%22num%22%3A4172%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C70%2C707%2C0%5D.
  99. Sarin, S. , Prabhu S., O'Meara M. M., Pe'er I., and Hobert O.. 2008. “ Caenorhabditis elegans Mutant Allele Identification by Whole‐Genome Sequencing.” Nature Methods 5: 865–867. 10.1038/nmeth.1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Sherry, S. T. , Ward M. H., Kholodov M., et al. 2001. “dbSNP: The NCBI Database of Genetic Variation.” Nucleic Acids Research 29: 308–311. 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Sims, D. , Sudbery I., Ilott N. E., Heger A., and Ponting C. P.. 2014. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews Genetics 15: 121–132. 10.1038/nrg3642. [DOI] [Google Scholar]
  102. Slatko, B. E. , Gardner A. F., and Ausubel F. M.. 2018. “Overview of Next‐Generation Sequencing Technologies.” Current Protocols in Molecular Biology 122: e59. 10.1002/cpmb.59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  103. Sollis, E. , Mosaku A., Abid A., et al. 2023. “The NHGRI‐EBI GWAS Catalog: Knowledgebase and Deposition Resource.” Nucleic Acids Research 51: D977–D985. 10.1093/nar/gkac1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Somineni, H. K. , Nagpal S., Venkateswaran S., et al. 2021. “Whole‐Genome Sequencing of African Americans Implicates Differential Genetic Architecture in Inflammatory Bowel Disease.” American Journal of Human Genetics 108: 431–445. 10.1016/j.ajhg.2021.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  105. Supernat, A. , Vidarsson O. V., Steen V. M., and Stokowy T.. 2018. “Comparison of Three Variant Callers for Human Whole Genome Sequencing.” Scientific Reports 8: 17851. 10.1038/s41598-018-36177-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  106. Taliun, D. , Harris D. N., Kessler M. D., et al. 2021. “Sequencing of 53,831 Diverse Genomes from the NHLBI TOPMed Program.” Nature 590: 290–299. 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  107. Tam, V. , Patel N., Turcotte M., Bosse Y., Pare G., and Meyre D.. 2019. “Benefits and Limitations of Genome‐Wide Association Studies.” Nature Reviews Genetics 20: 467–484. 10.1038/s41576-019-0127-1. [DOI] [Google Scholar]
  108. Tan, G. , Opitz L., Schlapbach R., and Rehrauer H.. 2019. “Long Fragments Achieve Lower Base Quality in Illumina Paired‐End Sequencing.” Scientific Reports 9: 2856. 10.1038/s41598-019-39076-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  109. Tempel, S. 2012. “Using and Understanding RepeatMasker.” Methods in Molecular Biology 859: 29–51. 10.1007/978-1-61779-603-6_2. [DOI] [PubMed] [Google Scholar]
  110. The 1000 Genomes Project Consortium . 2015. “A Global Reference for Human Genetic Variation.” Nature 526: 68–74. 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. The International HapMap Consortium . 2003. “The International HapMap Project.” Nature 426: 789–796. 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  112. Thiéry, A. , Zeller T., Blankenberg S., and Ziegler A.. 2020. “COMET: An R Package to Identify Sample Cross‐Contamination in Whole Genome Sequencing Studies.” Human Heredity 85: 93–94. [Google Scholar]
  113. Thrash, A. , Arick M. 2nd, and Peterson D. G.. 2018. “Quack: A Quality Assurance Tool for High Throughput Sequence Data.” Analytical Biochemistry 548: 38–43. 10.1016/j.ab.2018.01.028. [DOI] [PubMed] [Google Scholar]
  114. Uffelmann, E. , Huang Q. Q., Munung N. S., et al. 2021. “Genome‐Wide Association Studies.” Nature Reviews Methods Primers 1: 59. 10.1038/s43586-021-00056-9. [DOI] [Google Scholar]
  115. United Nations . 1998. “United Nations Standard Country Code, series M: Miscellaneous Statistical Papers, no. 49. United Nations.” Accessed September 19, 2023. https://unstats.un.org/unsd/classifications/Family/Detail/12.
  116. Van der Auwera, G. A. , Carneiro M. O., Hartl C., et al. 2013. “From FastQ Data to High Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.” Current Protocols in Bioinformatics 43: 11 10 11‐11 10 33. 10.1002/0471250953.bi1110s43. [DOI] [Google Scholar]
  117. Waldeyer, C. , Seiffert M., Staebe N., et al. 2017. “Lipid Management after First Diagnosis of Coronary Artery Disease: Contemporary Results From an Observational Cohort Study.” Clinical Therapeutics 39: 2311–2320 e2312. 10.1016/j.clinthera.2017.10.005. [DOI] [Google Scholar]
  118. Wang, J. , Samuels D. C., Shyr Y., and Guo Y.. 2015. “Population Structure Analysis on 2504 Individuals Across 26 Ancestries Using Bioinformatics Approaches.” BMC Bioinformatics [Electronic Resource] 16: P19. 10.1186/1471-2105-16-S15-P19. [DOI] [Google Scholar]
  119. Wang, M. , Beckmann N. D., Roussos P., et al. 2018. “The Mount Sinai Cohort of Large‐Scale Genomic, Transcriptomic and Proteomic Data in Alzheimer's Disease.” Scientific Data 5: 180185. 10.1038/sdata.2018.185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  120. Wang, W. , Wei Z., Lam T. W., and Wang J.. 2011. “Next Generation Sequencing Has Lower Sequence Coverage and Poorer SNP‐Detection Capability in the Regulatory Regions.” Scientific Reports 1: 55. 10.1038/srep00055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  121. Wang, X. , Sui W., Wu W., et al. 2016. “Whole‐Genome Resequencing of 100 Healthy Individuals Using DNA Pooling.” Experimental and Therapeutic Medicine 12: 3143–3150. 10.3892/etm.2016.3797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  122. Wellek, S. , Goddard K. A., and Ziegler A.. 2010. “A Confidence‐Limit–Based Approach to the Assessment of Hardy–Weinberg equilibrium.” Biometrical Journal 52: 253–270. 10.1002/bimj.200900249. [DOI] [PubMed] [Google Scholar]
  123. Wellek, S. , and Ziegler A.. 2019. “Testing for Goodness Rather Than Lack of Fit of an X‐Chromosomal SNP to the Hardy‐Weinberg Model.” PLoS ONE 14: e0212344. 10.1371/journal.pone.0212344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  124. Wright, M. N. , Gola D., and Ziegler A.. 2017. “Preprocessing and Quality Control for Whole‐Genome Sequences From the Illumina HiSeq X Platform.” Methods in Molecular Biology 1666: 629–647. 10.1007/978-1-4939-7274-6_30. [DOI] [PubMed] [Google Scholar]
  125. Xu, Y. , Lin Z., Tang C., et al. 2019. “A New Massively Parallel Nanoball Sequencing Platform for Whole Exome Research.” BMC Bioinformatics [Electronic Resource] 20: 153. 10.1186/s12859-019-2751-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  126. Yang, L. A. , Chang Y. J., Chen S. H., Lin C. Y., and Ho J. M.. 2019. “SQUAT: A Sequencing Quality Assessment Tool for Data Quality Assessments of Genome Assemblies.” BMC Genomics [Electronic Resource] 19: 238. 10.1186/s12864-019-5445-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  127. Yun, T. , Li H., Chang P. C., Lin M. F., Carroll A., and McLean C. Y.. 2021. “Accurate, Scalable Cohort Variant Calls Using DeepVariant and GLnexus.” Bioinformatics 36: 5582–5589. 10.1093/bioinformatics/btaa1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  128. Zhan, L. , Li J.,, Jew B., and Sul J. H.. 2021. “Rare Variants in the Endocytic PathWAY are Associated With Alzheimer's Disease, Its Related Phenotypes, and Functional Consequences.” PLOS Genetics 17: e1009772. 10.1371/journal.pgen.1009772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  129. Zhao, S. , Agafonov O., Azab A., Stokowy T., and Hovig E.. 2020a. “Accuracy and Efficiency of Germline Variant Calling Pipelines for Human Genome Data.” Scientific Reports 10: 20222. 10.1038/s41598-020-77218-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  130. Zhao, T. , Duan Z., Genchev G. Z., and Lu H.. 2020b. “Closing Human Reference Genome Gaps: Identifying and Characterizing Gap‐Closing Sequences.” G3: Genes—Genomes—Genetics 10: 2801–2809. 10.1534/g3.120.401280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  131. Zhou, G. , Zhou M., Zeng F., et al. 2022. “Performance Characterization of PCR‐Free Whole Genome Sequencing for Clinical Diagnosis.” Medicine 101: e28972. 10.1097/MD.0000000000028972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  132. Zhou, L. , Ng H. K., Drautz‐Moses D. I., et al. 2019. “Systematic Evaluation of Library Preparation Methods and Sequencing Platforms for High‐Throughput Whole Genome Bisulfite Sequencing.” Scientific Reports 9: 10383. 10.1038/s41598-019-46875-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  133. Zhou, X. , and Rokas A.. 2014. “Prevention, Diagnosis and Treatment of High‐Throughput Sequencing Data Pathologies.” Molecular Ecology 23: 1679–1700. 10.1111/mec.12680. [DOI] [PubMed] [Google Scholar]
  134. Ziaei Jam, H. , Li Y., DeVito R., et al. 2023. “A Deep Population Reference Panel of Tandem Repeat Variation.” Nature Communications 14: 6711. 10.1038/s41467-023-42278-3. [DOI] [Google Scholar]
  135. Ziegler, A. , and König I. R.. 2010. A Statistical Approach to Genetic Epidemiology: Concepts and Applications. 2nd ed. Weinheim: Wiley‐VCH. [Google Scholar]
  136. Ziegler, A. , Thompson J. R., and König I. R.. 2008. “Biostatistical Aspects of Genome‐Wide Association Studies.” Biometrical Journal 50: 8–28. 10.1002/bimj.200710398. [DOI] [PubMed] [Google Scholar]
  137. Ziegler, A. , Van Steen K., and Wellek S.. 2011. “Investigating Hardy–Weinberg Equilibrium in Case‐Control or Cohort Studies or Meta‐Analysis.” Breast Cancer Research and Treatment 128: 197–201. 10.1007/s10549-010-1295-z. [DOI] [PubMed] [Google Scholar]
  138. Zook, J. M. , and Salit M.. 2011. “Genomes in a Bottle: Creating Standard Reference Materials for Genomic Variation—Why, What and How?” Genome Biology 12: P31. 10.1186/1465-6906-12-S1-P31. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Data Availability Statement

One GIAB trio dataset generated during the current study is available in the Sequencing Read Archive (SRA) repository, accession number: PRJNA907182. All GIAB data, having approximately 6 TB, are available from the corresponding author on reasonable request for collaborative projects. Patient and study participant data may not be shared due to privacy issues.


Articles from Biometrical Journal. Biometrische Zeitschrift are provided here courtesy of Wiley

RESOURCES