Whole genome sequencing of 10K patients with acute ischaemic stroke or transient ischaemic attack: design, methods and baseline patient characteristics

Si Cheng; Zhe Xu; Yang Liu; Jinxi Lin; Yong Jiang; Yilong Wang; Xia Meng; Anxin Wang; Xinying Huang; Zhimin Wang; Guohua Chen; Songdi Wu; Zhengchang Jia; Yongming Chen; Xuerong Qiu; Jun Wu; Binbin Song; Weizhong Ji; Zhongping An; Wenjun Xue; Lili Zhao; Yu Geng; Hongyan Li; Hao Li; Yongjun Wang

doi:10.1136/svn-2020-000664

. 2020 Dec 18;6(2):291–297. doi: 10.1136/svn-2020-000664

Whole genome sequencing of 10K patients with acute ischaemic stroke or transient ischaemic attack: design, methods and baseline patient characteristics

Si Cheng ^1,^2,³, Zhe Xu ^1,^2,³, Yang Liu ^1,^2,³, Jinxi Lin ^1,², Yong Jiang ^1,², Yilong Wang ^1,², Xia Meng ^1,², Anxin Wang ^1,², Xinying Huang ^1,², Zhimin Wang ⁴, Guohua Chen ⁵, Songdi Wu ⁶, Zhengchang Jia ⁷, Yongming Chen ⁸, Xuerong Qiu ⁹, Jun Wu ¹⁰, Binbin Song ¹¹, Weizhong Ji ¹², Zhongping An ¹³, Wenjun Xue ¹⁴, Lili Zhao ¹⁵, Yu Geng ¹⁶, Hongyan Li ¹⁷, Hao Li ^1,², Yongjun Wang ^1,^2,^3,^✉

PMCID: PMC8258062 PMID: 33443231

Abstract

Background and purpose

Stroke is the second leading cause of death worldwide and the leading cause of mortality and long-term disability in China, but its underlying risk genes and pathways are far from being comprehensively understood. We here describe the design and methods of whole genome sequencing (WGS) for 10 914 patients with acute ischaemic stroke or transient ischaemic attack from the Third China National Stroke Registry (CNSR-III).

Methods

Baseline clinical characteristics of the included patients in this study were reported. DNA was extracted from white blood cells of participants. Libraries are constructed using qualified DNA, and WGS is conducted on BGISEQ-500 platform. The average depth is intended to be greater than 30× for each subject. Afterwards, Sentieon software is applied to process the sequencing data under the Genome Analysis Toolkit best practice guidance to call genotypes of single nucleotide variants (SNVs) and insertion-deletions. For each included subject, 21 fingerprint SNVs are genotyped by MassARRAY assays to verify that DNA sample and sequencing data originate from the same individual. The copy number variations and structural variations are also called for each patient. All of the genetic variants are annotated and predicted by bioinformatics software or by reviewing public databases.

Results

The average age of the included 10 914 patients was 62.2±11.3 years, and 31.4% patients were women. Most of the baseline clinical characteristics of the 10 914 and the excluded patients were balanced.

Conclusions

The WGS data together with abundant clinical and imaging data of CNSR-III could provide opportunity to elucidate the molecular mechanisms and discover novel therapeutic targets for stroke.

Keywords: stroke, genetic

Introduction

Stroke is the second leading cause of death worldwide, and the leading cause of mortality and long-term disability in China.¹ Being the most common type of stroke, ischaemic stroke (IS) accounts for about 80% of all strokes,² and more than 90% of IS are sporadic.³ IS is a complex multifactorial disease arising from complicated gene-environment interactions. Therefore, uncovering genetic contributions to IS could help to identify the genes, pathways and networks that are involved in IS pathogenesis. Although several novel genetic variants that were associated with IS susceptibility have been discovered in the last decades,^4–9 few studies explored the correlation between genetic variants and stroke outcomes. Moreover, previous genetic studies on IS were mainly conducted in European and African populations,^{4 10} and there is limited data for the Chinese population. Due to the substantial ancestral difference,¹¹ whether these reported IS-associated genetic variants could also contribute to IS pathogenesis in Chinese population needs verification.

The Third China National Stroke Registry (CNSR-III) is a nationwide prospective registry with 15 166 patients with IS or transient ischaemic attack (TIA) in China.¹² A broad and comprehensive spectrum of individual-level data had been collected, including clinical phenotypes, aetiological classification, neuroimaging, biomarkers and clinical outcomes. The aetiological subtyping information was recorded centrally. Taking these advantages, we perform whole genome sequencing (WGS) for 10 914 patients in the prespecified genetic substudy of CNSR-III to delineate the genetic landscape of IS and TIA in Chinese population.

Methods

Patients

The CNSR-III is a nationwide prospective registry for patients presented to hospitals with acute ischaemic cerebrovascular events between August 2015 and March 2018 in China.¹² There is a total of 15 166 patients with IS (n=14 146, 93.3%) or TIA (n=1020, 6.7%) within 7 days from the onset of symptoms to enrollment. The CNSR-III involved 201 hospitals that cover 22 provinces and 4 municipalities in China, including 163 grade III (central hospitals for certain district or city, usually teaching hospitals) and 38 grade II (hospitals serving several communities) urban hospitals. A total of 12 603 patients participated in the prespecified genetic substudy. The white blood cells (WBCs) from a total of 10 914 patients are applied in WGS (figure 1). The written informed consents were obtained from all patients or legally authorised representatives before entering into the study.

DNA extraction

For each sample, WBCs was used to extract the genomic DNA, which was performed using Magnetic Blood Genomic DNA Kit (DP329, TIANGEN Biotech Co Ltd, Beijing, China) on KingFisher Flex (Thermo Scientific Co, Massachusetts, USA) system for automatic genomic DNA extraction and purification at iGeneTech Co Ltd. (Beijing, China) or by manual phenol–chloroform DNA extraction at BGI Genomics (BGI-Shenzhen).

Evaluation of DNA quality

The concentration of genomic DNA was quantified using Qubit 2.0 fluorometer (Thermo Scientific Co, Massachusetts, USA) and SpectraMax Gemini XPS (Molecular Devices, San Francisco, USA) at BGI Genomics (BGI-Shenzhen). Electrophoresis was conducted on 1% agarose gel to make sure that the majority of genomic DNA segments was longer than 20 Kb and was not substantially degraded. Genomic DNA samples with concentration ≥12.5 ng/µL and total amount ≥0.5 µg was qualified for further procedures. For each of the qualified sample, the DNA is further applied in library construction and subsequent WGS process, as well as single nucleotide variant (SNV) genotyping (see details below).

Library construction

The qualified genomic DNA is randomly fragmented by ultrasound using CovarisLE220 (Covaris, Massachusetts, USA) according to the manufacturer’s instructions. The DNA fragments in the range of 200 to 400 bp are selected by VAHTSTM DNA Clean Beads (Vazyme Biotech Co, Ltd, Nanjing, China). The end repair for DNA fragments is performed by adding an ‘A’ nucleotide to the 3’ end of each strand. Afterwards, the dTTP-tailed adapters are ligated to both ends of the repaired/dA-tailed DNA fragments. The ligation product is then amplified by PCR. Then the products are purified by VAHTSTM DNA Clean Beads (Vazyme Biotech Co, Ltd, Nanjing, China). The purified PCR products with total mass ≥200 ng, and the main peak in 300 to 500 bp should be applied. Single strand separation is conducted by heat-denaturing the PCR product at 95 °C. Circularisation process is performed by mixing the single-stranded DNA fragments with splint oligos (sequence: GCCATGTCGTTCTGTGAGCCAAGG) and DNA Rapid Ligase to generate single-stranded DNA circles. The remaining linear molecule is digested with the exonuclease. The enzymatic digestion products are purified by Agencourt AMPure XP medium (Beckman Coulter, Indiana, USA). The single-stranded circle DNA (ssCir DNA) are formatted as the final library. The purified enzymatic digestion products are quantified with Qubit ssDNA Assay Kit (Thermo Scientific Co, Massachusetts, USA), and the final yield should be ≥12 ng.

BGISEQ-500 WGS sequencing

Rolling circle amplification is performed for the qualified libraries to produce DNA Nanoballs (DNBs). Then the DNBs are loaded into the patterned nanoarrays and 100 bp pair-end reads are sequenced on the BGISEQ-500 platform (BGI Genomics, Shenzhen, China). Sequencing-derived raw image files are processed by BGISEQ-500 base-calling software (V.1.2.1.21840) under default parameters settings. The sequence data are stored in FASTQ format. The average depth for each subject is intended to be greater than 30×.

SNV genotyping

To make sure that the DNA samples are neither mistaken nor contaminated during the WGS process, we selected 21 biallelic fingerprint SNVs and planned to genotype them for each participant of WGS. These 21 SNVs distribute on 15 different autosomes and are at least 13M apart. The minor allele frequencies of these SNVs are between 0.16 to 0.5 within the Han Chinese in Beijing samples in 1000 Genome Project.¹³ The SNV genotyping experiments are performed at BGI Genomics (BGI-Shenzhen) independently and simultaneously with WGS. For each sample, approximately 30 ng of qualified genomic DNA is used. Locus-specific PCR and detection primers are designed using the MassARRAY Assay Design software (Agena Bioscience, California, USA). Multiplex PCR and locus-specific single-nucleotide extension are performed for each DNA sample, then the products are desalted and transferred to a 384-well SpectroCHIP array. After MALDI-TOF (matrix-assisted laser desorption/ionization-time of flight) mass spectrometry, MassArray Typer software (V.4.1, Agena Bioscience, California, USA) is used to call the genotype for each participant.

After the accomplishment of WGS and SNV genotyping, the genotypes of the 21 SNVs are compared between those that are respectively obtained from WGS data analyses and MALDI-TOF mass spectrometry to verify that DNA sample and sequencing data originates from the same individual.

WGS data cleanup

Raw sequence reads are filtered using an in-house pipeline for quality control. The following steps are executed consecutively: Removing both of the paired reads if (1) any one of the reads contain sequencing adapter, (2) any one of the reads whose low-quality base ratio (base quality less than or equal to 12) is more than 50%, (3) any one of the reads whose unknown base (‘N’ base) ratio is more than 10%. Afterwards, fastp (V.0.20.0) is applied to filter out low-quality reads and bases,¹⁴ and downstream bioinformatics analyses are conducted on these qualified data.

Mapping and variant calling

The paired-end reads are processed under the Genome Analysis Toolkit (GATK) best practice guidance using Sentieon (release 201808.05, https://www.sentieon.com, bioRxiv 115717; doi:10.1101/115717).¹⁵ The reads are aligned to the hg38 human reference genome sequence that is downloaded from GATK bundle (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz) using Burrows-Wheeler Alignment tool that is implemented in Sentieon. The SNVs and insertion-deletions (indels) in the regions of segmental duplications and unassigned chromosomes are ignored in the downstream analyses. For each sample, the base quality, sequencing depth, GC (guanine-cytosine) content, mapping rate, mismatch rate, duplication rate and coverage is calculated. After removing the duplicated reads and recalibrating the base quality scores, SNVs and indels are first called using Haplotyper of Sentieon for each individual and then jointly called for all of the participants. Then, variant quality score recalibration and hard filter methods are applied to obtain the high-quality variant calls for SNVs and indels. The ‘*.bam’ and ‘*.vcf’ files that are generated in the above procedures would be reserved for other researches. Copy number variations (CNVs) and structural variations (SVs) in the genome of patients are mainly called using GraphTyper2 and Manta.^{16 17}

Population genetics analysis

To minimise problems arising from hidden family and population structure in the participants, we conduct the following quality control steps. First, kinship is explored by calculating pairwise identity-by-descent calculations for all pairs of individuals using PLINK (V.1.9).¹⁸ The existence of first and second degree relationships is checked using KING (V.2.1.8).¹⁹ Second, population structure is investigated using STRUCTURE software and by conducting principal component analysis.²⁰ All of these analyses are conducted using autosomal SNVs and indels.

Variant annotation

Impact of the mutations on protein coding and protein truncating variants were predicted using variant effect predictor.²¹ Pathogenicity of SNVs and indels are evaluated using InterVar software (V.2.0.1) under guidelines of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.²² The potential impact of SNVs and indels on gene expression/regulation is investigated by reviewing GTEx, HaploReg and other databases or online tools.^{23 24} The impact of intronic and exonic mutations on pre-messenger RNA splicing is mainly predicted using SpliceAI.²⁵

Biological significance of known or common CNVs and SVs are annotated by reviewing dbVar and Database of Genomic Variants.^{26 27} Novel CNVs and SVs are annotated by reviewing literatures on structure and function studies of the genes affected by the corresponding CNVs and SVs from PubMed.

Checking and reviewing

During the experimental procedures of this project, all of the WBCs and DNA loading, packaging, transferring and storing operations was conducted by one technician while being checked and supervised by another technician.

For WGS and SNV genotyping data, the MD5 code is generated for each data file before transfer, and is checked after the transfer. The commands and codes for WGS data mapping and variant calling are written by one bioinformatician while being reviewed by another bioinformatician. The log files are also reviewed and reserved.

All of the genetic information, clinical data and biospecimens are managed following the Regulations of the People’s Republic of China on Administration of Human Genetic Resources 2019.

Research projects

WGS data of 10K patients will be incorporated to identify the causality of certain risk factors for stroke outcomes, to investigate pleiotropic effect of genes on multiple phenotypes, and to understand the genetic relationship between particular comorbidities and IS. The accurate sequencing data from greater than 30× average depth in the WGS study also allows us to obtain a panoramic view of individual-specific variation and genetic structure of Chinese patients with IS or TIA. Some prespecified research topics are described below:

To draw a comprehensive genetic landscape of Chinese patients with IS or TIA, and characterise the geographical, lifestyle differences and their demographic origin;
To evaluate the genetic contribution to IS and its recurrent outcomes, especially the contribution of rare variants, CNVs and variants in certain region of the genome (eg, telomere and mitochondrial DNA);
To determine the causality of serum biomarkers for IS outcomes using association analyses and Mendelian randomisation;
To investigate the relationship between genetic features and brain imaging changes in IS;
To conduct the pharmacogenomics analyses on certain secondary prevention of IS;
To better understand the genetic mechanisms of IS with particular comorbidities (eg, chronic kidney disease, diabetes mellitus and hypertension).

Results

Among the 15 166 patients with IS or TIA in CNSR-III, 12 603 patients participated in the prespecified genetic substudy. Among them, 1308 participants did not provide enough WBCs. After DNA extraction and quality evaluation, the DNA of 381 participants was insufficient or unqualified. Therefore, a total of 1689 participants were excluded and WGS are conducted for 10 914 participants of CNSR-III (figure 1). The workflow of WGS and downstream bioinformatics analyses are shown in figure 2.

Workflow of WGS and bioinformatics analyses. The first two rows shows the process of DNA extraction, quality control, library construction and WGS. The third row demonstrates downstream bioinformatics analyses of WGS data. Some of the images are retrieved or adapted from Servier Medical Art (https://smart.servier.com/), which is licensed under a Creative Commons Attribution 3.0 Unported License. The photos of instruments are downloaded from websites of BGI Genomics (https://www.bgi.com/), Thermo Fisher (https://www.thermofisher.com/) and Agena Bioscience (https://agenabio.com/), respectively. DNB, DNA Nanoball; IS, ischaemic stroke; ssCir DNA, single-stranded circle DNA; TIA, transient ischaemic attack; WGS, whole genome sequencing.

Baseline clinical characteristics of the included 10 914 patients and excluded patients were presented in table 1. The average age was 62.2±11.3 years, and 31.4% of the patients were women. Patients diagnosed to be IS were 10 166 (93.2%), among which 50.4% had minor stroke (NIHSS (National Institutes of Health Stroke Scale score) ≤3). A total of 31.8% of the included patients were current smokers, and 14.5% were heavy drinkers (defined as ≥2 standard alcohol consumption per day). A total of 21.3% of the included patients had a history of IS. A total of 10.8%, 7.0% and 62.8% of the included patients had a history of coronary heart disease, atrial fibrillation and hypertension, respectively. The two groups of included and excluded patients were balanced regarding baseline characteristics (table 1).

Table 1.

Baseline characteristics of the included patients in the patients who underwent whole genome sequencing and the rest of the patients in CNSR-III

Characteristics	Included (n=10 914 to 72.0%)	Excluded (n=4252 to 28.0%)	Total (n=15 166 to 100%)
Age (years), mean±SD	62.2±11.3	62.2±11.3	62.2±11.3
Female, n (%)	3429 (31.4)	1373 (32.3)	4802 (31.7)
Ethnicity (non-Han), n (%)	306 (2.8)	134 (3.2)	440 (2.9)
Stroke type
IS	10 166 (93.2)	3980 (93.6)	14 146 (93.3)
TIA	748 (6.8)	272 (6.4)	1020 (6.7)
Current smoker, n (%)	3472 (31.8)	1280 (30.1)	4752 (31.3)
Heavy drinker, n (%)*	1582 (14.5)	544 (12.8)	2126 (14.0)
Medical history, n (%)
Ischaemic stroke	2322 (21.3)	827 (19.4)	3149 (20.8)
Coronary heart disease	1179 (10.8)	429 (10.1)	1608 (10.6)
Atrial fibrillation	765 (7.0)	254 (6.0)	1019 (6.7)
Hypertension	6858 (62.8)	2636 (62.0)	9494 (62.6)
Diabetes mellitus	2609 (23.9)	901 (21.2)	3510 (23.1)
Hypercholesterolaemia	903 (8.3)	288 (6.8)	1191 (7.9)
NIHSS at admission, median (IQR)†	3.0 (1.0 to 6.0)	3.0 (1.0 to 5.0)	3.0 (1.0 to 6.0)
NIHSS 0–3	5120 (50.4)	2199 (55.2)	7319 (51.7)
NIHSS ≥4	5046 (49.6)	1781 (44.8)	6827 (48.3)

Open in a new tab

*Heavy drinker was defined as ≥2 standard alcohol consumption per day.

†NIHSS in this table were summarised among IS patients only.

CNSR-III, the Third China National Stroke Registry; IS, ischaemic stroke; NIHSS, National Institutes of Health Stroke Scale score; TIA, transient ischaemic attack.

Discussion

Stroke is a complex disease that has multiple aetiologies. Genetic and genomic studies among populations from diverse ancestry could refine our understanding on molecular mechanism of stroke. Therefore, we conduct WGS for 10 914 patients from CNSR-III. The WGS procedures and baseline characteristics of patients are reported in this study. The WGS of CNSR-III constructs a genomic data set that facilitate large scale IS genetic analyses in Chinese population. The CNSR-III collected a comprehensive spectrum of phenotypic information under consistent and standardised criteria, which could increase the power and credibility of the genetic analyses. In addition, all of the patients are followed up for clinical outcomes,¹² and this provides an opportunity for discovery of genetic variants that are associated with patients’ outcomes after stroke.

In contrast to DNA microarrays that were mainly used in previous genetic associations on IS,^{4 10}WGS technology applied in this study could provide nearly all of the SNVs and indels, and simultaneously capture genetic information on CNVs and SVs for each patient. Therefore, WGS enables a systematic evaluation of the genetic effect of rare variants (allele frequencies <1% in population) to IS and TIA. As the contribution of the rare variants remains one of the top challenges in stroke genetics, the WGS study would provide a better understanding on IS and TIA pathophysiology.¹⁰ The average depth for WGS is intended to be greater than 30× in this project, because at this depth, both accurate variant calling and cost-effectiveness could be achieved.^{28 29} Moreover, >95% the genome could be covered by at least 10 sequencing reads, and >95% of the heterozygous variation could be accurately identified under this design.³⁰ Therefore, the WGS could provide high-quality genetic data for further investigations on IS.

In conclusion, the WGS and genome-wide analyses on CNSR-III would help to refine our understanding on the genetic contribution to IS/TIA and stroke outcomes, and possibly discover novel therapeutic targets for secondary prevention.

Footnotes

Twitter: @yilong

Contributors: Study concept and design: SC, HaL and YoW. Drafting of the manuscript: SC, ZX and YL. Statistical analysis: AW, XH and ZX. Study supervision and organisation of the project: JL, YJ, XM, HaL, YiW and YoW. Supplying patients: ZW, GC, SW, ZJ, YC, XQ, JW, BS, WJ, ZA, WX, LZ, YG and HoL.

Funding: This study was supported by grants from the Ministry of Science and Technology of the People’s Republic of China (2016YFC0901002, 2016YFC0901001), Beijing Municipal Science & Technology Commission (D171100003017002)，Beijing Municipal Administration of Hospitals’ Mission Plan (SML20150502) and National Science and Technology Major Project (2017ZX09304018). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Competing interests: None declared.

Provenance and peer review: Not commissioned; internally peer reviewed.

Data availability statement

Data are available upon reasonable request. Data in this article are available upon reasonable request.

Ethics statements

Patient consent for publication

Not required.

Ethics approval

The study was approved by the ethics committees of Beijing Tiantan Hospital and all other research centres according to the principles expressed in the Declaration of Helsinki.

References

1. GBD 2017 Causes of Death Collaborators . Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the global burden of disease study 2017. Lancet 2018;392:1736–88. 10.1016/S0140-6736(18)32203-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Wang Y, Li Z, Wang Y, et al. Chinese stroke center alliance: a national effort to improve healthcare quality for acute stroke and transient ischaemic attack: rationale, design and preliminary findings. Stroke Vasc Neurol 2018;3:256–62. 10.1136/svn-2018-000154 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Bersano A, Markus HS, Quaglini S, et al. Clinical Pregenetic screening for stroke monogenic diseases: results from Lombardia GENS registry. Stroke 2016;47:1702–9. 10.1161/STROKEAHA.115.012281 [DOI] [PubMed] [Google Scholar]
4. Malik R, Chauhan G, Traylor M, et al. Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nat Genet 2018;50:524–37. 10.1038/s41588-018-0058-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. NINDS Stroke Genetics Network (SiGN), International Stroke Genetics Consortium (ISGC) Pulit SL, McArdle PF, Wong Q. Loci associated with ischaemic stroke and its subtypes (sign): a genome-wide association study. Lancet Neurol 2016;15:174–84. 10.1016/S1474-4422(15)00338-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Neurology Working Group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, the Stroke Genetics Network (SiGN), and the International Stroke Genetics Consortium (ISGC) . Identification of additional risk loci for stroke and small vessel disease: a meta-analysis of genome-wide association studies. Lancet Neurol 2016;15:695–707. 10.1016/S1474-4422(16)00102-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Traylor M, Farrall M, Holliday EG, et al. Genetic risk factors for ischaemic stroke and its subtypes (the METASTROKE collaboration): a meta-analysis of genome-wide association studies. Lancet Neurol 2012;11:951–62. 10.1016/S1474-4422(12)70234-X [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Holliday EG, Maguire JM, Evans T-J, et al. Common variants at 6p21.1 are associated with large artery atherosclerotic stroke. Nat Genet 2012;44:1147–51. 10.1038/ng.2397 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2), Bellenguez C, et al. Genome-Wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke. Nat Genet 2012;44:328–33. 10.1038/ng.1081 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Dichgans M, Pulit SL, Rosand J. Stroke genetics: discovery, biology, and clinical applications. Lancet Neurol 2019;18:587–99. 10.1016/S1474-4422(19)30043-2 [DOI] [PubMed] [Google Scholar]
11. Sirugo G, Williams SM, Tishkoff SA. The missing diversity in human genetic studies. Cell 2019;177:26–31. 10.1016/j.cell.2019.02.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Wang Y, Jing J, Meng X, et al. The third China national stroke registry (CNSR-III) for patients with acute ischaemic stroke or transient ischaemic attack: design, rationale and baseline patient characteristics. Stroke Vasc Neurol 2019;4:158–64. 10.1136/svn-2019-000242 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Auton A, Brooks LD, 1000 Genomes Project Consortium, et al. A global reference for human genetic variation. Nature 2015;526:68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90. 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 2013;43:11.10.1-11.10.33. 10.1002/0471250953.bi1110s43 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Eggertsson HP, Kristmundsdottir S, Beyter D, et al. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat Commun 2019;10:5402. 10.1038/s41467-019-13341-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Chen X, Schulz-Trieglaff O, Shaw R, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 2016;32:1220–2. 10.1093/bioinformatics/btv710 [DOI] [PubMed] [Google Scholar]
18. Chang CC, Chow CC, Tellier LC, et al. Second-Generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 2015;4:7. 10.1186/s13742-015-0047-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Manichaikul A, Mychaleckyj JC, Rich SS, et al. Robust relationship inference in genome-wide association studies. Bioinformatics 2010;26:2867–73. 10.1093/bioinformatics/btq559 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Hubisz MJ, Falush D, Stephens M, et al. Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009;9:1322–32. 10.1111/j.1755-0998.2009.02591.x [DOI] [PMC free article] [PubMed] [Google Scholar]
21. McLaren W, Gil L, Hunt SE, et al. The Ensembl variant effect predictor. Genome Biol 2016;17:122. 10.1186/s13059-016-0974-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Li Q, Wang K. InterVar: clinical interpretation of genetic variants by the 2015 ACMG-AMP guidelines. Am J Hum Genet 2017;100:267–80. 10.1016/j.ajhg.2017.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. GTEx Consortium . Human genomics. The Genotype-Tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015;348:648–60. 10.1126/science.1262110 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Ward LD, Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 2012;40:D930–4. 10.1093/nar/gkr917 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting splicing from primary sequence with deep learning. Cell 2019;176:535–48. 10.1016/j.cell.2018.12.015 [DOI] [PubMed] [Google Scholar]
26. Lappalainen I, Lopez J, Skipper L, et al. DbVar and DGVa: public Archives for genomic structural variation. Nucleic Acids Res 2013;41:D936–41. 10.1093/nar/gks1213 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. MacDonald JR, Ziman R, Yuen RKC, et al. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res 2014;42:D986–92. 10.1093/nar/gkt958 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Kishikawa T, Momozawa Y, Ozeki T, et al. Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci Rep 2019;9:1784. 10.1038/s41598-018-38346-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Rashkin S, Jun G, Chen S, et al. Optimal sequencing strategies for identifying disease-associated singletons. PLoS Genet 2017;13:e1006811. 10.1371/journal.pgen.1006811 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:53–9. 10.1038/nature07517 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available upon reasonable request. Data in this article are available upon reasonable request.

[R1] 1. GBD 2017 Causes of Death Collaborators . Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the global burden of disease study 2017. Lancet 2018;392:1736–88. 10.1016/S0140-6736(18)32203-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2. Wang Y, Li Z, Wang Y, et al. Chinese stroke center alliance: a national effort to improve healthcare quality for acute stroke and transient ischaemic attack: rationale, design and preliminary findings. Stroke Vasc Neurol 2018;3:256–62. 10.1136/svn-2018-000154 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3. Bersano A, Markus HS, Quaglini S, et al. Clinical Pregenetic screening for stroke monogenic diseases: results from Lombardia GENS registry. Stroke 2016;47:1702–9. 10.1161/STROKEAHA.115.012281 [DOI] [PubMed] [Google Scholar]

[R4] 4. Malik R, Chauhan G, Traylor M, et al. Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nat Genet 2018;50:524–37. 10.1038/s41588-018-0058-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5. NINDS Stroke Genetics Network (SiGN), International Stroke Genetics Consortium (ISGC) Pulit SL, McArdle PF, Wong Q. Loci associated with ischaemic stroke and its subtypes (sign): a genome-wide association study. Lancet Neurol 2016;15:174–84. 10.1016/S1474-4422(15)00338-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6. Neurology Working Group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, the Stroke Genetics Network (SiGN), and the International Stroke Genetics Consortium (ISGC) . Identification of additional risk loci for stroke and small vessel disease: a meta-analysis of genome-wide association studies. Lancet Neurol 2016;15:695–707. 10.1016/S1474-4422(16)00102-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7. Traylor M, Farrall M, Holliday EG, et al. Genetic risk factors for ischaemic stroke and its subtypes (the METASTROKE collaboration): a meta-analysis of genome-wide association studies. Lancet Neurol 2012;11:951–62. 10.1016/S1474-4422(12)70234-X [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8. Holliday EG, Maguire JM, Evans T-J, et al. Common variants at 6p21.1 are associated with large artery atherosclerotic stroke. Nat Genet 2012;44:1147–51. 10.1038/ng.2397 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9. International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2), Bellenguez C, et al. Genome-Wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke. Nat Genet 2012;44:328–33. 10.1038/ng.1081 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10. Dichgans M, Pulit SL, Rosand J. Stroke genetics: discovery, biology, and clinical applications. Lancet Neurol 2019;18:587–99. 10.1016/S1474-4422(19)30043-2 [DOI] [PubMed] [Google Scholar]

[R11] 11. Sirugo G, Williams SM, Tishkoff SA. The missing diversity in human genetic studies. Cell 2019;177:26–31. 10.1016/j.cell.2019.02.048 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12. Wang Y, Jing J, Meng X, et al. The third China national stroke registry (CNSR-III) for patients with acute ischaemic stroke or transient ischaemic attack: design, rationale and baseline patient characteristics. Stroke Vasc Neurol 2019;4:158–64. 10.1136/svn-2019-000242 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13. Auton A, Brooks LD, 1000 Genomes Project Consortium, et al. A global reference for human genetic variation. Nature 2015;526:68–74. 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14. Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90. 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15. Van der Auwera GA, Carneiro MO, Hartl C, et al. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 2013;43:11.10.1-11.10.33. 10.1002/0471250953.bi1110s43 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16. Eggertsson HP, Kristmundsdottir S, Beyter D, et al. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat Commun 2019;10:5402. 10.1038/s41467-019-13341-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17. Chen X, Schulz-Trieglaff O, Shaw R, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 2016;32:1220–2. 10.1093/bioinformatics/btv710 [DOI] [PubMed] [Google Scholar]

[R18] 18. Chang CC, Chow CC, Tellier LC, et al. Second-Generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 2015;4:7. 10.1186/s13742-015-0047-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19. Manichaikul A, Mychaleckyj JC, Rich SS, et al. Robust relationship inference in genome-wide association studies. Bioinformatics 2010;26:2867–73. 10.1093/bioinformatics/btq559 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20. Hubisz MJ, Falush D, Stephens M, et al. Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009;9:1322–32. 10.1111/j.1755-0998.2009.02591.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21. McLaren W, Gil L, Hunt SE, et al. The Ensembl variant effect predictor. Genome Biol 2016;17:122. 10.1186/s13059-016-0974-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22. Li Q, Wang K. InterVar: clinical interpretation of genetic variants by the 2015 ACMG-AMP guidelines. Am J Hum Genet 2017;100:267–80. 10.1016/j.ajhg.2017.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23. GTEx Consortium . Human genomics. The Genotype-Tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015;348:648–60. 10.1126/science.1262110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24. Ward LD, Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 2012;40:D930–4. 10.1093/nar/gkr917 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting splicing from primary sequence with deep learning. Cell 2019;176:535–48. 10.1016/j.cell.2018.12.015 [DOI] [PubMed] [Google Scholar]

[R26] 26. Lappalainen I, Lopez J, Skipper L, et al. DbVar and DGVa: public Archives for genomic structural variation. Nucleic Acids Res 2013;41:D936–41. 10.1093/nar/gks1213 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27. MacDonald JR, Ziman R, Yuen RKC, et al. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res 2014;42:D986–92. 10.1093/nar/gkt958 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28. Kishikawa T, Momozawa Y, Ozeki T, et al. Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci Rep 2019;9:1784. 10.1038/s41598-018-38346-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29. Rashkin S, Jun G, Chen S, et al. Optimal sequencing strategies for identifying disease-associated singletons. PLoS Genet 2017;13:e1006811. 10.1371/journal.pgen.1006811 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:53–9. 10.1038/nature07517 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Whole genome sequencing of 10K patients with acute ischaemic stroke or transient ischaemic attack: design, methods and baseline patient characteristics

Si Cheng

Zhe Xu

Yang Liu

Jinxi Lin

Yong Jiang

Yilong Wang

Xia Meng

Anxin Wang

Xinying Huang

Zhimin Wang

Guohua Chen

Songdi Wu

Zhengchang Jia

Yongming Chen

Xuerong Qiu

Jun Wu

Binbin Song

Weizhong Ji

Zhongping An

Wenjun Xue

Lili Zhao

Yu Geng

Hongyan Li

Hao Li

Yongjun Wang

Abstract

Background and purpose

Methods

Results

Conclusions

Introduction

Methods

Patients

Figure 1.

DNA extraction

Evaluation of DNA quality

Library construction

BGISEQ-500 WGS sequencing

SNV genotyping

WGS data cleanup

Mapping and variant calling

Population genetics analysis

Variant annotation

Checking and reviewing

Research projects

Results

Figure 2.

Table 1.

Discussion

Footnotes

Data availability statement

Ethics statements

Patient consent for publication

Ethics approval

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases