Skip to main content
eLife logoLink to eLife
. 2020 Aug 10;9:e57390. doi: 10.7554/eLife.57390

Population-scale proteome variation in human induced pluripotent stem cells

Bogdan Andrei Mirauta 1,, Daniel D Seaton 1,†,§, Dalila Bensaddek 2,†,#, Alejandro Brenes 2, Marc Jan Bonder 1, Helena Kilpinen 1,; HipSci Consortium1,2,3,4, Oliver Stegle 1,5,6,‡,, Angus I Lamond 2,‡,
Editors: Stephen CJ Parker7, Patricia J Wittkopp8
PMCID: PMC7447446  PMID: 32773033

Abstract

Human disease phenotypes are driven primarily by alterations in protein expression and/or function. To date, relatively little is known about the variability of the human proteome in populations and how this relates to variability in mRNA expression and to disease loci. Here, we present the first comprehensive proteomic analysis of human induced pluripotent stem cells (iPSC), a key cell type for disease modelling, analysing 202 iPSC lines derived from 151 donors, with integrated transcriptome and genomic sequence data from the same lines. We characterised the major genetic and non-genetic determinants of proteome variation across iPSC lines and assessed key regulatory mechanisms affecting variation in protein abundance. We identified 654 protein quantitative trait loci (pQTLs) in iPSCs, including disease-linked variants in protein-coding sequences and variants with trans regulatory effects. These include pQTL linked to GWAS variants that cannot be detected at the mRNA level, highlighting the utility of dissecting pQTL at peptide level resolution.

Research organism: Human

Introduction

Induced pluripotent stem cells (iPSC) hold great promise for advancing basic research and biomedicine. By enabling the in vitro reconstitution of development and cell differentiation, iPS cells allow the investigation of mechanisms underlying development and the aetiology of many forms of genetic disease. To realize this potential, it is essential to characterise genetic and non-genetic sources of variability of molecular and cellular phenotypes in human iPSCs.

Recently, multiple reference panels of human iPSC lines have been established (Kilpinen et al., 2017; Panopoulos et al., 2017; Carcamo-Orive et al., 2017), providing a valuable resource for functional experiments in pluripotent cells. These cell lines, together with associated data, have enabled the characterisation of variability in iPSC transcriptomes, identifying genetic and non-genetic determinants of expression variation, including expression quantitative trait loci (eQTL) (Kilpinen et al., 2017; Rouhani et al., 2014; DeBoever et al., 2017) in cis.

While RNA-centric analyses are informative for studying gene regulatory mechanisms at the transcriptional level, most cellular phenotypes ultimately involve downstream mechanisms that are mediated by proteins. Several proteogenomics studies, primarily in cancer (Zhang et al., 2014; Mertins et al., 2016), have underlined the relevance of protein measurements to interpreting how genomic changes act at the phenotypic level. Moreover, recent evidence has shown that genetic alterations can have effects on RNA that are attenuated at the protein level (Gonçalves et al., 2017; Roumeliotis et al., 2017). Vice versa, the mapping of protein quantitative trait loci (pQTL), predominantly in lymphoblast cell lines (Battle et al., 2015; Stark et al., 2014; Wu et al., 2013) and for the plasma proteome (Sun et al., 2018; Yao et al., 2018; Liu et al., 2015; Johansson et al., 2013; Lourdusamy et al., 2012), has revealed genetic effects on protein traits that do not manifest at the RNA level. However, the extent of RNA-independent protein regulation is not yet understood, with previous analyses performed only at gene-level resolution and, in some cases, without comparing protein and RNA data from the same cellular material.

Here, we report the first comprehensive, population-scale, combined proteomic and transcriptomic analysis of human iPSC lines. Our data provide matched quantitative proteomic (Tandem Mass Tag Mass Spectrometry) and transcriptomic (RNA-Seq) profiles of 202 iPSC lines, derived from 151 donors from the HipSci project (Kilpinen et al., 2017). We identify both genetic and non-genetic effects associated with variability in protein expression between individuals and describe the first high-resolution pQTL map in human iPSCs, including loci not detected as eQTLs at the RNA level.

Results

A population reference proteome for human iPSCs

A set of 217 iPSC lines from the HipSci project (Kilpinen et al., 2017), derived from 163 distinct donors, was selected for protein analysis, using material from the identical batches of cells that were used for RNA-Seq and other assays (Materials and methods). Quantitative mass spectrometry was carried out in batches of 10 lines, using tandem mass tagging (TMT, Thompson et al., 2003), with one common reference sample shared across batches (Brenes et al., 2018) (Materials and methods). Collectively, we identified 255,015 distinct (unmodified) peptide sequences, corresponding to 16,773 protein groups (groups of protein isoforms with no discriminating peptides; hereon denoted proteins) with median sequence coverage of 46%.

After quality control, 202 lines (from 151 donors) with matched genotype, RNA-Seq and proteome data, were selected for further analysis (Figure 1a; Figure 1—figure supplement 1; Figure 1—source data 1). We identified 11,140 recurrently detected proteins, corresponding to 10,198 genes (detected in at least 30 lines; Materials and methods) and RNA expression for 12,363 protein-coding genes (population average TPM >1). Out of these, 9013 protein coding genes were detected both at the RNA and protein levels (Figure 1—source datas 2 and 3).

Figure 1. | Characterising variation in the iPSC proteome and transcriptome.

(a) Experimental design and assays considered in this study. Genotype, RNA-Seq and quantitative proteomics data were obtained from the same cell material of 202 iPSC lines derived from 151 unrelated donors. (b) Variance component analysis of RNA and protein abundances across genes, considering different technical and biological factors. Shown is the distribution of the fraction of variance explained by different factors (upper panel) across proteins, and the number of genes with substantial variance contribution for each factor (>20% contribution; lower panel). Also shown are the number of genes that retain greater than 20% contribution after adjusting for the effect of the corresponding RNA profiles on protein abundance (light blue; see Materials and methods). (c) Association of protein level with X chromosome inactivation (XCI) status across 110 female iPSC lines. Shown are lowess regression curves for 322 and 312 proteins respectively that were identified as significantly up (red) - and down (blue) - regulated with loss of XCI in female iPSC lines (lower panel; 10% FDR). Selected gene ontology enrichments for these sets of proteins are shown (right-hand panel; Materials and methods). XCI status was estimated as the average fraction of allele-specific expression for the inactive chromosome across all chromosome X genes (Materials and methods). (d) Scatter plot of the fraction of variance explained by donor at the RNA (x-axis) versus protein (y-axis) level. Encoded in colour is the fraction of variance explained by donor effects at the protein level after adjusting for the effect of the corresponding RNA profiles on protein abundance (Materials and methods).

Figure 1—source data 1. HipSci proteomics iPSC lines.
The public ids, TMT batch, donor, gender, age and growth media for the HipSci iPSC lines used in this study are shown.
Figure 1—source data 2. RNA gene level expression across the 202 lines for genes recurrently detected at the protein level.
Lines are indexed by protein Ensembl gene Id. Columns are the line public names.
Figure 1—source data 3. Protein abundance values across the 202 and reference lines for genes recurrently detected at the protein level and with RNA expression (TPM >1).
Lines are indexed by protein Uniprot Id. First 229 columns contain intensity values after quality line filtering, batch correction and quantile normalisation. Line names are encoded as follows: [line public name]@[TMT batch]@[TMT channel]. Last columns include protein information: ‘gene_chromosome', 'gene_start', 'gene_end', 'ensembl_gene_id', 'gene_name', 'gene_strand', 'number_of_peptides', 'in_CORUM’.
Figure 1—source data 4. Protein and RNA variance components.
Variance decomposition for 6009 genes high RNA expression (TPM >1) and detected in lines at the protein level.
Figure 1—source data 5. Protein and RNA correlation with X chromosome inactivation.
Correlation with XCI status of protein and RNA profiles for the 6336 genes (6406 proteins) with high RNA expression (TPM >1) and detected in all female lines at the protein level.
Figure 1—source data 6. Functional enrichment of genes with protein or RNA profiles correlated with XCI.
This table enumerates the significant Genome Ontology terms and DNA regulatory motifs (FDR 0.05; fields 'source' and 'term_name') for different gene sets (field ‘molecular_layer’ and ’change_direction’): 1) RNA positively correlated with XCI inactivation, 2) RNA negatively correlated with XIC, 3) proteins positively correlated with XIC and without RNA nominal significance, and 4) proteins negatively correlated with XIC and without RNA nominal significance.

Figure 1.

Figure 1—figure supplement 1. Protein quantification, quality control and batch correction.

Figure 1—figure supplement 1.

(a) Number of detected peptides across all 240 iPSC lines (including 23 replicates of the reference line) ordered by TMT processing batch (dashed vertical lines). The horizontal lines denote the QC cutoff of 75% of the median across lines (67,000 peptides). (b) Distribution of the number lines in which the peptides and proteins were detected. The number of recurrently detected peptides, or protein groups (at least one detected peptide per group), are shown as a function of the recurrence (considering 202 lines with QC-passing RNA and protein data). (c) Fraction of expressed genes detected at the protein level for increasing levels of expression at the RNA level (for each decile of RNA expression; grey: values for each cell line ; blue: average across cell lines). (d) Assessment of batch correction across TMT batches. Principal component analysis of all 202 iPS lines + 22 additional technical replicates of the reference cell line (HPSI0314i-bubh_3), which was included in each TMT batch. Colour denotes the TMT batch. (e) Scatter plot of the coefficient of variation of peptide abundance estimates across processing batches for the reference line, before (x-axis) and after (y-axis) batch correction and quantile normalisation. Note that the reference line is not used for estimating parameters of the batch adjustment (Materials and methods).
Figure 1—figure supplement 2. Comparison of iPSC proteome and somatic human tissues.

Figure 1—figure supplement 2.

Spearman correlation coefficients between the average iPSC proteome abundance and proteome profiles of 23 tissues obtained from the Human Proteome Map, including Fetal (Red) and Adult (Blue) tissues are shown (see Materials and methods for details).
Figure 1—figure supplement 3. Comparison of the iPS transcriptome and proteome of disease and normal lines.

Figure 1—figure supplement 3.

(a) Principal components analysis of protein (left panel) and RNA (right panel) profiles of 202 iPSC lines, with individual lines color coded by disease status. In total 6583 proteins detected in all 202 cell lines and 16,852 genes with RNA expression (TPM >1) in at least 30 lines were considered. (b) Differential expression analysis between the largest disease entities (Bardet-Biedl;N = 38 and Monogenic diabetes; N = 38 and ‘Normal’; N = 112; Supplementary file 2) for protein (left panels) and RNA (right panels). p-Values and effect size estimates obtained from a linear model with the disease indicator as an exogenous variable are shown. For protein, the fold change is computed for each batch and averaged across batches. Significantly differential RNA or protein levels (FDR < 10%, Benjamini-Hochberg adjusted) are indicated in red.
Figure 1—figure supplement 4. Comparisons of variance component estimates before and after regressing out mRNA effects.

Figure 1—figure supplement 4.

Briefly, for each protein-RNA pair, the effect of RNA abundance on protein abundance was accounted for using a linear model, regressing out its effect on the protein abundance. The analogous variance component model as considered in Figure 1b is then fitted on this adjusted protein abundance (Materials and methods).
Figure 1—figure supplement 5. Donor variance components of proteins differentially expressed between iPSC and ESC.

Figure 1—figure supplement 5.

Donor variance component results for all proteins and subsets of highly expressed and highly variable proteins, compared to the donor variance components for proteins reported as differentially expressed between iPSC and Munoz et al., 2011 (a set of 81 proteins) and Phanstiel et al., 2011 (a set of 255 proteins) (see Materials and methods). Horizontal black bars denote median variance components.
Figure 1—figure supplement 6. Quantification of X chromosome inactivation status in female iPSC lines using chromosome X ASE SNPs.

Figure 1—figure supplement 6.

(a) Mean allele-specific expression (ASE), averaged across all chromosome X heterozygous variants for all female iPSC lines included in this study (left), with illustrative examples of the distribution of SNP ASE measured across chromosome X (right). (b) Scatter plot between XIST RNA expression and mean ASE for chromosome X.

Collectively, these data provide the most comprehensive analysis of the human iPSC proteome reported to date, and one of the most comprehensive proteomic datasets reported for any human primary, or derived, cell type (Supplementary file 1). When overlaying our data with the Human Proteome Map (Kim et al., 2014), the iPSC proteome was most similar to foetal and reproductive organs (Figure 1—figure supplement 2), which is consistent with the expected expression of pluripotency markers in these tissues (Kerr et al., 2008a; Kerr et al., 2008b). We also assessed differences between healthy donors and disease-bearing donors, identifying no systematic expression differences in the iPSC state (Supplementary file 2; Figure 1—figure supplement 3).

RNA and proteome variability

We assessed a range of factors to explain the variation in protein expression between iPSC lines. Leveraging our experimental design with data from two or more lines for 34% of donors, we assessed the effects of donor, alongside age, sex, and the contributions of technical and cell culture related factors (on 6009 genes; Figure 1b,d; Figure 1—source data 4; using a linear mixed model; Materials and methods). Overall, the fraction of variance explained by biological factors was lower for protein levels, compared to RNA variation, which points to higher assay noise and/or stochastic variability of protein abundance. Consistent with previous results using RNA data (Kilpinen et al., 2017), we identified donor genome (i.e. DNA sequence variation) as the most relevant factor, followed by culture medium (Figure 1b). Critically, however, significant donor effects remained after accounting for RNA variability (Figure 1b,d; Figure 1—figure supplement 4; Materials and methods). This indicates that (i) genetic differences between individual donors and experimental differences between culture conditions play an important role in causing the observed variation in proteome expression between the iPSC lines and (ii) post-transcriptional mechanisms also contribute to these effects. Notably, many of the proteins showing strong donor effects were previously identified as differentially expressed between reprogrammed iPS cells and embryonic stem cells (ESCs) (Phanstiel et al., 2011; Munoz et al., 2011), suggesting that some of these previously reported differences could be due to genetic variation between donors, rather than intrinsic differences between the iPSC and ESC cell types (Figure 1—figure supplement 5).

The sex of the donor affected proteome expression, including a subset of proteins uniquely encoded on the male-specific X chromosome. There was also a strong (i.e. >20%) gender-related effect on the expression of a subset of 88 proteins (Figure 1b), which are enriched for proteins encoded on the X chromosome (Odds ratio = 24.8, PV = 3×10−32, Fisher’s exact test). This reflected the partial erosion of X chromosome inactivation (XCI) observed in a subset of iPSC lines derived from female donors, as confirmed both by quantification of allele-specific expression and XIST expression in these lines (Figure 1—figure supplement 6). Incomplete XCI has been linked previously to poor iPSC differentiation and changes in RNA levels (Mekhoubad et al., 2012; Salomonis et al., 2016). However, our data provide the first opportunity to link the XCI status of 110 distinct female iPSC lines (as inferred from allele-specific expression; Materials and methods; Figure 1—figure supplement 6), with changes in the abundance of both proteins and RNAs. This identified 1374 genes for which either protein, or RNA levels, or both, showed changes correlated with XCI status (Figure 1c, FDR < 10%; Figure 1—source data 5). Further analysis indicated that XCI status preferentially impacts catabolic processes and mitochondria at the protein level, while this was not observed at the RNA level (Gene Ontology; Materials and methods; Figure 1c; Figure 1—source data 6). These data thus reveal an important effect of XCI status on global gene expression in iPSC lines from female donors, including specific effects at the protein level that are not detected by transcriptomic analysis.

Mapping cis genetic effects on protein abundance

Next, we mapped cis quantitative trait loci at both the RNA and protein levels, considering 8543 autosomal protein coding genes that were quantified at both levels (MAF >5%; within +/- 250 kb of the gene boundaries; using a linear mixed model; Materials and methods). The number of pQTLs identified was greatly increased by adapting the PEER-based adjustment, which was previously developed for mapping of eQTLs (Stegle et al., 2012), for use with proteomic data (Materials and methods; Figure 2—figure supplement 1). Across all autosomal genes, we report 654 genes with at least one cis pQTL and 3487 genes with a cis eQTL (FDR < 10% for both eQTL and pQTL mapping; lead variants only; Figure 2a; Figure 2—source datas 1 and 2). Among these, 273 genes were shared and had identical or correlated lead variants, whereas 82 genes showed evidence for an eQTL and pQTL with independent lead variants (LD-based criterion, r2 <0.1, Figure 2—source data 1). Genes with substantial donor components, as identified based on the variance component analysis (>20%; Figure 1b), were enriched for significant cis pQTL (215 genes out of 962; Odds ratio = 4.2; Figure 2—figure supplement 2).

Figure 2. Human iPSC cis protein and RNA QTLs.

(a) Number of genes with a protein (blue) or RNA (green) cis QTL (FDR < 10%) and pairwise replication of genetic effects. Left: Number of genes with a pQTL, either with (dark blue) or without (light blue) replicated RNA effect. Right: Number of genes with an eQTL, either with (dark green) or without (light green) replicated protein effect. Replication defined by assessing nominal significance (PV <0.01) of QTL in the respective other layer. (b) Local Manhattan plots displaying negative log p-values (PV) from cis RNA (top) and protein (bottom) QTL mapping for PEX6. The dashed line and the grey box indicate the genomic positions of the lead QTL and of the gene. Boxplots show RNA and protein expression for different alleles at the pQTL lead variant rs11752813, a variant in LD (r2 = 1, 1000 Genomes European populations phase 3) with the Alzheimer risk variant rs1129187 (Jun et al., 2016) (OR 1.13). (c) Cumulative fraction of eQTLs with replicated protein effects as a function of the eQTL effect size (from highest to lowest). (d) Prediction of protein replication of eQTLs, considering features derived from gene annotations, eQTL, RNA and protein data. Predictions were obtained using a random forest model trained on the protein replication status of eQTL (as in a; Materials and methods). Left: Feature importance scores. Right: Precision-recall curve for the model, evaluated in independent test fractions. The model performance was assessed by random sampling of training/testing data with a 80/20 split, performed 50 times. Shown in red is the average precision-recall across all sampled training/test splits and in thin grey lines results of individual folds.

Figure 2—source data 1. pQTL_results.
The list of significant (FDR < 10%) genes with a pQTL provided as a supplementary file. Data fields are described in the table below.
Figure 2—source data 2. eQTL_results.
Reported are genes with a significant (FDR < 10%) QTL. It consists of variants mapped at RNA, gene resolution, for genes detected at both RNA and protein levels. This table includes the features used in the prediction of the pQTL status. The table columns are analogous to Figure 2—source data 1 pQTL_results.

Figure 2.

Figure 2—figure supplement 1. Selection of the number of PEER factors to adjust for unwanted variation.

Figure 2—figure supplement 1.

Pairwise correlation (Pearson correlation coefficient) between the Peer factors fitted on RNA (a) and protein data (b). Vertical red lines indicate the number of factors used within this study. (c) Number of genes with a pQTL (FDR < 10%) when accounting for increasing numbers of factors in the analysis. Dark blue denotes pQTLs replicated at RNA level (defined as nominal PV <0.01; Materials and methods).
Figure 2—figure supplement 2. Relationship between estimates of donor variance component and cis pQTLs.

Figure 2—figure supplement 2.

(a) Barplot showing the fractions of genes with significant variance donor component for genes with and without cis pQTLs. (a) Barplot showing the fractions of genes with a cis pQTL for genes stratified by the relative donor variance.
Figure 2—figure supplement 3. Comparison of eQTL and pQTL effect sizes and genomic positions.

Figure 2—figure supplement 3.

(a) Scatter plot of QTL effect size estimates for eQTL lead variants (FDR < 0.1) at the RNA and protein level, respectively. Dark green denotes eQTLs nominally significant at pQTL; Light green denotes eQTLs lacking protein replication. (b) Scatter plot of QTL effect size estimates for lead pQTL variants (FDR < 0.1) at the protein and RNA level, respectively. Dark blue denotes pQTLs nominally significant at eQTL; Light blue denotes pQTLs lacking eQTL replication. (c) Distribution of eQTL(top) and pQTL (bottom) around the gene start. Y-axis indicates the QTL effect size. pQTLs are stratified by eQTL replication status.
Figure 2—figure supplement 4. Example iPS eQTL and pQTL variance with evidence for co-localisation with GWAS variants.

Figure 2—figure supplement 4.

(a) Left - local Manhattan plot for the cis region of gene SMC2 (lead pQTL rs7872034), displaying eQTL and pQTL association negative log p-values, as well as negative log p-values obtained from a GWAS of invasive ovarian cancer (Phelan et al., 2017) (pQTL cumulative co-localisation posterior probability 0.8; eCAVIAR). Right: scatter plot with eQTL and pQTL effect sizes (y-axis) juxtaposed with effect sizes on invasive ovarian cancer (x-axis). The red triangle indicates the lead pQTL for protein and mRNA effects. (b) Local Manhattan plot of the cis region for the gene TRIM5 (lead pQTL missense variant rs11601507) displaying eQTL and pQTL association negative log p-values, as well as negative log p-values obtained from a GWAS study for coronary artery disease risk (van der Harst and Verweij, 2018). Right: scatter plot with eQTL and pQTL effect sizes juxtaposed with effect sizes on coronary artery disease risk. Left plot insert shows the position in Q9C035, the protein encoded by TRIM5, of the missense variant rs11601507 and of the peptides used for protein quantification.

To identify DNA sequence variants with effects at both the RNA and protein levels, we considered the pairwise replication of pQTLs at the RNA level and vice versa (lead QTL variants, defining ‘replication’ as nominal PV <0.01 with consistent effect direction; Materials and methods), which is more sensitive than assessing overlapping QTL variants. This identified 473 pQTLs (72%) with replicated eQTL effects. Conversely, 893 eQTLs (26%) had replicated protein effects, with globally concordant effect size directions and distance from gene boundaries (Figure 2—figure supplement 3).

Lack of replication of eQTLs at the protein level could arise from a combination of technical and/or biological factors. We identified the eQTL effect size as a strong predictor for protein replication, with larger effects being associated with increased replication rates (Figure 2c). To systematically characterise the determinants of eQTL replication, we considered a random forest model trained to predict the protein replication status (Figure 2d). In addition to the eQTL effect size, this identified other predictive factors, including the protein coefficient of error (estimated from technical replicate samples; Materials and methods) and the protein coefficient of variation across lines (Figure 2d; Figure 2—source data 2).

To explore the physiological relevance of iPSC pQTL variants, we examined their overlap with variants identified in genome-wide association studies (GWAS). Specifically, we probed for QTLs that tag GWAS variants contained in the GWAS catalogue (MacArthur et al., 2017) (i.e. are in LD r2 >0.8), identifying 136 (of 654) pQTL signals that tag a known GWAS variant (Figure 2—source data 1). In addition, we assessed the statistical evidence for co-localisation of pQTL and GWAS signals for 51 studies for which full summary statistics were obtained (using eCAVIAR Hormozdiari et al., 2016; Materials and methods; Figure 2—source data 1), yielding 49 pQTLs with evidence of co-localisation (i.e cumulative co-localisation probability greater than 0.1). Among these, examples of pQTLs with corresponding effects at the RNA level include the variant rs7872034, a pQTL for SMC2 with co-localisation evidence for serous invasive ovarian cancer (Phelan et al., 2017), and the variant rs11752813, a pQTL for PEX6 and in LD with Alzheimer's disease in APOE e4+ carriers risk variant rs1129187 (Jun et al., 2016; Figure 2—figure supplement 4; Figure 2b).

Notably, for 33 pQTLs linked to GWAS variants, either via co-localisation or LD tagging, no replicated effect was identified at the RNA level, suggesting protein-specific regulation (Figure 2—source data 1). For example, rs11601507 has no RNA effects, and is associated with TRIM5 protein abundance and with coronary artery disease risk (van der Harst and Verweij, 2018; Figure 2—figure supplement 4). Such cases raise the question of the mechanisms by which these variants modulate protein abundance and, ultimately, phenotypic traits, as addressed below.

pQTL linked to isoform-specific transcript expression

To investigate the mechanisms that underlie discordant eQTLs and pQTLs in more detail, we performed transcript isoform and protein peptide QTL analyses. cis QTL mapping of 33,050 reference transcript isoforms (Zerbino et al., 2018) (quantified using Salmon Patro et al., 2017; Materials and methods) and 119,747 peptides identified 3810 genes with a transcript QTL (tQTL) and 566 genes with a peptide QTL (pepQTL), respectively (Figure 3—figure supplement 1, Materials and methods, Figure 3—source datas 1 and 2).

Transcript-level QTL mapping could explain the lack of protein effects for a small fraction of the 2594 eQTL without a replicated pQTL effect (Figure 2a). For 48 of these, the eQTL variant was identified as exclusively associated with abundance changes of non-coding transcript isoforms (nominal PV <0.01), which explains the absence of protein effects (Figure 3—figure supplement 2). Furthermore, when considering 1262 transcript QTL that neither replicate at the eQTL, nor at the pQTL level, in 45 instances we observed consistent replication when considering peptide QTL (Figure 3—figure supplement 2b).

Among 181 pQTLs without eQTL replication (Figure 2a), 61 had nominally significant transcript QTLs (PV <0.01; Figure 3a). For 12 of these, including a pQTL for MMAB (Figure 3b), we observed genetic effects with opposite directions on coding and non-coding transcript isoforms, which explains the lack of genetic effects when considering gene-level RNA abundance.

Figure 3. Putative mechanisms of pQTL from transcript isoform regulation and protein-altering variants.

(a) Categorisation of 654 pQTL into four classes according to their putative mechanism: gene expression effect (i.e. replicated at eQTL level), transcript-isoform specific effect (i.e. not replicated at eQTL level, but significant at transcript isoform level), protein-altering variant (i.e. at least one inframe variant in LD with lead pQTL variant) without expression effect at RNA level, and without any putative mechanism identified. (b) Example pQTL without eQTL replication (rs6663; gene MMAB), with a directional opposite effect on a coding and non-coding isoform (cyan: ENST00000540016; grey: ENST00000537496), resulting in no overall change in gene expression level. (c) The pQTL variant (rs1051061) is a protein-altering variant associated with VRK2 protein abundance (below), and lacks detectable effect on RNA expression. The pQTL signal is observed across 15 peptides spanning the VRK2 protein sequence (above, left). This variant is associated with schizophrenia risk, and is located at the kinase active site, proximal to the proton acceptor residue (above, right). The dashed line and the grey box indicate the genomic positions of the lead QTL and of the gene. (d) Enrichment of RNA-independent pQTL in different categories of predicted variant effects, using gene variants in high LD with pQTLs (proxy gene variants; r2 >0.8; within the cis gene boundaries). Enrichment calculated using Fisher’s exact test.

Figure 3—source data 1. tQTL_results.
Consists of variants mapped at RNA, transcript isoform resolution, for genes detected at both RNA and protein levels. The table columns are analogous to Figure 2—source data 1 pQTL_results.
Figure 3—source data 2. pepQTL results.
Consists of variants mapped at the protein level, peptide resolution, for genes detected at both RNA and protein levels. The table columns are analogous to Figure 2—source data 1 pQTL_results.

Figure 3.

Figure 3—figure supplement 1. Discovery and replication of cis QTLs at protein and RNA levels.

Figure 3—figure supplement 1.

Upset plots showing from top to bottom: genome-wide significant cis QTLs (FDR < 10%) for protein (pQTLs), peptides (pepQTLs), gene expression (eQTLs) and transcript isoforms (tQTLs) summarised at gene level. For each discovery set the replication (i.e. nominal PV <0.01 and effect sizes of the same direction) is assessed in the respective other layers, and the number of genes with replicated effects for each intersection is displayed. For comparison purposes, we show only QTLs assessed in all the four layers.
Figure 3—figure supplement 2. Isoform-specific genetic effects.

Figure 3—figure supplement 2.

(a) An eQTL arising from a non-coding transcript QTL. The variant rs2709373, an eQTL variant for METTL21A, was associated with the abundance of the non-coding transcript isoform ENST00000477919 without detectable effect on the abundance of any protein-coding transcript isoform and thus not altering protein expression levels from this locus. (b) A transcript QTL that is neither an eQTL nor a pQTL. The variant rs12795503 has effects in opposite directions on the coding transcripts ENST00000301843 (light blue) and ENST00000346329 (light red), resulting in no detectable effects on either the RNA or protein level. The transcript-specific effect on ENST00000301843 is detectable at the peptide level (peptide QDSAAVGFDYK; uniquely mapping to exon 11 of ENST00000301843). Subplot shows genetic effect sizes for all peptides mapped to CTTN, with the peptides that are shared by both isoforms, and unique to the ENST00000301843 isoform, labelled.
Figure 3—figure supplement 3. Peptide resolution assessment of pQTLs.

Figure 3—figure supplement 3.

(a) Fraction of peptides supporting the genetic effects of missense variants detected at the protein level and not replicated in RNA. For each missense variant - cis pQTL lacking mRNA replication (either eQTL or tQTL), we show the fraction of peptides mapped to the protein with direction of genetic effects consistent with the effect at the protein level. The grey area indicates the random agreement fraction (assuming equal probability of either effect direction; CI: 5–95%). Right panels illustrate genetic effects at protein (dashed horizontal line) and peptide (vertical bars) levels. (b) Fraction of peptides supporting the 68 trans protein QTL. For peptides mapping to the trans protein, we show the fraction of peptideQTLs with direction of genetic effects consistent with the pQTL. (c) Assessment of protein sequence similarity for cis/trans gene pairs. For the top 69 (FDR 0.1) trans pQTLs, the number of missmatching aminoacids are shown, based on the local alignment between peptides of the source cis pQTL and the affected trans protein (Materials and methods). For each cis - trans association, we report the minimal number of mismatches across all peptides used for the cis protein. Right: barplot showing the number of amino acid differences between pairs of detected peptides. The proportion was calculated from detected peptides ordered by sequence, using 10 random groups of 1000 peptides.
Figure 3—figure supplement 4. Quantification of peptides containing coding polymorphisms.

Figure 3—figure supplement 4.

(a, b) For each peptide with an SNP changing the peptide sequence (Materials and methods), peptide intensities were averaged across samples from donors who were homozygous for the reference allele (AA), heterozygous (AB), and homozygous for the alternative allele (BB). To increase robustness, we limit the analysis to variants with at least three alternative homozygous and at least three heterozygous lines. (a) Histogram comparing the intensity average of homozygous samples for peptides containing coding polymorphisms. (b) Density plot of the ratio between the average of the heterozygous samples and the reference homozygous samples restricted on the peptides which were not detected for the alternative homozygous samples.

pQTL arising from protein-altering variants

Next, we set out to characterise further the remaining 120 pQTL without replication at either eQTL, or transcript QTL levels. When classifying the corresponding lead pQTL variants based on their predicted functional effect, we identified 24 inframe variants, a striking enrichment compared to pQTL with replicated RNA effects (3.8-fold enrichment; PV = 4×10−5, Fisher’s exact test; Figure 3c and d). These findings are in line with previous observations in lymphoblast cell lines (Battle et al., 2015). Of note, peptides containing protein-altering variants were excluded from the quantifications (Materials and methods), and the reported pQTL effects were observed for multiple peptides from the same proteins (Figure 3—figure supplement 3), providing further confidence in genuine regulatory effects. We assessed whether the 24 pQTL have effects at the RNA level (eQTL) in other cell types, and for 11 of these pQTL we did not find evidence of eQTL nominal significance in any of the 48 GTEx (PV <0.01/48; Battle et al., 2017; Figure 1—source data 1), which further points to RNA-independent mechanisms.

Inframe variants have the potential to affect protein function. We estimated whether a variant is likely to be deleterious to protein function using SIFT scores, which capture evolutionary conservation and amino acid similarity (Ng and Henikoff, 2003). This revealed a clear enrichment of the 24 RNA-independent pQTL that tag inframe variants, 10 of which have predicted deleterious effects (SIFT score <0.05), compared to four among all other pQTL (Odds ratio = 27.5, PV = 3.8×10−8, Fisher’s exact test; Figure 3d; Figure 1—source data 1). Putative effects of these variants on protein function include loss of enzymatic activity and disruption of protein structure. For example, the variant rs1051061 in VRK2 lies in a conserved sequence in the kinase domain, proximal to the proton acceptor residue, likely impacting kinase activity (Figure 3c). The identical variant has been identified as GWAS risk variant for schizophrenia (Yu et al., 2017) (OR 1.17), with the risk allele being associated with decreased protein abundance. The effect direction is consistent with previous studies that have linked decreased VRK2 expression to neurological disorders including schizophrenia (Azimi et al., 2018; Tesli et al., 2016).

These data show important roles of transcriptional regulation underlying cis pQTL effects, while also highlighting how isoform-specific effects, which are invisible to standard eQTL mapping approaches, can be detected at the protein level. For a substantial subset of pQTLs, we identified linked protein-altering variants, many with deleterious effects. Together with previous observations, these results suggest that proteomics information can aid understanding of pathogenic mechanisms of deleterious variants.

Proteome-wide effects of cis QTLs

Building on the compendium of cis pQTL identified here in iPS cells, we set out to characterise downstream proteome-wide changes. We mapped proteome-wide trans pQTL, considering 654 cis pQTL variants. This identified 51 cis-pQTL lead variants with trans effects on a total of 68 proteins (FDR < 10%; Figure 4—source data 1; Materials and methods). To rule out synthetic associations, we discarded associations with evidence for sequence similarity between cis and trans proteins, and we verified the consistency of the identified trans effects across multiple independent peptides (Materials and methods; Figure 3—figure supplement 3). The detected pairs of proteins with shared genetic regulation were strongly enriched for known protein-protein interactions (CORUM Ruepp et al., 2010, IntAct Orchard et al., 2014, StringDB Szklarczyk et al., 2017; Odds ratio = 9.1, PV = 1.5×10−10, Fisher’s exact test; Figure 4b). The cis and trans effects had similar effect directions and effect sizes, consistent with genetic effects mediated via stabilising protein-protein interactions (Figure 4c). This interpretation of our data in human iPSCs is consistent with the significant donor variance component we observed for many protein complexes (Figure 4d). It is also consistent with previous observations in an outbred mouse cross, showing that protein modules sharing genetic effects in trans are enriched in protein interactions (Ruepp et al., 2010), and identification of trans protein effects due to somatic aberrations in human cancer cell lines (Gonçalves et al., 2017; Roumeliotis et al., 2017). Importantly, our results generalise these previous observations to genetic effects of common variants that segregate in human populations.

Figure 4. | Trans effects on the iPSC proteome.

Figure 4.

(a) Strategy for mapping trans genetic effects on protein abundance. Lead cis pQTL variants were considered for proteome-wide association analysis. (b) Enrichment of previously catalogued protein-protein interactions among significant trans pQTLs. Shown is the fraction of cis-trans gene pairs linked by a trans pQTL with evidence of protein-protein interactions (based on the union of CORUM, IntAct, and StringDB), as a function of the considered FDR threshold for trans pQTL discovery. The dashed lines correspond to FDR < 10%. Numbers indicate the number of trans pQTL identified for each FDR threshold. (c) Comparison of genetic effect sizes, in cis and trans, for significant (FDR < 10%) trans pQTLs. Red points indicate cis-trans pairs with evidence for protein-protein interactions defined as in b. (d) Left: Protein co-expression of protein complex subunits defined based on CORUM. Right: i) subunit with the most significant cis pQTL; ii) fraction of subunits in association with the cis pQTL at nominal significance (PV <0.01). iii) fraction of the average cluster protein expression level explained by donor effects. (eTrans regulation of the PEX26-PEX6-PEX1 complex. The variant rs11752813 (LD r2 = 1 with rs1129187) is associated in cis with changes in the RNA and protein abundance of PEX6 and in trans with changes in the protein abundance of PEX1 and PEX26.

Figure 4—source data 1. trans-pQTL_results.
Reported are the trans pQTL (FDR < 10%).

In summary, the trans effects we detected appear to induce strong correlations across protein complex subunits (Figure 4d), whereby a variant associated in cis with one subunit was also associated in trans with other subunits. This is illustrated by PEX26-PEX6-PEX1, a protein complex involved in peroxisome biogenesis. As noted above, the underlying pQTL variant rs1129187 is associated in cis with an increase in both PEX6 RNA and protein abundance (Figure 2b) and is a known risk variant for Alzheimer's disease in APOE e4+ carriers (Jun et al., 2016). This cis pQTL in turn induces downstream associations on the remaining complex subunits, PEX26 and PEX1 (Figure 4e), suggesting that PEX6 acts as a limiting subunit of this complex in iPSCs. Thus, our results provide a potential biological mechanism underlying this risk variant, namely acting through changes in the abundance of the PEX26-PEX6-PEX1 complex. Notably, there is prior evidence for an implication of peroxisomal function in the development of Alzheimer’s disease and in other neurodegenerative processes (Lizard et al., 2012; Berger et al., 2016), providing further support for this hypothesis.

Discussion

Here, we report the first in-depth characterisation of the human iPSC proteome, connecting genetic variation to changes in RNA and protein levels. Beyond the relevance for iPS cell biology, this study, to our knowledge, provides the most detailed population-level analysis of parallel RNA/protein profiles in human cells. By quantifying genome-wide protein and transcript expression variation across more than 200 human iPSC lines, we identified both genetic and non-genetic mechanisms that underlie variation in both protein and RNA levels. We have mapped more than 600 cis protein quantitative trait loci (pQTLs) and analysed how these relate to cis eQTLs, how they impact other proteins in trans, and how pQTLs link to human disease variants.

The variance component analysis explained a lower overall fraction of variance in the protein data compared to RNA variation, which likely reflects larger technical effects and/or stochasticity in protein expression levels. Among the explainable fraction of variance, donor-specific genetic factors are a major contributor to the differences in protein expression detected across the iPSC lines. The corollary is that protein expression variation across iPS cells reflects genetic diversity in the human population. Consistent with this, we identified 654 common genetic variants associated with changes in protein abundance.

Globally, there were substantially fewer pQTLs than eQTLs, and while most pQTLs had effects of the same direction at eQTL, only 30% of eQTLs are nominally significant at the protein level. It is possible that technical factors resulting from the protein measurement methods may contribute, at least in part, towards attenuating the signal detected at the protein level. However, considering our data in light also of results from previous studies, some of which employed alternative technologies for protein detection to the MS methods used here, we suggest that the signal attenuation between eQTL and pQTL levels is not exclusively the result of limitations in protein measurements. Instead, many eQTLs may reflect variation in RNA abundance that does not cause significant changes in steady state protein levels.

By the systematic comparison of matched protein and RNA data, including detailed analysis of separate isoforms, we demonstrated that in order to fully understand the propagation of genetic effects to proteins, isoform-resolution protein and RNA phenotypes are indispensable. In particular, this approach identified additional RNA-dependent regulation that manifests in protein QTL, thereby improving the ability to identify genuine RNA-independent pQTL.

We showed that the pQTLs for which no corresponding changes in transcript levels were detected, are enriched in deleterious missense variants. This result suggests that the phenotypic effects of such variations may be exerted through protein abundance changes. Because most deleterious variants, and in particular pathogenic variants, are rare, larger sample sizes will be required to fully assess the protein components of this class of regulatory genetic effects.

Our study presents the first comprehensive map of pQTLs at peptide resolution, considering a total of 119,747 peptides from 8543 proteins for genetic analysis. This identified 566 peptide QTL, several of which were not detectable when considering whole protein expression levels, as illustrated with the variant rs12795503, pepQTL for gene CTTN (Figure 3—figure supplement 1). While we mapped fewer significant pepQTLs than pQTLs, peptide level analyses were shown here to overcome potential artefacts raised by protein quantification, in particular when mapping trans pQTL, and are invaluable in identifying isoform-specific effects.

Our data highlight the ability of protein-protein interactions to propagate genetic effects in human populations. A long-standing hypothesis has been that certain protein complexes may have a rate-limiting subunit that determines complex abundance, with any excess subunits produced being rapidly degraded (e.g. because of exposure of hydrophobic residues). This implies that cis eQTLs affecting the levels of rate-limiting subunits should also have effects in trans on the abundance of the whole complex, and on most, if not all, subunits therein. While trans genetic effects were previously reported to be mediated by protein interactions in high heterozygosity samples, that is outbred mice, (Chick et al., 2016) and for somatic aberrations in cancer cell lines (Gonçalves et al., 2017; Roumeliotis et al., 2017), to our knowledge, this study provides the first example that such effects act through common genetic variants in untransformed human cells. In the future, the approach we have taken here could be extended by mendelian randomisation-based approaches to formally assess the mediating role of the cis pQTL on protein complex members.

Understanding the mechanisms through which genetic variations act in the human population is of great relevance to characterising risk factors and susceptibility to disease. There is on-going interest in the potential for studying disease mechanisms using disease relevant tissues that are derived from panels of iPSCs (Cayo et al., 2017; Li et al., 2018; D'Aiuto et al., 2014; Schwartzentruber et al., 2018). This study provides important information showing how direct analysis of human iPSCs can advance our understanding of the genetic regulation of protein expression and how this influences cell phenotypes and disease mechanisms.

Materials and methods

Key resources table.

Reagent type
(species) or
resource
Designation Source or
reference
Identifiers Additional
information
Cell line (Homo-sapiens) iPSC www.hipsci.org RRID:SCR_003909
Software, algorithm MaxQuant https://www.maxquant.org/ RRID:SCR_014485
Software, algorithm Trim Galore https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ RRID:SCR_011847
Software, algorithm STAR https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/ RRID:SCR_015899
Software, algorithm Salmon https://combine-lab.github.io/salmon/ NA
Software, algorithm g:Profiler http://biit.cs.ut.ee/gprofiler/ RRID:SCR_006809
Software, algorithm eCAVIAR http://zarlab.cs.ucla.edu/tag/ecaviar/ NA
Software, algorithm VEP https://www.ensembl.org/info/docs/tools/vep/index.html RRID:SCR_007931
Software, algorithm MutFunc http://www.mutfunc.com/ NA
Software, algorithm Limix https://github.com/limix/limix NA
Software, algorithm Peer http://www.sanger.ac.uk/resources/software/peer/ RRID:SCR_009326
Software, algorithm Scikit-learn http://scikit-learn.org/ RRID:SCR_002577

Cell lines

As described in Kilpinen et al., 2017, all HipSci samples were collected from consented research volunteers recruited from the NIHR Cambridge BioResource, and iPSC were generated from fibroblasts by transduction with Sendai vectors. In brief, cells were cultured on fibroblasts (MEF-CF1) feeder layer with selected lines being transferred on Essential 8 (E8) medium. Pluripotency was assessed based on expression profiling, detection of pluripotency markers in culture and response to differentiation inducing conditions. Mycoplasma screening was performed using a standard PCR kit. Sample identity was confirmed using a Fluidigm genotyping assay containing 24 SNPs. The ID numbers and details for each cell line included in this analysis are listed in Figure 1—source data 1.

RNA-Seq data processing

Raw RNA-Seq data for 331 samples were obtained from the ENA project: ERP007111. CRAM files were merged on a sample level and were converted to FASTQ format. Sequencing reads were trimmed to remove adapters and low-quality bases (using Trim Galore! Dobin et al., 2013), followed by read alignment using STAR (version: 020201) (Liao et al., 2014), using the two-pass alignment mode and the default parameters as proposed by ENCODE (c.f. STAR manual). All alignments were relative to the GRCh37 reference genome, using ENSEMBL 75 as transcript annotation (Zerbino et al., 2018).

Samples with low-quality RNA-Seq were discarded, if they had less than 2 billion bases aligned, had less than 30% coding bases, or had a duplication rate higher than 75%. This resulted in 323 lines for analysis, of which 202 had matched proteome data.

Gene-level RNA expression was quantified from the STAR alignments using featureCounts (v1.6.0) (Robinson et al., 2010), which was applied to the primary alignments using the ‘-B’ and ‘-C’ options in stranded mode, using the ENSEMBL 75 GTF file. Quantifications per sample were merged into an expression table using the following normalisation steps. First, gene counts were normalized by gene length. Second, the counts for each sample were normalised by sequencing depth using the edgeR adjustment (McAlister et al., 2014).

Transcript isoform expression was quantified directly from the (unaligned) trimmed reads using Salmon (Zerbino et al., 2018) (version: 0.8.2), using the ‘--seqBias’, ‘--gcBias’ and ‘VBOpt’ options in ‘ISR’ mode to match our inward stranded sequencing reads. The transcript database was built on transcripts derived from ENSEMBL 75. The TPM values as returned by Salmon were combined into an expression table.

Quantitative proteomics data generation

Upon establishment in culture, lines were expanded for banking and molecular assays (including genomics, transcriptomics and proteomics). We selected 217 lines (banked frozen pellets) for in-depth proteomic analysis with Tandem Mass Tag Mass Spectrometry. A subset of 202 lines (112 normal and 90 disease; Figure 1—source data 1) with matched mRNA and protein data were considered for further analysis. Previous studies have shown that this sample size enables the identification of a large number of cis RNA and protein QTLs (Battle et al., 2015).

Sample preparation

For protein extraction, frozen iPSC cell pellets were washed with ice cold PBS and redissolved immediately in 200 μL of lysis buffer (8 M urea in 100 mM triethyl ammonium bicarbonate (TEAB) and mixed at room temperature for 15 min. DNA in the cell lysates was sheared using ultrasonication (6 × 20 s at 10°C). The proteins were reduced using tris-carboxyethylphosphine TCEP (25 mM) for 30 min at room temperature, then alkylated in the dark for 30 min using iodoacetamide (50 mM). Total protein was quantified using the fluorescence based EZQ assay (Life Technologies). The lysates were diluted four-fold with 100 mM TEAB for the first protease digestion with mass spectrometry grade lysyl endopeptidase, Lys-C (Wako, Japan), then diluted a further 2.5-fold before a second digestion with trypsin. Lys-C and trypsin were used at an enzyme to substrate ratio of 1:50 (w/w). Digestions were carried out for 12 hr at 37°C, then stopped by acidification with trifluoroacetic acid (TFA) to a final concentration of 1% (v:v). Peptides were desalted using C18 Sep-Pak cartridges (Waters) following manufacturer’s instructions and dried.

Tandem Mass Tag mass spectrometry analysis

For Tandem Mass Tag (TMT)-based quantification, the dried peptides were redissolved in 100 mM TEAB (50 µL) and their concentration was measured using a fluorescent assay (CBQCA) (Life Technologies). 100 µg of peptides, from each cell line to be compared, in 100 µL of TEAB were labelled with a different TMT tag (20 µg ml−1 in 40 µL acetonitrile) (Thermo Scientific), for 2 hr at room temperature. After incubation, the labelling reaction was quenched using 8 µl of 5% hydroxylamine (Pierce) for 30 min and the different cell lines/tags were mixed and dried in vacuo. TMT-ten plex was used to label ten IPSC lines and quantify them in parallel. In total, 24 TMT-ten plex experiments were performed, where one IPSC line (HPSI0314i-bubh_3) was chosen as a reference cell line and was kept constant in all TMT batches. The other nine quantification channels were used to label 9 different cell lines. No randomisation was applied in assigning the samples to batches.

The TMT samples were fractionated using off-line, high pH reverse phase chromatography: samples were loaded onto a 4.6 × 250 mm Xbridge BEH130 C18 column with 3.5 µm particles (Waters). Using a Dionex bioRS system, the samples were separated using a 25 min multistep gradient of solvents A (10 mM formate at pH 9) and B (10 mM ammonium formate pH 9 in 80% acetonitrile), at a flow rate of 1 ml/min. Peptides were separated into 48 fractions, which were consolidated into 24 fractions. The fractions were subsequently dried and the peptides redissolved in 5% formic acid and analysed by LC-MS.

5% of the material was analysed using an orbitrap fusion tribrid mass spectrometer (Thermo Scientific), equipped with a Dionex ultra high-pressure liquid chromatography system (nano RSLC). RP-LC was performed using a Dionex RSLC nano HPLC (Thermo Scientific). Peptides were injected onto a 75 μm × 2 cm PepMap-C18 pre-column and resolved on a 75 μm × 50 cm RP- C18 EASY-Spray temperature controlled integrated column-emitter (Thermo), using a 4-hr multistep gradient from 5% B to 35% B with a constant flow of 200 nL min−1. The mobile phases were: 2% ACN incorporating 0.1% FA (Solvent A) and 80% ACN incorporating 0.1% FA (Solvent B). The spray was initiated by applying 2.5 kV to the EASY-Spray emitter and the data were acquired under the control of Xcalibur software in a data-dependent mode using top speed and 4 s duration per cycle. The survey scan is acquired in the orbitrap covering the m/z range from 400 to 1400 Th, with a mass resolution of 120,000 and an automatic gain control (AGC) target of 2.0 e5 ions. The most intense ions were selected for fragmentation using CID in the ion trap with 30% CID collision energy and an isolation window of 1.6 Th. The AGC target was set to 1.0 e4 with a maximum injection time of 70 ms and a dynamic exclusion of 80 s.

During the MS3 analysis for more accurate TMT quantifications, five fragment ions were co-isolated using synchronous precursor selection, using a window of 2 Th and further fragmented using a HCD collision energy of 55% (McAlister et al., 2014). The fragments were then analysed in the orbitrap with a resolution of 60,000. The AGC target was set to 1.0 e5 and the maximum injection time was set to 105 ms.

Proteomics data processing

The TMT-labelled samples (24 batches of TMT-ten plex) were analysed using MaxQuant v. 1.6.0.13 (Cox et al., 2011; Schwanhäusser et al., 2011). Proteins and peptides were identified using the UniProt human reference proteome database (Swiss Prot + TrEMBL) release-2017_03, using the Andromeda search engine. Run parameters and the raw MaxQuant output have been deposited at PRIDE (PXD010557).

The following search parameters were used: reporter ion quantification, mass deviation of 6 ppm on the precursor and 0.5 Da on the fragment ions; Tryp/P for enzyme specificity; up to two missed cleavages, ‘match between runs’, ‘iBAQ’. Carbamidomethylation on cysteine was set as a fixed modification. Oxidation on methionine, pyro-glu conversion of N-terminal Gln, deamidation of asparagine and glutamine and acetylation at the protein N-terminus, were set as variable modifications (Cox and Mann, 2008; Cox et al., 2011; Schwanhäusser et al., 2011; Tyanova et al., 2016).

Peptides and protein groups, that is groups of protein isoforms identified by common peptides (for details see Brenes et al., 2018), were reported at FDR < 5%. The same FDR threshold was used for reporting the Peptide Spectrum Matches (PSM). We performed the FDR calculation on an extended set and removed the Razor Protein FDR calculation constraint (for more details see reference Ramasamy et al., 2013). In total, we identified 255,015 peptides detected in at least one sample (after removing reverse and contaminant peptides; on the 217 lines and 23 replicates of the reference line), which corresponds to 16,773 protein groups.

Quality control and quantification

Peptides that overlap non-synonymous variants may be incorrectly detected and could result in synthetic associations, similarly to the polymorphism-in-probe effects in microarrays (Schlaffner et al., 2017). We performed the following quality control and filtering steps. First, using PoGo; (Lippert et al., 2015) (applied with Gencode release 30 mapped on GRCh 37), we reconstructed the genomic loci of 250,171 peptides that could be assigned to one or multiple genomic locations. Next, we assessed the overlap of these peptides with common, non-synonymous variants in human populations (MAF >0.01 in 1000 Genomes European populations phase 3). To define non-synonymous variants, we considered the overlap with any transcript isoform that contains the analysed peptide obtained from Gencode release 30 mapped on GRCh 37. This resulted in 6273 peptides that overlap such polymorphisms, which were discarded from further analyses (Supplementary file 3). Peptides mapped to multiple locations in the genome were discarded if they overlap non-synonymous variants for at least one of these locations. Finally, we discarded 4844 peptides for which a genetic position was not identified.

We discarded 10 lines with fewer than 67,000 identified peptides (corresponding to 75% of the median number of peptides identified; Figure 1—figure supplement 1), resulting in a proteomics dataset consisting of 207 lines, 202 of which had matched RNA-Seq data and hence were considered for further analysis. In addition, the technical replicates for the included reference line in each TMT batch were retained to aid the normalisation of protein quantifications between batches; see below.

Protein group abundances were estimated using the remaining peptides as the sum of the intensities of individual peptides mapped to the protein group. Peptide abundance estimates were obtained from the intensity values reported in the ‘Peptides’ file from MaxQuant.

For downstream analysis, we considered the subset of peptides that were recurrently detected in at least 30 of the 202 lines. Similarly, we discarded all protein groups that did not contain at least one recurrent peptide. This resulted in a final dataset of 11,140 recurrent protein groups and 132,078 recurrent peptides (Figure 1—figure supplement 1), corresponding to 10,198 genes (Ensembl ID).

To adjust for technical effects during the acquisition of protein data in TMT batches, we scaled the abundance estimate for each feature (i.e. protein or peptide) as follows. For each feature and batch, we multiply the intensity with a scaling coefficient defined as the ratio between the median intensity across all lines and the median across the subset of lines within a given TMT batch. Next, we employed quantile normalisation for peptide and protein abundance estimates, by performing quantile normalisation of the feature distribution in each line relative to a normalisation reference line (the line with the highest number of total peptides detected). Briefly, for each line we assigned for each feature the value observed in the reference line corresponding to the rank of the value of that feature in the line to be normalised: y’pl = r [rank ypl ], where ypl are the intensity values for feature p and line l obtained after batch scaling, r is the sorted vector of intensities from the normalisation reference line, and y’pl is the normalised value. In order to evaluate how the data transformation tackles batch effects, we performed a PCA analysis of protein quantifications and compared the peptide coefficient of error before and after data transformation (Figure 1—figure supplement 1d and e).

Assessment of TMT ratio compression effects

We assessed quantitative compression of our proteomics data by examining changes in peptide abundance for those peptides that were discarded because they overlap non-synonymous variants, following Battle et al., 2015. The rationale behind this approach is that a non-synonymous variant in a peptide prevents detection of that peptide, as its sequence will not exist in the proteome reference. Thus, in samples heterozygous for the non-synonymous variants, the measured peptide abundance is expected to be half of that of samples homozygous for the reference variant. Although the data indicated that ratio compression effects can be noticed in our study (Figure 3—figure supplement 4), the QTL results show that protein measurements derived from TMT Mass spectrometry analyses are suitable for the detection of protein abundance changes.

Comparisons of iPSC proteome profiles to existing datasets

To compare our iPSC proteome dataset to the Human Proteome Map (HPM) (Kim et al., 2014; Figure 1—figure supplement 1d), we first mapped the RefSeq IDs of proteins quantified in the HPM to UniProt IDs. We considered the subset of 8333 proteins that were expressed in our iPSC dataset and in at least one HPM tissue, and for which IDs could be mapped. We then calculated spearman correlation coefficients between the aggregate iPSC proteome abundance profile (averaged across lines) and each HPM tissue.

Variance component analysis

In order to calculate the contribution of each factor k to variation in protein abundance, we fitted a random effects model as follows: y = μ + ∑k uk + ε ; uk ∼ N(0, σk2 ⋅ Mk ); ε ∼ N(0, σr2I); Mk [i, j] = {1 if fk [i] = fk [j]; 0 if fk [i ] ≠ fk [j]).

where y denotes the (N x 1) vector of normalised log protein abundances, uk are the random effects, Mk is the (N x N) covariance structure, σkis the standard deviation, and ε is the residual (i.i.d. noise). The random effect components are defined based on a categorical covariance function defined on covariates fk, that is the vector of observed values for factor k (e.g. fk [i] ∈ {′male′, ′female′} when k is the donor sex component). We considered donor identity, donor sex, donor age (treated as a categorical variable, with each group in age windows of 5 years), culture medium, TMT batch, and TMT channel as random effect components. In order to accurately estimate the donor variance component, we restricted this analysis to the set of lines from the subset of 51 donors for which two cell lines were assayed and on 6009 genes with TPM >1 and with proteins detected in this subset of lines. Analogous analyses were considered for RNA abundance, leaving out the TMT-specific, random effects.

In order to study the effect of different factors on protein abundance, after adjusting for the effects of RNA abundance on protein abundance, we also applied the variance decomposition analysis to protein abundance values after adjusting for RNA variation. Adjusted protein abundances were calculated by regressing out the effects of RNA abundance (i.e. gene-level quantifications of RNA), on protein abundance for each RNA-protein pair. To do this, we fitted a linear model between RNA and protein abundances across lines (using the Numpy function poly1d in Python), taking the model residuals as the adjusted protein abundance values. Variance decomposition models were then fitted as described above.

All variance component models were fitted using the LIMIX package (https://github.com/PMBio/limix; https://doi.org/10.1101/003905) (Reimand, 2016).

Quantification of X chromosome inactivation (XCI)

The X chromosome inactivation (XCI) status of female cell lines was quantified using allele-specific counts from RNA-Seq reads mapping to the X chromosome. These allele-specific counts were obtained for SNPs present in DBSNP using GATK ReadCounter with the command ‘GenomeAnalysisTK.jar -T ASEReadCounter -U ALLOW_N_CIGAR_READS --minMappingQuality 10 --minBaseQuality 2’. For known heterozygous SNPs in each line, the allele-specific fraction of expression was defined as the fraction of reads mapping to the less expressed allele (i.e. the allele-specific fraction was ≤0.5). The XCI status of each cell line was then defined as the mean of the allele-specific fractions across all heterozygous X chromosome SNPs with at least 20 overlapping reads in the corresponding RNA-Seq sample.

Gene ontology enrichment was performed against the 6335 genes included in the XCI analysis using g:Profiler (Ongen et al., 2016), and p-values were adjusted for multiple testing using the Benjamini-Hochberg FDR procedure.

QTL mapping of RNA and protein traits

Cis QTL mapping

We used PEER (Stegle et al., 2012) to account for unwanted variation and confounding factors. PEER was applied to log normalised protein abundance and log normalised gene TPM, considering the most highly expressed 10,000 proteins and genes, respectively. We fit 7 PEER factors for protein and 13 PEER factors for RNA, settings that were determined as the largest number of PEER factors that retain statistical independence of the inferred factors (R < 0.7; Figure 2—figure supplement 1).

For cis genetic analyses, we considered common variants (MAF >5%) in gene-proximal regions of 250 k upstream and downstream of gene transcription start and end sites (GRCh37). The chosen size of this cis analysis window is a compromise between comprehensiveness to detect distal regulatory elements, while managing the multiple testing burden. We used a linear mixed model implemented in LIMIX, thereby controlling for both population structure and repeat lines from the same donor, using kinship as a random effect component. The population structure random effect was accounted for using the realized relationship covariance,that is dot product of the genotype matrices. PEER factors were included as fixed effect covariates.

To adjust for multiple testing across cis variants for each gene, we fit an empirical null distribution using a permutation scheme combined with a parametric fit to the null distribution, similar to the approach taken in Fast QTL (Walter et al., 2015), Briefly, for each gene, we obtained p-values from 100 permutations of the tested variants. We then estimated an empirical null distribution by fitting a parametric Beta distribution to the obtained p-values. Using this null model, we estimated cis region adjusted p-values for QTL lead variants. For multiple testing adjustment across genes, we performed Benjamini-Hochberg adjustment.

For protein, peptide and transcript QTLs, as multiple of these features map to the same gene, we used Bonferroni adjustment to correct for feature multiplicity for each gene, followed by Benjamini-Hochberg adjustment, as performed for the gene-level eQTLs.

Trans QTL mapping

For trans QTLs analysis, we considered lead cis QTLs (FDR < 10%; 654 pQTLs) versus 11,140 recurrently detected proteins. Genome-wide adjustment for multiple testing was performed using Benjamini Hochberg (BH) across all tests (7⋅106 variants × proteins).

To rule out any potential artefact linked to the mis-mapping of cis protein peptides to the trans proteins, we aligned all trans peptide sequences to the cis protein sequences. We considered all peptides used for the quantification of proteins associated in trans and locally aligned them to the reference sequence of the proteins associated in cis (using pairwise2.align.localxs from the Biopython Project, with a gap penalty of 1). This identified a single trans association for which the peptides had less than two mismatches, which was excluded from the reported results.

Downstream analysis of QTL results

Cis eQTL and pQTL replication

To assess the replication of QTL across molecular layers, we considered the QTL detected in one layer and assessed the nominal significance in the other layer (PV <0.01), as well as the consistency of the effect directions in the second layer.

To identify the determinants of eQTL to pQTL replication, we trained a Random Forest model to the replication status of 3487 eQTL (Figure 2—source data 2). For each RNA-protein pair, we defined eight features: ‘eQTL effect size’, ‘protein coefficient of error’, ‘protein coefficient of variation’, ‘protein abundance’, ”SNP MAF’, ‘RNA abundance’, ‘number of peptides’, and ‘number of missing measurements’. The feature ‘protein coefficient of error’ was computed as the coefficient of variation of the reference line across the set of technical replicates (TMT batches). This model was fit in Python using the scikit-learn library v0.21.3, with the sklearn.ensembl.RandomForestClassifier model with n_estimators = 100 (i.e. the suggested default). The model was trained and tested 50 times, training on a random sample of 80% of the data, and tested on the remaining 20%.

Annotation of cis-trans protein pairs with protein-protein interactions

Protein-protein interactions were obtained from the union of CORUM (Ruepp et al., 2010), IntAct (Orchard et al., 2014) and protein-binding interactions from StringDB (Szklarczyk et al., 2017). For CORUM, we considered pairwise interactions between all protein complex subunits, discarded any isoform extension from the protein UniProt IDs, and intersect cis-trans protein pairs with the protein-protein interactions reference list.

Overlap with disease variants and GTEx eQTLs

Following the approach in Kilpinen et al., 2017, we defined proxy variants of each cis QTL as variants in high LD (r2 >0.8) based on the UK10K European reference panel (Buniello, 2019) and located in the same cis window. A QTL was defined as GWAS-tagging if at least one proxy variant was annotated in the NHGRI-EBI GWAS Catalog (download on 2019–03) (Staley et al., 2016). A pQTL was defined as replicating in GTEx eQTLs if it was mapped at nominal significance (PV <0.01/48) in any of the 48 tissues (Battle et al., 2017).

Complementary, we considered a more stringent criterion based on statistical co-localisation of GWAS signals with QTLs. Specifically, we used summary statistics from phenotypic traits mapped in 51 studies (McLaren et al., 2016), using eCAVIAR (Hormozdiari et al., 2016) to test for co-localisation. For each pQTL cis region and each GWAS trait and study we first intersected the variants assessed both in GWAS and in our study, resulting between 104 and 2 × 105 variants. We then selected the traits with at least one variant with PVGWAS <10−5 and genome wide-significant in our study. For each trait - QTL pair we performed the co-localisation of z-scores (computed as the ratio between effect size and effect size standard deviation).

Annotation of pQTLs with variant effects

For each pQTL lead variant, a set of proxy variants were identified based on LD (r2 >0.8). Each proxy variant located in the gene body for the corresponding gene was annotated based on its position and predicted effect, using Variant Effect Predictor (Eilbeck et al., 2005). The variants were grouped in parent categories using Sequence Ontology (2016 release; Wagih et al., 2018) as follows: 'inframe_variant' includes variants annotated as 'inframe_deletion', 'inframe_insertion', 'incomplete_terminal_codon_variant', 'stop_lost', 'stop_gained' and 'missense_variant'; 'splicing_variant' includes variants annotated as 'splice_acceptor_variant', 'splice_donor_variant' and 'splice_region_variant'; 'frameshift_variant' includes variants annotated as 'feature_elongation', 'feature_truncation' and 'frameshift_variant'. The set of inframe variants were further classified as deleterious or not according to their SIFT scores (Ng and Henikoff, 2003), as provided by MutFunc 69 (SIFT score <0.05 corresponds to a deleterious mutation). Each pQTL SNP with at least one proxy variant was annotated with the predicted effects of all its proxy variants. For each class of variants (inframe, splicing, etc.), enrichment in the set of pQTLs without any evidence of RNA mechanism (i.e. no eQTL or transcript isoform QTL) compared to all pQTLs was evaluated by Fisher’s exact test, restricted to pQTLs with at least one proxy variant in the gene body (Figure 3d).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Oliver Stegle, Email: oliver.stegle@embl.de.

Angus I Lamond, Email: a.i.lamond@dundee.ac.uk.

Stephen CJ Parker, University of Michigan, United States.

Patricia J Wittkopp, University of Michigan, United States.

HipSci Consortium:

Chukwuma A Agu, Alex Alderton, Petr Danecek, Rachel Denton, Richard Durbin, Daniel J Gaffney, Angela Goncalves, Reena Halai, Sarah Harper, Christopher M Kirton, Anja Kolb-Kokocinski, Andreas Leha, Shane A McCarthy, Yasin Memari, Minal Patel, Ewan Birney, Francesco Paolo Casale, Laura Clarke, Peter W Harrison, Helena Kilpinen, Ian Streeter, Davide Denovi, Oliver Stegle, Angus I Lamond, Ruta Meleckyte, Natalie Moens, Fiona M Watt, Willem H Ouwehand, and Philip Beales

Funding Information

This paper was supported by the following grants:

  • Wellcome Trust Strategic Award and UK Medical Research Council WT098503 to Bogdan Andrei Mirauta, Dalila Bensaddek, Helena Kilpinen.

  • Wellcome Trust Strategic Award 105024/Z/14/Z to Bogdan Andrei Mirauta.

  • EMBL Interdisciplinary Postdoctoral (EIPOD) programme under Marie Sklodowska-Curie Actions COFUND grant number 291772 to Daniel D Seaton, Marc Jan Bonder.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Resources, Data curation, Investigation, Writing - original draft.

Resources, Data curation, Investigation, Writing - original draft, Writing - review and editing.

Data curation, Software, Investigation, Writing - original draft, Writing - review and editing.

Investigation, Writing - review and editing.

Resources, Funding acquisition, Conceptualization.

Conceptualization, Formal analysis, Supervision, Funding acquisition, Writing - original draft, Project administration, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Writing - original draft, Project administration, Writing - review and editing.

Additional files

Supplementary file 1. Comparison of proteome coverage across human proteomics datasets.

To facilitate comparison with other datasets we report here the number of proteins and peptides at FDR 1%.

elife-57390-supp1.docx (60.8KB, docx)
Supplementary file 2. Disease status.

Shown are the number of lines and donors for which matched mRNA and protein data are available.

elife-57390-supp2.docx (45.3KB, docx)
Supplementary file 3. Peptides overlapping protein altering variants detected in this study.

File containing the list of peptides overlapping protein altering variants or unmapped to the reference genome.

elife-57390-supp3.tsv (901.7KB, tsv)
Transparent reporting form

Data availability

All data can be accessed via the HipSci data portal, which references EMBL-EBI archives that are used to store the HipSci data. This study includes both cell lines that are consented to be openly accessible as well as cell lines that are subject to managed access, which means a data access application needs to be filed prior to accessing the data.Managed access data from all assays are accessible via EGA under the study EGAS00001001465. Open access genotyping array data and RNA-Seq data are available from ENA under the studies PRJEB11752 and PRJEB7388. Proteomics quantifications (protein group and peptide resolution; MaxQuant output), and run parameters are available on the PRIDE Archive PRIDE (PXD010557). Intermediate result files for this study, such as processed gene expression levels, can be found in Figure 1—source data 2 and 3. Complete summary statistics for the protein and RNA QTL analyses are available at: https://figshare.com/projects/QTL_datasets_for_Population-scale_proteome_variation_in_human_induced_pluripotent_stem_cells_/84233. Analysed data is included in the supplementary files. Scripts used to perform the statistical analyses presented are available at: https://github.com/hipsci/Elife2020 (copy archived at https://github.com/elifesciences-publications/Elife2020).

The following datasets were generated:

Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. HipSci: the iPSC proteomic compendium. PRIDE. PXD010557

Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_peptide_qtls. figshare.

Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_transcript_qtls_202_lines. figshare.

Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_eqtls_202_lines. figshare.

Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_pqtls. figshare.

References

  1. Azimi T, Ghafouri-Fard S, Davood Omrani M, Mazdeh M, Arsang-Jang S, Sayad A, Taheri M. Vaccinia related kinase 2 (VRK2) expression in neurological disorders: schizophrenia, epilepsy and multiple sclerosis. Multiple Sclerosis and Related Disorders. 2018;19:15–19. doi: 10.1016/j.msard.2017.10.017. [DOI] [PubMed] [Google Scholar]
  2. Battle A, Khan Z, Wang SH, Mitrano A, Ford MJ, Pritchard JK, Gilad Y. Genomic variation impact of regulatory variation from RNA to protein. Science. 2015;347:664–667. doi: 10.1126/science.1260793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Battle A, Brown CD, Engelhardt BE, Montgomery SB, GTEx Consortium. Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group. Statistical Methods groups—Analysis Working Group. Enhancing GTEx (eGTEx) groups. NIH Common Fund. NIH/NCI. NIH/NHGRI. NIH/NIMH. NIH/NIDA. Biospecimen Collection Source Site—NDRI. Biospecimen Collection Source Site—RPCI. Biospecimen Core Resource—VARI. Brain Bank Repository—University of Miami Brain Endowment Bank. Leidos Biomedical—Project Management. ELSI Study. Genome Browser Data Integration &Visualization—EBI. Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz. Lead analysts: Laboratory, Data Analysis &Coordinating Center (LDACC): NIH program management: Biospecimen collection: Pathology: eQTL manuscript working group: Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Berger J, Dorninger F, Forss-Petter S, Kunze M. Peroxisomes in brain development and function. Biochimica Et Biophysica Acta (BBA) - Molecular Cell Research. 2016;1863:934–955. doi: 10.1016/j.bbamcr.2015.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brenes AB, Hukelmann J, Afzal V, Lamond A. The iPSC proteomic compendium. bioRxiv. 2018 doi: 10.1101/469916. [DOI]
  6. Buniello A. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics. Nucleic Acids Research. 2019;201947:D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Carcamo-Orive I, Hoffman GE, Cundiff P, Beckmann ND, D'Souza SL, Knowles JW, Patel A, Papatsenko D, Abbasi F, Reaven GM, Whalen S, Lee P, Shahbazi M, Henrion MYR, Zhu K, Wang S, Roussos P, Schadt EE, Pandey G, Chang R, Quertermous T, Lemischka I. Analysis of transcriptional variability in a large human iPSC library reveals genetic and Non-genetic determinants of heterogeneity. Cell Stem Cell. 2017;20:518–532. doi: 10.1016/j.stem.2016.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cayo MA, Mallanna SK, Di Furio F, Jing R, Tolliver LB, Bures M, Urick A, Noto FK, Pashos EE, Greseth MD, Czarnecki M, Traktman P, Yang W, Morrisey EE, Grompe M, Rader DJ, Duncan SA. A drug screen using human iPSC-Derived Hepatocyte-like cells reveals cardiac glycosides as a potential treatment for hypercholesterolemia. Cell Stem Cell. 2017;20:478–489. doi: 10.1016/j.stem.2017.01.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chick JM, Munger SC, Simecek P, Huttlin EL, Choi K, Gatti DM, Raghupathy N, Svenson KL, Churchill GA, Gygi SP. Defining the consequences of genetic variation on a proteome-wide scale. Nature. 2016;534:500–505. doi: 10.1038/nature18270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M. Andromeda: a peptide search engine integrated into the MaxQuant environment. Journal of Proteome Research. 2011;10:1794–1805. doi: 10.1021/pr101065j. [DOI] [PubMed] [Google Scholar]
  11. Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology. 2008;26:1367–1372. doi: 10.1038/nbt.1511. [DOI] [PubMed] [Google Scholar]
  12. D'Aiuto L, Zhi Y, Kumar Das D, Wilcox MR, Johnson JW, McClain L, MacDonald ML, Di Maio R, Schurdak ME, Piazza P, Viggiano L, Sweet R, Kinchington PR, Bhattacharjee AG, Yolken R, Nimgaonka VL, Nimgaonkar VL. Large-scale generation of human iPSC-derived neural stem cells/early neural progenitor cells and their neuronal differentiation. Organogenesis. 2014;10:365–377. doi: 10.1080/15476278.2015.1011921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. DeBoever C, Li H, Jakubosky D, Benaglio P, Reyna J, Olson KM, Huang H, Biggs W, Sandoval E, D’Antonio M, Jepsen K, Matsui H, Arias A, Ren B, Nariai N, Smith EN, D’Antonio-Chronowska A, Farley EK, Frazer KA. Large-Scale profiling reveals the influence of genetic variation on gene expression in human induced pluripotent stem cells. Cell Stem Cell. 2017;20:533–546. doi: 10.1016/j.stem.2017.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The sequence ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6:R44. doi: 10.1186/gb-2005-6-5-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gonçalves E, Fragoulis A, Garcia-Alonso L, Cramer T, Saez-Rodriguez J, Beltrao P. Widespread Post-transcriptional attenuation of genomic Copy-Number variation in Cancer. Cell Systems. 2017;5:386–398. doi: 10.1016/j.cels.2017.08.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hormozdiari F, van de Bunt M, Segrè AV, Li X, Joo JWJ, Bilow M, Sul JH, Sankararaman S, Pasaniuc B, Eskin E. Colocalization of GWAS and eQTL signals detects target genes. The American Journal of Human Genetics. 2016;99:1245–1260. doi: 10.1016/j.ajhg.2016.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Johansson Å, Enroth S, Palmblad M, Deelder AM, Bergquist J, Gyllensten U. Identification of genetic variants influencing the human plasma proteome. PNAS. 2013;110:4673–4678. doi: 10.1073/pnas.1217238110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Jun G, Ibrahim-Verbaas CA, Vronskaya M, Lambert JC, Chung J, Naj AC, Kunkle BW, Wang LS, Bis JC, Bellenguez C, Harold D, Lunetta KL, Destefano AL, Grenier-Boley B, Sims R, Beecham GW, Smith AV, Chouraki V, Hamilton-Nelson KL, Ikram MA, Fievet N, Denning N, Martin ER, Schmidt H, Kamatani Y, Dunstan ML, Valladares O, Laza AR, Zelenika D, Ramirez A, Foroud TM, Choi SH, Boland A, Becker T, Kukull WA, van der Lee SJ, Pasquier F, Cruchaga C, Beekly D, Fitzpatrick AL, Hanon O, Gill M, Barber R, Gudnason V, Campion D, Love S, Bennett DA, Amin N, Berr C, Tsolaki M, Buxbaum JD, Lopez OL, Deramecourt V, Fox NC, Cantwell LB, Tárraga L, Dufouil C, Hardy J, Crane PK, Eiriksdottir G, Hannequin D, Clarke R, Evans D, Mosley TH, Letenneur L, Brayne C, Maier W, De Jager P, Emilsson V, Dartigues JF, Hampel H, Kamboh MI, de Bruijn RF, Tzourio C, Pastor P, Larson EB, Rotter JI, O'Donovan MC, Montine TJ, Nalls MA, Mead S, Reiman EM, Jonsson PV, Holmes C, St George-Hyslop PH, Boada M, Passmore P, Wendland JR, Schmidt R, Morgan K, Winslow AR, Powell JF, Carasquillo M, Younkin SG, Jakobsdóttir J, Kauwe JS, Wilhelmsen KC, Rujescu D, Nöthen MM, Hofman A, Jones L, Haines JL, Psaty BM, Van Broeckhoven C, Holmans P, Launer LJ, Mayeux R, Lathrop M, Goate AM, Escott-Price V, Seshadri S, Pericak-Vance MA, Amouyel P, Williams J, van Duijn CM, Schellenberg GD, Farrer LA, IGAP Consortium A novel alzheimer disease locus located near the gene encoding tau protein. Molecular Psychiatry. 2016;21:108–117. doi: 10.1038/mp.2015.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kerr CL, Hill CM, Blumenthal PD, Gearhart JD. Expression of pluripotent stem cell markers in the human fetal testis. Stem Cells. 2008a;26:412–421. doi: 10.1634/stemcells.2007-0605. [DOI] [PubMed] [Google Scholar]
  21. Kerr CL, Hill CM, Blumenthal PD, Gearhart JD. Expression of pluripotent stem cell markers in the human fetal ovary. Human Reproduction. 2008b;23:589–599. doi: 10.1093/humrep/dem411. [DOI] [PubMed] [Google Scholar]
  22. Kilpinen H, Goncalves A, Leha A, Afzal V, Alasoo K, Ashford S, Bala S, Bensaddek D, Casale FP, Culley OJ, Danecek P, Faulconbridge A, Harrison PW, Kathuria A, McCarthy D, McCarthy SA, Meleckyte R, Memari Y, Moens N, Soares F, Mann A, Streeter I, Agu CA, Alderton A, Nelson R, Harper S, Patel M, White A, Patel SR, Clarke L, Halai R, Kirton CM, Kolb-Kokocinski A, Beales P, Birney E, Danovi D, Lamond AI, Ouwehand WH, Vallier L, Watt FM, Durbin R, Stegle O, Gaffney DJ. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature. 2017;546:370–375. doi: 10.1038/nature22403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, Jain S, Thomas JK, Muthusamy B, Leal-Rojas P, Kumar P, Sahasrabuddhe NA, Balakrishnan L, Advani J, George B, Renuse S, Selvan LD, Patil AH, Nanjappa V, Radhakrishnan A, Prasad S, Subbannayya T, Raju R, Kumar M, Sreenivasamurthy SK, Marimuthu A, Sathe GJ, Chavan S, Datta KK, Subbannayya Y, Sahu A, Yelamanchi SD, Jayaram S, Rajagopalan P, Sharma J, Murthy KR, Syed N, Goel R, Khan AA, Ahmad S, Dey G, Mudgal K, Chatterjee A, Huang TC, Zhong J, Wu X, Shaw PG, Freed D, Zahari MS, Mukherjee KK, Shankar S, Mahadevan A, Lam H, Mitchell CJ, Shankar SK, Satishchandra P, Schroeder JT, Sirdeshmukh R, Maitra A, Leach SD, Drake CG, Halushka MK, Prasad TS, Hruban RH, Kerr CL, Bader GD, Iacobuzio-Donahue CA, Gowda H, Pandey A. A draft map of the human proteome. Nature. 2014;509:575–581. doi: 10.1038/nature13302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li Y, Hermanson DL, Moriarity BS, Kaufman DS. Human iPSC-Derived natural killer cells engineered with chimeric antigen receptors enhance Anti-tumor activity. Cell Stem Cell. 2018;23:181–192. doi: 10.1016/j.stem.2018.06.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
  26. Lippert C, Casale FP, Rakitsch B, Stegle O. LIMIX: genetic analysis of multiple traits. bioRxiv. 2015 doi: 10.1101/003905. [DOI] [PubMed]
  27. Liu Y, Buil A, Collins BC, Gillet LC, Blum LC, Cheng LY, Vitek O, Mouritsen J, Lachance G, Spector TD, Dermitzakis ET, Aebersold R. Quantitative variability of 342 plasma proteins in a human twin population. Molecular Systems Biology. 2015;11:786. doi: 10.15252/msb.20145728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lizard G, Rouaud O, Demarquoy J, Cherkaoui-Malki M, Iuliano L. Potential roles of peroxisomes in alzheimer's disease and in dementia of the Alzheimer's type. Journal of Alzheimer's Disease. 2012;29:241–254. doi: 10.3233/JAD-2011-111163. [DOI] [PubMed] [Google Scholar]
  29. Lourdusamy A, Newhouse S, Lunnon K, Proitsi P, Powell J, Hodges A, Nelson SK, Stewart A, Williams S, Kloszewska I, Mecocci P, Soininen H, Tsolaki M, Vellas B, Lovestone S, Dobson R, AddNeuroMed Consortium. Alzheimer's Disease Neuroimaging Initiative Identification of cis-regulatory variation influencing protein abundance levels in human plasma. Human Molecular Genetics. 2012;21:3719–3726. doi: 10.1093/hmg/dds186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J, Pendlington ZM, Welter D, Burdett T, Hindorff L, Flicek P, Cunningham F, Parkinson H. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS catalog) Nucleic Acids Research. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. McAlister GC, Nusinow DP, Jedrychowski MP, Wühr M, Huttlin EL, Erickson BK, Rad R, Haas W, Gygi SP. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across Cancer cell line proteomes. Analytical Chemistry. 2014;86:7150–7158. doi: 10.1021/ac502040v. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biology. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Mekhoubad S, Bock C, de Boer AS, Kiskinis E, Meissner A, Eggan K. Erosion of dosage compensation impacts human iPSC disease modeling. Cell Stem Cell. 2012;10:595–609. doi: 10.1016/j.stem.2012.02.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Mertins P, Mani DR, Ruggles KV, Gillette MA, Clauser KR, Wang P, Wang X, Qiao JW, Cao S, Petralia F, Kawaler E, Mundt F, Krug K, Tu Z, Lei JT, Gatza ML, Wilkerson M, Perou CM, Yellapantula V, Huang KL, Lin C, McLellan MD, Yan P, Davies SR, Townsend RR, Skates SJ, Wang J, Zhang B, Kinsinger CR, Mesri M, Rodriguez H, Ding L, Paulovich AG, Fenyö D, Ellis MJ, Carr SA, NCI CPTAC Proteogenomics connects somatic mutations to signalling in breast Cancer. Nature. 2016;534:55–62. doi: 10.1038/nature18003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Munoz J, Low TY, Kok YJ, Chin A, Frese CK, Ding V, Choo A, Heck AJ. The quantitative proteomes of human-induced pluripotent stem cells and embryonic stem cells. Molecular Systems Biology. 2011;7:550. doi: 10.1038/msb.2011.84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Ongen H, Buil A, Brown AA, Dermitzakis ET, Delaneau O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics. 2016;32:1479–1485. doi: 10.1093/bioinformatics/btv722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research. 2014;42:D358–D363. doi: 10.1093/nar/gkt1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Panopoulos AD, D'Antonio M, Benaglio P, Williams R, Hashem SI, Schuldt BM, DeBoever C, Arias AD, Garcia M, Nelson BC, Harismendy O, Jakubosky DA, Donovan MKR, Greenwald WW, Farnam K, Cook M, Borja V, Miller CA, Grinstein JD, Drees F, Okubo J, Diffenderfer KE, Hishida Y, Modesto V, Dargitz CT, Feiring R, Zhao C, Aguirre A, McGarry TJ, Matsui H, Li H, Reyna J, Rao F, O'Connor DT, Yeo GW, Evans SM, Chi NC, Jepsen K, Nariai N, Müller FJ, Goldstein LSB, Izpisua Belmonte JC, Adler E, Loring JF, Berggren WT, D'Antonio-Chronowska A, Smith EN, Frazer KA. iPSCORE: a resource of 222 iPSC lines enabling functional characterization of genetic variation across a variety of cell types. Stem Cell Reports. 2017;8:1086–1100. doi: 10.1016/j.stemcr.2017.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Phanstiel DH, Brumbaugh J, Wenger CD, Tian S, Probasco MD, Bailey DJ, Swaney DL, Tervo MA, Bolin JM, Ruotti V, Stewart R, Thomson JA, Coon JJ. Proteomic and phosphoproteomic comparison of human ES and iPS cells. Nature Methods. 2011;8:821–827. doi: 10.1038/nmeth.1699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Phelan CM, Kuchenbaecker KB, Tyrer JP, Kar SP, Lawrenson K, Winham SJ, Dennis J, Pirie A, Riggan MJ, Chornokur G, Earp MA, Lyra PC, Lee JM, Coetzee S, Beesley J, McGuffog L, Soucy P, Dicks E, Lee A, Barrowdale D, Lecarpentier J, Leslie G, Aalfs CM, Aben KKH, Adams M, Adlard J, Andrulis IL, Anton-Culver H, Antonenkova N, Aravantinos G, Arnold N, Arun BK, Arver B, Azzollini J, Balmaña J, Banerjee SN, Barjhoux L, Barkardottir RB, Bean Y, Beckmann MW, Beeghly-Fadiel A, Benitez J, Bermisheva M, Bernardini MQ, Birrer MJ, Bjorge L, Black A, Blankstein K, Blok MJ, Bodelon C, Bogdanova N, Bojesen A, Bonanni B, Borg Å, Bradbury AR, Brenton JD, Brewer C, Brinton L, Broberg P, Brooks-Wilson A, Bruinsma F, Brunet J, Buecher B, Butzow R, Buys SS, Caldes T, Caligo MA, Campbell I, Cannioto R, Carney ME, Cescon T, Chan SB, Chang-Claude J, Chanock S, Chen XQ, Chiew YE, Chiquette J, Chung WK, Claes KBM, Conner T, Cook LS, Cook J, Cramer DW, Cunningham JM, D'Aloisio AA, Daly MB, Damiola F, Damirovna SD, Dansonka-Mieszkowska A, Dao F, Davidson R, DeFazio A, Delnatte C, Doheny KF, Diez O, Ding YC, Doherty JA, Domchek SM, Dorfling CM, Dörk T, Dossus L, Duran M, Dürst M, Dworniczak B, Eccles D, Edwards T, Eeles R, Eilber U, Ejlertsen B, Ekici AB, Ellis S, Elvira M, Eng KH, Engel C, Evans DG, Fasching PA, Ferguson S, Ferrer SF, Flanagan JM, Fogarty ZC, Fortner RT, Fostira F, Foulkes WD, Fountzilas G, Fridley BL, Friebel TM, Friedman E, Frost D, Ganz PA, Garber J, García MJ, Garcia-Barberan V, Gehrig A, Gentry-Maharaj A, Gerdes AM, Giles GG, Glasspool R, Glendon G, Godwin AK, Goldgar DE, Goranova T, Gore M, Greene MH, Gronwald J, Gruber S, Hahnen E, Haiman CA, Håkansson N, Hamann U, Hansen TVO, Harrington PA, Harris HR, Hauke J, Hein A, Henderson A, Hildebrandt MAT, Hillemanns P, Hodgson S, Høgdall CK, Høgdall E, Hogervorst FBL, Holland H, Hooning MJ, Hosking K, Huang RY, Hulick PJ, Hung J, Hunter DJ, Huntsman DG, Huzarski T, Imyanitov EN, Isaacs C, Iversen ES, Izatt L, Izquierdo A, Jakubowska A, James P, Janavicius R, Jernetz M, Jensen A, Jensen UB, John EM, Johnatty S, Jones ME, Kannisto P, Karlan BY, Karnezis A, Kast K, Kennedy CJ, Khusnutdinova E, Kiemeney LA, Kiiski JI, Kim SW, Kjaer SK, Köbel M, Kopperud RK, Kruse TA, Kupryjanczyk J, Kwong A, Laitman Y, Lambrechts D, Larrañaga N, Larson MC, Lazaro C, Le ND, Le Marchand L, Lee JW, Lele SB, Leminen A, Leroux D, Lester J, Lesueur F, Levine DA, Liang D, Liebrich C, Lilyquist J, Lipworth L, Lissowska J, Lu KH, Lubinński J, Luccarini C, Lundvall L, Mai PL, Mendoza-Fandiño G, Manoukian S, Massuger L, May T, Mazoyer S, McAlpine JN, McGuire V, McLaughlin JR, McNeish I, Meijers-Heijboer H, Meindl A, Menon U, Mensenkamp AR, Merritt MA, Milne RL, Mitchell G, Modugno F, Moes-Sosnowska J, Moffitt M, Montagna M, Moysich KB, Mulligan AM, Musinsky J, Nathanson KL, Nedergaard L, Ness RB, Neuhausen SL, Nevanlinna H, Niederacher D, Nussbaum RL, Odunsi K, Olah E, Olopade OI, Olsson H, Olswold C, O'Malley DM, Ong KR, Onland-Moret NC, Orr N, Orsulic S, Osorio A, Palli D, Papi L, Park-Simon TW, Paul J, Pearce CL, Pedersen IS, Peeters PHM, Peissel B, Peixoto A, Pejovic T, Pelttari LM, Permuth JB, Peterlongo P, Pezzani L, Pfeiler G, Phillips KA, Piedmonte M, Pike MC, Piskorz AM, Poblete SR, Pocza T, Poole EM, Poppe B, Porteous ME, Prieur F, Prokofyeva D, Pugh E, Pujana MA, Pujol P, Radice P, Rantala J, Rappaport-Fuerhauser C, Rennert G, Rhiem K, Rice P, Richardson A, Robson M, Rodriguez GC, Rodríguez-Antona C, Romm J, Rookus MA, Rossing MA, Rothstein JH, Rudolph A, Runnebaum IB, Salvesen HB, Sandler DP, Schoemaker MJ, Senter L, Setiawan VW, Severi G, Sharma P, Shelford T, Siddiqui N, Side LE, Sieh W, Singer CF, Sobol H, Song H, Southey MC, Spurdle AB, Stadler Z, Steinemann D, Stoppa-Lyonnet D, Sucheston-Campbell LE, Sukiennicki G, Sutphen R, Sutter C, Swerdlow AJ, Szabo CI, Szafron L, Tan YY, Taylor JA, Tea MK, Teixeira MR, Teo SH, Terry KL, Thompson PJ, Thomsen LCV, Thull DL, Tihomirova L, Tinker AV, Tischkowitz M, Tognazzo S, Toland AE, Tone A, Trabert B, Travis RC, Trichopoulou A, Tung N, Tworoger SS, van Altena AM, Van Den Berg D, van der Hout AH, van der Luijt RB, Van Heetvelde M, Van Nieuwenhuysen E, van Rensburg EJ, Vanderstichele A, Varon-Mateeva R, Vega A, Edwards DV, Vergote I, Vierkant RA, Vijai J, Vratimos A, Walker L, Walsh C, Wand D, Wang-Gohrke S, Wappenschmidt B, Webb PM, Weinberg CR, Weitzel JN, Wentzensen N, Whittemore AS, Wijnen JT, Wilkens LR, Wolk A, Woo M, Wu X, Wu AH, Yang H, Yannoukakos D, Ziogas A, Zorn KK, Narod SA, Easton DF, Amos CI, Schildkraut JM, Ramus SJ, Ottini L, Goodman MT, Park SK, Kelemen LE, Risch HA, Thomassen M, Offit K, Simard J, Schmutzler RK, Hazelett D, Monteiro AN, Couch FJ, Berchuck A, Chenevix-Trench G, Goode EL, Sellers TA, Gayther SA, Antoniou AC, Pharoah PDP, AOCS study group.  EMBRACE Study. GEMO Study Collaborators. HEBON Study. KConFab Investigators. OPAL study group Identification of 12 new susceptibility loci for different histotypes of epithelial ovarian Cancer. Nature Genetics. 2017;49:680–691. doi: 10.1038/ng.3826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ramasamy A, Trabzuni D, Gibbs JR, Dillman A, Hernandez DG, Arepalli S, Walker R, Smith C, Ilori GP, Shabalin AA, Li Y, Singleton AB, Cookson MR, Hardy J, Ryten M, Weale ME, NABEC. UKBEC Resolving the polymorphism-in-probe problem is critical for correct interpretation of expression QTL studies. Nucleic Acids Research. 2013;41:e88. doi: 10.1093/nar/gkt069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Reimand J. G:profiler - a web server for functional interpretation of gene lists. Nucleic Acids Research. 2016;47:W191–W198. doi: 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Rouhani F, Kumasaka N, de Brito MC, Bradley A, Vallier L, Gaffney D. Genetic background drives transcriptional variation in human induced pluripotent stem cells. PLOS Genetics. 2014;10:e1004432. doi: 10.1371/journal.pgen.1004432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Roumeliotis TI, Williams SP, Gonçalves E, Alsinet C, Del Castillo Velasco-Herrera M, Aben N, Ghavidel FZ, Michaut M, Schubert M, Price S, Wright JC, Yu L, Yang M, Dienstmann R, Guinney J, Beltrao P, Brazma A, Pardo M, Stegle O, Adams DJ, Wessels L, Saez-Rodriguez J, McDermott U, Choudhary JS. Genomic determinants of protein abundance variation in colorectal Cancer cells. Cell Reports. 2017;20:2201–2214. doi: 10.1016/j.celrep.2017.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. CORUM: the comprehensive resource of mammalian protein complexes--2009. Nucleic Acids Research. 2010;38:D497–D501. doi: 10.1093/nar/gkp914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Salomonis N, Dexheimer PJ, Omberg L, Schroll R, Bush S, Huo J, Schriml L, Ho Sui S, Keddache M, Mayhew C, Shanmukhappa SK, Wells J, Daily K, Hubler S, Wang Y, Zambidis E, Margolin A, Hide W, Hatzopoulos AK, Malik P, Cancelas JA, Aronow BJ, Lutzko C. Integrated genomic analysis of diverse induced pluripotent stem cells from the progenitor cell biology consortium. Stem Cell Reports. 2016;7:110–125. doi: 10.1016/j.stemcr.2016.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Schlaffner CN, Pirklbauer GJ, Bender A, Choudhary JS. Fast, quantitative and variant enabled mapping of peptides to genomes. Cell Systems. 2017;5:152–156. doi: 10.1016/j.cels.2017.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, Selbach M. Global quantification of mammalian gene expression control. Nature. 2011;473:337–342. doi: 10.1038/nature10098. [DOI] [PubMed] [Google Scholar]
  52. Schwartzentruber J, Foskolou S, Kilpinen H, Rodrigues J, Alasoo K, Knights AJ, Patel M, Goncalves A, Ferreira R, Benn CL, Wilbrey A, Bictash M, Impey E, Cao L, Lainez S, Loucif AJ, Whiting PJ, Gutteridge A, Gaffney DJ, HIPSCI Consortium Molecular and functional variation in iPSC-derived sensory neurons. Nature Genetics. 2018;50:54–61. doi: 10.1038/s41588-017-0005-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Staley JR, Blackshaw J, Kamat MA, Ellis S, Surendran P, Sun BB, Paul DS, Freitag D, Burgess S, Danesh J, Young R, Butterworth AS. PhenoScanner: a database of human genotype-phenotype associations. Bioinformatics. 2016;32:3207–3209. doi: 10.1093/bioinformatics/btw373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Stark AL, Hause RJ, Gorsic LK, Antao NN, Wong SS, Chung SH, Gill DF, Im HK, Myers JL, White KP, Jones RB, Dolan ME. Protein quantitative trait loci identify novel candidates modulating cellular response to chemotherapy. PLOS Genetics. 2014;10:e1004192. doi: 10.1371/journal.pgen.1004192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protocols. 2012;7:500–507. doi: 10.1038/nprot.2011.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Sun BB, Maranville JC, Peters JE, Stacey D, Staley JR, Blackshaw J, Burgess S, Jiang T, Paige E, Surendran P, Oliver-Williams C, Kamat MA, Prins BP, Wilcox SK, Zimmerman ES, Chi A, Bansal N, Spain SL, Wood AM, Morrell NW, Bradley JR, Janjic N, Roberts DJ, Ouwehand WH, Todd JA, Soranzo N, Suhre K, Paul DS, Fox CS, Plenge RM, Danesh J, Runz H, Butterworth AS. Genomic atlas of the human plasma proteome. Nature. 2018;558:73–79. doi: 10.1038/s41586-018-0175-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, Jensen LJ, von Mering C. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Research. 2017;45:D362–D368. doi: 10.1093/nar/gkw937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Tesli M, Wirgenes KV, Hughes T, Bettella F, Athanasiu L, Hoseth ES, Nerhus M, Lagerberg TV, Steen NE, Agartz I, Melle I, Dieset I, Djurovic S, Andreassen OA. VRK2 gene expression in schizophrenia, bipolar disorder and healthy controls. British Journal of Psychiatry. 2016;209:114–120. doi: 10.1192/bjp.bp.115.161950. [DOI] [PubMed] [Google Scholar]
  59. Thompson A, Schäfer J, Kuhn K, Kienle S, Schwarz J, Schmidt G, Neumann T, Johnstone R, Mohammed AK, Hamon C. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical Chemistry. 2003;75:1895–1904. doi: 10.1021/ac0262560. [DOI] [PubMed] [Google Scholar]
  60. Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature Protocols. 2016;11:2301–2319. doi: 10.1038/nprot.2016.136. [DOI] [PubMed] [Google Scholar]
  61. van der Harst P, Verweij N. Identification of 64 novel genetic loci provides an expanded view on the genetic architecture of coronary artery disease. Circulation Research. 2018;122:433–443. doi: 10.1161/CIRCRESAHA.117.312086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Wagih O, Galardini M, Busby BP, Memon D, Typas A, Beltrao P. A resource of variant effect predictions of single nucleotide variants in model organisms. Molecular Systems Biology. 2018;14:e8430. doi: 10.15252/msb.20188430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Walter K, Min JL, Huang J, Crooks L, Memari Y, McCarthy S, Perry JR, Xu C, Futema M, Lawson D, Iotchkova V, Schiffels S, Hendricks AE, Danecek P, Li R, Floyd J, Wain LV, Barroso I, Humphries SE, Hurles ME, Zeggini E, Barrett JC, Plagnol V, Richards JB, Greenwood CM, Timpson NJ, Durbin R, Soranzo N, UK10K Consortium The UK10K project identifies rare variants in health and disease. Nature. 2015;526:82–90. doi: 10.1038/nature14962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Wu L, Candille SI, Choi Y, Xie D, Jiang L, Li-Pook-Than J, Tang H, Snyder M. Variation and genetic control of protein abundance in humans. Nature. 2013;499:79–82. doi: 10.1038/nature12223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Yao C, Chen G, Song C, Keefe J, Mendelson M, Huan T, Sun BB, Laser A, Maranville JC, Wu H, Ho JE, Courchesne P, Lyass A, Larson MG, Gieger C, Graumann J, Johnson AD, Danesh J, Runz H, Hwang SJ, Liu C, Butterworth AS, Suhre K, Levy D. Genome-wide mapping of plasma protein QTLs identifies putatively causal genes and pathways for cardiovascular disease. Nature Communications. 2018;9:3268. doi: 10.1038/s41467-018-05512-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Yu H, Yan H, Li J, Li Z, Zhang X, Ma Y, Mei L, Liu C, Cai L, Wang Q, Zhang F, Iwata N, Ikeda M, Wang L, Lu T, Li M, Xu H, Wu X, Liu B, Yang J, Li K, Lv L, Ma X, Wang C, Li L, Yang F, Jiang T, Shi Y, Li T, Zhang D, Yue W, Chinese Schizophrenia Collaboration Group Common variants on 2p16.1, 6p22.1 and 10q24.32 are associated with schizophrenia in Han Chinese population. Molecular Psychiatry. 2017;22:954–960. doi: 10.1038/mp.2016.212. [DOI] [PubMed] [Google Scholar]
  67. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L, Gordon L, Haggerty L, Haskell E, Hourlier T, Izuogu OG, Janacek SH, Juettemann T, To JK, Laird MR, Lavidas I, Liu Z, Loveland JE, Maurel T, McLaren W, Moore B, Mudge J, Murphy DN, Newman V, Nuhn M, Ogeh D, Ong CK, Parker A, Patricio M, Riat HS, Schuilenburg H, Sheppard D, Sparrow H, Taylor K, Thormann A, Vullo A, Walts B, Zadissa A, Frankish A, Hunt SE, Kostadima M, Langridge N, Martin FJ, Muffato M, Perry E, Ruffier M, Staines DM, Trevanion SJ, Aken BL, Cunningham F, Yates A, Flicek P. Ensembl 2018. Nucleic Acids Research. 2018;46:D754–D761. doi: 10.1093/nar/gkx1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Zhang B, Wang J, Wang X, Zhu J, Liu Q, Shi Z, Chambers MC, Zimmerman LJ, Shaddox KF, Kim S, Davies SR, Wang S, Wang P, Kinsinger CR, Rivers RC, Rodriguez H, Townsend RR, Ellis MJ, Carr SA, Tabb DL, Coffey RJ, Slebos RJ, Liebler DC, NCI CPTAC Proteogenomic characterization of human Colon and rectal Cancer. Nature. 2014;513:382–387. doi: 10.1038/nature13438. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Stephen CJ Parker1
Reviewed by: Arushi Varshney2, Roderic Guigó3

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This exciting study quantifies the genetic architecture of gene expression (eQTL) and protein abundance (pQTL) regulation in iPSCs and will server is an important benchmark for the community. The rigorous comparison of these e/pQTL maps reveals differences that represent expected biological diversity, with a notable example in Figure 3B. Further, this work will serve as a foundation for many novel future studies, including GWAS colocalization and additional omics layers.

Decision letter after peer review:

Thank you for submitting your article "Population-scale proteome variation in human induced pluripotent stem cells" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Stephen CJ Parker as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Arushi Varshney (Reviewer #3); Roderic Guigó (Reviewer #4).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.

Summary:

Here the authors report pQTL results on 202 iPSC lines from 151 donors which already had RNA-seq data, enabling direct population-scale comparisons of eQTL and pQTL. This manuscript is well-written, clear, and has impactful results. We think it would be a great resource for the field. This is excellent work, and the authors are to be commended on an exciting study. We have a few items of feedback:

Revisions:

1) It is unclear how the acquisition of RNA and proteomics data were related. For example, were the same batches prepped, split, and material frozen for later profiling. Or, were the lines grown up at separate times for each of the different profiling modalities? If the latter, how could this potentially RNA-protein confounding batch effect influence the results?

2) In subsection “pQTL arising from protein-altering variants”, paragraph three, the authors advocate using protein abundance instead of RNA to interpret pathogenic mechanisms of rare variants. However, QTL studies generally have low power to detect effects for SNPs with low MAF. In fact, in this study, the authors used MAF 5% or higher. So, how does one reconcile this assertion to use protein abundance to interpret rare variant effects with the massive sample size it would take to do so. Addressing this idea in the text would be helpful.

3) The authors should make the full summary scan results available so that the rest of the community can use them as a resource.

4) Discussion paragraph three, the word "significant" is missing a "ly" and we'd advise removing that word altogether. In general, it should be used when making a statistical comparison and the associated test and p-value should be provided. This is not mentioned anywhere in the sentence. So, either those results should be disclosed, or a different word without a statistical connotation should be used. The same issue is present in paragraph three of subsection “pQTL arising from protein-altering variants”.

5) The variance component analysis showed effects of culture medium, sex, age etc. Were these taken as covariates in the QTL analysis? Looking at effects of the culture medium, were the culture passage numbers comparable across lines and does that have an effect?

6) For the X chromosome inactivation (XCI) section, it is unclear how exactly the XCI status was quantified. Subsection “RNA and proteome variability” paragraph two and the legend to Figure 1 reference the Materials and methods section but a description of this analysis is missing there. These results are hard to interpret without methodological clarity.

7) Figure 1D – While the random forest model analysis is interesting, the authors should elaborate on their selection of these specific variables. Some other factors that would be informative to include would be MAF and RNA expression level.

8) Figure 2B and 3C – is the grey bar/shaded region in these plots the genomic location of the respective gene? If so, this should be specified in the legend.

9) Figure 3C and the corresponding text briefly describe a pQTL for the VRK2 gene where the pQTL SNP is also a GWAS SNP for schizophrenia risk. The authors should elaborate on the pQTL direction of effect with respect to GWAS risk. Is the variant the lead SNP for GWAS or what is the LD r2 with the lead SNP and do these signals colocalize in that case? Is there some evidence of this protein being relevant in schizophrenia related cellular mechanisms?

10) Figure 4E – This is an interesting example. What is the eQTL/pQTL and trans pQTL direction of effect with respect to the Alzheimer's GWAS risk allele? In this example, the SNP rs1129187 is associated with PEX6 mRNA expression and protein abundance, and also associated with PEX1 and PEX26 abundance. To directly test if the trans-association of the SNP with PEX1 and PEX26 is through the association with PEX6 (complex stability) and not through other mechanisms, have the authors tried to regress out the PEX6 abundance from the association between the SNP and PEX1 or PEX26 and check if the association disappears?

11) Wrong figures are referenced in some places in the manuscript. Figure 3D is referenced before 3B etc.

12) Were there genes for which significant eQTL and also pQTL associations were identified but the variants were independent (low LD r2?)

13) It was confusing that the authors do not clearly distinguish between the variant affecting the phenotype of the gene (transcript or protein expression) and the affected gene. They write "we report 654 genes with a cis pQTL and 3487 genes with a cis eQTL. I assume that will in general find multiple p/eQTLs for a given gene (althought these number do not appear to be reported). These are thus the numbers of p/eGenes, but not of p/eQTLs. However, when they set to investigate replication of p/eQTLs, the numbers correspond to p/eGenes. The authors equate the numbers of QTLs with the number of affected genes. This part could be more clear.

14) Figure 1B. It looks to me that the fraction of the total variance explained by the factors the authors use in their model is much larger for transcriptomics than for proteomics data. I suggest the authors to report this number. If I am correct, this would mean that proteomics data behaves "somehow" more stochastically than transcriptomics data, maybe reflecting technical issues. It maybe also linked to the lack of replication of eQLTs at the protein level.

15) I understand the rationale of using 250Kb for eQTL analysis, since much of the regulation of gene expression is likely to reside in the promoter region. However, I do not see a biological rationale for using the same window for pQTL analysis. I understand that using the same window maybe the only way of making meaningful comparisons between eQTLs and pQTLs, and I think that this is ok. By using the same window the authors are implicitly assessing to what extent variation affecting gene expression also affects protein expression, that is the genetic variation in which the impact on protein expression is mediated by the impact on gene expression. Maybe the authors should acknowledge this.

16) Related to the above. eQTLs tend to cluster around the TSS. Do they observe the same clustering for pQTLs? What is the comparative distribution of p/eQTLs along the tested region?

eLife. 2020 Aug 10;9:e57390. doi: 10.7554/eLife.57390.sa2

Author response


Revisions:

1) It is unclear how the acquisition of RNA and proteomics data were related. For example, were the same batches prepped, split, and material frozen for later profiling. Or, were the lines grown up at separate times for each of the different profiling modalities? If the latter, how could this potentially RNA-protein confounding batch effect influence the results?

The lines were expanded once for both banking of material and all molecular assays, including RNA and proteomics quantification. Hence, the same batch of cells was split and frozen material was distributed and used for both RNA and proteomics processing. We have clarified this in the main text (Results section), and included further details about the processing steps in the Materials and methods section.

2) In subsection “pQTL arising from protein-altering variants”, paragraph three, the authors advocate using protein abundance instead of RNA to interpret pathogenic mechanisms of rare variants. However, QTL studies generally have low power to detect effects for SNPs with low MAF. In fact, in this study, the authors used MAF 5% or higher. So, how does one reconcile this assertion to use protein abundance to interpret rare variant effects with the massive sample size it would take to do so. Addressing this idea in the text would be helpful.

We agree that this section was not fully clear. We intended to advocate the value of proteome level data in addition to RNA abundance data and to state that our results demonstrate that the proteomics data add another important level of information for annotating specific pathogenic variants. We agree, however, that the sample size is a concern, which limits the frequency spectrum that can be studied using our data (we consider MAF>5% for our analyses). We have reworded implicated sections accordingly.

3) The authors should make the full summary scan results available so that the rest of the community can use them as a resource.

We agree that these data will be useful. We now provide the complete summary statistics for all eQTL, pQTL, transcript QTL and peptide QTL analyses we have conducted. These high-volume data are not suitable for including as supplementary tables and hence have been deposited on figshare. We have referenced them in the Data availability section.

4) Discussion paragraph three, the word "significant" is missing a "ly" and we'd advise removing that word altogether. In general, it should be used when making a statistical comparison and the associated test and p-value should be provided. This is not mentioned anywhere in the sentence. So, either those results should be disclosed, or a different word without a statistical connotation should be used. The same issue is present in paragraph three of subsection “pQTL arising from protein-altering variants”.

We thank the reviewers for highlighting this issue. We have fixed the corresponding sections and avoid statistical connotation in these specific instances, which we feel are not necessary for these specific statements made.

5) The variance component analysis showed effects of culture medium, sex, age etc. Were these taken as covariates in the QTL analysis? Looking at effects of the culture medium, were the culture passage numbers comparable across lines and does that have an effect?

For the QTL analyses we used PEER factors (Stegle et al., 2012) to adjust for batch and other confounding factors. We have experimented with alternative strategies, including to explicitly account for these additional covariates in the analysis. However, the inclusion of PEER factors performed best in terms of maximizing the number of discoveries. In general, we observed that these factors tend to capture broad covariates such as batch or culture media and hence there tends to be no significant benefit of including them separately. Regarding passage number – we agree that this an interesting covariate, and in particular would have added value to the variance component analysis. Unfortunately, the records on passage number were not fully complete for all lines and hence we could not consider this in our study.

6) For the X chromosome inactivation (XCI) section, it is unclear how exactly the XCI status was quantified. Subsection “RNA and proteome variability” paragraph two and the legend to Figure 1 reference the Materials and methods section but a description of this analysis is missing there. These results are hard to interpret without methodological clarity.

We thank the reviewers and apologize for leaving out this detail from the Materials and methods section. We have now added this in the Materials and methods section.

7) Figure 1D – While the random forest model analysis is interesting, the authors should elaborate on their selection of these specific variables. Some other factors that would be informative to include would be MAF and RNA expression level.

We thank the reviewer for these suggestions. We have extended the analysis and the revised Figure 2D now depicts the relevance of MFA and RNA abundance.

8) Figure 2B and 3C – is the grey bar/shaded region in these plots the genomic location of the respective gene? If so, this should be specified in the legend.

The grey box does indeed indicate the genomic location of the respective gene. We have extended the figure caption accordingly.

9) Figure 3C and the corresponding text briefly describe a pQTL for the VRK2 gene where the pQTL SNP is also a GWAS SNP for schizophrenia risk. The authors should elaborate on the pQTL direction of effect with respect to GWAS risk. Is the variant the lead SNP for GWAS or what is the LD r2 with the lead SNP and do these signals colocalize in that case? Is there some evidence of this protein being relevant in schizophrenia related cellular mechanisms?

Thank you for raising this point. The pQTL lead variant is indeed identical to a reported GWAS risk variant for schizophrenia (Yu et al., 2017). Unfortunately, we could not obtain access to the summary statistics, and hence we have not conducted a formal colocalization test for this locus. However, the effect size direction of this pQTL is consistent with prior evidence for the disease relevance of VRK2. Briefly, the risk allele (OR 1.17) is associated with decreased protein expression. Several previous studies have implicated VRK2 in schizophrenia and have linked downregulation of VRK2 RNA abundance to schizophrenia and other neurological disorders (Azimi et al., 2018, Tesli et al., 2016). We have extended the main text accordingly to mention this (subsection “pQTL arising from protein-altering variants”).

10) Figure 4E – This is an interesting example. What is the eQTL/pQTL and trans pQTL direction of effect with respect to the Alzheimer's GWAS risk allele? In this example, the SNP rs1129187 is associated with PEX6 mRNA expression and protein abundance, and also associated with PEX1 and PEX26 abundance. To directly test if the trans-association of the SNP with PEX1 and PEX26 is through the association with PEX6 (complex stability) and not through other mechanisms, have the authors tried to regress out the PEX6 abundance from the association between the SNP and PEX1 or PEX26 and check if the association disappears?

The risk allele at rs1129187 is associated with increased abundance of both PEX6 RNA and protein level, and is also positively associated with the abundance of the remaining complex subunits PEX26 and PEX1. We have extended the text to include this information.

Regarding the mediating role of PEX6 in the trans association, two lines of evidence indicate that PEX6 is mediating this QTL effect. First, all the complex members are correlated (e.g. r=0.42 for PEX6 and PEX1). Second, we have conducted a regression-based analysis to compare the evidence for a genetic effect on the downstream targets before and after adjusting for PEX6 expression. This results in decreased correlation (r=0.06 vs r=0.29 for PEX1 and r=0.36 versus 0.57 for PEX26), which is consistent with the hypothesized mediation. Nevertheless, this result remains a hypothesis at this stage and hence we have toned down this claim in the main text.

11) Wrong figures are referenced in some places in the manuscript. Figure 3D is referenced before 3B etc.

We thank the reviewer for highlighting these referencing errors. We have carefully revised and checked all references in the paper.

12) Were there genes for which significant eQTL and also pQTL associations were identified but the variants were independent (low LD r2?)

This is indeed a nice addition to our results as presented. In total, we identify 82 genes with significant eQTL and pQTL with independent lead variants (r2<0.1). The LD (r2) between the reported lead variants is now included in the Figure 1—source data 2, 3. We have added a statement to the main text (subsection “Mapping cis genetic effects on protein abundance”).

13) It was confusing that the authors do not clearly distinguish between the variant affecting the phenotype of the gene (transcript or protein expression) and the affected gene. They write "we report 654 genes with a cis pQTL and 3487 genes with a cis eQTL. I assume that will in general find multiple p/eQTLs for a given gene (althought these number do not appear to be reported). These are thus the numbers of p/eGenes, but not of p/eQTLs. However, when they set to investigate replication of p/eQTLs, the numbers correspond to p/eGenes. The authors equate the numbers of QTLs with the number of affected genes. This part could be more clear.

We have carefully revised the manuscript to clarify the number of eGenes versus distinct eQTL/pQTL. Note that given the moderate sample size of our study, we have limited our analysis to lead e/pQTL variants and hence in most cases there is a one to one mapping of these terms. Nevertheless, we agree that correct terminology is important.

14) Figure 1B. It looks to me that the fraction of the total variance explained by the factors the authors use in their model is much larger for transcriptomics than for proteomics data. I suggest the authors to report this number. If I am correct, this would mean that proteomics data behaves "somehow" more stochastically than transcriptomics data, maybe reflecting technical issues. It maybe also linked to the lack of replication of eQLTs at the protein level.

We agree that these differences are quite striking. The most likely explanation is higher assay noise reflecting the lower sensitivity of quantitative proteomics technologies compared to state-of-the-art deep RNA sequencing. This is indeed also the most likely explanation of the reduced mapping power. We have added a note in the main text (subsection “RNA and proteome variability”) as well as the Discussion section.

15) I understand the rationale of using 250Kb for eQTL analysis, since much of the regulation of gene expression is likely to reside in the promoter region. However, I do not see a biological rationale for using the same window for pQTL analysis. I understand that using the same window maybe the only way of making meaningful comparisons between eQTLs and pQTLs, and I think that this is ok. By using the same window the authors are implicitly assessing to what extent variation affecting gene expression also affects protein expression, that is the genetic variation in which the impact on protein expression is mediated by the impact on gene expression. Maybe the authors should acknowledge this.

We agree that these choices deserve some justification. Besides the expected biological mechanisms, e.g. promoter regulation or proximal enhancers, the choice is primarily motivated by tradeoffs in mapping power to identify genetic effects. Larger testing regions entail a multiple testing burden thereby prohibiting the ability to find eQTL/pQTL variants that are proximal to the TSS. There is no generic recipe for these tradeoffs and our choice can be considered a middle ground when comparing to previous pQTL studies (e.g. 20kb around the gene boundaries in Battle et al., 2015 and 1Mb around the TSS in Sun et al., 2018). We acknowledge that a smaller window may mean that we miss interesting distal effects. We have included a brief justification in the Materials and methods section.

16) Related to the above. eQTLs tend to cluster around the TSS. Do they observe the same clustering for pQTLs? What is the comparative distribution of p/eQTLs along the tested region?

Yes, this locational clustering in the vicinity of the TSS is expected, and indeed prior studies have reported this for eQTL (Kilpinen et al., 2017). The distribution of pQTLs is similar to the one observed for eQTLs. While initially we did not report these results, we follow the reviewer's suggestion and now include them (Figure 2—figure supplement figure 3).

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. HipSci: the iPSC proteomic compendium. PRIDE. PXD010557
    2. Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_peptide_qtls. figshare. [DOI]
    3. Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_transcript_qtls_202_lines. figshare. [DOI]
    4. Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_eqtls_202_lines. figshare. [DOI]
    5. Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_pqtls. figshare. [DOI]

    Supplementary Materials

    Figure 1—source data 1. HipSci proteomics iPSC lines.

    The public ids, TMT batch, donor, gender, age and growth media for the HipSci iPSC lines used in this study are shown.

    Figure 1—source data 2. RNA gene level expression across the 202 lines for genes recurrently detected at the protein level.

    Lines are indexed by protein Ensembl gene Id. Columns are the line public names.

    Figure 1—source data 3. Protein abundance values across the 202 and reference lines for genes recurrently detected at the protein level and with RNA expression (TPM >1).

    Lines are indexed by protein Uniprot Id. First 229 columns contain intensity values after quality line filtering, batch correction and quantile normalisation. Line names are encoded as follows: [line public name]@[TMT batch]@[TMT channel]. Last columns include protein information: ‘gene_chromosome', 'gene_start', 'gene_end', 'ensembl_gene_id', 'gene_name', 'gene_strand', 'number_of_peptides', 'in_CORUM’.

    Figure 1—source data 4. Protein and RNA variance components.

    Variance decomposition for 6009 genes high RNA expression (TPM >1) and detected in lines at the protein level.

    Figure 1—source data 5. Protein and RNA correlation with X chromosome inactivation.

    Correlation with XCI status of protein and RNA profiles for the 6336 genes (6406 proteins) with high RNA expression (TPM >1) and detected in all female lines at the protein level.

    Figure 1—source data 6. Functional enrichment of genes with protein or RNA profiles correlated with XCI.

    This table enumerates the significant Genome Ontology terms and DNA regulatory motifs (FDR 0.05; fields 'source' and 'term_name') for different gene sets (field ‘molecular_layer’ and ’change_direction’): 1) RNA positively correlated with XCI inactivation, 2) RNA negatively correlated with XIC, 3) proteins positively correlated with XIC and without RNA nominal significance, and 4) proteins negatively correlated with XIC and without RNA nominal significance.

    Figure 2—source data 1. pQTL_results.

    The list of significant (FDR < 10%) genes with a pQTL provided as a supplementary file. Data fields are described in the table below.

    Figure 2—source data 2. eQTL_results.

    Reported are genes with a significant (FDR < 10%) QTL. It consists of variants mapped at RNA, gene resolution, for genes detected at both RNA and protein levels. This table includes the features used in the prediction of the pQTL status. The table columns are analogous to Figure 2—source data 1 pQTL_results.

    Figure 3—source data 1. tQTL_results.

    Consists of variants mapped at RNA, transcript isoform resolution, for genes detected at both RNA and protein levels. The table columns are analogous to Figure 2—source data 1 pQTL_results.

    Figure 3—source data 2. pepQTL results.

    Consists of variants mapped at the protein level, peptide resolution, for genes detected at both RNA and protein levels. The table columns are analogous to Figure 2—source data 1 pQTL_results.

    Figure 4—source data 1. trans-pQTL_results.

    Reported are the trans pQTL (FDR < 10%).

    Supplementary file 1. Comparison of proteome coverage across human proteomics datasets.

    To facilitate comparison with other datasets we report here the number of proteins and peptides at FDR 1%.

    elife-57390-supp1.docx (60.8KB, docx)
    Supplementary file 2. Disease status.

    Shown are the number of lines and donors for which matched mRNA and protein data are available.

    elife-57390-supp2.docx (45.3KB, docx)
    Supplementary file 3. Peptides overlapping protein altering variants detected in this study.

    File containing the list of peptides overlapping protein altering variants or unmapped to the reference genome.

    elife-57390-supp3.tsv (901.7KB, tsv)
    Transparent reporting form

    Data Availability Statement

    All data can be accessed via the HipSci data portal, which references EMBL-EBI archives that are used to store the HipSci data. This study includes both cell lines that are consented to be openly accessible as well as cell lines that are subject to managed access, which means a data access application needs to be filed prior to accessing the data.Managed access data from all assays are accessible via EGA under the study EGAS00001001465. Open access genotyping array data and RNA-Seq data are available from ENA under the studies PRJEB11752 and PRJEB7388. Proteomics quantifications (protein group and peptide resolution; MaxQuant output), and run parameters are available on the PRIDE Archive PRIDE (PXD010557). Intermediate result files for this study, such as processed gene expression levels, can be found in Figure 1—source data 2 and 3. Complete summary statistics for the protein and RNA QTL analyses are available at: https://figshare.com/projects/QTL_datasets_for_Population-scale_proteome_variation_in_human_induced_pluripotent_stem_cells_/84233. Analysed data is included in the supplementary files. Scripts used to perform the statistical analyses presented are available at: https://github.com/hipsci/Elife2020 (copy archived at https://github.com/elifesciences-publications/Elife2020).

    The following datasets were generated:

    Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. HipSci: the iPSC proteomic compendium. PRIDE. PXD010557

    Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_peptide_qtls. figshare.

    Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_transcript_qtls_202_lines. figshare.

    Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_eqtls_202_lines. figshare.

    Mirauta BA, Seaton DD, Bensaddek D, Brenes MA, Bonder MJ, Kilpinen H, HipSci Consortium. Stegle O, Lamond AI. 2020. hipsci_pqtls. figshare.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES