Abstract
Background
A fundamental goal of single nucleotide polymorphism (SNP) genotyping is to determine the sharing of alleles between individuals across genomic loci. Such analyses have diverse applications in defining the relatedness of individuals (including unexpected relationships in nominally unrelated individuals, or consanguinity within pedigrees), analyzing meiotic crossovers, and identifying a broad range of chromosomal anomalies such as hemizygous deletions and uniparental disomy, and analyzing population structure.
Principal Findings
We present SNPduo, a command-line and web accessible tool for analyzing and visualizing the relatedness of any two individuals using identity by state. Using identity by state does not require prior knowledge of allele frequencies or pedigree information, and is more computationally tractable and is less affected by population stratification than calculating identity by descent probabilities. The web implementation visualizes shared genomic regions, and generates UCSC viewable tracks. The command-line version requires pedigree information for compatibility with existing software and determining specified relationships even though pedigrees are not required for IBS calculation, generates no visual output, is written in portable C++, and is well-suited to analyzing large datasets. We demonstrate how the SNPduo web tool identifies meiotic crossover positions in siblings, and confirm our findings by visualizing meiotic recombination in synthetic three-generation pedigrees. We applied SNPduo to 210 nominally unrelated Phase I / II HapMap samples and, consistent with previous findings, identified six undeclared pairs of related individuals. We further analyzed identity by state in 2,883 individuals from multiplex families with autism and identified a series of anomalies including related parents, an individual with mosaic loss of chromosome 18, an individual with maternal heterodisomy of chromosome 16, and unexplained replicate samples.
Conclusions
SNPduo provides the ability to explore and visualize SNP data to characterize the relatedness between individuals. It is compatible with, but distinct from, other established analysis software such as PLINK, and performs favorably in benchmarking studies for the analyses of genetic relatedness.
Introduction
High-density single nucleotide polymorphism (SNP) genotyping is used in association studies to find markers linked to loci that contribute to human disease and variation. Data derived from this approach have led to the identification of candidate loci in several human diseases, including macular degeneration [1], [2], rheumatoid arthritis [3], [4], and breast cancer [5]–[8]. SNP data are also useful for the analysis of homozygosity and copy number alterations in individuals, and inheritance patterns in pedigrees [9]. Two principles central to SNP use in association studies are identity by state (IBS) and identity by descent (IBD). IBS is the sharing of alleles between individuals. IBD is the sharing of alleles between individuals with an identified, common ancestral source of the alleles. Consider a family with a father and mother having genotypes CC and AC, respectively, and two children with an AC genotype. Each sibling's A allele is shared IBD since it must have been inherited from the mother on the same physical chromosome. The C alleles, however, are shared IBS between the siblings. One of the father's two C alleles was transmitted to each sibling, but it is not possible to discern whether the C alleles were derived from the same physical chromosome.
There are many applications of IBD studies including the affected sib-pair method [10], tests for linkage [11], [12] and studies of genetic relatedness [13]. The Merlin software package is commonly used for linkage studies, and includes IBD calculations [14]. The use of IBD requires knowledge of the relationship between individuals, or population allele frequencies to calculate IBD probabilities. In contrast, IBS can be calculated without knowledge of pedigree structure and does not require allele frequency information. In this paper we describe the SNPduo software tools. SNPduo is available in command-line and web accessible versions, referred to as SNPduo++ and SNPduo, respectively. SNPduo provides a method for the visualization of IBS between two individuals in an intuitive and informative way. By analyzing and plotting the IBS states between two individuals by the chromosomal location of each SNP, blocks of shared (and unshared) chromosomal material can be located. SNPduo++ provides a method for analyzing the mean and standard deviation of IBS states in large datasets, facilitating the discovery of unexpected relationships and population structure. Furthermore we show how IBS analysis in families can visualize meiotic crossover points in siblings (supported with synthetic data), delineate regions of hemizygous deletion, and detect uniparental disomy based upon discrepant IBS patterns. SNPduo identifies this broad range of genetic phenomena because it relies on high-density SNP data which are now routinely available. As the size of data sets grow in terms of the number of samples, so does the possibility of misclassified relationships due non-paternity, sample mislabeling, unexpected relatedness within families, and the unintentional inclusion of multiple family members (possibly recruited at different times or at different centers). SNPduo can help identify such errors.
Results
SNPduo demonstrates interpretable patterns of allele sharing
We developed the SNPduo program that tabulates and visualizes genetic relatedness based on SNP genotypes between pairs of individuals. The tool is web-accessible at http:// pevsnerlab.kennedykrieger.org/SNPduo. In particular, at each SNP we consider the sharing of 0, 1, or 2 alleles IBS. SNPduo requires genotype input, but is not a genotyping algorithm. Therefore the accuracy of the results depends upon the accuracy of the genotyping algorithm i.e. how often the observed genotype is the true genotype. For this reason all of the IBS classifications and genotypes described in this publication were the observed states. We determine the chromosomal locations at which a given SNP shared both alleles between two individuals (AA to AA, AB to AB, or BB to BB; IBS-2), one allele (e.g. AA to AB or AB to BB; IBS-1), or no alleles (AA to BB or BB to AA; IBS-0).
The input to SNPduo is a set of genotype calls with annotated chromosomes and physical map positions; the output includes an image with series of tracks plotting single-point, pairwise IBS by physical map position, a text summary file, and an IBS block file viewable on the UCSC genome browser. Images for three pairwise comparisons are shown for the X chromosome of a trio (father, mother, and son, Fig. 1). The unrelated parents included many IBS-0 calls, typical of genotypes from unrelated individuals (Fig. 1a). The mother/son pairwise comparison included no IBS-0 calls since she transmitted one of her two X chromosomes to the son (Fig. 1b). Notably, although the male X was hemizygous (and had genotypes A or B), current SNP genotyping algorithms typically generate all biallelic calls (AA or BB). Therefore even though the son's genotype is interpreted as diploid he always shared at least one observed allele with the mother's genotype. The father/son comparison showed almost no IBS-1 calls (Fig. 1c). These two chromosomes are both hemizygous and unrelated, and so include IBS-2 (e.g. AA matching AA) and IBS-0 (e.g. AA matching BB) SNPs. Rare instances of IBS-1 likely represent genotyping errors in which A or B genotypes were mistakenly identified as heterozygous AB calls. Vertical blanks with no data points on any track of a plot, including the No Call track of the individual genotypes, represent areas with no SNPs on the genotyping platform, typically in regions comprised of repetitive DNA (e.g. pericentromeric regions or areas of segmental duplications). Furthermore, it should be noted that although many areas appear to be a solid or nearly solid bar in this figure, the image is actually composed of individual SNPs plotted as dots by chromosome position with vertical jittering to make the track density easier to see.
When examining inheritance patterns in families occasionally aberrant patterns are observed. In particular, consider the pairwise comparisons of parents to a child who has a de novo deletion. The parents can have AA, AB, or BB genotypes for each SNP. In the region of deletion the child may have the genotype of A or B. However, genotyping algorithms interpret this as either AA or BB. For the parent whose allele was not deleted, the pattern appeared as a normal parent/child relationship with IBS-2 and IBS-1 tracks overlaid (Fig. 2, middle panel). In contrast, the parent whose allele was deleted in the child showed a pattern of IBS-2 and IBS-1 overlaid in diploid regions, but showed IBS-2, IBS-1, and IBS-0 in the region of the hemizygous deletion (Fig. 2, bottom panel arrow). In this case, the proband had an interstitial hemizygous deletion of approximately 7 megabases (Mb) as determined by karyotyping and SNP analyses [15]. The 7 Mb IBS-0 region appeared because of the genotype interpretation artifacts.
SNPduo visualizes variably shared genomic regions in siblings
Siblings inherit genetic material from common parents, and therefore share some genetic material with each other. However, the amount of sharing between any pair of siblings for any given chromosome varies between 0% and 100%. We performed pairwise analyses of siblings and examined all chromosomes. An example of a SNPduo output comparing siblings from a previously described family [16] (Fig. 3, upper panel) showed alternating patterns of IBS-2, -1 and -0 tracks. Three IBS patterns were evident. (1) If both siblings inherit the same alleles from both parents then they share two alleles IBS, and only the IBS-2 track will be present on a SNPduo plot (Fig. 3, upper panel, regions A). (2) Now consider that sibling 1 inherits a portion of the maternal grandfather's chromosome and a portion of the paternal grandmother's chromosome, while sibling 2 inherits a portion of the maternal grandmother's chromosome (unshared between siblings) and a portion of the paternal grandmother's chromosome (shared between siblings). In this case the siblings share only one chromosome segment (paternal grandmother's chromosome), and an overlapping pattern of IBS-2 and IBS-1 is apparent (Fig. 3, upper panel, regions B). (3) Finally, sibling 1 might inherit a segment of maternal grandfather's chromosome and a segment of paternal grandmother's chromosome, while sibling 2 inherits a segment of maternal grandmother's chromosome (unshared between siblings) and a segment of paternal grandfather's chromosome (unshared between siblings). In this instance, since the siblings do not have any common ancestry in this chromosomal segment, overlapping patterns of IBS-2, IBS-1, and IBS-0 are apparent on the SNPduo plot (Fig. 3, upper panel, regions C). Note that in region C where there are zero shared alleles IBS-1 and IBS-2 patterns also occur, and in region B where there is one shared allele there are also IBS-2 SNPs. The nature of these patterns is explainable by the heterozygosity rate. To be considered a SNP (a form of polymorphism) the minor allele frequency need only be 1%. SNPs with a low minor allele frequency, especially within a population group, will always have some SNPs with identically shared (IBS-2) or partially shared (IBS-1) genotypes between individuals.
SNPduo reveals meiotic crossover points in sibling comparisons
The sharing of chromosomal segments between siblings is characterized by alternating regions of IBS-2, IBS-1, and IBS-0. Each sibling may inherit either recombinant or non-recombinant alleles from either grandparent for each chromosome from each parent; furthermore, every gamete has a different, specific set of recombinations.
The switch between IBS track types in sibling comparisons was not random and occurred by the following pattern: IBS-2 ↔ IBS-1 ↔ IBS-0 (Fig. 3 A↔B↔C). IBS-0 → IBS-2 transitions were not seen (although the region of IBS-1 was occasionally very small). We hypothesized that the change points represented meiotic crossover events. For an IBS-2 → IBS-0 or vice versa transition without the intermediate transition to an IBS-1 track type to occur, two crossovers would have to occur in a region between SNPs. It is worth clarifying that these transition patterns apply only to the overall IBS track type in a region, not the individual SNP interpretations. Consecutive single point SNP calls transition freely between IBS classes. The SNPduo web tool implements a density segmentation, which is optimized for sibling comparisons, to find blocks of IBS types (Fig. 3, lower panel). Data are saved to a text file that can be viewed on the UCSC Genome Browser [17].
To confirm that changes in block type corresponded to meiotic crossovers we generated a synthetic data set (see Supplementary Information S1) for a family consisting of two siblings, their parents, and four grandparents (Fig. S1); the positions of all crossover events were known in the generation II parents and the generation III siblings. SNPduo and an IBS block type segmentation algorithm (see Supplementary Information S1) were applied to generation III siblings. Analysis of the synthetic pedigree confirmed that each IBS type change point corresponded to a meiotic crossover in either the mother or father for all chromosomes (Fig. 4). The figure shown was representative of the findings for all autosomes observed in the synthetic siblings. We observed similar findings with real data from multigenerational pedigrees (data not shown).
SNPduo identifies six relationships in HapMap
We analyzed nominally unrelated individuals from the HapMap project including individuals designated as parents from Yoruba (YRI; n = 60) and northern Europe (CEU; n = 60), and all individuals from China (CHB; n = 45) and Japan (JPT; n = 45). For this analysis we obtained the genotypes of all 270 phase 1 / 2 HapMap individuals assayed on the Affymetrix 500K platform (see Methods). We calculated the IBS state for all 210 unrelated individuals (n = 21,945 pairwise comparisons) across 22 autosomes. We next calculated the mean and standard deviation of the IBS states across all autosomes for each comparison. Related individuals were determined by plotting the mean versus standard deviation (using an approach similar to that described in the GRR software package [18]) and identifying outliers. The plot showed that the majority of pairwise comparisons had mean and standard deviation values that resulted in distinct clusters for pairwise comparisons within the YRI and CEU populations, and overlapping clusters for pairwise comparisons within the Asian (JPT and CHB) groups (Fig. 5). A total of 6 unexpectedly close relationships were discovered among the nominally unrelated individuals (Fig. 5).
(1) Yoruba individual NA18913 is the father in a trio while Yoruba individual NA19238 is the mother in a separate trio. However, the IBS analysis suggested that NA18913 and NA19238 shared a child/parent relationship. The SNPduo pairwise visualization for these individuals included IBS-1 type and IBS-2 type tracks overlaid (Fig. S2); this is a typical pattern for IBS-1 regions. We observed several regions with IBS-2 only, indicating some degree of historic relatedness between the putative mother (NA19238) and unknown father.
We analyzed the allele sharing between the two individuals for non-redundant HapMap mitochondrial SNPs and identified a high degree of mitochondrial SNP sharing (180 shared alleles out of 181 mitochondrial SNP genotypes). Since mitochondrial SNPs are maternally inherited, this provided strongly suggestive, but not definitive, evidence for the model in Fig. 6 in which NA19238 is the mother of NA18913. Furthermore, while there are a limited number of mitochondrial haplotypes the greatest diversity is in the African population, increasing the significance of a shared haplotype in this group.
(2) The second relationship discovered was between Yoruba male individuals NA19130 and NA19192 (refer to pedigrees in Fig. S3 and SNPduo output in Fig. S4). There was evidence of segmental sharing of one allele on all chromosomes, including the X chromosome. This pattern was suggestive of a second degree relationship, for example half-sibling or avuncular. The X sharing indicated a shared female relative, since a common male ancestor would have passed on his Y chromosome to a male child. Two relationships were most plausible: half-sibling (Fig. S3a) and avuncular (Fig. S3b). In this arrangement the roles of uncle and nephew were interchangeable. The correct pedigree structure could not be identified from the data.
(3) The third unexpected relationship was between Yoruba individuals NA19092 and NA19101. There was evidence of segmental single allele sharing on about half the chromosomes of these two individuals (Fig. S5). The pair shared 154 HapMap mitochondrial SNPs IBS-2 with 25 IBS-0, and did not share X chromosome DNA. These data indicated that the pair did not share a female ancestor. The sharing pattern, as well as the mean and standard deviation compared to the unrelated Yoruba individuals, suggests a third degree relationship, such as first cousins (Fig. S6).
(4,5,6) The remaining relationships were between CEU individuals NA07022 and NA06993, NA12155 and NA12264, and NA07056 and NA06993. All of these individuals showed evidence of segmental single allele sharing throughout the genome. The pattern of allele sharing was similar to the previous case of putative first cousins, but on fewer chromosomes and with smaller shared regions. These data suggest the relationships between these individuals to be first cousins, once removed or an alternative relationship of a similar degree.
All of the relationships we described are consistent with those reported in a HapMap publication [19].
Expected higher diversity in Yoruba compared to other HapMap groups
As stated previously, IBS-2 classifications correspond with IBS-2, -1, and -0 track types. IBS-1 classifications correspond with both IBS-1 and -0 track types. IBS-0 is the least common classification and corresponds to only the IBS-0 track type. Given that information content is inversely proportional to its frequency [20], the IBS-0 classified SNPs have the highest information content for determining unrelated individuals and amount of population variation since unrelated individuals demonstrate genome-wide IBS-0 track types. All pairwise comparisons of unrelated individuals within each HapMap population were performed using downloaded Affymetrix genotyping data. The number of IBS-0 classifications was counted for each autosome in each comparison. The X chromosome was excluded due to the confounding biallelic genotypes in hemizygous males. The Tukey Honest Significant Differences (Tukey HSD) is useful to define the statistical significance of differences for pairwise comparisons. We calculated Tukey HSD from the analysis of variance (ANOVA) of all of the within group comparisons using the R statistical programming language [21] (Fig. S7). The Tukey HSD allows for multiple comparisons while limiting the overall family-wise error rate (in this case to 5%), whereas multiple single tests would inflate the error rate. The Tukey HSD p-values are adjusted for multiple comparisons. Specifically, the Tukey HSD tested whether the within group IBS-0 counts differed between examined groups.
Several characteristics are apparent from the data. The within group YRI comparisons had a significantly greater number of IBS-0 counts (adjusted p-value<10−7) than the within group comparisons of CHB, CEU, and JPT. The JPT and CHB groups behaved similarly when compared to the YRI group, with both having between 95 and 105 fewer IBS-0 counts on average than the YRI group comparisons. The YRI group also demonstrated a greater number of IBS-0 counts than the CEU group, although the means differed by fewer IBS-0 counts (22 to 42 more IBS-0 calls in YRI than in CEU). The JPT and CHB groups showed no difference in the mean numbers of IBS-0 counts (adjusted p-value 0.62) in their respective within group comparisons. The CHB and JPT groups demonstrated similar differences in the mean number of IBS-0 counts compared with CEU with between 64 and 94 fewer IBS-0 calls (adjusted p-value<10−7).
By using the count of IBS-0 SNPs as a proxy for genotype diversity we concluded the YRI group had the greatest genetic diversity between the sampled “unrelated” individuals. CEU samples were more genetically similar than the YRI samples, but less genetically similar than the JPT or CHB samples. The JPT and CHB samplings appeared to be the most similar groups with respect to the genotyped markers, and they did not appear to be different from one another in terms of within group similarity. All of these findings were as expected based on previous HapMap analyses [19], [22].
Differences in mitochondrial genome diversity between groups
The number of IBS-0 classified mitochondrial SNPs between all unrelated individuals within each HapMap group was tabulated, and the results were visualized by plotting the Tukey HSD values (Fig. S8). The data plotted were the square-root of the number of IBS-0 counts instead of the counts themselves. This was because in performing ANOVA one must assume that the variance within groups is equal, and especially that there is no mean-variance relationship. A simple box plot of the IBS-0 counts (data not shown) demonstrated increasing group variance with an increase in mean, an effect that the square root transformation removed. A log transformation was not used since the data contained counts of 0. The YRI group demonstrated a greater mitochondrial diversity than the CEU, CHB, or JPT groups (adjusted p-value<10−7). The CHB and CEU groups appeared to have no significant difference in the mean number of IBS-0 counts (adjusted p-value 0.59). However, the JPT group demonstrated less mitochondrial diversity as measured by the IBS-0 counts than either the CHB or CEU groups (adjusted p-value<10−7). This contrasts with the whole-genome result that the JPT and CHB group showed no difference in the number of IBS-0 counts between unrelated individuals for autosomal SNPs. A smaller amount of mitochondrial diversity with very similar genome diversity supports a small Japanese founder population.
These data indicated a hierarchy of mitochondrial SNP diversity within the sampled data. All groups had some individuals with shared mitochondrial haplotypes. The sampled YRI group had the greatest amount of mitochondrial diversity among unrelated individuals, while the JPT group had the lowest of amount of mitochondrial diversity in the sampled unrelated individuals. The diversity amongst the CEU and CHB groups was not significantly different.
Analysis of a large autism data set
We obtained SNP data from 2,883 individuals genotyped on Affymetrix arrays in 721 multiplex autism families from the Autism Genetic Resource Exchange (AGRE). SNPduo++ analysis was performed to calculate the mean and standard deviation of IBS, the specified relationship, and a calculated relationship (unrelated, first degree [parent, child, or sibling], or identical sample) using a PLINK formatted ped and map file as input.
We highlight the usefulness of SNPduo++ in analyzing large datasets in two ways. First, we used SNPduo++ to find identical samples from those that were annotated in the pedigree file as non-identical. Twenty-seven identical sample pairs were identified (Supplementary Table S1). Twenty-six of these pairs were specified as siblings according to the pedigree (.ped) file, but calculated to be identical by SNPduo++. In these cases accompanying AGRE data confirmed that all of these pairs of siblings were actually identical twins (or higher order multiples). This highlights a weakness of the ped format. Without additional information it would be impossible to tell if these were in fact true identical siblings or if the samples were mislabeled.
For the remaining pair, a pairwise SNPduo++ calculation indicated that two samples were identical, but the pedigree file indicated that the genotype data were from unrelated individuals (Supplementary Table S1). This could occur if a sample is genotyped two or more times and was mislabeled. Less likely was the possibility that the individuals are in fact identical siblings adopted apart.
A second useful aspect of SNPduo++ is the identification of first-, second-, or third-degree relatedness in individuals who are annotated as unrelated. Twenty individuals specified as unrelated based on the pedigree file shared a first degree relationship (either parent/child or siblings; Supplementary Table S2). Many other individuals shared second or third-degree relationships (data not shown). Such findings allow one to move from the large-scale summary data back to the level of individual pedigrees. Extending and properly annotating pedigrees increases statistical power by increasing the number of families in the study and decreasing bias caused by not properly annotating those families.
We obtained a second set of 3,817 samples (which included reruns of some individuals) from AGRE genotyped on the Illumina 550K platform (n = 561,466 SNPs). These data were processed using command-line R scripts (see Methods) to calculate the mean and standard deviation of IBS state for all autosomes individually and summarily across all autosomes. Using these data several interesting phenomena were apparent.
One individual, AU009204, exhibited an unusual IBS sharing with his parents on chromosome 16. Compared to the mother, chromosome 16 consisted of a single IBS-2 block, but compared to the father chromosome 16 consisted of a single IBS-0 block. To rule out programming bugs, the identified individuals were directly exported from the Illumina BeadStudio software. Analysis by SNPtrio [23] confirmed a maternal uniparental heterodisomy of chromosome 16 in the child (Fig. S9). Given that this was a heterodisomy rather than a uniparental isodisomy, this event did not produce homozygosity on chromosome 16 in the child. Furthermore, all other chromosomes exhibited standard Mendelian inheritance, supporting a true UPD event and not a cell line artifact or software error.
Second, a mother, AU068501, exhibited unusual mean and standard deviation values on chromosome 18 when compared to her child. Examination in Illumina BeadStudio software revealed a mosaic monosomy of chromosome 18. There are two hallmarks that support this conclusion. The first is the decrease in intensity corresponding to reduced copynumber. The Illumina software provides a measure of intensity defined as the base 2 logarithm of the intensity of the individual versus a reference intensity (referred to as the Log R Ratio). The total intensity for a SNP is defined by the sum of both A allele and B allele probe intensities. The reference intensity is included in the software, and is derived from the average total intensity of that SNP for greater than 100 HapMap individuals. The decrease in Log R Ratio supports an interpretation of monosomy. The second hallmark is the homozygosity, evidenced in the Illumina data as B Allele Frequency. The B Allele Frequency is the fraction of total alleles that are a ‘B’ allele (which is derived from a series of transformations of coordinate intensity data). For example, with an AA genotype 0 out of 2 alleles are B, giving a B Allele Frequency of 0. For an AB genotype 1 of 2 alleles are B, giving a B Allele Frequency of 0.5. For a mosaic mixture of AB and A0, there are three alleles, and only one of them is a B. This gives a B Allele Frequency of approximately 0.33 (the same as an AAB trisomy genotype). An AB / B0 mosaic has two out of three alleles as a B genotype, giving a B Allele Frequency of approximately 0.67 (the same as an ABB trisomy genotype). For this individual the coordinate drop in Log R Ratio and splitting of the B Allele Frequency panel into tracks at 0, 0.33, 0.67, and 1 indicate a mosaic monosomy (Fig. S10). Without the intensity information it would be impossible to determine if it was a mosaic monosomy or a trisomy.
A number of samples appeared to have been genotyped twice in this dataset, presumably due to low genotype call rates evident during the first run. The rerun set had accepted genotyping rates and less noise, but IBS analysis revealed the rerun samples were mislabeled and did not agree with the specified relationships in the pedigree information. We use SNPduo++ IBS analysis to reconstruct pedigrees for these individuals (Fig. S11). The sample which demonstrated maternal uniparental heterodisomy, AU09204, was included in the mislabeled group. The nuclear family was reconstructed using SNPduo data (Fig. S11, family B), but we cannot determine which true individual ID the data belongs to, i.e. whether AU09204 from the rerun group corresponds to the true identity of AU09204. In large datasets, particularly studies with thousands of individuals, sample mislabeling is routine. These findings highlight the need for applying quality control measures to data after genotyping to reduce the impact these types of errors could have on data analysis.
The number of unlinked markers required to discern unrelated individuals from those with closer relationships is on the order of a few hundred markers (Roberson and Pevsner, in preparation). However, randomly sampling markers from larger data pools would fail to identify smaller scale changes, such as a single chromosome UPD or a segmental copy number alteration.
Discussion
SNPs are useful markers for many types of genetics and population-based studies. A challenge of SNP data analysis is to identify shared chromosomal segments in both closely and distantly related individuals. In this work we present SNPduo, a tool that identifies and visualizes shared loci.
SNPduo is a useful analysis tool in several different situations. (1) Large-scale SNP association studies can involve hundreds or thousands of presumably unrelated samples. By plotting mean and standard deviation values generated by SNPduo++, unexpected first-, second-, and third-degree relationships can be revealed. The --conflicting switch for SNPduo++ can also identify identical samples and first degree relationships. (2) PLINK is a popular tool for running association studies and many users generate a file containing IBD probabilities and IBS counts using the --genome option. The output file can be fed into SNPduo++ directly without having to rerun the entire analysis, facilitating the examination of population structure and cryptic relationships by calculating the mean and standard deviation of IBS for each comparison. In this scenario, in which the PLINK output does not include pedigree information, the specified relationships cannot be determined. (3) The web-based version of SNPduo is useful for smaller data sets, particularly within families. Using the web based version of SNPduo in family studies can help reveal shared ancestry in parents, the transmission of alleles through generations (such as grandfather to grandson), uniparental inheritance, location of meiotic crossovers between siblings, and the parental chromosome that is affected in hemizygous deletions.
Two previously described software tools are often used to determine relationships between individuals, RELPAIR and GRR [18], [24]. RELPAIR is written in Fortran 77 and based on the source code the initial limit of SNPs is 10,000. Current studies use hundreds of thousands to millions of SNPs per individual, far more than the suggested maximum. RELPAIR also requires a priori knowledge of the allele frequencies for each marker to be listed in the locus file, which may not be easily derivable for all marker sets in all populations, and may require a certain amount of estimation. GRR is another widely used option. The GRR documentation suggests using at least 200 markers, but is more reliable with 300 or more. Modern SNP studies use several thousand times more markers, which may require attempting to pare a dataset down to a few hundred markers selected at random from the genome.
The Merlin and PLINK software packages can perform pairwise IBD probability (P IBD) estimations of the probability that two individuals share 0, 1, or 2 alleles IBD. The PLINK output is a per individual P IBD estimation, along with a summary of IBS counts for each IBS type across all chromosomes. This output can be imported into SNPduo++ to generate mean and standard deviation of IBS calculations. Merlin is capable of generating single-point P IBD estimations for many markers, but all such probability calculations rely on accurate assumptions about allele frequencies in a test population. The use of summary IBS in SNPduo++ and single-point IBS calculations in SNPduo avoids these population assumptions.
In conclusion, SNPduo offers several advantages in dataset analysis. It is designed to work with datasets that contain hundreds of thousands to millions of SNPs. The SNPduo++ source code is freely available for local download and implementation on a variety of computer architectures. The scripts for the web based SNPduo are also available for local implementation, though they are not portable to architectures other than Linux at this time. There are many available tools for analyzing SNP data in populations (e.g. RELPAIR, GRR, Merlin, PLINK). The SNPduo tools are designed to complement other available tools for large-scale data analysis without replacing their functionality, as well as provide visualization tools for smaller scale analysis. The web based SNPduo allows for researchers who are not programmers to directly export data from standard genotyping software (e.g. from Affymetrix, Illumina, or HapMap bulk data downloads), upload it to a website, and obtain both tabular and visual output.
The tracks visualized on the SNPduo web output have low noise given accurate genotype calls. The simple density segmenting algorithm for finding blocks of IBS type (see Supplementary Information S1) relies on this high signal to noise ratio; it is most well-suited to sibling data where there are a variety of discrete regions of different track types. More sophisticated models, such as a Hidden Markov Model or Circular Binary Segmentation, could be implemented as well. However, the density algorithm is faster and less computationally intensive for high-density data than model-based alternatives.
The SNPduo web tool is useful for visualizing meiotic crossover points that occurred in siblings. Recent publications examining recombination use two [25], [26] or three [27] generations of a family to discover the location of a recombination event. SNPduo uses a single generation of siblings, which is more practically attainable than two or three generations. However, there are limitations in using only one generation. SNPduo can visualize the position of crossovers in siblings, but is unable to phase which sibling's chromosome the crossover occurred on, or to report which parent the crossover was derived from. Therefore it is useful in determining the position of crossovers, but not in determining the interference distance between crossovers on a chromosome or sex-specific recombination rates, since both require allele phasing.
Methods
SNPduo: web implementation
The SNPduo web tool consists of scripts that manipulate input into the proper format, determine the single-point IBS state of each SNP in a comparison, and generate the SNPduo output, which includes web tables, downloadable files, and images. The tool is web-accessible at http://pevsnerlab.kennedykrieger.org/SNPduo.
The source scripts are available for download from the SNPduo website. Local implementation of the SNPduo web tool requires an appropriate operating system, required software packages, and some code customization (for file and directory locations). The CGI script that parses input data, triggers system commands, performs text manipulations, converts image formats, and outputs the results is written in Perl. SNP level analysis is performed using R functions interfaced with compiled C code; postscript images are created using R as well. IBS classification is performed according to the SNPduo schema (Table 1). All of the R and C code used for processing IBS analyses were custom generated for this tool and not ported from existing packages.
Table 1. SNPduo IBS Calling Schema.
Genotype 1 | Genotype 2 | Identity-by-State |
AA | AA | 2 |
AA | AB | 1 |
AA | BB | 0 |
AA | No Call | NA |
AB | AA | 1 |
AB | AB | 2 |
AB | BB | 1 |
AB | No Call | NA |
BB | AA | 0 |
BB | AB | 1 |
BB | BB | 2 |
BB | No Call | NA |
No Call | AA | NA |
No Call | AB | NA |
No Call | BB | NA |
No Call | No Call | NA |
Genotype 1 and 2 are the genotypes of the two individuals being compared. Genotypes are shown in the binary A/B genotype convention. Uncalled alleles in either individual are uninformative of IBS and are not plotted. Abbreviation: NA, not applicable.
Data are uploaded via a web form. The upload form gives customizable options including how columns are delimited, the data platform, the columns of individual's data to compare, the chromosome(s) of interest, output page size, the genome build the data are based on, and whether to generate PNG images and Postscript files of the output. Analysis can be run on two individuals, or a batch of all possible comparisons can be performed on a group of columns. It should be noted that the number of operations increases nearly exponentially with each individual added to batch mode operations. Comparisons that include image generation are best suited for comparisons between up to five to ten individuals due to the length of time required for image creation. A single chromosome, a subset of chromosomes, or all included chromosomes can be examined in separate plots or in a combined genome plot. Data are delivered on an output page that shows a PNG, if created, and provides download links to all of the generated files. These include an output file with the chromosome, position, genotypes, and IBS classification of every SNP, a BED file of shared segments for display on the UCSC Genome Browser [17] and a publication quality postscript of the data visualization, if selected. Included on the output page are tables that show summary data for the genotype distribution and IBS class distribution for the analyzed data.
For larger datasets (tens of individuals) where all possible pairwise comparisons are desired without image generation, the tabulate option, which is more optimized for speed, can be used. The results are two summary files. The first lists the individual, the chromosome, and the count of each genotype class. The second file lists the two individuals compared, the chromosome, and the count of each IBS class. The data are also displayed on the result page as two separate tables.
For datasets with tens of individuals segmenting may also be accomplished in a reasonable amount of time if no images are generated (by deselecting all image generation options during batch mode). For example, segmenting all pairwise comparisons of 25 individuals (n = 300) for ∼50,000 markers on one chromosome required approximately 7 minutes on a server with 3.0 GHz Xeon processors, and did not trigger a browser timeout.
SNPduo++: command-line implementation
SNP datasets with high-density, genome-wide coverage for many individuals (hundreds to thousands) are intractable via a web interface using interpreted code. The time spent uploading the file, generating images, and performing computations would cause the connection to timeout. However, optimized and compiled code run locally is well-suited suited to this task.
A version of the SNPduo algorithm was written in C++ to make the SNPduo++ program. The command-line program accepts either pedigree format files, including a “.ped” and a “.map” file, or a transposed pedigree format, including a .tped and a .tfam file. These formats were chosen because they are the same input formats used by the popular analysis tool PLINK. A ped file requires seven columns containing family ID, individual ID, father ID, mother ID, sex (integer format), and phenotype (integer format), followed by individual alleles of each genotype. The map files include chromosome, RSID, genetic distance (optional), and physical position. The transposed format puts family information in a “.tfam” file, while annotation and genotypes go in a “.tped” file. The tfam file consists of the first seven columns found in a ped file. The tped file contains the columns of a map file, followed by genotypes.
When running the command-line version of SNPduo++, an input source is specified along with one or more analysis or recoding options. Data are read in from one of the supported file formats into memory for analysis. The selected analysis options are performed, and output is written to disk in files. Since the command-line version is written in C++ it is highly portable, and should run everywhere a C++ compiler works, i.e. Windows, Mac, and Linux platforms in both 32-bit and 64-bit environments. It should be noted that SNPduo++ seeks an end of line character (\n) to delimit new lines, ignoring carriage returns (\r). Mac users may have to convert end of line carriage returns to newline characters for the tool to parse files correctly.
Memory in excess of 4 Gb can be used if the code is compiled in a 64-bit environment. However, genotypes are stored in specialized vectors to reduce memory usage. As a result, a file greater than 4 Gb on disk can be represented at ∼500 Mb in memory. A recoding switch, --recode --webduo, will generate a text file compatible with Custom Format of the web implementation of SNPduo to explore datasets further. For SNPduo++ (as for PLINK), the genotypes of a given SNP are specified by any two alphanumeric characters, e.g. A/B format, 1/2 format, A/C/G/T format, etc. The --webduo option converts the read alleles to the A/B format to allow analysis of the genotype data with the web-based SNPduo that requires A/B coded genotypes.
Benchmarking SNPduo++ and PLINK
We benchmarked the performance of PLINK (v1.03) and SNPduo++ (v1.01b) to compare functionality and speed. Tests were performed on a server with four 3.00 GHz Xeon processors and 6 Gb of RAM running Red Hat Linux kernel 2.6.9–78.0.1.ELsmp. Both programs were compiled from source code using make (v3.80) and g++ (v3.4.6). Compile flags used optimization level three, and march / mtune with the parameter “nocona”. Benchmarking data was the AGRE Affymetrix autism family dataset analyzed elsewhere in this paper, in the standard ped / map file formats. Each program was tested for the ability to analyze all possible comparisons of individuals in the dataset (n = 4,154,403). The PLINK --genome finished in 39 hours 25 minutes. The performance of SNPduo++ to count IBS states between individuals was benchmarked using the --counts switch. The analysis completed in 12 hours 5 minutes, an approximately 70% shorter run time. Since SNPduo++ does not perform marker sorting, IBD calculations, marker missingness calculations, or tests of Hardy-Weinberg equilibrium it can perform IBS counting operations much more quickly. As such, SNPduo++ is not a replacement for PLINK functionality, but instead is software optimized for specific SNP calculations.
Analysis of HapMap data
The genotypes of 270 HapMap individuals genotyped on the Affymetrix 500K platform using the BRLMM algorithm were downloaded from the Affymetrix website (http://www.affymetrix.com/support/technical/sample_data/500k_hapmap_genotype_data.affx, accessed March 26, 2008). The genotypes were coded into a numeric format and stored into a local MySQL database. The SNPduo R code was used to analyze the data. The pedigree data from HapMap were used to select all individuals with both parents designated as 0, i.e. unrelated individuals (n = 210). On a chromosome by chromosome basis these individual's genotypes were queried from the database to be used as input genotypes, and the counts of each IBS class for each comparison were tallied. In total this involved 21,945 inter-individual comparisons for 23 chromosomes.
The most closely related individuals were defined by calculating the mean and standard deviation of IBS state across all autosomes for each comparison. The mean and standard deviation data were visualized in the Partek Genomics Suite software (Partek Incorporated, St. Louis, MO). Data were imported such that the individuals compared were text labels, the comparison group was a categorical factor, and the mean and standard deviation were double precision response variables. Outliers were visualized by plotting the mean versus standard deviation for each comparison. The outliers were then individually selected and further studied to define the relationship.
Analysis of Illumina autism data
Illumina 550K autism family data were exported from BeadStudio, including only chromosome, physical position, and genotypes. Data were numerically coded into a 0, 1, 2, 3 convention representing NC, AA, AB, and BB, respectively. Converted data were stored in a MySQL database.
Counts of each IBS state were determined for each chromosome. The counts were added for all autosomes, and used to calculate the autosomal mean and standard deviation of IBS. The per-chromosome counts were visualized and examined for outliers (reported in results) using the R Statistical Programming Language and Partek Genomics Suite.
Availability and requirements
SNPduo++ is platform independent; the SNPduo webtool requires Linux for local installation, and the webtool is accessible from any browser. SNPduo was programmed in Perl, R, C/C++, PHP, and is available without restrictions.
Supporting Information
Acknowledgments
We thank Jason Ting for his advice on debugging the CGI code, and N. Varg for helpful comments on this manuscript. We thank Dr. Ingo Ruczinski for his advice on statistical treatments of data and comments on the manuscript, and Dr. Donald Coppock for providing helpful comments.
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: J.P. was supported by NIH grant HD24061. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.DeWan A, Liu M, Hartman S, Zhang SS-M, Liu DTL, et al. HTRA1 Promoter Polymorphism in Wet Age-Related Macular Degeneration. Science. 2006;314:989–992. doi: 10.1126/science.1133807. [DOI] [PubMed] [Google Scholar]
- 2.Klein RJ, Zeiss C, Chew EY, Tsai J-Y, Sackler RS, et al. Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Plenge RM, Cotsapas C, Davies L, Price AL, de Bakker PIW, et al. Two independent alleles at 6q23 associated with risk of rheumatoid arthritis. Nat Genet. 2007;39:1477–1482. doi: 10.1038/ng.2007.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, et al. TRAF1-C5 as a Risk Locus for Rheumatoid Arthritis – A Genomewide Study. N Engl J Med. 2007;357:1199–1209. doi: 10.1056/NEJMoa073491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, et al. Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proceedings of the National Academy of Sciences. 2008;105:4340–4345. doi: 10.1073/pnas.0800441105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Thompson D, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447:1087–1093. doi: 10.1038/nature05887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 2007;39:870–874. doi: 10.1038/ng2075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stacey SN, Manolescu A, Sulem P, Rafnar T, Gudmundsson J, et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet. 2007;39:865–869. doi: 10.1038/ng2064. [DOI] [PubMed] [Google Scholar]
- 9.Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. Journal of Clinical Investigation. 2008;118:1590–1605. doi: 10.1172/JCI34772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Thomson G, Motro U. Affected sib pair identity by state analyses. Genetic Epidemiology. 1994;11:353–364. doi: 10.1002/gepi.1370110405. [DOI] [PubMed] [Google Scholar]
- 11.Bishop D, Williamson J. The power of identity-by-state methods for linkage analysis. Am J Hum Genet. 1990;46:254–265. [PMC free article] [PubMed] [Google Scholar]
- 12.Kong A, Cox NJ. Allele-Sharing Models: LOD Scores and Accurate Linkage Tests. The American Journal of Human Genetics. 1997;61:1179–1188. doi: 10.1086/301592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chakraborty R, Jin L. Pena SDJ, Chakraborty R, Epplen JT, Jeffreys AJ, editors. A unified approach to study hypervariable polymorphisms: Statistical considerations of determining relatedness and population distances. Basel: Birkhauser Verlag. 1993. pp. 153–175 p. [DOI] [PubMed]
- 14.Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2001;30:97–101. doi: 10.1038/ng786. [DOI] [PubMed] [Google Scholar]
- 15.Miller N, Nance M, Wohler E, Hoover-Fong J, Lisi E, et al. Molecular (SNP) analyses of overlapping hemizygous deletions of 10q25.3 to 10qter in four patients: Evidence for HMX2 and HMX3 as candidate genes in hearing and vestibular function. Am J Med Genet A: In press. 2009 doi: 10.1002/ajmg.a.32705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lisi EC, Hamosh A, Doheny KF, Squibb E, Jackson B, et al. 3q29 interstitial microduplication: A new syndrome in a three-generation family. American Journal of Medical Genetics Part A. 2008;146A:601–609. doi: 10.1002/ajmg.a.32190. [DOI] [PubMed] [Google Scholar]
- 17.Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, et al. The UCSC Genome Browser Database: update 2009. Nucl Acids Res. 2009;37:D755–761. doi: 10.1093/nar/gkn875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Abecasis GR, Cherny SS, Cookson WOC, Cardon LR. GRR: graphical representation of relationship errors. Bioinformatics. 2001;17:742–743. doi: 10.1093/bioinformatics/17.8.742. [DOI] [PubMed] [Google Scholar]
- 19.The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shannon CE. A Mathematical Theory of Communication. The Bell System Technical Journal. 1948;27:379–423, 623-656. [Google Scholar]
- 21.R Development Core Team. 2007. R: A language and environment for statistical computing. 2.6.0 ed. Vienna, Austria.
- 22.The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ting JC, Roberson EDO, Miller ND, Lysholm-Bernacchi A, Stephan DA, et al. Visualization of uniparental inheritance, Mendelian inconsistencies, deletions, and parent of origin effects in single nucleotide polymorphism trio data with SNPtrio. Human Mutation. 2007;28:1225–1235. doi: 10.1002/humu.20583. [DOI] [PubMed] [Google Scholar]
- 24.Epstein MP, Duren WL, Boehnke M. Improved Inference of Relationship for Pairs of Individuals. The American Journal of Human Genetics. 2000;67:1219–1231. doi: 10.1016/s0002-9297(07)62952-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Coop G, Wen X, Ober C, Pritchard JK, Przeworski M. High-Resolution Mapping of Crossovers Reveals Extensive Variation in Fine-Scale Recombination Patterns Among Humans. Science. 2008;319:1395–1398. doi: 10.1126/science.1151851. [DOI] [PubMed] [Google Scholar]
- 26.Wirtenberger M, Hemminki K, Chen B, Burwinkel B. SNP microarray analysis for genome-wide detection of crossover regions. Human Genetics. 2005;117:389–397. doi: 10.1007/s00439-005-1323-y. [DOI] [PubMed] [Google Scholar]
- 27.Cheung VG, Burdick JT, Hirschmann D, Morley M. Polymorphic Variation in Human Meiotic Recombination. American Journal of Human Genetics. 2007;80:526–530. doi: 10.1086/512131. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.