Small et al. 10.1073/pnas.0700890104.

Supporting Information

Files in this Data Supplement:

SI Text
SI Table 2
SI Table 3
SI Figure 5
SI Figure 6
SI Table 4
SI Table 5
SI Table 6
SI Table 7
SI Figure 7




SI Figure 5

Fig. 5. Distributions of SNP and microindel frequency. (Left) Pink circles show SNP frequency across the haplome alignments; turquoise triangles show the SNP frequency calculated after randomizing the positions within each alignment. The standard deviation of the observed SNP distribution is 0.026; standard deviation of the randomized alignment is 0.013. (Right) The frequency of 1-bp microindels; green circles show the observed frequency, blue triangles the frequency measured after randomizing alignment positions. The standard deviation of the observed microindel distribution is 1.2 ´ 10-3; in the randomized alignment the standard deviation is 0.8 ´ 10-3.





SI Figure 6

Fig. 6. Comparison of test distributions to simulated distributions. Each chart displays the comparison of 10 individual test simulations at the specified Ne to the full simulation panel. Each colored line represents a single test simulation. Each data point represents the mean similarity of the test distribution to 1,000 independent simulated distributions at the Ne specified on the x axis.





SI Figure 7

Fig. 7. Characteristics of dinucleotide indels. In both plots each dinucleotide is represented by a separate bar. (Left) Blue bars show the percentage of total instances of each dinucleotide that are contained within a 2-bp indel. (Right) The green bars show the ratio of observed instances of a dinucleotide indel occurring adjacent to an identical dinucleotide to that expected given the global dinucleotide frequency. The dashed line marks a ratio of one.





SI Text

Alignment

Alignment Construction.

To separate the original Arachne assembly (1) into two haplomes, a purpose-built alignment pipeline was constructed and applied to the assembly. The pipeline consists of, in brief, an initial anchoring stage in which unique matches between all contigs were identified with BLASTN (http://blast.wustl.edu), followed by pairwise LAGAN (2) alignments of all anchored contig pairs and a binning algorithm to sort supercontig pairs, weighted by their contig LAGAN alignment scores, into allelic bins. Many small, highly repetitive supercontigs (constituting 13% of the bases in the original Arachne assembly) lacked a unique anchor and were therefore not included in the subsequent steps. We suspect that most of these are uncondensed repetitive regions, and hence if fully assembled would represent significantly less than 13% of the genome. Once in bins, the weighted alignments were used to assign each supercontig to a haplotype, A or B. Contigs in each bin were then ordered and oriented with respect to contigs in the opposite haplotype (3), discrepancies in order and orientation of contigs were resolved (and, if pertinent evidence was present, recorded as inversions), and identifiable assembly artifacts were corrected. The ordered and oriented contigs in each haplotype were then concatenated into a single "hypercontig." Hypercontig pairs were aligned with LAGAN to produce the final alignment, and a single reference sequence was parsed from each hypercontig alignment.

The haplome assembly consists of 374 allelic pairs of hypercontigs. The total length of the two haplomes is 336,005,028 positions, of which 12,758,832 positions (3.8%) are N contig break placeholders. LAGAN alignments of each pair of hypercontigs are provided at http://mendel.stanford.edu/SidowLab/Ciona.html.

Alignment Statistics.

The total alignment length of the two haplomes is 213,620,997 positions. We categorized 47,394,759 alignment positions as assembly break regions (described below), indicating that the draft assembly was incomplete in one of the two haplomes and eliminated these regions from further analyses. A total of 118,881,546 alignment positions contain a base aligned to a base, of which 109,608,915 positions pass our alignment quality filter (described below). The remaining 47,343,769 alignment positions are comprised of bases aligned to gaps, indicating that they are haplome-specific sequence. The total number of bases (both haplomes) in aligned and haplome-specific positions is 284,873,932. The total number of bases (both haplomes) in assembly break regions is 38,139,160 (12% of input bases).

Alignment Filtering.

We used a conservative filtering procedure for calling SNPs and 1-bp microindels from the aligned haplomes. Each alignment position was analyzed for a SNP only if the position was considered "well aligned," defined as being at the center of a 21-bp window of at least 70% identity (counting mismatches and bases aligned to gaps as nonidentities). This strict criterion prevents inevitably occurring alignment artifacts from inflating the SNP counts. The short window size was chosen to prevent the elimination from analysis of well aligned regions flanking the numerous haplome-specific gaps. In total, 110,513,493 alignment positions across the genomewide alignment were identified as well aligned, of which 109,608,915 represent bases aligned to bases (as opposed to bases aligned to gaps).

Alignment positions were categorized as assembly breaks if an aligned position contained an N, or if a gapped region contained an N in either sequence at any position in the entire gapped region, or if a N was encountered on either side of a gapped region before encountering a well aligned position. All regions distal to the first and last well-aligned position in each alignment were also categorized as assembly breaks. These distal regions comprised a total alignment length of 5,675,332 positions, containing 5,194,825 bases.

Coordinate System for Fig. 2.

For ease of display, Fig. 2 is presented in "reftig coordinates," which means that the coordinate system was based on the reference sequence parsed from the alignment, and not on alignment coordinates (see Alignment Pipeline). Because of the numerous haplome-specific indel events, which may or may not be represented in the reference sequence in any given window, reftig coordinates do not correspond linearly to alignment or haplome coordinates. To report polymorphism instances as frequencies in reftig coordinates we first calculated the number of well aligned positions (center of 21-bp window of >70% identity) in each nonoverlapping 50- kb reftig window and report the frequency of SNPs and microindels as number per well aligned bases in that window. This avoids the artifactual drop in SNPs that occur in regions with many haplome-specific insertions if alignment or haplome coordinates are used. No frequency is reported in windows that had <2,000 well aligned positions.

Polymorphism

SNPs.

SNPs were identified only in aligned positions that passed a strict alignment quality filter (center of 21-bp window of >70% identity).

Finished sequence from seven BACs, comprising random regions of the genome from different individuals collected from San Francisco Bay, was generated in a previous study (1). The seven BACs and the C. savignyi assembly represent the totality of available genomic sequence from C. savignyi. An alignment of each BAC sequence with the two alleles of the sequenced individual comprises a total of 213,775 well aligned positions, which contain 14,386 SNPs. Of these SNPs, 250 (1.74%) are tri-allelic. The nucleotide heterozygosity (q) in the BAC regions is 0.04486, and the nucleotide diversity (p) is 0.04525, which is in close agreement with the SNP frequency of 0.04485 observed in the haplome alignment of the sequenced individual.

Indels.

Indels were parsed directly from the haplome alignment by counting the number and size of alignment gaps in regions that did not contain or border assembly breaks. We estimate that 16.6% of the sequence in the diploid sequenced individual is in an indel. This estimate was derived by dividing the total number of bases in haplome-specific regions of the alignment by the sum of bases in haplome-specific regions and aligned positions. Our genomewide estimate is in broad agreement with estimates from seven allelic pairs of finished fosmids from the sequenced individual (total alignment length of 218 kb), from which we estimate that 18.8% of sequence is haplome-specific, as well as with the seven finished BACS from different individuals, in which we estimate that 15.7% of sequence is haplome specific.

An indel was considered repetitive if at least 25% of its sequence was annotated as a repeat. Repeat annotation is described in Materials and Methods. By this criterion, 37% of all indel events, which contain 78% of all haplome-specific sequence, are repetitive. At indel sizes of >150 bp, 80% of all events, which comprise 85% of the sequence in this category, are repetitive. Requiring 80% of an indel to be repeat masked still annotates 61% of sequence in indels >150 bp as repetitive.

Microindels.

We identified 748,230 microindels (gaps of 10 bp or less) between the aligned haplomes. Microindels account for 72% of total indel events but only 1.9% of the total sequence present in indels (Table1). The microindel size distribution within C. savignyi is strikingly similar to that seen between diverged species (4, 5), which is to be expected given that both distributions should be generated by the same underlying molecular processes (Fig. 1D). Microindels occur at 1/6.5 the rate of SNPs, which is roughly similar to the ratio observed in Drosophila species (6).

We examined microindels of size two in more depth to shed light on the dynamics of the mutational process. Dinucleotide indels have an A/T bias and in agreement with previous studies between diverged species exhibit an enrichment for tandem repeats (SI Fig. 7). The largest bias was observed in the dinucleotide TA, which displayed a 2-fold excess of occurrence in dinucleotide indels and a 3.5-fold enrichment for the presence of a neighboring TA. The A/T-rich dinucleotides AA/TT and AT exhibit a 1.4- and 1.3-fold excess of occurrence in dinucleotide indels. The dinucleotide indels AT, CC/GG, GC, and CG exhibited a >2-fold enrichment for a neighboring identical dinucleotide

Distribution and Correlation of SNPs and Microindels.

In all analyses comparing SNPs and microindels, microindels of 1 bp were used as a proxy and called with the exact same procedure as SNPs. This was done to minimize potential biases due to the stringent requirement of 70% identity in 21-base windows. The frequency of SNPs and microindels (using microindels of size one that occur in well aligned positions as a proxy) are correlated across the C. savignyi genome, with an R2 of 0.25 when measured in nonoverlapping windows of 1,000 bp. When the order of alignment positions is randomized, the correlation is substantially smaller (R2 = 0.06).

Repeats

Repeat Identification.

All repeats were identified with RepeatMasker (http://repeatmasker.org). RepeatMasker was run using WUBLAST with the curated RECON C. savignyi library (see below), and separately with cross-match (www.phrap.org) to identify low complexity and simple repeats. A total of 32% of bases in the haplome assemblies were masked.

RECON Repeat Library.

A de novo repeat library for C. savignyi was created by Zhirong Bao and Sean Eddy using their RECON (7) program. RECON libraries have been used in several genome projects including C. briggsae (8) and Gallus gallus (9), in the analysis of mobile elements in the rice genome (10, 11) and to identify repeats in mammalian sequence (12). RECON is designed to identify repeat element families in a completely unannotated assembly, but does not provide information on the identity of elements. A subset of the C. savignyi RECON library has been annotated by Arian Smit and colleagues and deposited in the Repbase Database (13).

RECON was run on the original, redundant, Arachne assembly, with parameters set to identify all repeat element families of copy number 20 or more (which corresponds to 10 elements per haplome). RECON identified 1,072 repeat elements, ranging in size from 41 to 8,016 bp, with a median length of 778 bp. Copy number ranged from 20 to 144,713, with a median of 159 copies. The 144,713 copy element is 289 bp in length and has been identified as a SINE by the Repbase Database (13).

As RECON identifies elements that are present in multiple instances within a genome it will also identify multicopy gene families. To eliminate these functional sequences from the repeat library we performed a curation of the C. savignyi library by blasting all elements to SwissProt (14), and blasting human tRNA an rRNA sequences against the C. savignyi library. After manual inspection of all SwissProt blast hits, 77 elements were removed from the library based on hits to known genes. These included cytochrome, EGF families, actin, tubulin, myosin, and multiple zinc finger proteins. Twenty additional elements were removed on the basis of the tRNA blast, and seven from the rRNA blast. The 104 removed elements represent 2.42% of the total elements in the library, as determined by copy number, and 0.93% of the total base pairs in the library (copy number ´ length). The curated version of the RECON C. savignyi library is available at http://mendel.stanford.edu/SidowLab/Ciona.html.

Calculating the Population Recombination Parameter, r

Data.

The seven BACs cited above were aligned to the two alleles of the sequenced individual and the three-way alignment was used to estimate haplotype block lengths (measured in number of concordant SNPs). Across the seven regions, the mean haplotype block length is eight SNPs, the longest is 133 (Fig. 3a). 250 tri-allelic SNPs were not included in the analysis for technical reasons. We randomized the order of positions within each region and reparsed haplotype blocks as a control. In the randomized alignment there was virtually no haplotype structure: the mean haplotype block length was 1.5 and the longest block was 10 (Fig. 3a).

Simulations.

MS (15) is a coalescent-based program that generates independent replicate samples under a Wright-Fisher neutral model of evolution that has been used extensively to test methods of calculating population parameters (16-21) and to compare different methods of estimating r (22, 23). We therefore deemed the use of MS for generating simulated data to be appropriate for our purposes.

A total of 1,000 independent replicates were generated with MS for values of r(where r = 4 Nec) corresponding to each of the following values of Ne (in millions): 0.1, 0.25, 0.5, 1, 1.5, 2, 2.5, 3, 5, 7, 10. A smaller number of samples were generated for Ne of 4 million, 4.5 million, 15 million, and 20 million. To match the observed data, the length of each simulated sequence was set to 25,000, and an n of three sequences was specified for each run. Theta (heterozygosity) was set to 0.045 for all replicates.

The recombination model in MS assumes a constant rate of recombination. This is an unrealistic assumption given that our independent samplings of the C. savignyi per site recombination rate from five regions of the genome vary by 5-fold. To account for this discrepancy, each simulation replicate was run separately under each of the five per site recombination rate estimates and the resultant haplotype block length counts were combined across the five simulations to produce a single frequency distribution.

To test the reproducibility of the metric, 10 additional sets of simulations were generated for each value of Ne and compared to the same panel of simulated distributions as the observed data (SI Fig. 6).

Measuring Similarity Between Simulations and Actual Data

Haplotype block length frequency distributions were constructed by tallying the number of SNPs contained in all blocks of each length, and dividing by the total number of SNPs in the set (Fig. 3b). A set was defined as all seven observed regions or five independent MS simulations, with a simulation for each of the five sampled values of the per site recombination rate. Block lengths of >20 were condensed into one category, as the sampling in the observed data are incomplete at longer lengths. Distance between distributions was calculated as the sum of the absolute value of the difference in frequency at each block length.

Defining Gene Locations

We identified C. savignyi genes by similarity to JGI Ciona intestinalis v2.0 gene models (http://genome.jgi-psf.org/Cioin2). All C. intestinalis gene models were aligned by TBLASTN against the C. savignyi two-haplome assembly, which contains two reconstructed alleles of most regions separated into A and B. The haplome alignment was used to map positions between the A and B alleles.

The C. intestinalis v2.0 gene set contains 14,548 gene models comprising 6,533,453 codons. A total of 12,843 gene models had at least one good TBLASTN hit in the two haplome C. savignyi assembly. We eliminated 2,858 gene models that had a hit in only one haplome or whose top two hits did not map to the exact same alignment position in both haplomes as determined by the haplome pairwise alignments. Individual HSPs were eliminated if they were <15 aa long, contained a gap in C. intestinalis or either C. savignyi haplome or if they overlapped another HSP with a higher bit score. A total of 51,987 HSPs (a total of 8.6 Mb) passed our filtering procedure. Of the 2,885,017 aligned codons in this set 92.8% were identical, 5.94% had a synonymous change and 1.15% had a nonsynonymous change (SI Table 4). A matrix with counts of all aligned codons is provided in SI Table 5, and a translated matrix of aligned amino acids counts in SI Table 6.

Annotating genes by orthology may bias our data set toward more conserved genes, and hence bias our estimate of p A/pS downward. However, we expect this effect to be minor for three reasons: (i) The chimp genome paper (24) reports only a minor drop of 0.03 in pA/pS between human and chimp sequence when calculated with 13,454 human/chimp 1:1 unambiguous orthologs vs. 7,043 human/chimp/mouse/rat 1:1:1:1 unambiguous orthologs. (ii) The number of codons in our analysis of C. savignyi is almost one half of the total codon count in C. intestinalis. (iii) The estimate is similar to the pA/pS ratio we calculated using annotations from the recent Ensembl C. savignyi gene set (http://www.ensembl.org/Ciona_savignyi).

1. Vinson JP, Jaffe DB, O'Neill K, Karlsson EK, Stange-Thomann N, Anderson S, Mesirov JP, Satoh N, Satou Y, Nusbaum C, et al. (2005) Genome Res 15:1127-35.

2. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S (2003) Genome Res 13:721-731.

3. Sundararajan M, Brudno M, Small KS, Sidow A, Batzoglou S (2004) Lecture Notes Comput Sci 3240:326-337.

4. Cooper GM, Brudno M, Stone EA, Dubchak I, Batzoglou S, Sidow A (2004) Genome Res 14:539-548.

5. Petrov DA, Hartl DL (1998) Mol Biol Evol 15:293-302.

6. Sinha S, Siggia ED (2005) Mol Biol Evol 22:874-885.

7. Bao Z, Eddy SR (2002) Genome Res 12:1269-1276.

8. Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al. (2003) PLoS Biol 1:e45.

9. Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, Bork P, Burt DW, Groenen MA, Delany ME, et al. (2004) Nature 432:695-716.

10. Jiang N, Bao Z, Zhang X, Hirochika H, Eddy SR, McCouch SR, Wessler SR (2003) Nature 421:163-167.

11. Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Nature 431:569-573.

12. Margulies EH, Maduro VV, Thomas PJ, Tomkins JP, Amemiya CT, Luo M, Green ED (2005) Proc. Natl. Acad. Sci. USA 102:3354-3359.

13. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J (2005) Cytogenet Genome Res 110:462.

14. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al. (2003) Nucleic Acids Res. 31, 365-370.

15. Hudson RR (2002) Bioinformatics 18:337-338.

16. Crawford DC, Bhangale T, Li N, Hellenthal G, Rieder MJ, Nickerson DA, Stephens M (2004) Nat Genet 36:700-706.

17. Fearnhead P, Donnelly P (2001) Genetics 159:1299-1318.

18. Hey J, Wakeley J (1997) Genetics 145:833-846.

19. Kuhner MK, Yamato J, Felsenstein J (2000) Genetics 156:1393-1401.

20. McVean G, Awadalla P, Fearnhead P (2002) Genetics 160:1231-1241.

21. Wakeley J (1997) Genet Res 69:45-48.

22. Wall JD (2000) Mol Biol Evol 17:156-163.

23. Smith NG, Fearnhead P (2005) Genetics 171:2051-2062.

24. Chimpanzee Sequencing and Analysis Consortium (2005) Nature 437:69-87.