Frank et al. 10.1073/pnas.0706625104.

Supporting Information

Files in this Data Supplement:

SI Table 2
SI Figure 6
SI Table 3
SI Figure 7
SI Figure 8
SI Figure 9
SI Table 4
SI Figure 10
SI Figure 11
SI Figure 12
SI Table 5
SI Table 6
SI Table 7
SI Table 8
SI Figure 13
SI Text
SI Table 9




SI Figure 6

Fig. 6. Distribution of operational taxonomic units (OTUs). SSU rRNA sequences were clustered into OTUs at a range of phylogenetic depth (% identity). Bars represent the total number of OTUs observed for each % identity cutoff used to assign sequences to clusters. 99% OTU and 97% OTU clusters correspond, respectively, to liberal and conservative definitions of species.





SI Figure 7

Fig. 7. Comparison of SSU rRNA sequences isolated from colon and small intestine specimens. The phylogenetic trees depict classification of sequences isolated from nonIBD colon (Left) and non-IBD small intestine (Right) samples. Numbers within wedges represent the proportion of sequences within each sample set (i.e., colon or small intestine) that were assigned to a particular clade (values <2% are omitted). Wedge widths represent the taxa with the longest (top bar) and shortest (bottom bar) distances within the clade. Wedge areas represent the number of taxa in each clade. The scale bar represents base changes per site.





SI Figure 8

Fig. 8. Prevalence of 97% OTUs in samples. Heights of bars depict the percentage of samples in a particular diagnostic category in which a particular phylogenetic group was detected by sequence analysis. NI, non-IBD controls; UC. ulcerative colitis; CD, Crohn's disease; IBD-Sub and CON-Sub denote classification of samples into the nominal IBD-subset or Control-subset (see text).





SI Figure 9

Fig. 9. Principal components analysis (PCA). In these analyses, OTUs were encoded as presence/absence data for each sample and the corresponding data matrix subjected to PCA. Each circle, which is representative of a single sample and shaded according to disease status, is plotted along the first two principal component axes. Analyses were performed for data clustered at the 99%, 97%, 95%, 90%, and 85% levels.





SI Figure 10

Fig. 10. Agglomerative hierarchical clustering of samples. Samples were clustered on the basis of their first two principal component scores. Samples are color-coded by disease status. A cluster corresponding to the nominal IBD subset, which was identified by principal components analysis, is denoted by the bar labeled IBD-subset. The leafs (rectangles) marked by circles were grouped with the IBD subset when clustering of raw OTU presence/absence data was performed (SI Fig. 10).





SI Figure 11

Fig. 11. Agglomerative hierarchical clustering of samples. OTUs were encoded as presence/absence data for each sample and the corresponding data matrix subjected to clustering of samples, based on similarity of OTU data. Samples are color-coded by disease status. A cluster corresponding to the nominal IBD subset, which was identified by principal components analysis, is denoted by the bar labeled IBD subset. Leaves marked by circles were not included in the IBD subset, as defined by PCA and clustering based on principal components scores.





SI Figure 12

Fig. 12. Phylogenetic analysis of community structure. Images show the results of UPGMA clustering of environments (i.e., diagnostic categories, indicated to the left of each tree) based on the UniFrac metric (17). Jackknife resampling (1,000 replicates) was performed to assess the statistical support for the topology of each tree. Nodes with high support are indicated by the fraction of trees in which the nodes were recovered (>70: >70% --< 90%; >90: >90% --< 99.9%; >99.9: >99.9% --100%). NI, non-IBD controls; UC, ulcerative colitis; CD, Crohn's disease; Con-subset and IBD-subset denote classification of samples based on exclusion from, or inclusion in, the nominal IBD subset (see text for details). ABN, abnormal gross pathology; N, normal gross pathology. NA, no data for pathology (two libraries).





SI Figure 13

Fig. 13. Microbial diversity of the human gastrointestinal tract. OTU richness was estimated for a data set that combined sequences obtained in this study with those of Eckburg et al. (18) and Ley et al. (19). Sequences were clustered into OTUs and averaged collector's curves simulated by 1,000 random replicate samplings of the OTUs each for sample sizes ranging from 1,000 to 45,000 sequences (total sequences = 45,172). (A) Plots of sample size vs. the Chao1 estimate of OTU richness (Schao1) for sample sets clustered at 99%, 97%, and 95% sequence identity thresholds. (B) Double reciprocal plots, similar to Lineweaver-Burk analysis of enzyme kinetics, of mean Sobs -1 vs. Sequences-1 for 99%, 97%, and 95% OTUs. The Y intercepts of the graphs were extrapolated by linear regression of Sobs -1 and Sequences-1 using Microsoft Excel, from which expected OTU richnesses (i.e., Smax) were inferred. The dotted lines extending from each graph approximate the extrapolations.





SI Text

Patients.

IBD was diagnosed on the basis of combined gross and microscopic features. Control subjects were recruited from patients with diverse non-IBD conditions: abscess (1), cancer (38), chronic constipation (1), diverticulosis/diverticulitis (5), hernia (1), ischemic colitis (1), Meckel's diverticulum (1), nonspecific inflammatory adhesions (1), perforation (2), pneumatosis intestinalis (1), polypoid granulations (1), radiation proctitis (1), rectal prolapse (2), transplant (2), or ulcers (3). All surgery was performed at Mt. Sinai School of Medicine. The protocol was approved by the Institutional Review Boards of Mt. Sinai School of Medicine and the University of Colorado.

Sample Collection.

Resected tissue samples (1.5 ´ 1.5 cm) were collected and placed in 15-ml sterile plastic tubes along with 10 ml of 70% ethanol, 15% buffer-saturated phenol (pH 8; Sigma-Aldrich, St. Louis, MO), and 15% H2O (molecular-biology grade; Sigma-Aldrich). Samples were shipped on dry ice to Boulder, CO, where they were stored at -80°C until further processing.

DNA Extraction.

Mixed-community genomic DNAs were isolated from tissues by a rigorous solvent and mechanical DNA extraction protocol. All steps were performed in a laminar flow hood decontaminated by UV light. Tissue samples were rinsed in 3 ml of sterile TE (10 mM Tris-Cl, pH 8.0/1 mM EDTA; Sigma-Aldrich) plus 0.15 M NaCl and then finely minced with sterile scalpel blades. Approximately 200 mg of each tissue sample was placed in 2-ml microcentrifuge tubes to which 700 ml of Buffer A (200 mM NaCl/200 mM Tris-Cl, pH 8.0/20 mM EDTA/5% SDS), 500 ml phenol (pH 8; Sigma-Aldrich), and 0.5 g of zirconium beads (0.1 mm; Biospec Products, Bartlesville, OK) were added. The samples were agitated in a Mini Beadbeater-8 (Biospec Products) on the highest setting for 8 min and then subjected to centrifugation (13,000 rpm) for 5 min. The aqueous phase was reextracted twice with phenol/chloroform and once with chloroform. DNA was precipitated from by addition of 0.5 volumes of 7.5 M ammonium acetate and 1.5 volumes of isopropanol followed by centrifugation (13,000 rpm) for 30 min. DNA pellets were washed with 70% ethanol, dried in a laminar flow hood, and resuspended in 150-ml sterile TE. Aliquots of each DNA sample were diluted 1:10 in sterile H2O. All DNA extraction and PCR amplification were conducted by D.N.F. in the laboratory of N.R.P. Although samples were not processed simultaneously, frozen aliquoted reagents were used for DNA extractions and PCR amplifications to minimize sample-to-sample variation. Reagent-only negative extraction controls were included in each batch of DNA extractions.

rRNA Library Construction.

SSU rRNA genes were amplified from DNA samples by PCR with primers specific for all bacterial SSU rRNA genes: 8F (5'-AGAGTTTGATCCTGGCTCAG) and 805R (5'-GACTACCAGGGTATCTAAT). Each 30-ml PCR included 3 ml of 10x PCR buffer, 2.25 ml of dNTP mix (2.5 mM each dNTP), 1.5 ml of 50 mM MgCl2, 37.5 ng of each primer, 1.5 ml of genomic DNA lysate, and 1 unit Taq polymerase (biolase polymerase; Bioline USA, Boston, MA). Negative-extraction and no-template controls were run in parallel with each set of PCRs to control for exogenous contamination.

Initial tests of a variety of PCR protocols indicated that a "Touchdown" PCR protocol (1) produced the most reproducible results, presumably due to the dominance of human genomic DNA in the samples. Consequently, all PCR libraries were generated by using the following touchdown PCR protocol: (i) initial denaturation at 94° for 4 min; (ii) 20 cycles of 94° (30 sec), 65°-45° (1° decrement in annealing temperature per cycle; 30 sec), 72° (60 sec); (iii) 14-20 cycles of 94° (30 sec), 45° (30 sec), 72° (60 sec); and (iv) 72° (20 min). Initially, all samples were subjected to 14 amplification cycles in the third phase of the protocol. If insufficient product was generated under these conditions to permit cloning of a particular sample, the PCR was repeated with the third phase of the touchdown protocol incremented by two cycles. Samples that failed to amplify after incrementing to 20 cycles (40 cycles total) were judged to be negative. No correlations were evident between the number of cycles needed to amplify product and the composition of the rRNA sequence libraries that were constructed.

Both undiluted and 1:10 diluted DNA templates were subjected to duplicate PCRs. PCR products were visualized under low-wavelength UV irradiation after agarose gel electrophoresis (1.5% agarose gel in Tris-borate EDTA stained with ethidium bromide). PCRs were judged to be positive only if bands of the appropriate size were visible for both duplicate samples of the undiluted and/or 1:10 diluted DNA templates. Agarose gel slices encompassing all positive bands of a particular sample were excised with a sterile razor blade (a new blade was used for each sample), and pooled. DNA was purified by using the QIAquick gel extraction kit (Qiagen, Valencia, CA). Genes were cloned into the pCR4-TOPO vector of the Invitrogen TOPO TA Cloning kit and transformed into One Shot TOP 10 competent cells following the manufacturer's instructions (Invitrogen, Carlsbad, CA).

For each clone library, 96 transformants were grown overnight at 37°C in a 96-well culture plate filled with 1.5 ml of 2x YT medium per well. Twenty microliters of each overnight culture was added to 20 ml of 10 mM Tris-Cl (pH 8.0), heated 10 min at 95°C, and centrifuged 10 min at 4,000 rpm in a 96-well plate centrifuge (Eppendorf, Westbury, NY). One microliter of culture supernatant was used as template in a 30-ml PCR with vector specific primers (T7 and T3 sites). Ten microliters of each PCR product was first treated with the ExoSap-IT kit (USB, Cleveland, OH) and then subjected to cycle sequencing with the Big-Dye Terminator kit (Applied Biosystems, Foster City, CA) following the manufacturers' protocols. Sequencing was performed on MegaBACE 1000 (Amersham Pharmacia Biosciences, Piscataway, NJ) automated DNA sequencers.

Quantitative PCR.

Select groups of microbes were quantified in a subset of the samples by 25 PCR. Thirty-microliter Q-PCRs contained nanograms of each primer, 15 ml PowerSYBR green PCR Master Mix (Applied Biosystems, Foster City, CA), 5 mg of BSA, 11.5 ml of H2O, and 172-1,920 pg of template DNA (1:100 dilutions of sample genomic DNA preparations). PCR primers used were as follows: (i) total bacteria: 515F (5'-GTGCCAGCMGCCGCGGTAA) and 805R (5'-GACTACCAGGGTATCTAAT); (ii) Bacteroidetes (2): Bac32F (5'-AACGCTAGCTACAGGCTT) and Bac303R (5'-CCAATGTGGGGGACCTTC); (iii) Enterobacteriaceae (3): Eco1457F (5'-CATTGACGTTACCCGCAGAAGAAGC) and Eco1652R (5'-CTCTACGAGACTCAAGCTTGC); and (iv) Lachnospiraceae (4): Ccocc1F 5'-CGGTACCTGACTAAGAAGC) and Ccocc1R (5'-AGTTTYATTCTTGCGAACG). Annealing temperatures were determined empirically by temperature gradient PCR of cognate templates and primers. The cycling protocol for Total Bacteria, Enterobacteriaceae, and Lachnospiraceae primer sets was as follows: (i) initial denaturation at 95°C (10 min) and (ii) 45 cycles of 95°C (15 sec), 56°C (15 sec), 60°C (30 sec followed by fluorescence plate read). The cycling protocol for the Bacteroidetes primer set was as follows: (i) initial denaturation at 95°C (10 min) and (ii) 45 cycles of 95°C (15 sec), 60°C (45 sec followed by fluorescence plate read). Denaturation curves were determined from 60°C to 95°C for all products for quality assurance.

Plasmid quantification standards for Q-PCR assays were prepared from representative clones of Bacteroidetes and Lachnospiraceae phylotypes. Plasmids were purified using the HiSpeed Plasmid Midi Kit, according to the manufacture's protocol (Qiagen) and quantified by A260 determination. A 10-fold dilution series, ranging from 1 ´ 108 copies to one copy, was made by serial dilution into H2O. PCRs were performed in 96-well plates using a DNA Engine Opticon System (MJ Research, Incline Village, NE). Each Q-PCR experiment assayed all of the DNA templates under consideration along with duplicate reactions of the plasmid dilution series and multiple negative-template controls. Quantification of template concentrations was made by linear extrapolation of baseline-subtracted data from the plasmid dilution series standard curves. Q-PCR products were analyzed by temperature gradient denaturation profile to qualitatively assess reaction product specificity. PCR products generated in the first experiment for each primer set also were inspected by agarose gel electrophoresis to confirm that amplicons were of the predicted length. We estimate a sensitivity of these PCR assays of »10-20 bacterial cells (or genes) per milligram of resected tissue.

Sequence Analysis.

Sequence base calling and contig assembly were performed with the applications phred and phrap (5, 6), as implemented by XplorSeq (D.N.F., unpublished data). Vector and primer sequences were removed along with flanking nucleotides of poor quality (Q <20). Initial microbial species identifications were made by a batch BLAST search of both GenBank and a local database of rRNA sequences culled of environmental/uncultured sequences using the client applications blastcl3 and blastall (National Center for Biotechnology Information). Sequences with BLAST bit scores lower than 400 and/or lengths less than 400 nt were discarded. Potentially chimeric sequences, as suggested by either Bellerophon (7) or Mallard (8), also were removed from the dataset (these accounted for »1% of the total clones screened). Cloned sequences were aligned to an existing database of rRNA gene sequences (9) using the NAST alignment algorithm (10). Phylogenetic analyses, including phylogenetic tree estimations, used the application ARB (11).

Operational Taxonomic Units.

Sequences were clustered into relatedness-groups (OTUs) by average-linkage clustering, using the application sortx (D.N.F., unpublished data). Phylogenetic distances were uncorrected and calculated using only conserved positions [Lane-mask (12)]. Multiple data sets were created by clustering sequences with distance thresholds ranging from 99% to 85%. Sampling completeness was assessed by Good's Coverage estimator (13), and estimates of species richness were calculated by using the nonparametric estimators ACE [abundance-based coverage estimator (14)] and Chao1 (15).

Statistical Comparisons of Communities.

Multivariate statistical analyses (PCA, MANOVA, clustering) were performed by using the R software package (ver. 2.0.1; www.r-project.org). Sequence data were encoded as presence/absence of OTUs for each sample. Principal components calculations used correlation matrices. Statistical relationships between data sets and diagnostic categories (e.g., disease state) were examined by General Linear Modeling of response (OTU data) and explanatory (e.g., disease state) variables. GLMs included either single or multiple explanatory variables to statistically eliminate hidden effects of variables. Cluster analysis of samples used an agglomerative hierarchical clustering algorithm with either Jaccard distances (OTU presence/absence data) or Euclidian distances (principal components scores) (16). The results of the latter clustering method were used to classify samples as members of the IBD subset or Control subset. Associations between medical treatments and membership in the IBD or Control subsets were examined by Fisher's exact test.

Q-PCR experiments were performed in triplicate for each primer set. Q-PCR data (expressed as gene copies per volume) were logarithmically transformed, normalized to human actin levels (measured by Q-PCR for each sample) and then analyzed by two-tailed Student's t test without treating variances as equivalent. For statistical analysis of Bacteroidetes, Enterobacteriaceae, and Lachnospiraceae, results were first normalized to total bacterial loads.

UniFrac tests (17) were performed by using the Web service provided at http://bmf.colorado.edu/unifrac/index.psp (Web-link valid as of 7/2007). The phylogenetic tree tested in this analysis was constructed by neighbor joining of sequences representative of the 95% OTU clusters, followed by parsimony insertion of remaining sequences, by use of the ARB software package (11). UniFrac significance was ascertained for all environments together and all pairs of environments via 1,000 resamplings. The significance of environment clusters was assessed by 1,000 jackknife resamplings using normalized weighted UniFrac.

1. Don RH, Cox PT, Wainwright BJ, Baker K, Mattick JS (1991) Nucleic Acids Res 19:4008.

2. Bernhard AE, Field KG (2000) Appl Environ Microbiol 66:1587-1594.

3. Bartosch S, Fite A, Macfarlane GT, McMurdo ME (2004) Appl Environ Microbiol 70:3575-3581.

4. Rinttila T, Kassinen A, Malinen E, Krogius L, Palva A (2004) J Appl Microbiol 97:1166-1177.

5. Ewing B, Green P (1998) Genome Res 8:186-194.

6. Ewing B, Hillier L, Wendl MC, Green P (1998) Genome Res 8:175-185.

7. Huber T, Faulkner G, Hugenholtz P (2004) Bioinformatics 20:2317-2319.

8. Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ (2006) Appl Environ Microbiol 72:5734-5741.

9. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL (2006) Appl Environ Microbiol 72:5069-5072.

10. DeSantis TZ, Jr., Hugenholtz P, Keller K, Brodie EL, Larsen N, Piceno YM, Phan R, Andersen GL (2006) Nucleic Acids Res 34:W394--W399.

11. Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, et al. (2004) Nucleic Acids Res 32:1363-71.

12. Lane DJ (1991) in Nucleic Acid Techniques in Bacterial Systematics, ed Goodfellow, SaM (Wiley, New York), pp 115-175.

13. Good IJ (1953) Biometrika 40:237-264.

14. Chao A, Lee, S.-M (1992) J Amer Stat Assoc 87:210-217.

15. Chao A (1984) Scand J Statist 11:265-270.

16. Kaufman L, Rousseeuw PJ (2005) Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, New York).

17. Lozupone C, Knight R (2005) Appl Environ Microbiol 71:8228-8235.

18. Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA (2005) Science 308:1635-1638.

19. Ley RE, Turnbaugh PJ, Klein S, Gordon JI (2006) Nature 444:1022-1023.