Supporting information for Laub et al. (2002) Proc. Natl. Acad. Sci. USA 99 (7), 4632–4637. (10.1073/pnas.062065699)

Formaldehyde Crosslinking and Immunoprecipitation (IP)

This is the procedure described in ref. 1, with some modifications. Media used in these studies were M2G and PYE . TE buffer is 10 mM Tris/1 mM EDTA, pH 8, unless specified otherwise.

Ten milliliters of CB15N was grown in M2G minimal media at 30ºC to an optical density of 0.3-0.4 at 660 nm. One hundred liters of 1 M sodium phosphates, pH 7.6, and 270 liters of 37% formaldehyde were added, and the culture was incubated for 10 min at room temperature with occasional shaking, then for 30 min on ice. Cells were pelleted and washed twice with 10 ml phosphate-buffered saline, pH 7.4. Cells were resuspended in lysis buffer (10 mM Tris, pH 8/20% sucrose/50 mM NaCl/10 mM EDTA) with 20 mg/ml of lysozyme (freshly added), then incubated at 37ºC for 30 min. Five hundred liters of 2´ IP wash buffer with 1 mM PMSF (freshly prepared) was added, then cells were incubated at 37ºC for 10 min. The lysed cells were placed on ice to cool. Samples were then sonicated with a Branson sonifier at power 3-4 for 30-50 sec in 10-sec pulses. After centrifugation, 75 liters of the supernatant containing sheared DNA was saved as the "total DNA" sample and kept on ice during IP of the rest of the sample.

Two liters of anti-CtrA serum was added to 1 ml of crosslinked sonicated DNA. After shaking gently at room temperature for 1 h, 25 liters of a 50% slurry of protein A-agarose beads (equilibrated in IP wash buffer) was added. Samples were shaken gently at room temperature for an additional 1 h. Beads were collected by centrifugation and washed five times with IP wash buffer and two times with TE buffer. Mock IP samples were handled identically, except no serum was added in the first incubation step.

To the total DNA sample, 5 liters of 10% SDS was added. Agarose beads from immunoprecipitated samples were resuspended in 50 liters of TE buffer, pH 8. Then all samples were incubated at 65ºC overnight to reverse crosslinks. Samples were spun 5 min to remove any beads or insoluble materials. Twenty liters of each sample was purified by using a Qiagen (Chatsworth, CA) PCR purification kit, resulting in a final volume of »50 liters of purified total or immunoprecipitated DNA.

Blunting DNA

Twenty liters of purified immunoprecipitated DNA or 2 liters of purified total DNA was blunted in a 100-liter reaction containing 20 liters of 5 ´ T4 DNA polymerase buffer, 2.5 liters of 4 mM dNTPs, 0.5 liters of 10 mg/ml BSA, and 0.6 liters of 1 unit/liter T4 DNA polymerase. Reactions were incubated at 15ºC for 1 h. Then 11.5 liters of 3 M NaOAc, pH 5.2, and 1 liter of 10 mg/ml glycogen were added. One hundred twenty liters of 25:24:1 phenol/chloroform/isoamyl alcohol was added, samples were vortexed and centrifuged, and the upper aqueous layer was transferred to a new tube. Two hundred forty liters of cold 100% ethanol was added, tubes were vortexed, placed at -20ºC for 1 h, then centrifuged for 15 min at 4°C. After removing the supernatant, the pellet was washed with 700 liters of 70% ethanol and spun 10 min at 4ºC. The supernatant was removed and the pellet allowed to air dry 15 min at room temperature.

The blunted DNA pellet was resuspended in 25 liters of ddH2O and placed on ice. To each sample was added 5 liters of 10 ´ T4 DNA ligase buffer, 6.7 liters of 15 M annealed linkers, 0.5 liters of T4 DNA ligase, and 13 liters of ddH2O. The ligation reactions were incubated overnight at 15ºC. DNA was precipitated by adding 6 liters of 3 M NaOAc, pH 5.2, and 130 liters of cold 100% ethanol. Samples were vortexed and placed at -20ºC for 1 h, then centrifuged 15 min at 4ºC. After removing the supernatant, the pellet was washed with 700 liters of 70% ethanol and spun 10 min at 4ºC. The supernatant was removed and the pellet allowed to air dry 15 min at room temperature.

The linker-ligated DNA pellet was resuspended in 25 liters of ddH2O and placed on ice. To each sample was added 4 liters of 10´ Thermopol buffer (NEB, Beverly, MA)/2.5 liters of 4 mM dNTPs/0.5 liters of 100 M oJW102/8 liters of ddH2O. Samples were transferred to PCR tubes and heated in a PCR machine to 55ºC for 2 min. Then 8 liters of ddH20/1 liter of 10´ Thermopol buffer/1 liter of 5 units/liter AmpliTaq DNA polymerase/0.01 liter of plaque-forming unit turbo was added. Samples were heated to 72ºC for 5 min, then 95ºC for 2 min. Then they were cycled 35 times with the following cycle: 95ºC for 30 sec, 55ºC for 30 sec, 72ºC for 1 min. After PCR, samples were purified with a Qiagen PCR purification kit. Five liters of this was run on a 1% agarose gel to verify the product size, typically a smear ranging from 200 to 1,000 bp, with average size around 400 bp. DNA was quantified by its absorbance at 260 and 280 nm, and 2 g was concentrated by using a Centricon-30 ultrafiltration device (Millipore).

Primer Sequences

. Blunt linkers were composed of equal molar amounts of oJW102 and oJW103 . To anneal them, 150 liters of 100 M oJW102/150 liters of 100 M oJW103/250 liters of 1 M Tris· HCl (pH 7.9)/450 liters of ddH2O were mixed, split into 50-liter aliquots, and heated to 95ºC for 5 min in a heating block. The heating block was removed, cooled quickly to 70ºC by adding ddH2O to fill the holes, and allowed to sit at room temperature to cool. Then the samples were incubated at 4ºC overnight and stored at -20ºC.

oJW102 GCGGTGACCCGGGAGATCTGAATTC

oJW103 GAATTCAGATC

Motif Searches

. We used the programs meme (http://meme.sdsc.edu) and bioprospector (http://bioprospector.stanford.edu) , both run from a command line. For regulatory intergenic regions upstream of genes in the CtrA cell cycle regulon, we used the entire intergenic region as input into the program. Each program was run twice, once to look for short (ungapped) and once for long (gapped) motifs. The parameters for meme were:

Short:

mod tcm

nmotifs 5 –minw 6 –maxw 9 –revcomp

Long:

mod tcm

nmotifs 5 –minw 14 –maxw 17 –revcomp

The parameters for bioprospector were:

Short:

W 10 –n 40 –e 10

Long:

W 5 –w 5 –G 20 –g 1 –n 40 –e 10

Several other combinations of width parameters were tried for each without significant differences in the final results. Nearly all motifs found by bioprospector contained a nearly exact match to the top motif found by meme. Therefore, we used the top motif matrix output by both meme runs for further searches. For each of the meme runs, the E value for the top motif was over 10 orders of magnitude smaller than the next best motif, and the log likelihood ratio was also greater by at least 100. These matrices were:

Long:

 

Simplified

A

 

:

:

:

7

6

2

4

1

1

:

1

4

:

:

8

a

:

 

pos.-specific

C

 

3

1

1

:

1

5

3

3

2

3

6

:

:

:

1

:

7

 

probability

G

 

6

1

:

:

3

2

3

2

4

2

2

6

1

:

1

:

3

 

matrix

T

 

2

9

9

2

:

1

1

4

3

4

1

:

9

a

1

:

:

Short:

 

Simplified

A

 

6

1

1

:

:

:

a

a

:

 

pos.-specific

C

 

1

:

2

1

:

:

:

:

5

 

probability

G

 

3

3

7

9

:

1

:

:

2

 

matrix

T

 

:

5

:

:

a

9

:

:

3

Motif searches were done with mast. The command line options were: -c 1 –w –ev 600. Input sequences were a set of all intergenic regions and the set of regulatory intergenic regions upstream of CtrA cell cycle regulon genes. Motifs with a combined P value less than 0.05, as reported by mast, were considered reasonable matches, without regard to E value. The –ev 600 switch was needed to get all matches with similar P value despite the larger data set when all intergenic regions were used.

Overview

As discussed elsewhere, we have three sets of microarray data for each regulatory intergenic region. These are ratios of (i). mock IP to genomic DNA; (ii). CtrA IP to genomic DNA; and (iii). total DNA to genomic DNA. From these data, we wish to identify those intergenic regions that are enriched specifically by IP by anti-CtrA antibody. The basic experimental design is predicated on the following logic: the total DNA sample can be used to control for enrichment due to crosslinking, sonication, or amplification procedures. The mock IP sample can be used to control further for enrichment due to any IP procedures. Finally, the CtrA IP sample will be used to identify regulatory intergenic regions enriched solely by the addition of anti-CtrA antibody.

All samples were compared to genomic DNA so that we could compare any combination of mock IP, CtrA IP, or total DNA. To go from these three sets of microarray data to a list of intergenic regions enriched by CtrA IP, we did the following:

(i) Calculated mock IP/total DNA and CtrA IP/total DNA ratios for each repetition.

(ii) For each repetition, rank ordered the mock IP/total DNA ratios and calculated a percent rank, with 0% being the smallest ratio and 100% the largest. Separately, we did the same for the CtrA IP/total DNA ratios.

(iii) For each spot (i.e., regulatory intergenic region), we took all of the mock IP/total DNA ranks and CtrA IP/total DNA ranks. If there was not good data for at least 5/8 mock IP/total DNA ranks and 4/7 CtrA IP/total DNA ranks, we did not consider that intergenic region further.

(iv) Performed a one-sided Mann--Whitney test separately for each intergenic region:

(A) We took all ranks (both mock IP/total DNA and CtrA IP/total DNA) and reranked them, with 1 being the lowest rank. We will refer to these ranks as "secondary ranks" (these are not percentile ranks).

(B) We calculated the sum of the mock IP/total DNA secondary ranks (X) and the sum of the CtrA IP/total DNA secondary ranks (Y).

(C) We calculated the Mann--Whitney U score, which depends on the number of data points, as follows:

U = mn + m(m + 1)/2 – X,

where m is the number of mock IP/total DNA secondary ranks (there could be fewer than eight if good microarray data was not present in all repetitions), and n is the number of CtrA IP/total DNA secondary ranks.

(D) By looking at a critical value table for values of U given m and n, get a P value for the null hypothesis that the mock IP/total DNA and CtrA IP/total DNA percentile ranks were drawn from distributions with the same means (do a one-sided test, so the alternative hypothesis is that the CtrA IP/total DNA percentile ranks are larger than the mock IP/total DNA percentile ranks).

Why Use Ranks?

We decided to rank our data prior to analysis for two main reasons. First, ranks are more robust than experimental variations, which is thought to be mainly because of IP efficiency. Second, ranks are far easier to model than ratios.

We noticed that, between different repetitions of the CtrA IP experiment, the absolute level of enrichment varied. We attributed this mostly to variation in IP efficiency. (Several other factors could contribute to this, such as specificity of DNA shearing, efficiency of crosslink reversal, or specificity of crosslink reversal. Therefore, what follows is equally applicable to these other artifacts, but we will refer to the entire experiment-to-experiment variability as if it were wholly due to variation in IP efficiency.) When combining the results from different repetitions, however, this presents a problem if we were to use the raw ratios. If IP efficiency for one repetition of the experiment was much greater than any other repetition, averaging raw ratios would allow this one experiment to dominate the final result. It would then be possible for a false positive in this one repetition to have a relatively high average ratio, even if it had a low ratio in all other repetitions. On the other hand, to gain confidence, we wanted to see consistent enrichment across all repetitions of the CtrA IP experiment. Therefore, we could try to weight the different repetitions according to IP efficiency, to keep one experiment from carrying the others. Alternatively, we could try to remove the magnitude of enrichment from the data, i.e., reduce it to a more simple measure of relative enrichment. This could be accomplished by ranking the data, which preserves information about whether a spot had a higher ratio than another, but eliminates information about how much higher the ratio was.

Another advantage of using ranks was that we could use modeling to test hypotheses about our experimental techniques. We performed a mock IP experiment to control for any enrichment of DNA pieces due to the IP procedure. We expected there to be very few such enriched fragments; the vast majority we assumed would be present at background levels, and their microarray data should thus be pure noise. After ranking the data, every spot on the microarray should have equal probability of getting any rank. To look at all the data at once, we took the mean of the percentile ranks for each intergenic piece and plotted a histogram of mean ranks. For the mock IP experiment, this was the mean of eight repetitions. Then we calculated the probability of obtaining a mean of ´(0% ¬ ´ ¬100%) for eight numbers, assuming that each number was drawn from a uniform probability distribution between 0 and 100% (a random rank model). The derivation of the theoretical distributions is taken from ref. 7 and expanded on in Random Rank Derivation. This theoretical distribution fit very well with the experimental data both visually (Fig. 6) and by a c2 test with the theoretical cumulative distribution function (P = 0.30, 21 df, bin width 3.03%, bottom bins combined to give expected value >5 for all bins, top bins excluded after expected = total), validating both our experimental technique and our assumption that the IP procedures themselves introduce no gross bias into relative levels of different DNA sequences. Individual sequences may still be enriched, but the vast majority are not.

The Mann--Whitney Test and Outlier Analysis

Once we decided to rank our data, the most powerful statistical test available to us was We used the Mann--Whitney test to ask whether the ranks in the CtrA IP experiments were higher than the ranks in the mock IP experiments . The Mann--Whitney test, also referred to as the Wilcoxon rank-sum test, makes no assumption about the underlying distribution of the data; the only requirement is that the data can be described meaningfully by an ordinal scale . Conceptually, the Mann--Whitney test is similar to the t test, in that it can be used to ask whether two sets of observations came from the distributions with the same mean. We performed a separate Mann--Whitney test for each regulatory intergenic region represented on the arrays.

We selected those intergenic fragments that had P < 0.05 by using a one-sided the Mann--Whitney test.  To test for outliers in the mock IP ranks, we did the following analysis: for each fragment, we removed the highest percentile rank from the list of mock IP ranks and recalculated the Mann--Whitney P value.  Those fragments that had a P value less than 0.05 in this second test (but not the first) and that showed a decrease in P value greater than 3-fold were also selected as significantly enriched by the CtrA IP procedure. This outlier analysis led to the inclusion of 32 additional fragments. The results of these analyses are in Table 3.