Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Sep 1.
Published in final edited form as: J Struct Biol. 2010 Mar 27;171(1):64–73. doi: 10.1016/j.jsb.2010.03.016

PREDICTION OF PROTEIN CRYSTALLIZATION OUTCOME USING A HYBRID METHOD

Frank H Zucker 1,, Christine Stewart 1,, Jaclyn dela Rosa 1, Jessica Kim 1, Li Zhang 1, Liren Xiao 1, Jenni Ross 1, Alberto J Napuli 1, Natascha Mueller 1, Lisa J Castaneda 1, Stephen R Nakazawa Hewitt 1, Tracy L Arakaki 1, Eric T Larson 1, Easwara Subramanian 1, Christophe LMJ Verlinde 1, Erkang Fan 1, Frederick S Buckner 1, Wesley C Van Voorhis 1, Ethan A Merritt 1, Wim G J Hol 1,*
PMCID: PMC2957526  NIHMSID: NIHMS199569  PMID: 20347992

Abstract

The great power of protein crystallography to reveal biological structure is often limited by the tremendous effort required to produce suitable crystals. A hybrid crystal growth predictive model is presented that combines both experimental and sequence-derived data from target proteins, including novel variables derived from physico-chemical characterization such as R30, the ratio between a protein‘s DSF intensity at 30 °C and at Tm. This hybrid model is shown to be more powerful than sequence-based prediction alone – and more likely to be useful for prioritizing and directing the efforts of structural genomics and individual structural biology laboratories.

Keywords: Crystal growth, protein characterization, thermal shift assay, dynamic light scattering, limited proteolysis, regression partition tree

1. Introduction

Detailed knowledge of protein and nucleic acid structures is of central importance for understanding life at its molecular and atomic level, and benefits human health by guiding design of therapeutics, vaccines and diagnostics. For decades protein crystallography has been the primary technique for obtaining structural information of biomacromolecules but, despite huge technical advances, obtaining crystals of good diffraction quality often remains a major bottleneck. Data from 17 structural genomics projects in TargetDB indicate that only 13% of soluble proteins yield crystals suitable for structure determination[1]. Protein crystallization is a complex, relatively poorly understood process driven by many thermodynamic, kinetic, and stochastic factors[2]. However, certain properties of a protein sample that are expected to impact crystallizability, e.g. homogeneity, solubility, stability and flexibility[3], can be characterized by biophysical methods available to most laboratories. Several of these methods, including dynamic light scattering (DLS)[4], limited proteolysis (LP)[5], differential scanning fluorimetry (DSF)[3,6] and size-exclusion chromatography (SEC) [6,7] assays have been suggested singly as predictors of success in crystal growth. However, there is still considerable scope for improvement in prediction of crystallization outcome[8].

The wealth of data capturing the success or failure of crystallization attempts by large structural genomics efforts has provided a basis for analyses that attempt to correlate crystallization success with variables derived from amino acid sequence. Sequence-based variables such as size, hydrophobicity, and isoelectric point have long been used to predict solubility[9], which appears to be inversely related to crystallizability[6]. In addition, newer algorithms examine additional variables such as homology to proteins in TargetDB[10,11], amino acid composition[12], co-location of amino acids[13,14], side chain entropy and buried glycines[6]. Significant limitations of such methods include reduced accuracy for proteins larger than 200 residues[13,14], reliance on availability of previously-studied homologs[10], or a priori assumptions about structure[6]. For example, the predictive value of homology appears to drop rapidly below 90% sequence identity[11]. This is not surprising, given that changes to only a few residues may introduce or remove favorable protein:protein interaction surfaces that stabilize the formation of a crystal lattice. Indeed, deliberate introduction of small changes in sequence constitutes an established strategy for addressing difficulty in crystallization[15,16]. Variation in sequence, position and cleavage of affinity tags is also widely used to improve crystallization, an effect confirmed in this study (Supplementary Table 1a, e.g. for targets Cpar071490AAB and Tbru022584AAA).

A possible further concern is that a disproportionate number of structural genomics target sequences are derived from prokaryotic and archeal genomes, which may reduce the predictive power of TargetDB when applied to predicting the crystallizability of eukaryotic target proteins. Indeed, a recent sequence-based predictor of crystallization for expressed proteins did not have the same predictive power for overall success of human proteins[6], an observation confirmed by our studies reported below.

Quantitative comparison of existing crystal growth prediction methods is difficult for several reasons including the fact that the criteria for judging a prediction as ‘correct’ varies [6,10,12,13,14]. In several cases only overall success from expression to crystal growth is scored[10, PXS-C-Hs in 6], rather than distinguishing between success in protein expression and success in crystallization of purified protein. In the current paper we focus on the latter step.

The hypothesis underlying the current paper is that a more powerful approach to predicting crystallizability of a given protein sample is to combine sequence-derived information with multiple experiments that measure a range of biophysical properties of the actual sample to be crystallized. The reasoning is that multiple factors regarding the proteins sample under consideration determine jointly the success of a crystal growth experiment. Since during crystal growth protein-protein contacts need to be established, the nature of the surface of a protein is obviously of special importance. Hence in addition to the homogeneity and stability of individual folded proteins, it makes sense to consider (i) the average physico-chemical properties of the atoms making up the surface of the protein, such as charged versus uncharged, hydrophilic versus hydrophobic, etc; (ii) the degree of deviations from that average, e.g. the flexibility of side chains, loops, motifs and domains; and (iii) the degree of uniformity in the association of the protein molecules in solution, i.e. whether or not the protein forms well-defined single chain entities or well-defined multi-chain particles.

Estimates of the nature and flexibility of exposed side chains can be derived from sequence information provided that a good prediction of which residues are at the surface can be obtained[6]. Flexible loops are the subject of several sequence-based prediction methods[6,10], while limited proteolysis also gives information about the dynamics of surface loops[17]. The mobility of motifs and domains of a protein with respect to each other is likely reflected in the accessibility of hydrophobic pockets measured by fluorescent probes which increase in quantum yield when the probe is shielded from the solvent, i.e. when the probe interacts with hydrophobic patches of the protein in DSF assays[3]. Homogeneity of a protein sample with regards to aggregation state and impurities can be assessed by combining information from DLS measurements[4,18], SDS-PAGE and SEC[19]. These complementary classes of information should be considered together, as suggested by a survey of SPINE quality assessment data[20]. Some of the parameters derived from sequence and from biophysical data might be overlapping. For example, it was reported that side chain entropy (SCE) could replace individual experimental measures of stability for predicting crystallization of expressed prokaryotic proteins in a recent predictor[6]. Therefore statistical methods are to be used to discover the best combination of parameters for optimal prediction of crystallization results.

We describe here the use of statistical analysis methods to develop a predictor of crystallization and diffraction quality that is based on several types of biophysical experiments combined with protein sequence analysis. New variables are derived for several of the biophysical measurements of protein solutions. The value of these variables is explored in combination with variables derived from sequence to find an optimal combination of variables for predicting the outcome of crystallization experiments. Although we expect that performance of the prediction model will continue to improve as larger training sets and additional categories of physical data are brought to bear, our current best hybrid crystal growth prediction model, HyXG-1, already demonstrates the power of this approach. In contrast to previous work[6], the resultant hybrid crystal growth prediction method obtained, HyXG-1, is substantially better than methods based on sequence alone in predicting outcome for our validation set.

2. Methods

2.1 Protein Expression and Purification

Proteins were prepared by the SGPP consortium[21] (www.sgpp.org) and the MSGPP program project (www.msgpp.org) using N-terminal His6 tags, NiNTA and size exclusion chromatography as described previously[22,23]. SGPP targets (as indicated in Supplementary Table 2) were cloned using the BG1861 vector giving an uncleavable tag. MSGPP targets were also cloned using AVA0421 with a cleavable tag. Thus three tag variants of each target were possible: the 8-residue uncleavable tag, the 21-residue uncleaved tag, or the 4-residue cleaved tag. The SGPP procedure for high-throughput soluble expression screening[22] was modified for MSGPP targets (as indicated in Supplementary Table 2) by the replacement of sonication with freezing at −80 °C and thawing in lysis buffer containing 0.04 g lysozyme, 0.5 g CHAPS, 0.2 g MgCl2(H20)6 and 6 µL benzonate per 100 mL SGPP buffer (see below) with 30 mM imidazole. Proteins were stored in SGPP buffer (25 mM HEPES pH 7.25, 500 mM NaCl, 5% Glycerol) except where noted in Supplemental table 4 and flash frozen[24] before further characterization and crystallization.

2.2 Experimental Protein Characterization

Protein samples were thawed and characterized in the following ways.

SDS-PAGE Analysis

Samples were flash thawed in 30 °C water bath, DTT was added to 5 mM and samples were spun at 25000g at 4 °C for 30 minutes prior to sample dilution. SDS dye with 5% β-mercaptoethanol was added and samples were boiled at 90 °C for 4 minutes and then run on 8–16% Tris-HCl Ready gel (Bio-Rad).

Differential scanning fluorimetry curves

DSF curves were collected using an Opticon 2 real-time PCR detector (BioRad) to measure the fluorescence of SYPRO Orange (Sigma) in the presence of protein at 0.5 mg/ml in SGPP buffer with 5 mM DTT in 96-well plates as the temperature increased from 20 or 30 to 90 °C in increments of 0.2 °C. Proteins were centrifuged for 30 minutes at 25000g, 4 °C before sample preparation. SYPRO Orange dye was diluted from initial concentration of “5000X” to “2.5X” in the final sample.

Limited proteolysis

Purified protein at 1 mg/ml in SGPP buffer+5 mM CaCl2 was exposed to 20 µg/ml trypsin, chymotrypsin, subtilisin A, or endoproteinase Glu-C for 0, 1 and 24 hours. After each time period, the reaction was stopped with 0.17 M acetic acid and SDS dye was added. All samples were boiled and run on SDS-PAGE, gels were then stained with Coomassie Blue stain.

Dynamic Light Scattering

Measurements were made using DynaPro light scattering instrument (Protein Solutions Inc). All samples were centrifuged 30 minutes at 4 °C and 25000g immediately before the experiment in order to remove possible dust particles and diluted to 5–10 mg/mL in SGPP buffer +5 mM DTT. Measurements were performed at 5 °C and 30 readings were taken for each sample.

2.3 Crystallization

Crystallization screening was performed at the Hauptman–Woodward Institute as previously described[23],[25] and using the JCSG+ Suite of screens (QIAGEN). After rapid thawing samples were centrifuged for 30 min. at 25000g at 4 °C to remove possible precipitate, and kept on ice afterwards until used in crystallization experiments. Crystallization leads from initial screens were optimized for pH, precipitant and additive concentrations as well as protein concentration and temperature. MSGPP crystallization trials were set up using a Phoenix crystallization robot (Art Robbins Instruments) using various commercially available screens. Each screen was set up at varying ratios of protein to reservoir volumes. Conditions for the best-diffracting crystals are shown in Supplementary Table 4.

2.4 Determination of Diffraction Quality

Suitable crystal cryoprotection solutions were determined as needed. Typically, a synthetic mother liquor was prepared that contained an increased amount of precipitants, salts, and/or additives relative to the crystallization solution, and was then diluted with varying concentrations of glycerol, ethylene glycol, low molecular weight polyethylene glycols (MW<400 Da), or concentrated salt solutions. Crystals were subjected to the cryoprotection solution for varying amounts of time and in some cases had to be transferred gradually from low to high concentration of the cryoprotectant. On occasion, oils such as paratone-N, mineral oil, parfin oil, or mixtures were used for cryoprotection. Following cryoprotection (if needed), crystals were mounted in suitably-sized CryoLoops (Hampton Research) and flash frozen in liquid nitrogen and tested for diffraction at 100 K on our home x-ray source (Rigaku MM007HF, Saturn detector) or on various synchrotron beamlines (SSRL, ALS, and APS).

2.5 Quantification of Experimental and Sequence Variables

Yield

Expression of soluble protein in high-throughput screens was evaluated from the staining of protein from the equivalent of ~8% of a 600 µL culture. YldS was scored on a scale from 1, no detectable soluble protein, to 5, extremely high soluble protein expression (Supplementary Fig. 2). A score of 5 indicates approximately 5 µg of protein from 48 µL of cultured cells or more, i.e. at least 100 mg/L. YldM is the total mass of protein sent from protein production to crystal screening and growth after large scale expression. Large scale expression was carried out using several different aeration methods and volumes were not consistently recorded, so this measure of yield is not normalized for volume of cell culture.

Size-exclusion chromatography

SEC curves obtained during protein purification were exported from PrimeView Evaluation (Amersham Pharmacia Biotech) and analyzed using Microsoft Excel and gnuplot (http://gnuplot.sourceforge.net) as described by Kawate and Gouaux [19]. After fitting a linear background and a single Gaussian to the peak with the highest absorbance peak (Fig. 1a), we calculated the total residual Rabs in Excel as Rabs=Σ|Yobs − Ycalc| / ΣYobs. We then iteratively fit additional Gaussians to the largest residual peaks (Fig. 1b, 1c) until a plateau in Rabs was reached (Fig. 1d). The Gaussian which gave maximal improvement in Rabs was taken as the last Gaussian in the optimal model. SECR1 is Rabs with one Gaussian fit (Fig.1a). SECPP is the percent purity of the pooled fractions using the optimal model (Fig. 1b).

Figure 1. Analysis of size exclusion chromatography profiles.

Figure 1

Gaussian peaks fit to the SEC curve for E. histolytica aspartate-tRNA ligase batch 24058. In (a), (b) and (c) open black circles are observed absorbance at 280 nm in milli-absorbance units (mAu); vertical dashes bound the fractions pooled for further characterization and crystallization; red line is calculated mAu using a linear background plus 1, 2 or 3 Gaussian curves fit to the observed mAu using gnuplot. In (b) and (c) dotted lines in blue, green and violet show individual Gaussians. (A 4th Gaussian, not shown, can be fit as another small curve under the main peak.) (d) Residuals and calculated pool purity for fitting 1 to 4 Gaussians to observed mAu. Left axis: solid black circles, total Rabs, the absolute value of the difference between observed and calculated mAu divided by the total observed mAu; magenta squares, Rabs for the pooled fractions; green triangles, root mean square of the residuals as a fraction of the mean. Right axis: red diamonds, purity of the pooled fractions i.e. the maximum area under a single Gaussian in the pooled fractions divided by the total pool area. SECR1 is Rabs for one Gaussian: i.e. the area between the red and black curves in (a) over the area under the black curve. For this sample SECR1 = 0.16. SECPP is the purity of the pooled fractions calculated in the optimal model. For this sample SECPP=0.99 from (b). (Figures prepared in the R statistical environment)

SDS-PAGE Analysis

Coomassie Blue-stained gels were scored visually on a scale of 1 (lowest purity) to 5 (highest purity); none of the samples scored below 3.

Differential scanning fluorimetry curves

In theory a protein undergoing a two-state unfolding transition (folded to unfolded with no stable intermediate states) should produce a sigmoid fluorescence intensity curve[3,26]:

I=Imin+(ImaxImin)/(1+e(Tm-T)/Tw)

Ideally, the change in intensity with temperature, dI/dT, should be maximal at Tm, the temperature at which half the protein is unfolded, also referred to as the melting point[26]. Tw is a measure of the width of the transition, proportional to the full width at half the maximal dI/dT (FWHM). To derive Tw, we calculated FWHM from the data (see Supplementary Methods) and divided this value by the constant 2*ln[(2+√2)/(2−√2)] ≈ 3.525.

In practice the intensity curve for most of the samples in our study followed a sigmoid curve near Tm but deviated in one or more ways at other temperatures. We therefore used the simple estimate of Tm as the temperature at (dI/dT)max to avoid dependence on deviations, and quantified the deviations separately. Deviations included high initial intensity, which we quantified as R30 (Fig. 2b and d); multiple transitions with increasing intensity, quantified as RMT (Fig. 2c and Supplemental Figure 1c, right side); and a decrease in intensity at high temperature, seen in all samples. In the cases of samples with multiple transitions, the transition with the highest dI/dT always had the highest total change in intensity. We therefore assumed that the major intensity transition represented the major unfolding step, or at least the step in which the plurality of hydrophobic pockets were exposed to dye. We took the midpoint in that major unfolding step as Tm rather than attempting to fit a single sigmoid curve to data showing a multi-step transition, or attempting to determine the midpoint of a multi-step transition.

Figure 2. Analysis of differential scanning fluorimetry curves.

Figure 2

Four protein samples illustrate different curve shapes. Black solid lines : fluorescence intensity of SYPRO Orange dye vs. temperature, smoothed over 15 points (3 °C) and normalized to the minimum and maximum observed intensities. Blue dashed vertical lines: Tm, the temperature with the steepest positive slope, (dI/dT)max. Blue horizontal dashes: ITm, the intensity at Tm. (a) L. guyanensis 6-phosphogluconolactonase with ideal shape: low intensity at low temperature and a single transition. Blue horizontal arrow: temperature range over which the slope is at least ½ of (dI/dT)max i.e. full width at half maximum (FWHM) of the derivative, proportional to the melting transition width Tw. (b) E. histolytica aspartate-tRNA ligase batch 21516 with high intensity at low temperature and a single transition. Red horizontal dashes: I30, intensity at 30 °C. R30 is the ratio of I30 to ITm. Green dot-dash line: I30 threshold based on the R30 criterion in the decision tree, Figure 3b, i.e. I30/ITm=0.105. (c) T. gondii porphobilinogen synthase amino acids 320–658, with two distinct transitions. Magenta dotted line: sigmoid curve fit to observed intensity at Tm and at 2∙Tw below Tm. At low temperatures this curve approaches Imin, the estimated starting intensity of the major transition. Since in many cases intensity decays above Tm, and in others a minor transition is seen above Tm, the amplitude of the major transition is estimated as twice the intensity change between Imin and ITm. When there is a minor transition below Tm as in this case, Imin is also used as an estimate of the amplitude of that minor transition. RMT, the transition fraction, is calculated as the amplitude of the minor transition(s) over the total amplitude of all transitions. (d) L. major methionyl-tRNA synthetase, amino acids 206 to 747, with high R30 and high RMT. Both I30, red dashes, and Imin from the curve fit to the transition, magenta dots, are near ITm, blue dashes. (Figures prepared in Excel.)

We quantified minor transitions (Fig. 2c and Supplemental Fig. 1c, right) as RMT, the fraction of intensity change observed outside the major transitions. We fit the above equation to observed intensities at Tm and Tm – 2Tw to find Imin, estimated the major transition intensity ΔImain as 2*(ITm – Imin), and calculated RMT as the ratio of the remaining intensity change to the intensity of the major transition (see Supplementary Methods for details). In cases such as Figure 2d, the major positive transition was dwarfed by the overall negative slope of the curve; here, RMT approached its maximum of 1 while R30 was between 1 and its maximum of 2.

Low-temperature fluorescence was quantified using the intensity at 30 °C since this temperature was consistently included in the temperature range of DSF experiments performed in our laboratory. We calculated R30 as I30/ITm, the ratio of the intensity at 30 °C to the intensity at Tm (Fig. 2b), with intensity measured in arbitrary units from the minimum value for each curve. For an ideal sigmoid curve, ITm would be equal to Imax/2. For real curves, the intensity decrease at high T made it difficult to directly observe Imax; ITm was less sensitive to this common deviation from the ideal. For curves with multiple positive transitions (Fig. 2c, Supplementary Fig. 2c right), using ITm as the denominator to determine R30 gave similar results in most cases to using the overall positive intensity change (ΔItotal). Using ITm resulted in a substantially lower R30 compared to using the estimated intensity change of the main transition (ΔImain as described above). In all cases, the ratio using ITm had the strongest correlation with crystallization outcome.

For curves with overall downward trends (Fig. 2d), any of these denominators (ITm, Itotal or Imain) would lead to extremely high ratios. Since the intensity was minimal and still dropping at the highest temperature used, the values and thus the ratio of I30 and ITm depended on the highest temperature used. Setting the baseline to the minimum intensity before Tm would have avoided this effect. However, the ratio was still so high in all such cases that this effect did not significantly alter the resulting model or predictions made using R30.. Further, this effect was quantified as a high RMT value. In pathological cases where the intensity at 30 °C was far greater than the intensity at Tm, we assigned an arbitrary maximum value of 2 for R30.

In most cases we had at least two measurements of the sample in standard buffer. The average of all valid values was used. Curves with no positive slope above 0.001 raw intensity units per degree were not included in averaging. This threshold is 0.0002 units per 0.2 degree increment, twice the Opticon Monitor‘s precision in reporting intensity of 4 decimal places. One sample had no curves with any positive slope; this sample was given arbitrary values of 0 for Tm and Tw, 2 for R30 and 1 for RMT.

Limited Proteolysis

Each protease was scored visually on a scale of 1 to 5 (most stable) according to the criteria in Supplementary Table 3, and the scores for the 4 proteases were averaged to calculate LPav.

Dynamic Light Scattering

Hydrodynamic radius (RH), polydispersity, intensity and fraction of mass in each peak were recorded. For each sample a dominant peak was chosen as the consistent peak with the highest fraction of mass. DLSP was assigned as the polydispersity of that peak. DLSI was calculated as the intensity of that peak over the total intensity of that peak and all peaks with larger RH. Smaller peaks were assumed to be salts and other small molecules. DLSMW was derived from RH for that peak according to the formula from the Dynamics Version 5 software: DLSMW=(1.68×RH)2.3398. DLSMR is the ratio of DLSMW to the molecular weight of the monomer calculated from the sequence of the expressed protein. An additional categorical score DLSSC was assigned: 4 (<30% polydispersity in a single major peak), 3 (≥30% polydispersity in a single major peak, or 2 (more than one peak, regardless of polydispersity); none of the proteins in this study were in category 1 (unmeasurable).

Sequence variables

We explored a limited set of parameters derived directly from the protein sequence: MW, calculated molecular weight of the monomer; HYDav, average hydropathy using Kyle and Doolittle values[27]; Dismax, number of amino acids in the longest contiguous stretch of disorder predicted by DisEMBL[28] (http://dis.embl.de/); Dis-t, longest stretch of predicted disorder excluding the N-terminal His tag; and XP, the score of 1 to 5, optimal to difficult, from XtalPred, a predictor based on 9 sequence parameters (http://ffas.burnham.org/XtalPred-cgi/xtal.pl)[10]. Other summary metrics such as PXS and PC-XS-Hs[6] were also tested but did not contribute to the predictive power of the models.

2.6 Statistical Analysis

Development of Predictive Model

Predictive models were constructed and tested in the R statistical environment (http://www.R-project.org) version 2.8.0. For recursive regression partition trees, parameters were tuned using leave-one-out cross-validation on the training set to optimize predictive power for biophysically valid trees. For SVM, variables were selected using 10-fold cross validation on the training set by cycles of incremental variable addition and automated combinatorial surveys; parameters were retuned after each round of variable selection.

Analysis of Predictive Model

Predictive power for regression models was measured by DSPred Error, the root mean squared error = √[Σ(O – P)2 / N] where O and P are observed and predicted diffraction scores respectively; by Pearson‘s correlation coefficient, and by area under the ROC curve of true positive rate versus false positive rate. Since P and O had bimodal rather than normal distributions, probability of observed correlations were estimated using synthetic data. For binary classifications Matthews correlation coefficient, accuracy, sensitivity and selectivity were also measured. Standard deviations for measures of predictive power were calculated using cross-validation results and synthetic data. See Supplementary Methods for further details on model development and analysis.

3. Results

3.1 Quantification of Experimental and Sequence Variables

We considered 107 eukaryotic protein samples (Supplementary Tables 1 and 2, Supplementary Fig. 1) originating from the Structural Genomics of Pathogenic Protozoa (SGPP; www.sgpp.org) and Medical Structural Genomics of Pathogenic Protozoa (MSGPP; www.msgpp.org) pipelines, described in Supplementary Methods. This sample set includes both widely divergent genes and minor sequence variations, and represents the full range of diffraction outcomes, from failure to crystallize to diffraction better than 2 Å resolution. The full set was divided into a training set of 77 samples and a test set of 30 samples, such that the two sets contained similar distributions of crystallization outcome. The training set contained 41 sequences with less than 90% sequence identity to each other. Training set samples with similar sequences but distinct experimental characteristics and outcomes included multiple batches of the same sequence, tag variants, truncations, and homologs from related organisms. All 30 sequences in the test set had less than 85% identity to other proteins in either set.

We derived and quantified 21 experimental and sequence variables based on biophysical characterizations using SDS PAGE, SEC, DSF, DLS and LP (Table 1). Novel quantitative measures were developed for SEC profiles, DSF curves and LP gels as described in Figs. 1, 2 and Supplementary Table 3. Crystallization outcome, ranging from 0 to 6, was quantified as diffraction score (DS): no mountable protein crystals after extensive crystal screening (DS=0), no diffraction (DS=1), diffraction worse than 10 Å (DS=2), 10 Å or better (DS=3), 4 Å or better (DS=4), 2.8 Å or better (DS=5), or 2.0 Å or better (DS=6).

Table 1.

Experimental and sequence variables tested

Source Variable Description
(see Supplementary Methods for full definitions)
Rangea Mean (sd)b Corre-
lationc
Protein
Production
YldS Score for soluble expression screening gels 1–5 3.4 (1.0) 0.16
YldM Total mass of protein produced (mg) >0 52 (39) 0.18
SDS-PAGE SDS Average of 4 visual scores; reducing conditions 1–5 4.4 (0.6) −0.01
Limited
Proteolysis
LPav Average of scores for 4 proteases 1–5 3.3 (0.9) 0.39
Size
Exclusion
Chromato-
    graphy
SEChu Visual scoring of chromatogram image 1–5 3.4 (1.0) 0.08
SECR1 Residual (Rabs) with 1 Gaussian fit, as fraction of total area 0–1 0.4 (0.3) −0.11
SECPP Percent purity of pooled fractions at plateau of Rabs 0–1 0.8 (0.2) −0.17
Dynamic
Light
Scattering
DLSP Percent polydispersity 0–100 23 (14) −0.09
DLSI Percent intensity in major peak 0–100 92 (11) 0.05
DLSSC Composite score: 4, DLSP ≤ 30 and DLSI =100;
3, DLSP >30 and DLSI=100; 2, DLSI<100
2–4 2.6 (0.8) 0.19
DLSMW MW calculated from hydrodynamic radius (kDa) >0 190 (332) −0.01
DLSMR MW from hydrodynamic radius / predicted monomer MW >0 4 (7) 0.04
Differential
Scanning
Fluorimetry
Tm Melting temperature (°C) or 0 if no valid melting point 20–90 53 (10) 0.08
TW Melting width (°C) ≥0 7 (3) 0.07
R30 Ratio of intensity at 30 °C to intensity at Tm 0–2 0.4 (0.5) −0.37
RMT Fraction of intensity change in other transitions −1 to 1 0.28 (0.24) −0.31
Sequence
Analysis
MW Predicted molecular weight of monomer including tag (Da) >0 49k (16k) −0.34
Hydav Average hydropathy (GRAVY) ±4.5 −0.32 (0.14) 0.05
Dismax Longest stretch of disordered residues ≥0 19 (9) −0.19
Dis−t Longest stretch of disorder excluding N-terminal tag ≥0 8 (8) −0.07
XP Score from XtalPred web server 1–5 3.4 (1.3) −0.23
a

Range of possible values.

b

Mean (and standard deviation) of values for training set of 77 samples.

c

Correlation of training set values to diffraction score.

Large, bold variables are those used in partition trees in Table 2.

3.2 Development of Best Predictive Model

Many statistical methods can in principle be used to develop predictive models based on experimental and sequence variables (Fig. 3a). We evaluated linear regression, naïve Bayesian, several varieties of support vector machines (SVM), clustering, and recursive regression partition trees as described in Supplementary Methods. Regression partitioning and SVM gave the best results in cross-validation tests using only training data (Supplementary Results). However, regression partitioning gave the best results in predicting test set diffraction scores of the protein samples and will therefore be discussed here further.

Figure 3. Development of diffraction predictor using experimental results and sequence.

Figure 3

Figure 3

(a) Predictive model design. Top: train the model on experimental and sequence data and known crystallization outcomes quantified as diffraction scores (DS). Bottom: use the model to predict DS for new samples from new experimental and sequence data. (b) Hybrid crystal growth predictor (HyXG-1) decision tree prediction trained on 77 samples: start with experimental and sequence data for a new protein sample (top left); travel to the right across the tree branching according to criteria shown; arrive at the predicted DS for each category (center). Predicted DS is the mean DS for all training samples in that category; from top to bottom, there were 9, 7, 10, 14, 7, 12 and 18 training samples in each category. To the right are the percent of all test and training samples in each category diffracting to at least 10 Å or at least 2.8 Å, and suggestions for actions if no crystals are seen in initial trials. Possible changes include: change construct tag, tag placement or promoter; change expression host, scale-up volume, aeration method, or time and temperature regime; change purification columns (e.g. add ion exchange), tag cleavage, lysis and column buffers, or final concentration step.

3.3 Analysis of Hybrid Experimental Characterization and Sequence Model

The best partition tree (Fig. 3b, hereafter also called the HyXG-1 tree) obtained from consideration of all 21 variables (Table 1) applies four experimental and two sequence criteria. Experimental variables used in the model are: (i) the ratio of intensity at 30 °C to intensity at the melting point in differential scanning fluorimetry curves (R30); (ii) soluble protein expression level in high-throughput screening (YldS); (iii) residual after fitting one Gaussian to a SEC curve (SECR1); and (iv) ratio of molecular weight from hydrodynamic radius to calculated weight of the monomer (DLSMR), while, in addition, sequence variables incorporated into the model are: (v) calculated monomer molecular weight (MW) in daltons; and (vi) number of amino acids in the longest disordered region predicted by DisEMBL[28] (Dismax). The model predicts good diffraction for samples with low MW (i.e. monomer under 36.3 kDa) and low R30 (i.e. I30 / ITm less than 0.105), but poor outcomes for samples with low MW and high R30. Moderate outcomes are predicted for samples with high MW and very high YldS scores (over 100 mg/L soluble expression in HT screening). Poor outcomes are predicted for other high MW samples, with slightly better outcomes for samples with low SECR1 (less than 21.5% of A280 outside a single Gaussian curve) or with low Dismax (fewer than 19 amino acids in the longest stretch of predicted disorder) and high DLSMR (MWRH / MWmonomer greater than 1.88).

The predictive power of this HyXG-1 tree was evaluated by applying the model to the test set of 30 samples (Fig. 4 and Table 2 row A). With success defined as 2.8 Å or better diffraction (DS≥5), 25 samples (83%) were correctly predicted. With success defined as better than 10 Å diffraction (DS>3, dotted line in Fig. 4a), 26 samples were correctly predicted, 6 as successful, 20 as unsuccessful. The resulting Matthews correlation coefficient is 0.67; selectivity is high, 20/21 = 95%; sensitivity is moderate, 6/9 = 67%; and the overall accuracy of the prediction model is high, 26/30 = 87%. For comparison, the highest Matthews correlation coefficient on our test set using previously reported sequence-only predictors[6,10] was 0.48, with an accuracy of 60%.

Figure 4. Diffraction score predictions using experimental results and sequence.

Figure 4

(a) DS observed vs. DS predicted by the HyXG-1 model shown in (3b) for the test set of 30 new samples. DS is:
  • 0, no mountable protein crystals after extensive crystal screening;
  • 1, no diffraction;
  • 2, diffraction worse than 10 Å;
  • 3, 10 to 4.01 Å diffraction;
  • 4, 4.80 to 2.81 Å diffraction;
  • 5, 2.80 to 2.01 Å diffraction;
  • 6, 2.00 Å or better diffraction.
Bars: ±1 standard deviation based on the deviation of training DS. Dotted lines and coloring based on success threshold of better than 10 Å (DS>3). (b) Receiver operating characteristic (ROC) curves: area under curve is a measure of predictive power. Blue lines, predictions from combined experimental and sequence data (Table 2, row A); red, predictions leaving out experimental data (row C). Dashes, ROC curve for success threshold of better than 10 Å (DS>3); solid, success threshold of 2.8 Å or better (DS≥5). Shading added to visually clarify the association of lines.

Table 2.

Effects of experimental and sequence variables on prediction power

Model Variables Used in Prediction Model DSPred
Errorf
Corre-
lationg
ROC Areah
Experimental Variables Sequence Variables DS>3 DS≥5
A. Best with
expt & seqa
R30 YldS SECR1 DLSMR MW Dismax 1.96
(0.13)
0.56
(0.06)
0.77
(0.04)
0.87
(0.05)
B. Leave out
seq from Ab
R30 (YldS) SECR1 DLSMR 2.73
(0.08)
−0.07
(0.06)
0.61
(0.05)
0.49
(0.06)
C. Leave out
expt from Ac
MW Dismax 2.46
(0.10)
0.18
(0.07)
0.65
(0.05)
0.69
(0.06)
D. Best with
expt onlyd
R30 YldS SECPP d DLSMW d LPav d 1.90
(0.06)
0.57
(0.04)
0.70
(0.08)
0.71
(0.08)
E. Best with
seq onlye
MW Dismax Hydav e XPe 2.58
(0.12)
0.17
(0.08)
0.64
(0.05)
0.63
(0.06)

For descriptions of variables see Table 1.

a

Best partition model combining experimental and sequence variables from 77-sample training set.

b

The 4 experimental variables from model A were supplied to the partition algorithm. The algorithm discarded YldS as a criterion.

c

The 2 sequence variables from A were supplied to the algorithm; the algorithm used both as criteria.

d

All experimental variables were supplied. The algorithm used 2 of the same variables as in A, replaced SECR1 and DLSMR with related variables SECPP and DLSMW, and added LPav.

e

All sequence variables were supplied; hydropathy (Hydav) and XtalPred score (XP) were added to the sequence variables used in A.

f,g,h

Three measures of predictive power for the 30-sample test set (parentheses: standard deviation estimated from synthetic data).

f

Square root of the mean square difference between predicted and observed diffraction scores (DS).

g

Pearson‘s correlation coefficient for predicted and observed DS.

h

Area under ROC curves as in Figure 4b, with success defined as “better than 10 Å diffraction” (DS > 3) or as “2.8 Å or better diffraction” (DS ≥ 5).

3.4 Relative Importance of Experimental and Sequence Variables

In order to test the relative importance of two classes of variables, those from experimental results and those from sequence analysis, new decision trees based on only one of the two classes were constructed. First, we considered only those variables of one class that contributed to the best hybrid tree. Next, we constructed trees from all variables of one class from the full set of 21 variables. In each case we used the same parameters and training set as for the best hybrid tree. There is a substantial increase in predictive power of the best hybrid tree compared to trees without experimental variables (Fig. 4b and Table 2, row A compared to C or E). For example, the correlation rose from 0.18 (p>0.16) to 0.56 (p<0.0014) with the addition of experimental variables. The improvement in predictive power is more than twice the estimated standard deviation for prediction error, for correlation and also for the area under the receiver operating characteristic (ROC) curve with a diffraction score cutoff of DS≥5 (Fig. 4b). Interestingly, the error and correlation for the best experiment-only tree (Table 2, row D) were significantly better than the best sequence-only tree (Table 2, row E).

4. Discussion

The HyXG-1 decision tree suggested by recursive regression partition (Fig. 3b) is consistent with correlations of individual protein characteristics to crystallization found in previous work[3,6,10,19] and in this study (Table 1). For instance, low initial intensity followed by a sharp increase on melting in DSF has been reported as favorable for crystallization[3]. High fluorescence intensity at 30 °C indicates existence of hydrophobic pockets, possibly due to flexibility of loops, secondary structure elements or motifs, in which the fluorophore can bind. Upon increasing the temperature, unfolding of the environment of these pockets may lead to increased exposure of the fluorophore to the surrounding solvent and concomitant decreased fluorescence intensity. When the temperature is sufficiently high to initiate unfolding of one or more major domains, an increase in fluorescence intensity is observed when new binding sites for the fluorophore become available. Determining the precise mechanism leading to high R30 is beyond the scope of this paper, but it appears from our analysis that R30 quantifies a property of proteins which is more significant than the Tm, which might be due to the fact that R30 reports on features of the target protein at a temperature generally closer to the conditions of crystallization than Tm.

Though the DSF properties of some proteins are sensitive to buffer conditions [29], results in our lab (unpublished) and others [30,31,32] suggest that for many proteins DSF results are consistent across a variety of buffers and protein concentrations. This may partially explain why characterization experiments done in one buffer have considerable power in predicting crystallization, even though crystallization conditions essentially always differ from any buffer used to test solution properties of the protein (Supplementary Table 4).

While it is not clear precisely what roles overall protein stability and local flexibility play in crystallization[6], low predicted disorder has been shown to be important for crystallographic success[6,10]. High predicted stability, moderate fraction of predicted loops and no long stretches of predicted disorder were favorable for crystallization in one set of mostly prokaryotic proteins[10]. In another set of proteins, no predictive power was seen for either experimentally measured overall stability or limited proteolysis which may monitor loop flexibility, but low predicted disorder was important for success in crystallizing soluble prokaryotic proteins and also in expressing and crystallizing soluble eukaryotic proteins[6]. These finding are in agreement with our results showing that proteins with smaller predicted disordered regions (low Dismax) tend to crystallize better.

Most proteins require relatively pure solutions to crystallize. Gaussian SEC profiles indicate homogeneous protein solutions, or at least homogeneity of protein size. In some cases, protein crystallization requires SEC profiles close to Gaussian[19]. Our measure of SECR1 quantifies the purity of the protein sample in terms of hydrodynamic radius, which reflects the homogeneity of monomer or oligomer size and shape. A value of SECR1 less than 0.215 is incorporated in the partition tree obtained (Fig. 3b).

Our DLSMR threshold near 2 in the partition tree is consistent with the finding that dimers and oligomers are favored for crystallization over monomers[6]. Other DLS-derived variables do not contribute to predictive power, possibly because the properties they measure were already accounted for by other variables used in the model. Our samples did not show the strong negative correlation between multidispersity and well-diffracting crystals seen in other work[18]. The YldS criterion of the decision tree is consistent with the high success rate observed in our structural genomics work for proteins that express very well, probably due to the relative ease of selecting highly purified fractions from purification columns (unpublished results). Thus for the decision tree from regression partitioning on combined experimental and sequence variables, the criteria are plausible given the known and expected correlates of those biophysical properties.

The reason why combined consideration of several variables enhances prediction of crystallization outcome is likely due to the fact that multiple factors play a role in determining the success in crystal growth. The molecular weight criterion in the predicting partition tree might reflect that larger proteins tend to contain multiple domains some of which may have a tendency to be flexible with respect to each other. R30 from DSF experiments likely indicate a degree of flexibility of loops, motifs and domains. The symmetry of sizing chromatographic peaks is related to the homogeneity of the molecular species in the sample and its state of oligomerization. Long stretches of amino acids that are predicted to be disordered decrease the likelihood of forming regular crystal contacts. From the results obtained it appears that the well-crystallizing protein tends to be – in general – one with homogenous particle size, stable folding at 30 °C, and few flexible domains, motifs or loops.

The analysis presented here was necessarily limited to protein samples for which full biophysical characterization data was available. Despite this relatively small set as compared to the number of targets available for sequence-only analysis, it is clear that joint consideration of multiple experimental variables in addition to sequence significantly improves prediction of crystallization and diffraction (Table 2), yielding higher accuracy than previously reported for methods based on sequence alone[6,10,12]. The improved predictive power gained by joint consideration of multiple experimental variables stands in contrast to relatively poor correlation with success reported for single experimental measures[6]. It is quite possible that incorporating other experimental methods such as mass spectroscopy[33], NMR data[34] and static light scattering[35], may further increase the predictive power of hybrid models.

The HyXG-1 hybrid predictor may be most useful in cases where proteins fail to crystallize on initial setup and the prediction is strongly positive or negative. The prediction can then help investigators prioritize their efforts towards an increased likelihood of success in producing diffracting crystals (Fig. 3b, right side). For instance, if the protein sample prepared has a high R30 and a molecular weight less than 36 kDa, strategies to lower the R30 are likely to be most effective. This might be achieved in several ways such as removing flexible termini by limited proteolysis; or by designing, cloning and expressing new truncations of the protein; or by switching to other species which contain fewer stretches of predicted disorder; or by replacing flexible segments by shorter linkers or by domains of known structure with little disorder.

We are developing a web site which will provide researchers with tools for assigning standardized quantitative descriptions to their experimental results, and for using these results to predict crystallization outcome and prioritize further efforts. Researchers will be invited to upload sets of protein characterizations and crystallization outcomes to help improve the predictive model by increasing the number of samples in the training set and adding new experimental methods to be considered.

5. Conclusion

We have developed a set of novel variables derived from biophysical data. Several of these such as R30 and DLSMR appear to be useful in predicting crystallization outcome. A predictive hybrid model, combining multiple biophysical characterization and sequence-derived data, such as the HyXG-1 decision tree derived by regression partition (Fig. 3b), is more powerful than sequence-based prediction alone – and therefore likely to be useful in guiding crystallization efforts.

Supplementary Material

01
02

Acknowledgements

We thank Drs. Chris Mehlin and Juergen Bosch for their contributions to this work in protein production and structure determination, respectively. We are indebted to Thomas E. Kammeyer for assistance in deciphering the Opticon Monitor 3 raw data format for DSF data. Funded by NIH grants GM64655, GM088518 and P01AI067921.

ABBREVIATIONS

DLS

dynamic light scattering

DS

diffraction score

DSF

differential scanning fluorimetry

HyXG-1

hybrid crystal growth prediction model-1

I30

intensity at 30 °C in DSF

ITm

intensity at the inflection point of a DSF curve

LP

limited proteolysis

R30

ratio of I30 to ITm

RMT

ratio of the intensity of minor transition(s) to the total intensity transition in a DSF curve

SCE

side chain entropy

SEC

size-exclusion chromatography

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Author Contributions

Database development and statistical analysis: Frank H. Zucker, Christine Stewart, Wim G.J. Hol

Bioinformatics and target selection: Frank H. Zucker, Christophe Verlinde, Easwara Subramanian, Fred Buckner

Protein production: Alberto J. Napuli, Natascha Mueller, Lisa J. Castaneda, Stephen Nakazawa Hewitt, Wesley C. Van Voorhis

Protein characterization: Jaclyn dela Rosa, Jessica Kim, Li Zhang, Liren Xiao, Jenni Ross, Alberto J. Napuli, Natascha Mueller, Lisa J. Castaneda, Stephen Nakazawa Hewitt

Protein crystallization: Jaclyn dela Rosa, Jessica Kim, Li Zhang, Liren Xiao, Jenni Ross, Wim G. J. Hol

Xray data collection: Tracy Arakaki, Eric Larson, Ethan Merritt

Project coordination: Erkang Fan, Wim G.J. Hol

Manuscript writing: Frank H. Zucker, Christine Stewart, Ethan Merritt, Wim G. J. Hol

REFERENCES

  • 1.Chayen NE, Saridakis E. Protein crystallization: from purified protein to diffraction-quality crystal. Nat Methods. 2008 Feb;vol. 5:147–153. doi: 10.1038/nmeth.f.203. [DOI] [PubMed] [Google Scholar]
  • 2.Rupp B, Wang J. Predictive models for protein crystallization. Methods. 2004 Nov;vol. 34:390–407. doi: 10.1016/j.ymeth.2004.03.031. [DOI] [PubMed] [Google Scholar]
  • 3.Ericsson UB, Hallberg BM, Detitta GT, Dekker N, Nordlund P. Thermofluor-based high-throughput stability optimization of proteins for structural studies. Anal Biochem. 2006 Oct 15;vol. 357:289–298. doi: 10.1016/j.ab.2006.07.027. [DOI] [PubMed] [Google Scholar]
  • 4.D'Arcy A. Crystallizing proteins - a rational approach? Acta Crystallogr D Biol Crystallogr. 1994 Jul 1;vol. 50:469–471. doi: 10.1107/S0907444993014362. [DOI] [PubMed] [Google Scholar]
  • 5.Gao X, Bain K, Bonanno JB, Buchanan M, Henderson D, Lorimer D, Marsh C, Reynes JA, Sauder JM, Schwinn K, Thai C, Burley SK. High-throughput limited proteolysis/mass spectrometry for protein domain elucidation. J Struct Funct Genomics. 2005;vol. 6:129–134. doi: 10.1007/s10969-005-1918-5. [DOI] [PubMed] [Google Scholar]
  • 6.Price WN, 2nd, Chen Y, Handelman SK, Neely H, Manor P, Karlin R, Nair R, Liu J, Baran M, Everett J, Tong SN, Forouhar F, Swaminathan SS, Acton T, Xiao R, Luft JR, Lauricella A, DeTitta GT, Rost B, Montelione GT, Hunt JF. Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat Biotechnol. 2009 Jan;vol. 27:51–57. doi: 10.1038/nbt.1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Graslund S, Sagemark J, Berglund H, Dahlgren LG, Flores A, Hammarstrom M, Johansson I, Kotenyova T, Nilsson M, Nordlund P, Weigelt J. The use of systematic N- and C-terminal deletions to promote production and structural studies of recombinant proteins. Protein Expr Purif. 2008 Apr;vol. 58:210–221. doi: 10.1016/j.pep.2007.11.008. [DOI] [PubMed] [Google Scholar]
  • 8.Rupp B. High-Throughput Crystallography at an Affordable Cost: The TB Structural Genomics Consortium Crystallization Facility. Acc. Chem. Res. 2003;vol. 36:173–181. doi: 10.1021/ar020021t. [DOI] [PubMed] [Google Scholar]
  • 9.Bertone P, Kluger Y, Lan N, Zheng D, Christendat D, Yee A, Edwards AM, Arrowsmith CH, Montelione GT, Gerstein M. SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Res. 2001 Jul 1;vol. 29:2884–2898. doi: 10.1093/nar/29.13.2884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A. XtalPred: a web server for prediction of protein crystallizability. Bioinformatics. 2007 Dec 15;vol. 23:3403–3405. doi: 10.1093/bioinformatics/btm477. [DOI] [PubMed] [Google Scholar]
  • 11.Jaroszewski L, Slabinski L, Wooley J, Deacon AM, Lesley SA, Wilson IA, Godzik A. Genome pool strategy for structural coverage of protein families. Structure. 2008 Nov 12;vol. 16:1659–1667. doi: 10.1016/j.str.2008.08.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Overton IM, Padovani G, Girolami MA, Barton GJ. ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics. 2008 Apr 1;vol. 24:901–907. doi: 10.1093/bioinformatics/btn055. [DOI] [PubMed] [Google Scholar]
  • 13.Chen K, Kurgan L, Rahbari M. Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun. 2007 Apr 13;vol. 355:764–769. doi: 10.1016/j.bbrc.2007.02.040. [DOI] [PubMed] [Google Scholar]
  • 14.Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S. CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Structural Biology. 2009;vol. 9:1–15. doi: 10.1186/1472-6807-9-50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Cooper DR, Boczek T, Grelewska K, Pinkowska M, Sikorska M, Zawadzki M, Derewenda Z. Protein crystallization by surface entropy reduction: optimization of the SER strategy. Acta Crystallogr D Biol Crystallogr. 2007 May;vol. 63:636–645. doi: 10.1107/S0907444907010931. [DOI] [PubMed] [Google Scholar]
  • 16.Klock HE, Koesema EJ, Knuth MW, Lesley SA. ombining the polymerase incomplete primer extension method for cloning and mutagenesis with microscreening to accelerate structural genomics efforts. Proteins Structure, Function and Bioinformatics. 2007;vol. 71:982–994. doi: 10.1002/prot.21786. [DOI] [PubMed] [Google Scholar]
  • 17.Hubbard S. The structural aspects of limited proteolysis of native proteins. Biochimica et Biophysica Acta - Protein Structure and Molecular Enzymology. 1998 Februqary 17;vol. 1382:191–206. doi: 10.1016/s0167-4838(97)00175-1. 1998. [DOI] [PubMed] [Google Scholar]
  • 18.Niesen FH, Koch A, Lenski U, Harttig U, Roske Y, Heinemann U, Hofmann KP. An approach to quality management in structural biology: biophysical selection of proteins for successful crystallization. J Struct Biol. 2008 Jun;vol. 162:451–459. doi: 10.1016/j.jsb.2008.03.007. [DOI] [PubMed] [Google Scholar]
  • 19.Kawate T, Gouaux E. Fluorescence-detection size-exclusion chromatography for precrystallization screening of integral membrane proteins. Structure. 2006 Apr;vol. 14:673–681. doi: 10.1016/j.str.2006.01.013. [DOI] [PubMed] [Google Scholar]
  • 20.Geerlof A, Brown J, Coutard B, Egloff MP, Enguita FJ, Fogg MJ, Gilbert RJ, Groves MR, Haouz A, Nettleship JE, Nordlund P, Owens RJ, Ruff M, Sainsbury S, Svergun DI, Wilmanns M. The impact of protein characterization in structural proteomics. Acta Crystallogr D Biol Crystallogr. 2006 Oct;vol. 62:1125–1136. doi: 10.1107/S0907444906030307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Fan E, Baker D, Fields S, Gelb MH, Buckner FS, Van Voorhis WC, Phizicky E, Dumont M, Mehlin C, Grayhack E, Sullivan M, Verlinde C, Detitta G, Meldrum DR, Merritt EA, Earnest T, Soltis M, Zucker F, Myler PJ, Schoenfeld L, Kim D, Worthey L, Lacount D, Vignali M, Li J, Mondal S, Massey A, Carroll B, Gulde S, Luft J, Desoto L, Holl M, Caruthers J, Bosch J, Robien M, Arakaki T, Holmes M, Le Trong I, Hol WG. Methods in Molecular Biology, Structural Proteomics - High-throughput Methods. 2008/06/11 ed. vol. 426. Totawa, NJ: Humana Press, Inc.; 2008. Structural Genomics of Pathogenic Protozoa: An Overview; pp. 497–513. [DOI] [PubMed] [Google Scholar]
  • 22.Mehlin C, Boni E, Buckner FS, Engel L, Feist T, Gelb MH, Haji L, Kim D, Liu C, Mueller N, Myler PJ, Reddy JT, Sampson JN, Subramanian E, Van Voorhis WC, Worthey E, Zucker F, Hol WG. Heterologous expression of proteins from Plasmodium falciparum: results from 1000 genes. Mol Biochem Parasitol. 2006 Aug;vol. 148:144–160. doi: 10.1016/j.molbiopara.2006.03.011. [DOI] [PubMed] [Google Scholar]
  • 23.Arakaki T, Le Trong I, Phizicky E, Quartley E, DeTitta G, Luft J, Lauricella A, Anderson L, Kalyuzhniy O, Worthey E, Myler PJ, Kim D, Baker D, Hol WGJ, Merritt EAM. Structure of Lmaj006129AAA, a hypothetical protein from Leishmania major. Acta Cryst. 2006;vol. F62:175–179. doi: 10.1107/S1744309106005902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Deng J, Davies DR, Wisedchaisri G, Wu M, Hol WGJ, Mehlin C. An improved protocol for rapid freezing of protein samples for long-term storage. Acta Crystallographica Section D: Biological Crystallography. 2004;vol. 60:203–204. doi: 10.1107/s0907444903024491. [DOI] [PubMed] [Google Scholar]
  • 25.Luft JR, Collins RJ, Fehrman NA, Lauricella AM, Veatch CK, DeTitta GT. A deliberate approach to screening for initial crystallization conditions of biological macromolecules. J Struct Biol. 2003 Apr;vol. 142:170–179. doi: 10.1016/s1047-8477(03)00048-0. [DOI] [PubMed] [Google Scholar]
  • 26.Niesen FH, Berglund H, Vedadi M. The use of differential scanning fluorimetry to detect ligand interactions that promote protein stability. Nat Protoc. 2007;vol. 2:2212–2221. doi: 10.1038/nprot.2007.321. [DOI] [PubMed] [Google Scholar]
  • 27.Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982 May 5;vol. 157:105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
  • 28.Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003;vol. 11:453–459. doi: 10.1016/j.str.2003.10.002. [DOI] [PubMed] [Google Scholar]
  • 29.Vedadi M, Niesen FH, Allali-Hassani A, Fedorov OY, Finerty PJ, Jr, Wasney GA, Yeung R, Arrowsmith C, Ball LJ, Berglund H, Hui R, Marsden BD, Nordlund P, Sundstrom M, Weigelt J, Edwards AM. Chemical screening methods to identify ligands that promote protein stability, protein crystallization, and structure determination. Proc Natl Acad Sci U S A. 2006 Oct 24;vol. 103:15835–15840. doi: 10.1073/pnas.0605224103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lavinder JJ, Hari SB, Sullivan BJ, Magliery TJ. High-throughput thermal scanning: a general, rapid dye-binding thermal shift screen for protein engineering. J Am Chem Soc. 2009 Mar 25;vol. 131:3794–3795. doi: 10.1021/ja8049063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yeh AP, McMillan A, Stowell MH. Rapid and simple protein-stability screens: application to membrane proteins. Acta Crystallogr D Biol Crystallogr. 2006 Apr;vol. 62:451–457. doi: 10.1107/S0907444906005233. [DOI] [PubMed] [Google Scholar]
  • 32.Jarvest RL, Berge JM, Brown MJ, Brown P, Elder JS, Forrest AK, Houge-Frydrych CS, O'Hanlon PJ, McNair DJ, Rittenhouse S, Sheppard RJ. Optimisation of aryl substitution leading to potent methionyl tRNA synthetase inhibitors with excellent gram-positive antibacterial activity. Bioorg Med Chem Lett. 2003 Feb 24;vol. 13:665–668. doi: 10.1016/s0960-894x(02)01027-2. [DOI] [PubMed] [Google Scholar]
  • 33.Jeon WB, Aceti DJ, Craig A, Bingman1 CraigA, Vojtik1 FrankC, Olson1 AndrewC, Ellefson1 JasonM, McCombs1 JanetE, Sreenath1 HassanK, Blommel1 PaulG, Seder1 KoryD, Burns1 BrendanT, Geetha1 HolalkereV, Harms1 AmyC, Sabat1 Grzegorz, Sussman1 MichaelR, Fox1 BrianG, Phillips GeorgeN., Jr High-throughput Purification and Quality Assurance of Arabidopsis thaliana Proteins for Eukaryotic Structural Genomics. Journal of Structural and Functional Genomics. 2005;vol. 6:143–147. doi: 10.1007/s10969-005-1908-7. [DOI] [PubMed] [Google Scholar]
  • 34.Page R, Peti W, Wilson IA, Stevens RC, Wüthrich K. NMR screening and crystal quality of bacterially expressed prokaryotic and eukaryotic proteins in a structural genomics pipeline. PNAS. 2005 February 8;vol. 102:1901–1905. doi: 10.1073/pnas.0408490102. 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wilson W. Light scattering as a diagnostic for protein crystal growth—A practical approach. Journal of Structural Biology. 2003;vol. 142:56–65. doi: 10.1016/s1047-8477(03)00038-8. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01
02

RESOURCES