Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 8.
Published in final edited form as: Structure. 2024 May 15;32(8):1248–1259.e5. doi: 10.1016/j.str.2024.04.017

DomainFit: Identification of Protein Domains in cryo-EM maps at Intermediate Resolution using AlphaFold2-predicted Models

Jerry Gao 1,2,#, Maxwell Tong 1,2,#, Chinkyu Lee 3, Jacek Gaertig 3, Thibault Legal 1,2,*, Khanh Huy Bui 1,2,**,4
PMCID: PMC11316655  NIHMSID: NIHMS1990910  PMID: 38754431

Summary

Cryo-electron microscopy (cryo-EM) has revolutionized the structural determination of macromolecular complexes. With the paradigm shift to structure determination of highly complex endogenous macromolecular complexes ex vivo and in situ structural biology, there are an increasing number of structures of native complexes. These complexes often contain unidentified proteins, related to different cellular states or processes. Identifying proteins at resolutions lower than 4 Å remains challenging because side chains cannot be visualized reliably. Here, we present DomainFit, a program for semi-automated domain-level protein identification from cryo-EM maps, particularly at resolutions lower than 4 Å. By fitting domains from AlphaFold2-predicted models into cryo-EM maps, the program performs statistical analyses and attempts to identify the domains and protein candidates forming the density. Using DomainFit, we identified two microtubule inner proteins, one of which contains a CCDC81 domain and is exclusively localized in the proximal region of the doublet microtubule in Tetrahymena thermophila.

Keywords: AlphaFold2 modelling, cryo-EM, cryo-ET, subtomogram averaging, protein identification, doublet, cilia, microtubule inner protein

Graphical Abstract

graphic file with name nihms-1990910-f0001.jpg

eTOC Blurb

Cryo-electron microscopy enables structural determination of native macromolecular complexes, which might contain unidentified densities at a resolution that might prevent the identification. Gao and Tong et al. present DomainFit. By fitting predicted protein domains from AlphaFold2 prediction, DomainFit identifies the best fitting domains in a given density, allowing protein identification.

Introduction

In the last decade, cryo-electron microscopy (cryo-EM) has become a powerful technique to determine the structures of macromolecular complexes. With advances in high-throughput data acquisition and cryo-EM image processing algorithms, large endogenous complexes have been solved at high resolutions such as the phycobilisomes1, the mitochondrial membrane bending supercomplex2 and the doublet microtubules of the cilia3,4. Recently, in situ cryo-electron tomography (cryo-ET) from thin lamellae of cells prepared by a cryo-focused ion beam (cryo-FIB) instrument has emerged as a transformative technique, revolutionizing our understanding of cellular structures and molecular processes5. With high-throughput tilt series acquisition and improvement of subtomogram averaging software613, subtomogram averaging of complexes can reach sub-nanometer resolution and in some cases better than 4 Å resolution11,1416. Structures of large complexes obtained ex vivo and in situ often contain varying unknown components depending on cell type, environmental factors, purification conditions and the stage of a given cellular process. Identification of these unknown components could help the understanding of the functions of the corresponding complexes in cellulo.

For high-resolution structures, there are many methods that can be used for the modelling and identification of unknown proteins including CryoID, DeeptracerID, FindMySequence, and modelAngelo1720. These programs trace and model the backbone of the protein density. Then, the identity of the protein is predicted by comparing the side-chain densities against a database of protein sequences. This approach was successful in determining many proteins in the doublet microtubule21,22, radial spokes and central apparatus and the phycobilisomes20. However, all these methods only work reliably at a resolution better than 4 Å.

In the last few years, there has been a breakthrough in artificial intelligence (AI)-assisted protein structure prediction. Programs like AlphaFold23, ColabFold24 and RoseTTAfold25 can produce accurate structure predictions. With the establishment of the AlphaFold Protein Structure Database (AlphaFold DB), everyone now has access to over 200 million predicted models. For certain organisms such as humans, mice, zebrafish and nematodes, AlphaFold2-predicted models of the entire proteome are available.

Given the high accuracy of the AI-predicted models, it is possible to identify the fold signature of a protein by fitting the AI-predicted models into the density. Using complementary data such as in situ chemical crosslinking, quantitative and proximity mass spectrometry, it is possible to identify proteins. For example, microtubule inner proteins (MIPs) were identified from the in situ subtomogram average of human sperm doublet microtubules at ~6–7 Å using Colores Situs program by fitting over 21,000 AlphaFold2-predicted protein models of the mouse proteome26. In another study, 38 proteins were identified from the mouse sperm by manually fitting AlphaFold2-predicted models of proteins found in the mass spectrometry analysis of the same sample15.

While those studies illustrate that it is possible to identify well-structured proteins from a map at an intermediate resolution (4–8 Å), the methods are not automated or are difficult to use. In addition, the criteria to evaluate and identify proteins from the fitting results of all proteins are not clear from these studies. While the accuracy of the AI-predicted model is high for compact domains, the tertiary structure of the predicted model might not be correct, due in part to poor prediction of flexible regions. Therefore, protein identification is easier when domains are extracted from each AI-predicted model and used instead of the entire model.

In this work, we developed a program called DomainFit to identify domains of proteins from cryo-EM maps, particularly at intermediate resolution when it is not possible to do so with high-resolution modelling methods. DomainFit uses popular programs such as Phenix27, R28 and UCSF ChimeraX (ChimeraX)29 making it easy to install and accessible to people. In addition, DomainFit is flexible and has a clear statistical and visual approach to evaluate the fit and the identity of proteins. Our workflow will be useful for the upcoming wave of highly complex ex vivo structures using single-particle cryo-EM structure and in situ structures from cryo-FIB and cryo-ET.

Results

An automated pipeline for domain parsing and fitting into cryo-EM map

With the aim of domain/protein identification from unidentified cryo-EM density, we developed DomainFit to automatize the process of domain parsing, fitting and evaluation (Fig. 1). First, the unidentified density is segmented manually from the cryo-EM map (See Materials & Methods). Then, a database of AI-predicted models of candidate proteins (i.e. full species proteome or a limited list based on mass spectrometry studies) is downloaded automatically from the AlphaFold DB or generated using a local/cloud instance such as AlphaFold, RosettaFold and ColabFold2325. Next, every model is divided into different domains automatically. Each individual domain is then fitted into the segmented cryo-EM density. Finally, DomainFit generates a statistical evaluation of all the fitted domains. The identification of the correct domain can be based purely on P value statistical analysis for high resolution density but also on other complementary data such as surrounding densities, protein size, chemical crosslinking and quantitative mass spectrometry. The process of domain parsing, fitting and statistical evaluation is done entirely by command line, enabling its parallelization and automation. Finally, the program provides visualization to quickly evaluate and identify the top domain candidate from the fitting process.

Figure 1: Workflow of DomainFit.

Figure 1:

First, the AI-predicted models database for all candidate proteins is built using a local prediction program or downloaded from AlphaFold DB based generally on proteomics data (Database Build-Up). Then, the domains are parsed from the predicted models into single PDB files (Domain Parsing). After that, every domain is fitted into the unknown segmented density (Domain Fitting). Finally, all domain fitting results are analysed to identify the top hit based on statistical analysis (Domain Identification).

Using P value as an indicator for domain identification

While cross correlation (CC) is usually directly used as the criteria to evaluate the best fit of each domain and the overall top hit among all domains, DomainFit uses a P value calculated from the unique fits of each domain to assess the significance of the best fit and comparison among all domains.

The program uses the ChimeraX fitmap (fitmap) command to fit each domain in the density from a fixed number of randomly generated, initial search positions within the density, i.e. initial search placements (Fig. 2A). Instead of an exhaustive six-dimensional search, fitmap performs a global search where it places the model in random orientations and shifts within the map, and performs rigid-body local optimization from each initial placement. fitmap then groups the fitting solutions based on similarity in orientations and translations to generate a list of “unique fits” with corresponding CC coefficients. Provided that enough initial search placements are used, fitmap performs well to identify the correct fit30.

Figure 2. Domain fitting and evaluation in DomainFit.

Figure 2.

(A) Schematic of ChimeraX fitmap function. The domain is placed in the density at random orientation and translation (red dots in the grid of the map). Local optimization is done for each placement and the results are clustered into a unique fit list based on rotation and translation. (B) Best fit of a correct domain inside the density (left) and the histogram of the Fisher z-transformed correlation score of unique fits from such domain (right). The best fit is separated from the score distribution, resulting in a significantly low P value. (C) Best fit of a wrong domain inside the same density (left) and the histogram of the Fisher z-transformed correlation score of unique fits from such domain (right). Since the Fisher z-transformed correlation score of the best fit is not well separated from the rest, the P value is not low. (D) Scatter plot to visualize the value of -log10 of Benjamini-Hochberg-adjusted P value versus the CC coefficient of each domain fitting into the density of interest. The top right corner point is well separated from the rest of the points cloud, indicating a well-matched domain into the density. See also Figure S1.

For each domain, the CC coefficients of the unique fits can be utilized for statistical evaluation, to check whether the best fit of that domain is likely correct. To do so, the CC coefficients of the unique fits are Fisher z-transformed to yield an approximately normal distribution and then used for the calculation of the P value of the best fit for each domain, independently of other domains. When the correct domain is fitted into a density, the Fisher z-transformed score of the best fit is clearly distinguishable from the other fits (i.e. it has a much higher CC coefficients) (Fig. 2B), hence, its associated P value is significantly lower. On the other hand, when the incorrect domain is fitted into a density, the Fisher z-transformed score of the best fit is not noticeably higher than that of the other fits, therefore, its P value is not as low as the one from the correct domain (Fig. 2C) (See Materials & Methods). The fitmap and P value calculation of every domain are done during the fitting process of DomainFit. It is worth noting that the overall pipeline approach with fitmap and command line in our approach is similar to Assembline and efitter scripts31.

The P value is a more reliable indication of true positives than the CC since it is less sensitive to the size of the domain fitted in the density. Small domain models, either by nature or due to over-partitioning, tend to have high CC coefficients because they fit perfectly in a small area of the density. To easily visualize the top hit and identify the correct domain, we plotted the P value vs the CC for all domains. (Fig. 2D).

For the P value to be estimated properly, a reasonable number of data points, i.e. unique fits are needed. Our simulation for a small density of ~100–150 amino acids and its corresponding domain shows that above 150 initial search placements, the resulting P value plateaus (Figure S1). Larger domains and densities may require a higher number of initial search placements to cover larger volumes for local optimization of fitmap. Thus, setting a high search placement e.g. 1000 initial search placements, should be enough to ensure a global minimum is found in the case of larger densities and domains.

P value outperforms CC for domain identification

To assess the efficiency of P value over CC, we tested DomainFit on segmented densities from the single-particle cryo-EM map of the doublet microtubule from Tetrahymena thermophila (EMD-29685) at ~4 Å resolution21. More than 40 proteins (MIPs) were identified and modelled using an AI-assisted modelling approach (Fig. 3A). Our goal for this test was to assess whether we could identify the MIPs using DomainFit independently and compare the performance of P value with CC.

Figure 3: Identification of MIPs in the T. thermophila doublet microtubule.

Figure 3:

(A) Structure of the doublet microtubule with MIPs (PDB: 8g2z). (B) Densities (red) in the cryo-EM map of the doublet microtubule (EMD-29685) segmented for testing with DomainFit. (C) DomainFit ranking by P value and CC coefficient of the correct domain for the 24 segmented densities. (D) Example of the perfect match between the AlphaFold2-predicted model of PACRGB fitted within the corresponding density. (E) The domain of RIB22 is not partitioned as expected. However, the bigger domain was still fitted correctly into the density due to good agreement between the AlphaFold2-predicted model and the density. Due to under-partitioning and the fact that the EF-hand domain is abundant in the cilium, the fit of the correct domain is ranked 9 but the P value is still ranked 1st. (F) When the fold of the density is common (EF-hand domain RIB571–91 - Density 10), the top hits consist of domains with a similar fold. The top five hits are all domains from CFAP115, which have the same EF-hand fold as the correct domain RIB571–91. (G) The experimentally determined model of RIB57 (PDB: 8g2z) is shown inside its corresponding density. (H) There is a big discrepancy between the experimentally determined model (cyan) and AlphaFold2-predicted model (purple) of RIB571–91 (RMSD: 5.164 Å), which leads to the poor fitting of RIB57. See also Figure S2 and Table S1 and S3.

We segmented 24 densities from the cryo-EM map corresponding to different domains of MIPs (Table S1, Figure 3B). We searched all the density over a database comprising 856 AlphaFold2-predicted models based on the mass spectrometry data of the doublet microtubules21.

Out of all the densities tested, we found 22 out of 24 correct domains are present in the top five hits ranked by descending P value (Fig. 3C, Figure S2 and Table S1). Our results showed that the P value is a better predictor of correct identification than the CC (Fig. 3C). Often, the domains were correctly found as the top hits with the highest CC coefficient and best P value such as PACRGB, an inner junction protein32 (Fig. 3D). Hereafter, top hits will refer to the top P value ranked hits.

There are cases where the partitioned domain is bigger than the segmented density (Table S1, Figure S2, Density 6, 10–13, 16–17, 19–22, 24). For example, the EF-hand domain pairs are usually parsed as a single domain, instead of two separate EF-hand domains such as RIB22 (Fig. 3E). The correct domain was still found for the density represents only RIB221–91 with the bigger domain model RIB224–191 (Fig. 3E, Table S1).

Interestingly, for Density 10, corresponding to the EF-hand domain RIB571–91, DomainFit failed to find the correct domain while it managed to find the correct EF-hand fold from the top hits (Fig. 3F). Apparently, there was a big difference between the experimentally determined model of RIB57 (Fig. 3G) and the AlphaFold2-predicted model (RMSD 5.164 Å, Fig. 3H). Recently, it was reported that even high-confidence AlphaFold2-predicted models can differ to experimental map through global scale distortion and domain orientation33.

DomainFit works across resolutions

Next, we wanted to know whether DomainFit could successfully identify domains at lower resolutions. We first ran a simulation with lowpass-filtered density maps of the MIPs at 4, 6, 8, and 10 Å (Fig. 4A). We found that the search consistently identified the correct domain within the top 10 hits at 4 to 8 Å resolution (Fig. 4B). At lower resolutions, there were significantly fewer unique fits found even with the same number of initial search placements, likely due to the lack of details in the density (Figure S3A). Therefore, at low resolution, fitmap produces a small number of unique fits from a large number of initial search placements which leads to less confidence in P value estimation of the best fit. If we compare the Fisher z-transformed correlation scores at 4, 6, 8 and 10 Å of the correct domain (Figure S3B, C), we can see that at 6 Å, the distribution of the unique fits and the best fit still maintains its shape, allowing reliable P value calculation. At 8 Å, while the best fit is still well separated from the rest, the Fisher z-transform of the CC of the unique fits does not look normally distributed. At 10 Å, not only is the distribution of the Fisher z-transformed scores not normally distributed but the difference between the best fit and the rest gets smaller. As a result, DomainFit seems to work best in the 4-to-6 Å resolution range. At resolutions lower than that, DomainFit is still useful for finding types of folds that fit well in the density by visual inspection of the top hits but, in the absence of other information, it should not be used for domain identification.

Figure 4: Performance of DomainFit at different resolutions.

Figure 4:

(A) Appearance changes of a density filtered at 4, 6, 8 and 10 Å resolution. (B) P value rank of the correct domains fitted to different densities of MIPs at 4, 6, 8, and 10 Å resolution. (C) Map of a partial microtubule doublet along with radial spokes 1 and 2 from Chlamydomonas reinhardtii (EMD-22475, EMD-22481, EMD-22483). The segmented densities used to test DomainFit are colored in green and pink. (D) Segmented density from radial spoke 1 corresponding to RSP25 and RSP26. The parsed domains of RSP25 and RSP26 are fitted into the density. Two other high-ranking domains A0A2K3DZR1 and P05434 are also fitted into the density. (E) Benjamini-Hochberg-adjusted P value vs CC coefficient of the domains fitted into the RSP25 density. (F) Segmented density from radial spoke 2 corresponding to RSP24. The two parsed domains of RSP24 are fitted into the density. (G) Benjamini-Hochberg-adjusted P value vs CC coefficient of the domains fitted into the RSP24 density. (H) Segmented density corresponding to TrxL1 and TrxL2 from the subpellicular microtubule of T. gondii used to test DomainFit (EMD-26019). (I) Benjamini-Hochberg-adjusted P value vs CC coefficient of the domains fitted into the TrxL density. See also Figure S3 and Table S3.

To test the efficiency of DomainFit on a cryo-EM map at higher resolution, we chose the cryo-EM maps of the radial spokes from the axoneme of Chlamydomonas reinhardtii34. Although most proteins were identified by the authors, some densities were unassigned until ModelAngelo was used to identify new radial spokes proteins including RSP24, RSP25 and RSP2620. We segmented those densities and ran DomainFit to identify the corresponding domains from 2181 proteins from the C. reinhardtii ciliome35 (Fig. 4C). DomainFit successfully identified RSP25 and RSP26 (Fig. 4D, E). Two other proteins, P05434 and A0A2K3DZR1 had low P values and high CC coefficients (Fig. 4D, E). They indeed fitted well into the segmented density but did not fill it as well as RSP25 and RSP26 (Fig. 4D). This reinforces our conclusion above that P value is a better indicator than CC.

DomainFit also identified both domains of RSP24 successfully (Fig. 4F and G). Each domain was fitted into a different part of the map by DomainFit, similar to the model that was automatically built by ModelAngelo. Both domains had a smallest P value and a higher CC coefficient than the other domains that were fitted into the density (Fig. 4G).

To test DomainFit to detect homologs from density, we segmented the densities corresponding to TrxL1 and TrxL2 in the subtomogram average of the subpellicular microtubule of Toxoplasma gondii reported at 6.7 Å resolution36 (Fig. 4H). TrxL1 and TrxL2 are homologues and therefore have a very similar structure (Figure S3D). DomainFit successfully found TrxL1 and TrxL2 as the best solutions from 95 proteins associated with the subpellicular cytoskeleton37, however, it fitted them both in the same part of the density (Fig. 4I, H). Therefore, DomainFit cannot differentiate between homologs when the details in the map corresponding to the differences between the homologs’ structures are not resolved.

New MIP identified in the doublet microtubule

Following our success in testing and validating the DomainFit in different cases, we attempted to identify unknown proteins from cryo-EM maps. An unidentified MIP was previously reported in the subtomogram average of the T. thermophila doublet microtubule38 (Fig. 5A) but not in the single particle cryo-EM map of the doublet microtubule21 (Fig. 5B).

Figure 5: New proteins identified from the doublet microtubule of T. thermophila.

Figure 5:

(A) Overview of the unknown density (yellow) from the 48-nm repeating unit subtomogram average map of doublet microtubule (EMD-24376)38. (B) The same density is not visible in the same view of the single particle cryo-EM map of the 48-nm repeating unit of the doublet microtubule from the T. thermophila K40R map (EMD-29692)21. (C) The unknown density exists in the doublet microtubule map after 3D classification. (D) Segmentation of the unknown density into different small unknown densities for DomainFit. (E) Top hits for each unknown density using DomainFit search on an AlphaFold database of salt-treated ciliome. (F) Corrected model of protein domains after examining, showing BMIP1 (UniprotID I7MB72) is density 1 and CCDC81B (UniprotID I7M688) forms density 2–12. (G) Fit of AlphaFold2-predicted model BMIP112–168 into density 1. (H) Fit of AlphaFold2-predicted model CCDC81B38–218 into density 2. See also Figure S4, S5 and Table S2.

Using 3D classification and refinement of the cryo-EM dataset of the T. thermophila K40R doublet microtubule21, we obtained a 4.3 Å resolution cryo-EM map with the same density as the subtomogram average (Fig. 5C, Figure S4A, B).

To identify the proteins in the unknown density, we manually segmented the density into 12 smaller densities (Fig. 5D) and performed a DomainFit search with 166 proteins identified in the proteomics of the salt-treated doublet microtubules3.

Interestingly, the top hits for the 12 densities came from only three proteins (Fig. 5E, Table S2). Upon inspecting the results, we concluded that the 12 densities are composed of only two proteins I7MB72 (TTHERM_00525130) and I7M688 (TTHERM_00649260) because of the overall architecture (Fig. 5F) and the fit of unique domains in both proteins (Fig. 5GH).

I7MB72 matches unknown density 1 with high confidence in P value and CC coefficient (Fig. 5F), hereafter BMIP1. Upon inspection, unknown density 1 was always present in the doublet microtubule (EMD-29692) but was not identified previously. We could model extra regions of the protein BMIP1 in the T. thermophila K40R map (EMD-29692) at 3.5 Å resolution, confirming the identity of unknown density 1 and the success of DomainFit (Fig. 6A, B, Table 1). BMIP1 has a crosslink to CFAP45, which exists as two copies between protofilaments B8B9 and B7B8, as shown in studies that used chemical crosslinking coupled with mass spectrometry of cilia39 (Fig. 6C). While the crosslinked residues (Lysines 213 and 222) from BMIP1 are not modelled, they are within the crosslink distance (< 35 Å) from lysines 403 and 283 of CFAP45.

Figure 6: Validation of the newly identified MIPs.

Figure 6:

(A) Model of BMIP1 (UniprotID I7MB72). (B) Side-chain densities of a helix from BMIP1. (C) Intra-crosslinks (orange) within CCDC81B and inter-crosslinks (yellow) between CCDC81B and CFAP106, a known MIP protein32. (D) Merged super resolution-structured illumination microscopy image of T. thermophila cells with CCDC81B-GFP (red), polyglycylated tubulin (green) and DNA (blue). CCDC81B only localizes to the proximal region of the cilia. Rectangle indicates inset shown in (E). (E) Inset from (D) with merged image (top) and GFP channel (bottom) showing the localization of CCDC81B to the proximal region.

Table 1:

Refinement statistics of the proximal MIP models

Bonds (RMSD)
 Length (Å) (# > 4σ) 0.003 (0)
 Angles (°) (# > 4σ) 0.628 (4)
MolProbity score 1.75
Clash score 12.26
Ramachandran plot (%)
 Outliers 0.15
 Allowed 2.70
 Favored 97.15
Rama-Z (Ramachandran plot Z-score, RMSD)
 whole (N = 1335) 1.47 (0.23)
 helix (N = 862) 1.47 (0.18)
 sheet (N = 27) −0.31 (0.86)
 loop (N = 446) 0.25 (0.31)
Rotamer outliers (%) 0.00
Cβ outliers (%) 0.00
Peptide plane (%)
 Cis proline/general 0.0/0.0
 Twisted proline/general 0.0/0.0

Densities 2–12 (Fig. 5D) belong to I7M688, a CCDC81-domain containing protein, referred to as CCDC81B here onwards. The CCDC81 domain is unambiguously fitted into unknown density 2 (Fig. 5H). There are a few CCDC81-containing proteins in the cilia of T. thermophila: IJ34 (UniprotID I7M9T0), CCDC81A (UniprotID I7MLF6), CCDC81B (UniprotID I7M688), Q240Y1 and Q22HG4. IJ34 is a MIP near the inner junction21 and CCDC81A was identified as a ciliary tip protein40. Q240Y1 and Q22HG4 are clear paralogs of CCDC81B with similar architecture but are not abundant21.

In situ crosslinking mass spectrometry shows that CCDC81B is crosslinked to α-tubulin, CFAP106, FAP210, and interestingly RIB43A39. All these crosslinks are satisfied relative to the CCDC81B position in the unknown density, except for RIB43A (Fig. 6C). Therefore, we are confident that CCDC81B corresponds to unknown densities 2–12.

For a quick comparison of DomainFit with complete AlphaFold2 model fitting, we also fitted 166 complete AlphaFold2-predicted models into the full segmented density (Figure S5). The top hits were IJ34 and CCDC81B. Both proteins contain a CCDC81 domain (Figure S5B). BMIP1 only ranks 22 in P value. In this case, the AlphaFold2-predicted model of CCDC81B has good tertiary structure prediction, which helps the complete model fitting. This shows the weakness of complete model fitting. First, complete model fitting seems to penalize smaller domains such as BMIP1. Second, it does not allow size estimation of the fitted domains, which is useful to visually eliminate false positive fits. Finally, with DomainFit, we can visualize the top hits from different densities (Figure S5A), allowing a quicker identification of multidomain proteins without relying on the quality of the tertiary structure prediction. On the other hand, complete model fitting of 166 AlphaFold2-predicted models in the density filtered at 10 Å resolution found CCDC81B as the top hit with good fitting accuracy, thanks to good tertiary structure prediction (Figure S5C, D).

To further confirm that CCDC81B is a MIP found only in a subset of the doublet microtubule, we generated a T. thermophila strain with CCDC81B fused to GFP. Using super-resolution structured illumination microscopy, we observed that the signal of CCDC81B was limited to the 1–1.5 μm proximal region of the cilia (Fig. 6D, E). This observation explains the substoichiometric fraction of the particles containing CCDC81B densities in the single-particle cryo-EM data (Figure S5). As a result, we confirm that CCDC81B is a MIP that localizes to the proximal region of the cilia.

Discussion

In this paper, we demonstrated that it is possible to identify compact domains of proteins in maps at different resolutions using the DomainFit program. We benchmarked DomainFit using known structures and successfully identified CCDC81B and BMIP1 as two new proteins in the doublet microtubule of T. thermophila. Notably, we showed that CCDC81B only exists in the proximal region of the cilium using super-resolution structured illumination microscopy of T. thermophila cells expressing CCDC81B fused to GFP.

However, in the absence of complementary data, the domains identified by DomainFit cannot be assigned with as much confidence as high-resolution cryo-EM data where side-chain information is available. We showed that the P value serves as a robust indicator for domain identification. When the fold of the domain is unique, the identification works excellently. When the fold is not unique, DomainFit still finds the correct fold. In this case, homologs or isoforms may actually have better P values than the correct protein. Tools like Foldseek41 can then be used to list the proteins with the same fold available. With other complementary info such as chemical crosslinking mass spectrometry, BioID, and stoichiometry of proteins from quantitative mass spectrometry, it is possible to narrow down and identify the right proteins as demonstrated by the identification of CCDC81B in this work and other integrated structural biology approach42. There are certain cases in which the AlphaFold2-predicted model is not similar to the protein fold in the map, leading to a misidentification. Therefore, exact protein identification must be done in a conservative fashion when having no extra concrete evidence.

Despite being flexible and more automated than approaches proposed in the literature, the approach is not fully automated due to manual segmentation. Segmentation is particularly needed for successful identification as it can influence the number of initial search positions and P value score. In corporation of automatic segmentation methods such as Segger43 and MAINMASTSeg44 might improve the automation of the program and improve identification success. In addition, there are other fitting approaches with better speed and accuracy such as Vesper30 and others (reviewed in Alnabati and Kihara45). Implementation of these fitting approaches in a similar workflow to DomainFit might improve fitting and, hence, successful identification.

The validation of DomainFit using known MIPs highlights a caveat of the workflow in the partitioning of domains. Breaking proteins into compact domains will never be perfect since over-partitioning will produce smaller domains and therefore increase false positive fits while under-partitioning produces bigger domains, requiring a high number of initial search positions. Perhaps, the development of domain partitioning with better customization in the future can improve the usability of DomainFit. Despite that caveat, our work shows that domain fitting is in general more accurate than complete model fitting since AlphaFold2 domain prediction is more accurate than AlphaFold2 tertiary and quarternary structure prediction.

Another point to consider for a successful identification is the quality of the map such as properly segmentation and uniform and isotropic resolution. DomainFit works well between 4 and 6 Å resolution where detailed secondary structural features can be visualized. It is recommended to use appropriate post-processing methods to improve the interpretability of the map such as DeepEmhancer46, LocSpiral47 or Density Modification48. These post-processing approaches improve map isotropicity and uniformity, which facilitates better segmentation and fitting.

At the lower resolutions of 8–10 Å, DomainFit still works but requires complementary data for validation. For lower resolution, perhaps using domains with bigger partitions or even the entire AlphaFold2-predicted model might help with the fitting and identification if the tertiary structure is predicted correctly. With the flexibility of DomainFit, both approaches should be run to evaluate the fitting visually.

In addition, it is possible to use DomainFit to construct the structures of complexes of known compositions. A database of domains from all the known components can fit into different densities of the map. This allows a more unbiased way of building up the structure of the large complex. This serves as a similar approach to Assembline31 and Integrated Modelling Platform49 using only models and maps without considering other experimental constrains.

In conclusion, we presented here the DomainFit program, which allows the unbiased fitting of domains of AI-predicted models into the cryo-EM map. At resolutions better than 6 Å, the program can be used reliably as a tool to identify compact proteins in the map.

STAR Methods

RESOURCE AVAILABILITY

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Khanh Huy Bui (huy.bui@mcgill.ca).

Materials availability

All unique/stable reagents generated in this study will be made available on request, but we may require a payment and/or a completed materials transfer agreement if there is potential for commercial application.

Data and code availability

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Antibodies
mouse monoclonal anti-polyglycylated tubulin AXO49 Bre et al.61 N/A
polyclonal anti-GFP Rockland Cat#600–401–215
goat-anti-mouse IgG-FITC Jackson Immunoresearch Cat#115–095–146
goat-anti-rabbit-Cy3 Jackson Immunoresearch Cat#111–165–003
Deposited data
PDB model of the identified proximal proteins from the 48-nm T. thermophila K40R doublet microtubule This study 8VL3
The map from the proximal 48-nm T. thermophila K40R doublet microtubule This study EMD-42949
PDB model of the 48-nm T. thermophila WT doublet microtubule Kubo et al.21 8G2Z
48-nm T. thermophila K40R doublet microtubule Kubo et al.21 EMD-29692
48-nm T. thermophila WT doublet microtubule Kubo et al.21 EMD-29685
Subpellicular microtubule map from T. gondi Sun et al.36 EMD-26019
Subpellicular microtubule PDB model from T. gondi Sun et al.36 7TNS
Radial spoke map from C. reinhardtii Gui et al.34 EMD-22475
Radial spoke map from C. reinhardtii Gui et al.34 EMD-22481
Radial spoke PDB model from C. reinhardtii Gui et al.34 7JTK
Radial spoke PDB model from C. reinhardtii Gui et al.34 7JU4
Experimental models: Organisms/strains
Tetrahymena thermophila: Strain CCDC81B-GFP This study N/A
Oligonucleotides
Primer 5F: ctatagggcgaattggagctttgtgaaatagatggaagag This study N/A
Primer 5R: atcaagcttgccatccgcggacttgtgaatttttaaagagat This study N/A
Primer 3F: gcttatcgataccgtcgaccatcaattatttcaaagtattaa This study N/A
Primer 3R: agggaacaaaagctgggtacgcattatccaaaatatattctaa This study N/A
Software and algorithms
AlphaFold Jumper et al.23 https://github.com/google-deepmind/alphafold
ColabFold Mirdita et al.24 https://github.com/sokrypton/ColabFold
Phenix Afonine et al.57 https://phenix-online.org/
UCSF ChimeraX Pettersen et al.29 https://www.cgl.ucsf.edu/chimerax/
R R Core Team28 https://www.r-project.org/
Coot Emsley et al.56 https://www2.mrc-lmb.cam.ac.uk/personal/pemsley/coot/
Cryosparc Pujani et al.55 https://cryosparc.com/
FIJI Schindelin et al.62 https://imagej.net/software/fiji/
DomainFit This study https://github.com/builab/DomainFit
DeepEMhancer Sanchez-Garcia et al.46 https://github.com/rsanchezgarc/deepEMhancer
XMAS Ilse et al.58 https://github.com/ScheltemaLab/ChimeraX_XMAS_bundle

EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS

Tetrahymena thermophila strain CCDC81B-GFP were grown in SPP media in a shaker incubator at 30°C and 120 rpm.

METHOD DETAILS

Cryo-EM density map segmentation

To segment the density of interest for DomainFit from the cryo-EM map, we manually placed markers onto the density of interest in ChimeraX29. After that, we colored the density around the markers, and use volume splitbyzone function of ChimeraX to segment the density out and save. Then, we used the same volume rendered at lower threshold as the surface mask to trim the volume size using volume mask function. The masked volume therefore only contains the segmented density and is significantly smaller than the original volume.

To prepare for the fitting of domains, a compact density i.e. either globular or coiled-coil bundle should be segmented from the cryo-EM map. Ideally, the segmented density should be equivalent to a domain of 50 to 300 amino acids in size. This serves two purposes: (i) to reduce computational time, and (ii) to avoid the wrongly predicted tertiary conformation caused by long flexible loop between different domains.

Domain parsing and saving

We used phenix.process_predicted_model program from the Phenix package27 for the partitioning of the PDB into domains because of its flexibility. phenix.process_predicted_model can function based on PAE but also based on the 3D arrangement of the proteins without PAE information. Therefore, we implemented a Python wrapper process_predicted_models.py to partition PDBs into domains in batch. Users can customize the options for parsing based on their needs.

For domain parsing, we used the option split_model_by_compact_regions=True from phenix.process_predicted_model. In addition, we set the option maximum_domains for each protein so that the maximum_domains equal the protein size in amino acids divided by 100 amino acids.

The script process_predicted_models.py writes out the information of the predicted domains using the format employed by DPAM50. As a result, domain information data can be used from either phenix.process_predicted_model or existing domain data of certain organisms such as human, mouse, and zebrafish from DPAM50 to generate the PDBs of individual domains. In the case of a small number of target proteins, the domain information files can be edited manually to generate customized domain parsing.

We imposed a filter on the minimum and maximum domain sizes for fitting. At the lower end, 40 amino acids is the minimum size that allows us to reduce the number of domains used without affecting the results since domains under 40 amino acids tend to be either wrongly partitioned or not compact. On the upper limit, we recommend using a bigger value like 1,000 amino acids as a fail-safe in the case that domain parsing does not work properly.

Domain fitting

Using the ChimeraX fitmap command, the program fits each domain in the density based on a fixed number of randomly generated initial search positions within the density (Fig. 2A). The number of initial search placements is essential to identify the correct position and generate enough unique fits for P value calculation. If the density is big, there needs to be more initial search placements to cover the grid points and random orientations. If the model is big and, especially, not compact, there is also a need for higher initial search placements since the center of rotation imposed by ChimeraX is the center of the bounding box covering the model, which can be far off from the center of rotation of the correct domain. While the number of initial search placements is a crucial parameter, it is easy to use 1000 initial search placements as default for a comprehensive search for large models/domains or segmented maps. However, understanding the minimum initial placements is important if users want to use DomainFit on modest hardware. In practice, we often start with 200–400 initial search placements to speed up computation. If nothing significant in P value was found, we ran DomainFit again with 1,000 initial search placements.

DomainFit analysis of MIP proteins

We segmented 24 densities from 13 identified MIPs in the cryo-EM map corresponding to different domains of MIPs (Table S1).

We used DomainFit to download a database comprising 856 AlphaFold2-predicted models found in the mass spectrometry data of the WT doublet microtubules (994 detected proteins)21. 138 proteins were not available from the AlphaFold DB because of a lack of predictions due to their high molecular weight. We ignored them for this test since the 856 AlphaFold2-predicted models covered all the tested MIPs.

We divided 856 AlphaFold2-predicted models into 2819 domains. Upon removing domains of less than 40 amino acids, 1561 domains remained (Fig. 3C). 1561 domains were fitted into each density filtered at 4 Å using 200 initial search position placements (Table S3).

DomainFit analysis of radial spoke proteins

Radial spokes maps (EMD-22475 and EMD-22481) were downloaded as well as the respective models (PDB 7JTK and 7JU4)34. The selected densities were segmented manually in ChimeraX based on the ModelAngelo publication. The density corresponding to both RSP25 and RSP26 was segmented as one density. The density corresponding to both domains of RSP24 was segmented as a separate density.

2736 proteins from the Chlamydomonas Flagella Proteome Project (V5) were identified35. The 2179 available AlphaFold2 predicted structures were downloaded from AlphaFold DB. The structures of RSP24 (Cre08.g800895_4532.1) and RSP26 (Cre17.g802036_4532.1) were predicted using ColabFold since they were not available in the AlphaFold DB. A total of 2557 domains were fitted into each density (Table S3).

DomainFit analysis of proteins in the lumen of subpellicular microtubules from T. gondii

The map of the subpellicular microtubule (EMD-26019) and associated model (PDB 7TNS) were downloaded36. The densities around chains u and w corresponding to TrxL1 and TrxL2 respectively were segmented in ChimeraX. 94 proteins were identified from the proteomic characterization of the subpellicular cytoskeleton of T. gondii37. Out of these proteins, 56 had a UniProt ID and their structures were downloaded. A total of 75 domains were fitted into the density (Table S3).

Cryo-EM and image analysis of the doublet microtubule

The single particle cryo-EM data of the 48-nm repeating unit of the doublet microtubule from T. thermophila was originally published in21.

To obtain the density with a resolution better than 12 Å resolution of the subtomogram average, we constructed a focused classification mask at the region of the density based on the EMD-24376 subtomogram average map and performed 3D classification without alignment for four classes on the 48-nm particles cryo-EM dataset of T. thermophila K40R doublet microtubule21 using Cryosparc55 (Figure S4A). After classification, class 1 contains 45,618 particles showing the same density as the subtomogram average (Fig. 5C). This suggests that the density feature observed by subtomogram averaging is not uniformly located in the cilia. We further refined Class 1 in Cryosparc with focused local refinement using a refinement mask slightly larger than the classification mask, resulting in a cryo-EM map of 4.3 Å resolution (Fig. 5C, Figure S4A, B). The details in the density map suggest a 5 Å resolution.

The final map was post-processed using DeepEMhancer46.

All the visualization of maps and models were done in ChimeraX29.

Domain search of the unknown MIP density

The unknown density from the map above was segmented into 12 smaller densities (Fig. 5D). To improve the accuracy of the search for the unknown MIP density, we used the ciliome of the salt-treated doublet microtubules containing 166 proteins3. The reason is that the cryo-EM map of salt-treated doublet microtubules retains most of the MIPs and the unknown density. The salt-treated ciliome of 166 proteins was divided into 334 domains of at least 40 amino acids. We ran DomainFit with these 334 domain models for the above 12 densities (Table S3).

Entire model search for the unknown MIP density

For a quick comparison of DomainFit with complete AlphaFold2 model fitting, we fitted 166 complete AlphaFold2-predicted models into the full segmented density (Figure S5). The top hits were IJ34 and CCDC81B. Both proteins contain a CCDC81 domain (Figure S5). BMIP1 only ranks 22 in P value.

Modelling

For modelling of BMIP1 (UniprotID I7MB72, TTHERM_00525130), we started with the AlphaFold2-predicted model of the globular domain of BMIP1 fitted in its density. We fixed the model manually in Coot56 and modelled some extra regions of the protein. The final models were then real-space refined in Phenix57.

For the modelling of CCDC81B (UniprotID I7M688, TTHERM_00649260), best fits of different domains of CCDC81B were found using DomainFit. We joined all the domains together using Coot56. The model was then manually adjusted in Coot and refined in Phenix57 (Table 1).

Crosslinking mass spectrometry visualization

Crosslinks to BMIP1 and CCDC81B were obtained from a chemical crosslinking mass spectrometry of T. thermophila cilia (Reported in Table S1 in39). The crosslinks were visualized in ChimeraX using the bundle XMAS58.

Cell culture and gene editing

All T. thermophila strains used in this study were grown in SPP media59 in a shaker incubator at 30°C and 120 rpm.

The CCDC81B gene, TTHERM_00649260 gene was edited by homologous DNA recombination using a targeting plasmid carrying the neo4 selectable marker. The portions of TTHERM_00649260 required for gene targeting were amplified using primers 5F (5’- ctatagggcgaattggagctttgtgaaatagatggaagag-3’) and 5R (5’- atcaagcttgccatccgcggacttgtgaatttttaaagagat-3’) amplified a terminal portion of the coding region and primers 3F (5’- gcttatcgataccgtcgaccatcaattatttcaaagtattaa-3’) and 3R (5’- agggaacaaaagctgggtacgcattatccaaaatatattctaa −3’) amplified a portion of the 3’ UTR and cloned into the pNeo24-GFP plasmid60. The resulting edited fragment TTHERM_00649260 was targeted to the native locus using biolistic bombardment of T. thermophila cells and paromomycin selection.

Immunofluorescence

For immunofluorescence, T. thermophila cells were fixed and stained as described61. The primary antibodies used were the mouse monoclonal anti-polyglycylated tubulin AXO49 (diluted 1:200)61 and polyclonal anti-GFP antibodies (Rockland, 1:800). The secondary antibodies used were goat-anti-mouse IgG-FITC and goat-anti-rabbit-Cy3 antibodies (Jackson Immunoresearch). SR-SIM imaging was conducted on an ELYRA S1 microscope equipped with a 63× NA 1.4 Oil Plan-Apochromat DIC objective. The optical slices were analyzed by Fiji/ImageJ (Z project tool)62.

QUANTIFICATION AND STATISTICAL ANALYSIS

P value calculation

The P value calculation for the fitting of each domain has been implemented in R28 in the integrated modelling software Assembline31 and used for the identification of the nuclear pore complex scaffold components5153 and the integrated modelling of the elongator complex54. The R script is modified to use within DomainFit.

In brief, for each domain, ChimeraX clusters the results into a number of unique fits with distinct rotational and translational parameters, and CC coefficients. As a result, many searches with different initial rotation and translation placement, each resulting in a similar final fitting orientation are clustered into one unique fit. The CC coefficients of all unique fits are transformed to z-scores using Fisher z-transform and centered to yield an approximately normal distribution (Fig. 2F).

Then, the two-sided P value and Benjamini-Hochberg adjusted P value are calculated from the Fisher z-scores using the false discovery rate (fdr) package in R. The best fit for each domain which corresponds to the highest CC coefficient and lowest P value is recorded in an aggregated list. Once this process is done for all the domains, the list is then sorted by P value. When the correct domain is evaluated, the best fit’s Fisher z-transformed score is clearly discriminated from that of the rest of the unique fits while the difference is not as clear when a wrong domain is evaluated (Fig. 2B, C). That translates to a lower P value for the best fit of the correct domain.

In the cases where domains have a common fold, multiple domains returned a P value of 2.22*10−16 (limit of R program). Thus, we sorted the fitting solutions first based on P values, and then by their CC coefficient. We found that the Benjamini-Hochberg adjusted P value functions similar to the P value.

Refinement statistics of the proximal MIP models

The proximal MIP model was manually adjusted and refined in Phenix57. The refinement statistics are shown in Table 1.

Supplementary Material

1

Highlights.

  • Easy-to-use program to automatically fit protein domains into cryo-EM/ET maps

  • Testing shows DomainFit performs best in 3–8 Å resolution range

  • New ciliary protein discovered with DomainFit

Acknowledgements

We thank Drs. Mike Strauss, Muneyoshi Ichikawa and Corbin Black for critically reading the manuscript. KHB is supported by grants from the Canadian Institutes of Health Research (PJT-190195) and the Natural Sciences and Engineering Research Council of Canada (RGPIN-2022-04774). JG is supported by NIH grant R01GM135444.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interests

The authors declare no competing interest.

References

  • 1.Ma J, You X, Sun S, Wang X, Qin S, and Sui SF (2020). Structural basis of energy transfer in Porphyridium purpureum phycobilisome. Nature 579, 146–151. 10.1038/s41586-020-2020-7. [DOI] [PubMed] [Google Scholar]
  • 2.Muhleip A, Flygaard RK, Baradaran R, Haapanen O, Gruhl T, Tobiasson V, Marechal A, Sharma V, and Amunts A (2023). Structural basis of mitochondrial membrane bending by the I-II-III2-IV2 supercomplex. Nature 615, 934–938. 10.1038/s41586-023-05817-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ichikawa M, Khalifa AAZ, Kubo S, Dai D, Basu K, Maghrebi MAF, Vargas J, and Bui KH (2019). Tubulin lattice in cilia is in a stressed form regulated by microtubule inner proteins. Proc Natl Acad Sci U S A 116, 19930–19938. 10.1073/pnas.1911119116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ma M, Stoyanova M, Rademacher G, Dutcher SK, Brown A, and Zhang R (2019). Structure of the Decorated Ciliary Doublet Microtubule. Cell 179, 909–922 e912. 10.1016/j.cell.2019.09.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Berger C, Premaraj N, Ravelli RBG, Knoops K, Lopez-Iglesias C, and Peters PJ (2023). Cryo-electron tomography on focused ion beam lamellae transforms structural cell biology. Nat Methods 20, 499–511. 10.1038/s41592-023-01783-5. [DOI] [PubMed] [Google Scholar]
  • 6.Castano-Diez D, Kudryashev M, Arheit M, and Stahlberg H (2012). Dynamo: a flexible, user-friendly development tool for subtomogram averaging of cryo-EM data in high-performance computing environments. J Struct Biol 178, 139–151. 10.1016/j.jsb.2011.12.017. [DOI] [PubMed] [Google Scholar]
  • 7.Heumann JM, Hoenger A, and Mastronarde DN (2011). Clustering and variance maps for cryo-electron tomography using wedge-masked differences. J Struct Biol 175, 288–299. 10.1016/j.jsb.2011.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chen M, Bell JM, Shi X, Sun SY, Wang Z, and Ludtke SJ (2019). A complete data processing workflow for cryo-ET and subtomogram averaging. Nat Methods 16, 1161–1168. 10.1038/s41592-019-0591-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Himes BA, and Zhang PJ (2018). emClarity: software for high-resolution cryo-electron tomography and subtomogram averaging. Nature Methods 15, 955–961. 10.1038/s41592-018-0167-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hrabe T, Chen Y, Pfeffer S, Cuellar LK, Mangold AV, and Forster F (2012). PyTom: a python-based toolbox for localization of macromolecules in cryo-electron tomograms and subtomogram analysis. J Struct Biol 178, 177–188. 10.1016/j.jsb.2011.12.003. [DOI] [PubMed] [Google Scholar]
  • 11.Tegunov D, Xue L, Dienemann C, Cramer P, and Mahamid J (2021). Multi-particle cryo-EM refinement with M visualizes ribosome-antibiotic complex at 3.5 angstrom in cells. Nature Methods 18, 186–193. 10.1038/s41592-020-01054-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wan W, Khavnekar S, Wagner J, Erdmann P, and Baumeister W (2020). STOPGAP: A Software Package for Subtomogram Averaging and Refinement. Microscopy and Microanalysis 26, 2516–2516. 10.1017/S143192762002187X. [DOI] [Google Scholar]
  • 13.Zivanov J, Oton J, Ke Z, von Kugelgen A, Pyle E, Qu K, Morado D, Castano-Diez D, Zanetti G, Bharat TAM, et al. (2022). A Bayesian approach to single-particle electron cryo-tomography in RELION-4.0. Elife 11, e83724. 10.7554/eLife.83724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Schur FKM, Obr M, Hagen WJH, Wan W, Jakobi AJ, Kirkpatrick JM, Sachse C, Krausslich HG, and Briggs JAG (2016). An atomic model of HIV-1 capsid-SP1 reveals structures regulating assembly and maturation. Science 353, 506–508. 10.1126/science.aaf9620. [DOI] [PubMed] [Google Scholar]
  • 15.Tai L, Yin G, Huang X, Zhu Y, and Sun F (2023). In-cell structural insight into the stability of sperm microtubule doublet. Cell Discov 9, 116. 10.1038/s41421-023-00606-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Xing HP, Taniguchi R, Khusainov I, Kreysing JP, Welsch S, Turonova B, and Beck M (2023). Translation dynamics in human cells visualized at high resolution reveal cancer drug action. Science 381, 70–75. 10.1126/science.adh1411. [DOI] [PubMed] [Google Scholar]
  • 17.Ho CM, Li X, Lai M, Terwilliger TC, Beck JR, Wohlschlegel J, Goldberg DE, Fitzpatrick AWP, and Zhou ZH (2020). Bottom-up structural proteomics: cryoEM of protein complexes enriched from the cellular milieu. Nat Methods 17, 79–85. 10.1038/s41592-019-0637-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pfab J, Phan NM, and Si D (2021). DeepTracer for fast de novo cryo-EM protein structure modeling and special studies on CoV-related complexes. Proc Natl Acad Sci U S A 118, e2017525118. 10.1073/pnas.2017525118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chojnowski G, Simpkin AJ, Leonardo DA, Seifert-Davila W, Vivas-Ruiz DE, Keegan RM, and Rigden DJ (2022). findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM. IUCrJ 9, 86–97. 10.1107/S2052252521011088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Jamali K, Kall L, Zhang R, Brown A, Kimanius D, and Scheres SHW (2024). Automated model building and protein identification in cryo-EM maps. Nature. 10.1038/s41586-024-07215-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kubo S, Black CS, Joachimiak E, Yang SK, Legal T, Peri K, Khalifa AAZ, Ghanaeian A, McCafferty CL, Valente-Paterno M, et al. (2023). Native doublet microtubules from Tetrahymena thermophila reveal the importance of outer junction proteins. Nat Commun 14, 2168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Leung MR, Zeng JW, Wang XL, Roelofs MC, Huang W, Chiozzi RZ, Hevler JF, Heck AJR, Dutcher SK, Brown A, et al. (2023). Structural specializations of the sperm tail. Cell 186, 2880–2896. 10.1016/j.cell.2023.05.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Mirdita M, Schutze K, Moriwaki Y, Heo L, Ovchinnikov S, and Steinegger M (2022). ColabFold - Making protein folding accessible to all. Nat Methods 19, 679–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876. 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chen Z, Shiozaki M, Haas KM, Skinner WM, Zhao S, Guo C, Polacco BJ, Yu Z, Krogan NJ, Lishko PV, et al. (2023). De novo protein identification in mammalian sperm using in situ cryoelectron tomography and AlphaFold2 docking. Cell 186, 5041–5053. 10.1016/j.cell.2023.09.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Oeffner RD, Croll TI, Millan C, Poon BK, Schlicksup CJ, Read RJ, and Terwilliger TC (2022). Putting AlphaFold models to work with phenix.process_predicted_model and ISOLDE. Acta Crystallographica Section D-Structural Biology 78, 1303–1314. 10.1107/S2059798322010026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Team, R.C. (2021). R: A language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna, Austria: ). [Google Scholar]
  • 29.Pettersen EF, Goddard TD, Huang CC, Meng EC, Couch GS, Croll TI, Morris JH, and Ferrin TE (2021). UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Sci 30, 70–82. 10.1002/pro.3943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Han X, Terashi G, Christoffer C, Chen S, and Kihara D (2021). VESPER: global and local cryo-EM map alignment using local density vectors. Nat Commun 12, 2090. 10.1038/s41467-021-22401-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Rantos V, Karius K, and Kosinski J (2022). Integrative structural modeling of macromolecular complexes using Assembline. Nat Protoc 17, 152–176. 10.1038/s41596-021-00640-z. [DOI] [PubMed] [Google Scholar]
  • 32.Khalifa AAZ, Ichikawa M, Dai D, Kubo S, Black CS, Peri K, McAlear TS, Veyron S, Yang SK, Vargas J, et al. (2020). The inner junction complex of the cilia is an interaction hub that involves tubulin post-translational modifications. Elife 9, e52760. 10.7554/eLife.52760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Terwilliger TC, Liebschner D, Croll TI, Williams CJ, McCoy AJ, Poon BK, Afonine PV, Oeffner RD, Richardson JS, Read RJ, and Adams PD (2023). AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination. Nat Methods. 10.1038/s41592-023-02087-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Gui M, Ma M, Sze-Tu E, Wang X, Koh F, Zhong ED, Berger B, Davis JH, Dutcher SK, Zhang R, and Brown A (2021). Structures of radial spokes and associated complexes important for ciliary motility. Nat Struct Mol Biol 28, 29–37. 10.1038/s41594-020-00530-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pazour GJ, Agrin N, Leszyk J, and Witman GB (2005). Proteomic analysis of a eukaryotic cilium. J Cell Biol 170, 103–113. 10.1083/jcb.200504008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sun SY, Segev-Zarko LA, Chen M, Pintilie GD, Schmid MF, Ludtke SJ, Boothroyd JC, and Chiu W (2022). Cryo-ET of Toxoplasma parasites gives subnanometer insight into tubulin-based structures. Proc Natl Acad Sci U S A 119. 10.1073/pnas.2111661119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gomez de Leon CT, Diaz Martin RD, Mendoza Hernandez G, Gonzalez Pozos S, Ambrosio JR, and Mondragon Flores R (2014). Proteomic characterization of the subpellicular cytoskeleton of Toxoplasma gondii tachyzoites. J Proteomics 111, 86–99. 10.1016/j.jprot.2014.03.008. [DOI] [PubMed] [Google Scholar]
  • 38.Li S, Fernandez JJ, Fabritius AS, Agard DA, and Winey M (2022). Electron cryo-tomography structure of axonemal doublet microtubule from Tetrahymena thermophila. Life Sci Alliance 5, e202101225. 10.26508/lsa.202101225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.McCafferty CL, Papoulas O, Lee C, Bui KH, Taylor DW, Marcotte EM, and Wallingford JB (2023). An amino acid-resolution interactome for motile cilia illuminates the structure and function of ciliopathy protein complexes. bioRxiv, 2023.2007.2009.548259. 10.1101/2023.07.09.548259. [DOI] [Google Scholar]
  • 40.Legal T, Parra M, Tong M, Black CS, Joachimiak E, Valente-Paterno M, Lechtreck K, Gaertig J, and Bui KH (2023). CEP104/FAP256 and associated cap complex maintain stability of the ciliary tip. J Cell Biol 222. 10.1083/jcb.202301129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Soding J, and Steinegger M (2023). Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ghanaeian A, Majhi S, McCaffrey CL, Nami B, Black CS, Yang SK, Legal T, Papoulas O, Janowska M, Valente-Paterno M, et al. (2023). Integrated modeling of the Nexin-dynein regulatory complex reveals its regulatory mechanism. Nat Commun 14, 5741. 10.1038/s41467-023-41480-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Pintilie GD, Zhang J, Goddard TD, Chiu W, and Gossard DC (2010). Quantitative analysis of cryo-EM density map segmentation by watershed and scale-space filtering, and fitting of structures by alignment to regions. J Struct Biol 170, 427–438. 10.1016/j.jsb.2010.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Terashi G, Kagaya Y, and Kihara D (2020). MAINMASTseg: Automated Map Segmentation Method for Cryo-EM Density Maps with Symmetry. J Chem Inf Model 60, 2634–2643. 10.1021/acs.jcim.9b01110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Alnabati E, and Kihara D (2019). Advances in Structure Modeling Methods for Cryo-Electron Microscopy Maps. Molecules 25. 10.3390/molecules25010082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Sanchez-Garcia R, Gomez-Blanco J, Cuervo A, Carazo JM, Sorzano COS, and Vargas J (2021). DeepEMhancer: a deep learning solution for cryo-EM volume post-processing. Commun Biol 4, 874. 10.1038/s42003-021-02399-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Kaur S, Gomez-Blanco J, Khalifa AAZ, Adinarayanan S, Sanchez-Garcia R, Wrapp D, McLellan JS, Bui KH, and Vargas J (2021). Local computational methods to improve the interpretability and analysis of cryo-EM maps. Nat Commun 12, 1240. 10.1038/s41467-021-21509-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Terwilliger TC, Ludtke SJ, Read RJ, Adams PD, and Afonine PV (2020). Improvement of cryo-EM maps by density modification. Nature Methods 17, 923–927. 10.1038/s41592-020-0914-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Russel D, Lasker K, Webb B, Velazquez-Muriel J, Tjioe E, Schneidman-Duhovny D, Peterson B, and Sali A (2012). Putting the pieces together: integrative modeling platform software for structure determination of macromolecular assemblies. PLoS Biol 10, e1001244. 10.1371/journal.pbio.1001244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Zhang J, Schaeffer RD, Durham J, Cong Q, and Grishin NV (2023). DPAM: A domain parser for AlphaFold models. Protein Science 32. ARTN e4548 10.1002/pro.4548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kosinski J, Mosalaganti S, von Appen A, Teimer R, DiGuilio AL, Wan W, Bui KH, Hagen WJ, Briggs JA, Glavy JS, et al. (2016). Molecular architecture of the inner ring scaffold of the human nuclear pore complex. Science 352, 363–365. 10.1126/science.aaf0643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Bui KH, von Appen A, DiGuilio AL, Ori A, Sparks L, Mackmull MT, Bock T, Hagen W, Andres-Pons A, Glavy JS, and Beck M (2013). Integrated structural analysis of the human nuclear pore complex scaffold. Cell 155, 1233–1243. 10.1016/j.cell.2013.10.055. [DOI] [PubMed] [Google Scholar]
  • 53.von Appen A, Kosinski J, Sparks L, Ori A, DiGuilio AL, Vollmer B, Mackmull MT, Banterle N, Parca L, Kastritis P, et al. (2015). In situ structural analysis of the human nuclear pore complex. Nature 526, 140–143. 10.1038/nature15381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Dauden MI, Kosinski J, Kolaj-Robin O, Desfosses A, Ori A, Faux C, Hoffmann NA, Onuma OF, Breunig KD, Beck M, et al. (2017). Architecture of the yeast Elongator complex. EMBO Rep 18, 264–279. 10.15252/embr.201643353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Punjani A, Rubinstein JL, Fleet DJ, and Brubaker MA (2017). cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods 14, 290–296. 10.1038/nmeth.4169. [DOI] [PubMed] [Google Scholar]
  • 56.Emsley P, Lohkamp B, Scott WG, and Cowtan K (2010). Features and development of Coot. Acta Crystallogr D Biol Crystallogr 66, 486–501. 10.1107/S0907444910007493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Afonine PV, Poon BK, Read RJ, Sobolev OV, Terwilliger TC, Urzhumtsev A, and Adams PD (2018). Real-space refinement in PHENIX for cryo-EM and crystallography. Acta Crystallogr D Struct Biol 74, 531–544. 10.1107/S2059798318006551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lagerwaard IM, Albanese P, Jankevics A, and Scheltema RA (2022). Xlink Mapping and AnalySis (XMAS) - Smooth Integrative Modeling in ChimeraX. bioRxiv, 2022.2004.2021.489026. 10.1101/2022.04.21.489026. [DOI] [Google Scholar]
  • 59.Williams NE, Wolfe J, and Bleyman LK (1980). Long-term maintenance of Tetrahymena spp. J Protozool 27, 327. 10.1111/j.1550-7408.1980.tb04270.x. [DOI] [PubMed] [Google Scholar]
  • 60.Gaertig J, Wloga D, Vasudevan KK, Guha M, and Dentler W (2013). Discovery and functional evaluation of ciliary proteins in Tetrahymena thermophila. Methods Enzymol 525, 265–284. 10.1016/B978-0-12-397944-5.00013-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Bre MH, Redeker V, Quibell M, Darmanaden-Delorme J, Bressac C, Cosson J, Huitorel P, Schmitter JM, Rossler J, Johnson T, et al. (1996). Axonemal tubulin polyglycylation probed with two monoclonal antibodies: widespread evolutionary distribution, appearance during spermatozoan maturation and possible function in motility. J Cell Sci 109 ( Pt 4), 727–738. 10.1242/jcs.109.4.727. [DOI] [PubMed] [Google Scholar]
  • 62.Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, Preibisch S, Rueden C, Saalfeld S, Schmid B, et al. (2012). Fiji: an open-source platform for biological-image analysis. Nat Methods 9, 676–682. 10.1038/nmeth.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Data Availability Statement

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Antibodies
mouse monoclonal anti-polyglycylated tubulin AXO49 Bre et al.61 N/A
polyclonal anti-GFP Rockland Cat#600–401–215
goat-anti-mouse IgG-FITC Jackson Immunoresearch Cat#115–095–146
goat-anti-rabbit-Cy3 Jackson Immunoresearch Cat#111–165–003
Deposited data
PDB model of the identified proximal proteins from the 48-nm T. thermophila K40R doublet microtubule This study 8VL3
The map from the proximal 48-nm T. thermophila K40R doublet microtubule This study EMD-42949
PDB model of the 48-nm T. thermophila WT doublet microtubule Kubo et al.21 8G2Z
48-nm T. thermophila K40R doublet microtubule Kubo et al.21 EMD-29692
48-nm T. thermophila WT doublet microtubule Kubo et al.21 EMD-29685
Subpellicular microtubule map from T. gondi Sun et al.36 EMD-26019
Subpellicular microtubule PDB model from T. gondi Sun et al.36 7TNS
Radial spoke map from C. reinhardtii Gui et al.34 EMD-22475
Radial spoke map from C. reinhardtii Gui et al.34 EMD-22481
Radial spoke PDB model from C. reinhardtii Gui et al.34 7JTK
Radial spoke PDB model from C. reinhardtii Gui et al.34 7JU4
Experimental models: Organisms/strains
Tetrahymena thermophila: Strain CCDC81B-GFP This study N/A
Oligonucleotides
Primer 5F: ctatagggcgaattggagctttgtgaaatagatggaagag This study N/A
Primer 5R: atcaagcttgccatccgcggacttgtgaatttttaaagagat This study N/A
Primer 3F: gcttatcgataccgtcgaccatcaattatttcaaagtattaa This study N/A
Primer 3R: agggaacaaaagctgggtacgcattatccaaaatatattctaa This study N/A
Software and algorithms
AlphaFold Jumper et al.23 https://github.com/google-deepmind/alphafold
ColabFold Mirdita et al.24 https://github.com/sokrypton/ColabFold
Phenix Afonine et al.57 https://phenix-online.org/
UCSF ChimeraX Pettersen et al.29 https://www.cgl.ucsf.edu/chimerax/
R R Core Team28 https://www.r-project.org/
Coot Emsley et al.56 https://www2.mrc-lmb.cam.ac.uk/personal/pemsley/coot/
Cryosparc Pujani et al.55 https://cryosparc.com/
FIJI Schindelin et al.62 https://imagej.net/software/fiji/
DomainFit This study https://github.com/builab/DomainFit
DeepEMhancer Sanchez-Garcia et al.46 https://github.com/rsanchezgarc/deepEMhancer
XMAS Ilse et al.58 https://github.com/ScheltemaLab/ChimeraX_XMAS_bundle

RESOURCES