Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2024 Jan 5;5(1):102814. doi: 10.1016/j.xpro.2023.102814

Identification of homologous protein models via 3D comparisons using predicted structures

Anyu Pan 1,2, Jieyi Shentu 1,2, Yangfan Zeng 1, Rong Guo 1, Yang Yu 1,3,4,
PMCID: PMC10789644  PMID: 38183654

Summary

Recent advances in protein structure prediction enable 3D homology alignment and domain annotation using tertiary structures. Here, we present a protocol to identify homologous structures and annotate protein domains through in silico comparisons using the AlphaFold database. We describe steps for downloading and installing PyMOL software, preparing the query structure, and conducting a 3D homology search. The example provided highlights the application of this protocol in reevaluating an mpox viral protein annotation.

For complete details on the use and execution of this protocol, please refer to Pan et al. (2023).1

Subject areas: Bioinformatics, Protein Biochemistry, Structural Biology

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • AI-based method for protein 3D homology searches

  • Protocol for annotating unknown domains using predicted structures

  • In silico method to validate protein domain identity


Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.


Recent advances in protein structure prediction enable 3D homology alignment and domain annotation using tertiary structures. Here, we present a protocol to identify homologous structures and annotate protein domains through in silico comparisons using the AlphaFold database. We describe steps for downloading and installing PyMOL software, preparing the query structure, and conducting a 3D homology search. The example provided highlights the application of this protocol in reevaluating an mpox viral protein annotation.

Before you begin

Download and install PyMOL

Inline graphicTiming: 1 h

This protocol is based on the Windows 7 or 10 64-bit operating system (OS), which is more user-friendly compared with Linux OS. Alternatively, all the software required are also available in the repositories of Ubuntu, which can be installed with “sudo apt install” command.

  • 1.
    Ensure PyMOL works under a functional Python environment. To install Python and pmw package:
    • a.
      Download Python from https://www.python.org/downloads/. We recommend to use a version after 3.5 to match the requirements of recent releases of the open-source PyMOL.
    • b.
      Run the Python installer, following the instructions. Ensure "Add Python to environment variables" is selected.
      Note: This allows you to use Python from the Command Prompt or any other directory on your computer.
    • c.
      To install the "pmw" package. Open the command prompt and right click on run as administrator option enter the following commands:
      >python -m pip install pmw
  • 2.
    To install the open-source PyMOL:
    • a.
      Download the open-source PyMOL package from https://www.lfd.uci.edu/∼gohlke/pythonlibs/#pymol.
    • b.
      Make sure to find the right version of the PyMOL package that matches your Windows system’s bit (32-bit or 64-bit) and Python version.
      Note: For example, assuming you have a 64-bit Windows OS with Python 3.7 installed, you need to download "pymol-2.3.0-cp37-cp37m-win_amd64.whl".
    • c.
      To install PyMOL, launch the Command Prompt window and enter the following commands:
      >python -m pip install pymol-2.3.0-cp37-cp37m-win_amd64.whl
  • 3.
    Alternatively, PyMOL can be installed using Miniconda or Anaconda.

>conda install -c conda-forge pymol-open-source

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Software and algorithms

AlphaFold database Varadi et al.2 https://alphafold.ebi.ac.uk/
UniProt database UniProt3 https://www.uniprot.org/
Protein Data Bank (PDB) Berman et al.4 https://www.rcsb.org/
Expasy Translation Tool Duvaud et al.5 https://web.expasy.org/translate/
UniProt Species Proteomes database UniProt3 https://www.uniprot.org/proteomes?query=∗
ColabFold Mirdita et al.6 https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
LocalColabFold Mirdita et al.6 https://github.com/YoshitakaMo/localcolabfold
Average B factors tool N/A https://swift.cmbi.umcn.nl/servers/html/listavb.html
Dali server Holm et al.7 http://ekhidna2.biocenter.helsinki.fi/dali/
FoldSeek van Kempen et al.8 https://search.foldseek.com/search
Python Python https://www.python.org/downloads/
Open Source PyMOL Schrödinger, LLC https://www.lfd.uci.edu/∼gohlke/pythonlibs/#pymol
PyMOL Schrödinger, LLC https://pymol.org/2/
Miniconda Anaconda, Inc. https://docs.conda.io/projects/miniconda/en/latest/index.html
Anaconda Anaconda, Inc. https://www.anaconda.com/download

Other

AMD PRO A10-8770 R7 (CPU) Advanced Micro Devices, Inc. https://amd.com/en/support/apu/amd-pro-series-processors/amd-pro-series-a10-apu-for-desktops/6th-gen-amd-pro-a10-8770
HMA81GU6CJR8N-VKN0 (UDIMM 8 GB x8 RAM) Sk hynix, Inc. https://product.skhynix.com/products/dram/module.go

Step-by-step method details

Prepare protein structure files as queries in 3D comparison

Inline graphicTiming: 1 h

  • 1.
    Prepare query models for subsequent 3D homology searches (Figure 1). Use either experimentally-solved protein structures or predict de novo using AlphaFold2. Process these protein structures to extract specific domains or motifs of interest, which serve as search queries in the following steps. To retrieve protein structure files from public databases for analysis:
    • a.
      Navigate to the UniProt homepage (https://www.uniprot.org/) and search for the gene or protein name. Switch to the main page of the protein in the search results.
    • b.
      Within the protein’s main page, find the "Structure" tab, which provides links to structure files in different databases.
      Note: Solved structures from the Protein Data Bank (PDB) are labeled as "PDB," while predicted structures from the AlphaFold database are labeled as "AlphaFold".
      Note: If there is no structure provided, copy the amino acid sequence of the protein and skip to step 2.
    • c.
      Click the given links to view the structure’s details in the PDB database or the AlphaFold database. Download the structure file in pdb format.
      Note: Solved structures usually only cover a fraction of the full-length protein. Ensure that the existing structures cover the region of interest.
  • 2.
    Predict protein structures de novo when they are not available in databases. Use ColabFold for carrying out these predictions.
    • a.
      Retrieve protein primary sequences from databases such as UniProt (https://www.uniprot.org/) or NCBI (https://www.ncbi.nlm.nih.gov/), as explained in step 1.
    • b.
      Access Google ColabFold webpage and log in using a Google account.
    • c.
      In the "query_sequence" box, enter the amino acid sequences for structure prediction. Install the necessary dependencies using default settings. Run each step in the webpage by clicking the running button for each step in the default order.
      Note: The key parameters are outlined in the following table (Table 1).
    • d.
      The output files may include multiple ranked pdb files, typically named as “rank_number_model”. We recommend to use the rank 1 model for the subsequent step.
    • e.
      Store the pdb files sequentially for future use.
      Alternatives: If you have a workstation equipped with GPU, you can also set up a local AlphaFold2 working environment.9 In addition to the online Google ColabFold, a local ColabFold platform can be configured using the LocalColabFold installer (https://github.com/YoshitakaMo/localcolabfold).6
      Note: The pLDDT score (per-residue confidence score) of the predicted model generated by AlphaFold2 and saved in pdb file format is stored as the "B-factor" value. This score will be reviewed in the subsequent steps of the analysis.
  • 3.
    Isolate Protein Structures as Query Modules.
    • a.
      Launch PyMOL and load the protein structure file by clicking "File" and selecting "Open" in the PyMOL interface.
    • b.
      Click on the "S" button located in the bottom right corner of the PyMOL interface to display the amino acid sequence.
    • c.
      Select the amino acids you wish to exclude from the query modules, right-click, then choose the 'remove' option from the pop-up “sele” menu to remove the unwanted residues.
    • d.
      When working with an experimentally determined structure, we recommend to remove molecules such as water (which appear as “0” in the sequence), ions and nucleic acids (which appear as “DA” “DT”, “DC”, or “DG” for DNA, and “A”, “U”,” G″, or “C” for RNA).
    • e.
      The resulting protein structures/modules can be exported as an independent file. Navigate to “File”, then click “Export Molecule”. In the window that opens, select the “PDB options” tab, and tick all the boxes below. Archive the file.

Note: In this protocol, the initial structure submitted for Dali search is called the "query". Structures found by Dali that show structural similarities to the query are termed "subject" structures.

Note: Predicted structures, unlike solved structures, require pLDDT checks. Predicted structures are based on estimations, and the accuracy of each position is not always guaranteed. This accuracy is quantified using pLDDT, which is stored in the “B-factor” field of a pdb file for a predicted structure. See step 6 for the procedure to assess pLDDT.

Inline graphicPause point: Keep the prepared query structure files readily accessible for use as needed.

Figure 1.

Figure 1

Schematic representation of prepare protein structure files as queries in 3D comparison

Schematic flowchart illustrating the step-by-step method to obtain raw structures and process them into query structures for 3D homology search. Steps 1–3 are indicated within the processing workflow.

Table 1.

Essential parameter settings for protein structure prediction with google ColabFold

Parameter name Suggestion (for this step) Default?
query_sequence The input amino acid sequences of interest No
Jobname A file name with taxonomy, gene name and Protein ID No
Num_relax 0 Yes
Template_mode none Yes
msa_mode mmseq2_uniref_env Yes
pair_mode Unpaired_paired Yes
num_recycles Default setting is 3; higher numbers yield more accurate results, but increase running time Depends
Recycle_early_stop_tolerance auto Yes
Pairing_strategy greedy Yes
max_msa auto Yes
Num_seeds 1 Yes
Dpi 200 Yes

Conduct 3D homology search

Inline graphicTiming: 1 day

Conduct a 3D homology search and inspect the search results. Identify structurally equivalent regions and validate them through a reverse search (Figure 2). The aim is to identify protein modules with a 3D conformation similar to the query structure.

  • 4.
    Use the Dali Server to conduct a 3D homology search across different proteomes.
    • a.
      Navigate to the “AF-DB search” tab on the Dali Server homepage (http://ekhidna2.biocenter.helsinki.fi/dali/).
    • b.
      Upload the processed pdb format query file.
    • c.
      Choose the target organism and submit the job. It may take several hours for the results to be ready. Optionally, provide an email address to receive job status notification. Troubleshooting 2.
    • d.
      When the job is completed, save the resulting webpage as an independent file.
    • e.
      Document the search results in csv format for future reference. The file should include the following parameters (Table 2):
    • f.
      For each subject structure identified by the Dali search, use the AlphaFold ID to obtain its pdb formatted structure from the AlphaFold database. Troubleshooting 1.
      Alternatives: In addition to the Dali server, alternative 3D comparison engines such as FoldSeek can be used to identify homologous structures. To conduct a search in FoldSeek, upload the pdb files as queries on FoldSeek website (https://search.foldseek.com/search) and initiate the search. In the main page's 'DBs & search settings' tab, an optional 'Taxonomic filter' allows for species-specific searches. The search results will be displayed in the outcomes section. Retrieve key information, including AlphaFold ID, gene name, and structurally equivalent regions, which can be found by expanding the alignment details, into a csv file for downstream analysis.
      Alternatives: For proteomes not included in Dali Server, a local Dali tool can be set up using Dali Lite (http://ekhidna2.biocenter.helsinki.fi/dali/README.v5.html).
      Inline graphicPause point: Retain the collected csv file for immediate access and future use whenever downstream analysis is required.
  • 5.
    Manually inspect the 3D homology search results to analyze the similarity between the query structure and proteins from the Dali search results, identify structurally equivalent regions in the subject structure, and annotate unrecognized protein modules, domains of unknown functions (DUFs), or orphan domains.
    • a.
      Use PyMOL to analyze the similarity between the query structure and proteins from the Dali search results. Identify the precise structurally equivalent region in the subject structure and generate the processed file accordingly.
      • i.
        Launch the PyMOL software and load both the query and subject structure. Assume the file name as “Query.pdb” and “Subject_A.pdb”.
      • ii.
        Superimpose the two structures. In the command line denoted by “PyMOL>”, type the command:
        >cealign Query, Subject_A
        This command structurally aligns the two modules in the display window. Be aware that only the selected structures in the right column are shown. Users can left-click the file name in the right column to show or hide the structure.
        Alternatives: The "cealign" command is particularly effective for aligning proteins even when they have limited similarities. However, if you only focus on proteins with closer homology, you can opt for the "super" or "align" command to carry out the alignments.
      • iii.
        Conduct a visual assessment to evaluate the similarities between the query and subject structures. Verify whether the subject structures contain the defining features of the query protein models.
        Note: For example, the BEN domain was defined by the presence of five α-helices and a middle loop, which were arranged in a globular conformation. Therefore, in the subject structure, the presence of these modules, especially the central helix α5 and the middle loop, in the region of structurally equivalence becomes the criterion for evaluation.
        Inline graphicCRITICAL: As shown in Figure 3, Insv-BEN structure superimposes well with the zebrafish AF-A0A2R8QJE3-F1 protein, which encompasses all five helices and the middle loop (Figure 3). Consequently, we can deduce that there exists a region within the AF-A0A2R8QJE3-F1 protein that is structurally analogous to the queried Insv-BEN domain.
        In contrast, when comparing the queried Insv-BEN with AF-P52945-F1, a human protein that has a 4.7 Z-score by Dali search, the aligned region in the subject model only covers specific segments of the BEN domain. Specifically, the human AF-P52945-F1 150–213 region contains only three α-helices and lacks the corresponding middle loop. This discrepancy indicates that AF-P52945-F1 protein does not contain BEN domain and should consequently be excluded from further analysis.
        Note: Color coding the protein by secondary structures can speed up the manual inspection process. In the internal GUI window (the window on the right of the visualization area), left click the “C” color button, select “by ss”, and further select any color combination provided. This will color code the protein based on secondary structure and help to decide the boundaries of the domain/motif of interest (Figure 4).
        Inline graphicCRITICAL: There might be multiple structurally equivalent regions in one subject protein. But alignment command in PyMOL will only unveil the most similar region. To uncover additional analogous segments following the initial inspection, we recommend a two-step process: first, remove the previously identified structures, and then perform superimposition once more. This approach helps identify any other regions that may align with the query structure. Here are the necessary steps:
        To delete the structurally equivalent region in the original subject structure, right-click on the highlighted area from the first-round superposition and select "remove." Rerun the "cealign" command in PyMOL to superimpose the query structure and the remaining subject structure.
        Continue the refinement and manual comparison process as previously outlined. Repeat this procedure until no further matching regions are identified within the subject model (Figure 5). Troubleshooting 5.
      • iv.
        Document the start and end positions of the structurally equivalent region in the subject structure.
      • v.
        Extract the structurally equivalent regions from the original subject pdb file as separate pdb files. To achieve this, load the original subject structure in PyMOL. Select the residues located outside the defined structurally equivalent region. Then, right-click and choose the “remove” option to delete these residues. Save each of these structurally equivalent modules as independent pdb files.
        Inline graphicPause point: Temporarily stop the manual inspection process between the completion of individual structure analyses, ensuring each structure is fully analyzed before pausing.
    • b.
      One of the most powerful applications of this protocol is to annotate unrecognized protein modules, domain of unknown functions (DUFs) or orphan domains. To check whether the newly identified structurally equivalent regions have been annotated as any domains or DUFs, follow these steps (Figure 6):
      • i.
        Locate the UniProt proteins containing the subject structure. To do this, retrieve the UniProt IDs from the Dali search results and search for the protein in UniProt main page. Alternatively, AlphaFold database also provides a direct link to the corresponding UniProt information page for each protein.
      • ii.
        Access the "Family & Domains" tab on the UniProt main page to obtain information about the annotated domains and their positions. This tab provides a chart describing the known domains of the protein. Expand the hidden amino acid sequences by clicking the triangle-shaped icon to examine the detailed sequence information associated with each domain.
        Note: In the annotation section, some proteins include links to InterPro/PROSITE, which provide detailed explanations about the domain's annotation rules.
      • iii.
        If the structurally equivalent region has been annotated as a domain, thoroughly review the literature for structure-defining features. Then, determine if the query structure belongs to this known domain. It is also reasonable to experimentally test if they share similar molecular functions. Troubleshooting 4.
        Inline graphicPause point: Interrupt the domain function analysis only after fully analyzing each individual structure, ensuring a complete evaluation before any pause.
  • 6.
    Use the pLDDT score for each predicted structure as a quality control measure to evaluate the confidence of the structure prediction. Employ the Average B factors tool to facilitate this inspection, following these necessary steps:
    • a.
      On the Average B factors tool webpage (https://swift.cmbi.umcn.nl/servers/html/listavb.html), select and upload the pdb file obtained from step 5a.
    • b.
      Display the outcome as a chart, detailing the type and number of amino acids. The pLDDT score for the predicted structure is embedded in the “B-factor” field of the pdb file. The score under "All" signifies the pLDDT score of each amino acid within the predicted structure.
    • c.
      Ideally, the average pLDDT score for amino acids within the query structure should exceed 70. Optionally, you can visualize the pLDDT scores of the structure using the following PyMOL command:

> spectrum b, blue_white_red, minimum=20, maximum=80

Figure 2.

Figure 2

Schematic representation of conduct 3D homology search

This figure outlines the process of inputting the query into a 3D homology search. Each subject structure identified was manually inspected and refined to highlight structurally equivalent regions. Some of the resulting hits were checked for the pLDDT score and then used in a reverse search. Steps 4–7 are marked within the depicted workflow.

Table 2.

Key parameters and information to collect for subsequent structural analysis

Parameter name Suggestion (for this step) Default?
AlphaFold ID Yes
Z Score We recommend to set the Z Score cutoff at 3 or above. Use Z score to rank the searching outcomes Yes
Rmsd Yes
Lali Yes
Nres Yes
Percent of sequence identity (% id) Yes
Gene name Retrieved by searching database using AlphaFold ID
Structurally equivalent residues in Dali This information is displayed as protein sequences in uppercase letters within the "Pairwise Structural Alignments" section on the Dali search results page

Figure 3.

Figure 3

Visual assessment of query and subject structures in the PyMOL interface

This figure illustrates the critical step of visually comparing the query and subject structures. Using the defining features from the query as benchmarks, the aligned subject molecule in PyMOL is inspected to determine if it contains the expected secondary structures.

Figure 4.

Figure 4

Color-coded representation of protein by secondary structures

This figure shows how the 'color by ss' option color-codes different secondary structures within a molecule.

Figure 5.

Figure 5

Workflow for identifying multiple structurally equivalent regions in the same subject

The figure outlines the procedure to handle subjects with multiple structurally equivalent regions. After aligning the subject structure with the query in PyMOL, the first structurally equivalent region is identified and then deleted. The resulting subject structure is realigned with the query to check for the presence of a second structurally equivalent region.

Figure 6.

Figure 6

Workflow to verify protein annotation information of structurally equivalent regions

Using the structure CG42854 as an example, prior steps determined it contains a region structurally equivalent to the query Insv-BEN. CG42854 amino acids 48–126 were annotated as DUF4806 in the UniProt. This led to further investigations into the relationship between DUF4806 and the BEN domain.

This command will color the displayed molecule, highlighting regions with high pLDDT scores in red, regions with moderate scores in white, and regions with low scores in blue. Adjust the color scale by modifying the minimum and maximum values as needed.

Note: Within motifs or domains, amino acids of loop regions might be unstructured and display pLDDT scores below the designated confidence threshold. However, as long as these positions don't alter the defining features of the query protein model, they remain valid for analysis.

  • 7.
    Use the reverse 3D search as a validation process after completing a 3D homology comparison and manual inspection. This involves using the identified structurally equivalent regions for a 3D search on the Dali server, aiming to recognize the original query protein. Follow these steps to execute this process:
    • a.
      Prepare the isolated structurally equivalent modules obtained from step 5a.
    • b.
      Upload the structurally equivalent modules to the Dali server and carry out the 3D homology search, selecting the species from which the query originated. Notably, during this reverse search, ensure that the targeting proteome matches with the original query structure.
    • c.
      Search for the original query’s AlphaFold ID within the outcome page in the web browser. If the reverse search does not identify the initial query structure, it indicates that the conclusion drawn from the initial 3D search result is invalid.

Note: The initial query structure might appear across protein isoforms. As such, it can manifest under multiple AlphaFold IDs.

Alternatives: If the structure query originates from an experimentally determined structure, the "pdb search" function provided by the Dali Server may be used. This function conducts a reverse search in all structures in the PDB database.

Expected outcomes

The proposed protocol consists of three parts: the preparation of the query 3D structure, the execution of the 3D homology search, and the interpretation and validation of the 3D homology search results. Here are examples of the searching outcomes:

Example 1:

For instance, to investigate whether there is undiscovered BEN domain in the proteome of C. elegans, a 3D homology search can be conducted (Figure 7). To do this, the query uses the BEN domain of the Drosophila Insv protein, which has been solved in complex with DNA (PDB: 4IX7). The Insv-BEN domains form dimers in the crystal structure. Consequently, process the DNA-protein complex structure to eliminate unnecessary molecules and retains solely the monomeric BEN domain. This resulting monomeric BEN model serves as the “query structure” for the 3D homology search.

Figure 7.

Figure 7

Identification of LIN-14 as a novel BEN domain

This figure outlines the step-by-step process of determining that LIN-14 contains an unrecognized BEN domain.

After manually inspecting the search hits, 12 C elegans proteins containing BEN-like regions were identified. For example, the LIN-14 (AF-Q21446-F1) protein, which is known to have an orphan DNA binding domain at amino acids 244–465, harbors a typical BEN domain in the 334–456 region. This region shows a high pLDDT score and can be validated through a reverse 3D search in PDB database, which not only reveals Drosophila Insv-BEN(4IX7) but also identified human BANP-BEN (7YUG) and BEND6-BEN (7YUL) domains. Consequently, these observations suggest that the LIN-14 344–456 region is a novel BEN domain.

Example 2:

This protocol can also uncover potential inaccuracies in annotations. Previous bioinformatics analysis has identified multiple BEN factors within poxvirus proteomes (Figure 8). Considering the increasing public health risks linked with the mpox virus, we directed our attention to this virus for our investigation. We focused on the mpox viral protein Q3I8W1, which is named as “BEN domain-containing protein”. To validate the presence of a BEN domain within Q3I8W1, we used this protein as a query to perform 3D homology search in Drosophila and human proteomes.

Figure 8.

Figure 8

mpox Q3I8W1 protein does not contain a BEN domain

This figure details the step-by-step evaluation which concluded that mpox Q3I8W1, which is named as “BEN domain-containing protein”, does not contain a BEN domain.

However, searching results yielded no match to any known BEN-domain containing proteins. This implies that Q3I8W1 is unlikely to be a BEN factor. To validate this inference, we conducted structural alignments. Using the "cealign" command in PyMOL, we superimposed Q3I8W1 with the Drosophila Insv-BEN and human BEND3-BEN4, respectively. The results revealed very limited similarity between Q3I8W1 and Insv-BEN or BEND3-BEN4, demonstrating that Q3I8W1 is not a BEN domain protein.

Limitations

The effectiveness of our method relies on the accuracy of structure prediction. Nevertheless, AlphaFold2 or other multiple sequence alignment (MSA)-based prediction algorithms are insensitive to structure-disrupting mutations in input proteins,10 posing challenges for annotating protein domains with subtle yet pivotal sequence changes. Furthermore, a considerable number of experimentally determined structures result from complexes forming from several proteins or molecules. However, current AlphaFold or AlphaFold-multimer algorithms still have limitations to accurately detect conformational changes within protein complexes. Consequently, using structures derived from solved complexes to identify protein models in a predicted database may result in inadequacies. This is attributed to the current limitations of AlphaFold and AlphaFold-multimer algorithms in predicting conformational changes within protein complexes.

Troubleshooting

Problem 1

Cannot locate the matched AlphaFold2 ID-based structure in AlphaFold database based on the ID provided by the Dali server (related to Step 4f).

Potential solution

If a given AlphaFold2 ID cannot find a matched model, use its corresponding UniProt ID, which is sandwiched between two hyphens, to conduct the search on the UniProt website. Find the structure under the "Structure" tab in the protein detail page.

Problem 2

Dali Server does not include the species of interest (related to Step 4c).

Potential solution

If the species of interest is not included in the Dali Server database (e.g., mpox in the Example 2), an alternative approach is to utilize the UniProt Proteome Database and ColabFold. This method involves retrieving proteome information for the species of interest from the UniProt Proteome Database (https://www.uniprot.org/). This information provides the amino acid sequences for each protein within the species' proteome. Next, ColabFold, previously introduced in the Step 2b, is employed to predict the 3D structures of these proteins based on their amino acid sequences. Subsequently, a local Dali server, as described in the Step 4, can be set up to enable the performance of a 3D homology search using the predicted protein structures. This process facilitates the identification of homologous structures for the proteins of interest, even if they are not present in the Dali Server database.

To address the absence of the species of interest in the Dali Server database, FoldSeek, a 3D comparison engine introduced in Step 4, can be employed. FoldSeek offers a broader range of species coverage compared to the Dali Server. However, it is important to note that there is a possibility that the AlphaFold2 Database may not have predicted the proteome of the species of interest. In such cases, the FoldSeek search may not yield any results, indicating either the absence of a similar structure or the absence of the species in the database. Therefore, it is crucial to carefully interpret FoldSeek search results.

Problem 3

Due to the isoform issue, the structure homology search results might contain multiple structures that belong to the same gene and interfere with downstream analysis (related to Step 4e).

Potential solution

As introduced in the “Conduct 3D homology Search” part,the search results include the gene name in the collected csv file. To highlight duplicate genes in a csv file using spreadsheet software, open the CSV file, select the column containing the values you want to check for duplicates, and navigate to the "Home" tab. Click on "Conditional Formatting" and select "Highlight Cells Rules" followed by "Duplicate Values." Choose the desired highlighting options, either highlighting all duplicate values or only unique values, and click "OK." The duplicate genes will be highlighted in the chosen color.

Problem 4

It is crucial to verify whether structurally equivalent regions adequately represent independent novel domains for downstream analysis (related to Step 5b).

Potential solution

Relying solely on structure similarity to identify protein domains can lead to incomplete or inaccurate domain assignments. Downstream biological validation experiments are essential for confirming the existence and functionality of newly identified protein domains. However, comparing protein structures using PyMOL can reveal shared structural features beyond the current region of interest for multiple potential candidates suspected to belong to the same structural domains. These shared structural features may provide valuable clues about the functionalities of the newly identified structural domains.

Problem 5

When searching for a query structure in the Dali Server, a series of proteins containing the subject structure are arranged based on Z scores. However, regardless of how many subject structures a protein contains, the Dali Server aligns only the subject structure sequence with the highest similarity. This may lead to the omission of the second subject structure in the same protein (related to Step 5a).

Potential solution

Download protein pdb files that contain the subject structures, and then conduct alignment in PyMOL following the process outlined in Figure 5. This allows you to investigate whether there are additional subject structures present in the target protein.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Yang Yu (yuy@ibms.pumc.edu.cn).

Technical contact

Technical questions on executing this protocol should be directed to and will be answered by the technical contact, Yang Yu (yuy@ibms.pumc.edu.cn).

Materials availability

There are no newly generated materials associated with this protocol.

Data and code availability

This study did not generate any datasets or code.

Acknowledgments

Work in Y.Y.’s group was supported by the National Natural Science Foundation of China (32170550); State Key Laboratory Special Fund (2060204); the Special Research Fund for Central Universities, Peking Union Medical College (2022-RC310-07); and CAMS (Chinese Academy of Medical Sciences) Innovation Fund for Medical Sciences (CIFMS) 2021-I2M-1-019.

Author contributions

Y.Y., A.P., and Y.Z. developed the analytics pipeline. Y.Y. supervised this study. Y.Y., A.P., R.G., and J.S. wrote the manuscript.

Declaration of interests

The authors declare no competing interests.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, the authors used ChatGPT by OpenAI in order to enhance the clarity of the language. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

References

  • 1.Pan A., Zeng Y., Liu J., Zhou M., Lai E.C., Yu Y. Unanticipated broad phylogeny of BEN DNA-binding domains revealed by structural homology searches. Curr. Biol. 2023;33:2270–2282.e2. doi: 10.1016/j.cub.2023.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Varadi M., Anyango S., Deshpande M., Nair S., Natassia C., Yordanova G., Yuan D., Stroe O., Wood G., Laydon A., et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Duvaud S., Gabella C., Lisacek F., Stockinger H., Ioannidis V., Durinx C. Expasy, the Swiss Bioinformatics Resource Portal, as designed by its users. Nucleic Acids Res. 2021;49:W216–W227. doi: 10.1093/nar/gkab225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Mirdita M., Schütze K., Moriwaki Y., Heo L., Ovchinnikov S., Steinegger M. ColabFold: making protein folding accessible to all. Nat. Methods. 2022;19:679–682. doi: 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Holm L., Laiho A., Törönen P., Salgado M. DALI shines a light on remote homologs: One hundred discoveries. Protein Sci. 2023;32 doi: 10.1002/pro.4519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.van Kempen M., Kim S.S., Tumescheit C., Mirdita M., Lee J., Gilchrist C.L.M., Söding J., Steinegger M. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 2023 doi: 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Buel G.R., Walters K.J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 2022;29:1–2. doi: 10.1038/s41594-021-00714-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This study did not generate any datasets or code.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES