Skip to main content
Computational and Structural Biotechnology Journal logoLink to Computational and Structural Biotechnology Journal
. 2025 Jan 22;27:467–477. doi: 10.1016/j.csbj.2025.01.016

AlphaFold 2, but not AlphaFold 3, predicts confident but unrealistic β-solenoid structures for repeat proteins

Olivia S Pratt a, Luc G Elliott a, Margaux Haon a,b, Shahram Mesdaghi a,c, Rebecca M Price a, Adam J Simpkin a, Daniel J Rigden a,
PMCID: PMC11795689  PMID: 39911842

Abstract

AlphaFold 2 (AF2) has revolutionised protein structure prediction but, like any new tool, its performance on specific classes of targets, especially those potentially under-represented in its training data, merits attention. Prompted by a highly confident prediction for a biologically meaningless, randomly permuted repeat sequence, we assessed AF2 performance on sequences composed of perfect repeats of random sequences of different lengths. AF2 frequently folds such sequences into β-solenoids which, while ascribed high confidence, contain unusual and implausible features such as internally stacked and uncompensated charged residues. A number of sequences confidently predicted as β-solenoids are predicted by other advanced methods as intrinsically disordered. The instability of some predictions is demonstrated by molecular dynamics. Importantly, other deep learning-based structure prediction tools predict different structures or β-solenoids with much lower confidence suggesting that AF2 alone has an unreasonable tendency to predict confident but unrealistic β-solenoids for perfect repeat sequences. The potential implications for structure prediction of natural (near-)perfect sequence repeat proteins are also explored.

Keywords: Alphafold, Structure prediction, Beta-solenoid, Model confidence, Repeat proteins

Graphical Abstract

graphic file with name ga1.jpg

1. Introduction

Accurate protein structure determination is essential for understanding biological processes and protein-related disease. Protein structures are experimentally determined mainly by protein crystallography or cryogenic electron microscopy (cryoEM) but these approaches can be very time-consuming, with single structures sometimes taking months or even years to solve [32]. For this reason, there has been great interest in methods for computational structure prediction, especially the ab initio or de novo methods that are independent of the availability of already known similar structures (templates). In 2020, at the 14th Critical Assessment of Protein Structure Prediction (CASP14), AlphaFold2 (AF2), a deep-learning based model, was introduced by Google DeepMind [38] and was recognised as a huge step forward in prediction accuracy in the absence of templates [19]. AF2 was then deployed on a large scale resulting in the AlphaFold Protein Structure Database (AFDB) [53] which has over 214 million models.

Like previous generations of modelling software [31], accurate template-independent structure prediction with AF2 depends [19] on the availability of a sufficiently large and diverse multiple sequence alignment (MSA). Analysis of the MSA highlights amino acid covariance i.e., which pairs of residues have co-evolved and are therefore likely to be close in 3D space in the folded structure. In addition to coordinates, AF2 produces two quality estimates. The pLDDT (predicted Local Distance Difference Test; [30]) is a per-residue local structure confidence score, ranging from 0 to 100, which is returned in the B-factor column of the structure. By convention, pLDDT is coloured from blue (high confidence) to red or orange (low confidence). The second measure, the PAE (Predicted Align Error), informs on global structural prediction confidence e.g., whether inter-domain orientations can be trusted [19].

AF2 can be used to explore the structures of understudied proteins. Here we apply AF2 to repeat proteins. These can be defined on the basis of repetitive sequences or repeating structure: repeating structure may or may not be visible as repetition at a sequence level. Repeat proteins are very diverse, resulting from the repetition of a single residue to much larger repeating sequences or structures [20]. Despite occurring in ∼14 % of proteins [29], they remain poorly characterised [26]. To address this, in 2014 the database RepeatsDB was launched to automate the annotation and classification of repetitive structures [6], [8]. It recognises five classes [20].

Class I consists of crystalline aggregates which have short repeat units between 1 and 2 residues, often associated with neurodegenerative disorders [22]. Class II refers to fibrous structures with repeats between 3 and 4 residues, for example, collagen. The most researched class, class III, are elongated structures which have repeats between 5 and 40 residues, consisting of solenoid and non-solenoid proteins. Class IV are ‘closed’ structures that have repeat lengths overlapping with classes III and V, but fold circularly. Finally, class V or ‘beads on a string’ structures can fold into independent stable domains, such as the DNA binding Zinc-finger domain [24].

Recently there has been increased interest in beta-solenoid (β-solenoid) proteins within class III. With repeat units between 5 and 30 residues, β-solenoids display repeating parallel beta strands separated by tight turns [21], but due to their vast architectural diversity, they remain largely uncharacterised. In 2023, AF2 was used to predict the structures of several new β-solenoids [33], highlighting disease-related repeats such as functional amyloids and the Chlamydia trachomatis protein PmpD.

As mentioned, for natural proteins (designed proteins can be an exception) lacking templates in the Protein Data Bank [3], AF2 requires a sufficiently information-rich MSA for accurate structure prediction [19], [46]. Thus, for singleton proteins that lack homologues, AF2 typically struggles to make confident predictions. However, surprisingly, we observed that when the repetitive sequence of human protein mucin 22 (UniProt: E2RYF6) is randomly permuted, AF2 still predicts a confident ꞵ-solenoid. In fact the randomly permuted version has a higher average pLDDT (90.3) than the ꞵ-solenoid portion of the native sequence in the AFDB entry (77.6; [53]). Nevertheless, the model of the permuted sequence appears implausible, showing uncompensated internal stacking of glutamic acid, which is generally unrealistic for natural and stable protein structures due to repulsion of the negative charges (Fig. 1). An exception might be proteins with internal counter-ions eg Ca2 + ([13]; PDB: 3p4g), but positioning Ca2+ in the model for Glu interaction reveals that it would clash with other parts of the structure and that a favourable binding site is not present (Fig S1). Consequently, we hypothesised that AF2 may have a bias towards predicting ꞵ-solenoids, which could affect structure prediction of natural repeat proteins, including those involved in disease. This report therefore aims to investigate this potential blind-spot.

Fig. 1.

Fig. 1

AF2 model from randomly permuting the repetitive sequence of protein mucin 22. The sequence was randomised into a 20-residue unit (sequence QTSTVIGTIASTETSSSTGI), and concatenated 10 times to form a 200-residue repeat protein . There is an uncompensated internal glutamic acid stack (pink), which is generally unfavourable to be within the protein core. Despite this, the model is highly confident (pLDDT 90.3) Like later figures, the model backbone is coloured by pLDDT: Dark blue for pLDDT> 90, cyan for 70 < pLDDT < 90, yellow for 50 < pLDDT < 70, orange for pLDDT < 50. Figure made using PyMOL (pymol.org).

2. Methods

2.1. Sequence & model generation

Firstly, amino acids were grouped into three categories based on their propensities for different secondary structure: favouring beta (C, F, I, T, V, W, Y), alpha (A, E, K, M, Q, R, L) and neutral (G, N, D, P, H, S) [7]. The ‘RANDOM’ function in Python (v3.11) was then used to create random 20-residue sequences, concatenated 10 times, producing 200-residue repeat sequences. In order to cover a range of propensities, amino acids were randomly selected from the different categories. They were drawn from one, two or all three of the pools. For initial experiments two sequences were generated from the alpha pool, three sequences from the beta pool and two sequences from the neutral pool. In addition, three sequences were generated from the following: the alpha and neutral pools combined, the beta and neutral pools combined, and the alpha and beta pools combined. Finally, fifteen sequences were generated from all three pools (i.e. a free choice from all 20 amino-acids). Nineteen mutants were also created by replacing aromatic or charged residues with alanine within each rung of several sequences that generated β-solenoid structures. The average propensity scores for both secondary structures were calculated by taking the mean for each sequence. 50 sets of models were then generated using the AlphaFold2 Colab page (v1.5.5) [35] under default settings.

The same process was then repeated, altering the repeat unit length to be between 5 and 30 residues. For each length, three random sequences were created: one with residues selected from the pool biased towards beta, one from the pool of alpha and beta combined, and one from all three pools combined. The sequences were then concatenated 10 times. Models were generated by AF2 locally (v1.5.2), using the parameter --num-models 5, resulting in a dataset of 78 sets of models.

2.2. Model classification using STRPsearch

To classify the fold of the rank 1 AF2 models, the software STRPsearch was used [40], and the e-value set between 0.1 and 10.0 with the --max-eval parameter. As this software uses a library of structures, for models that received no classification or multiple, the most appropriate classifier was manually selected by visualising secondary and tertiary structure.

2.3. Calculating RMSD between models 1–5

Using the ‘ALIGN’ command in PyMOL (v2.5.2), for each β-solenoid model, model 1 (the number here referring to the rank) was compared to models 2 through 5, and the root mean square deviation (RMSD) was calculated using the parameters cycles= 0, transform= 0.

2.4. Model validation with Verify3D

For the following analysis, model refers to rank 1 models only. To investigate the plausibility of the β-solenoid models, they were submitted to the Verify3D server [11]. Verify3D operates by first assigning an environment to each position in a 3D structure, based on secondary structure, burial vs exposure, and the polarity of the surroundings. A per-residue score is given to measure the compatibility of each amino-acid with its environment, based on statistical preferences observed in the PDB. The average Verify3D score was calculated by taking the mean of the per-residue average scores. A categorical failure is awarded by Verify3D if fewer than 80 % of residues within a model score > =0.1 in the 3D/1D profile.

As a standard, β-solenoid crystal structures identified by RepeatsDB were also analysed by Verify3D. First, the sequences were input into RADAR [14], to identify those that most exemplified ‘perfect’ sequence repetition. For normalisation, the RADAR score was divided by the repeat unit length. Those with a score above 7.0 were selected for comparison.

2.5. Modelling with ESMFold, AlphaFold3 and RoseTTAFold-All-Atom

The sequences were also modelled using the ESMFold Colab page or server [25] and AlphaFold3 (AF3) at the server or locally [1]. To compare the outputs, TM-scores between models for the same sequence were calculated [56], [57]. Average pLDDT values for the AF3 models were determined with the GEMMI library [55] in Python as AF3 gives atomic pLDDT. Finally, the sequences were modelled with RoseTTAFold-All-Atom (RFAA) [23] for additional comparison.

2.6. Sequence-based predictions of disorder and secondary structure

The predicted disorder for each sequence was determined with the AIUPred server [12]. The average probability was calculated by taking the mean of the per residue probabilities. Secondary structure was predicted with JPred [9].

2.7. All-atom molecular dynamics with GROMACS

All-atom molecular dynamics (MD) simulations were conducted using GROMACS 2022.2 [51] to study models containing internally stacked charged residues and the β-solenoid crystal structure 3JX8:A. The systems were solvated using the TIP3P water model [18] and neutralised with Na+/Cl⁻ ions. Long-range electrostatic interactions were determined using the particle-mesh Ewald algorithm [42]. Energy minimisation was performed for up to 10,000 steps using the steepest descent followed by the conjugate gradient algorithm. The temperature of the system was incrementally raised from 0 K to 300 K, followed by two 20 ps equilibration steps. Three 1 µs simulations were then carried out with a 2fs timestep using the CHARMM36 force field [4].

The production simulations employed a leap-frog integrator with linear centre-of-mass motion removal, running for a total of 500 million steps (equivalent to 1 µs) per simulation. Output data for coordinates, velocities, energies, and logs were recorded every 10,000 steps. Bond constraints for hydrogen atoms were applied using the LINCS algorithm with a sixth-order expansion and a single iteration. The Verlet cutoff scheme was utilised for neighbour searching with a grid-based approach and a neighbour list update interval of 10 fs. Short-range electrostatic and van der Waals interactions were truncated at 1.2 nm, with force-switching applied for van der Waals interactions starting at 1.0 nm. Temperature coupling was managed using a velocity-rescale thermostat at 310 K for protein and non-protein groups, while isotropic pressure coupling at 1 bar was maintained using the Berendsen algorithm.

Simulation results were visualised using VMD 1.9.3 [17]. Subsequent analysis was performed with the ‘MD-DaVis’ [28] and ‘MDAnalysis’ [34] Python modules, with RMSD values calculated using the ‘rms’ command.

2.8. Database searches

Sequences were searched against the UniRef90 [48] and PDB [3] databases using BLAST [2] and an e-value of 0.001 used to define a significant match. Pfam [37] and PDB [3] databases were also queried with HHPred [58] to search for chance sequence matches. Structural similarities of models to entries in the PDB or AFDB were determined using FoldSeek [52].

3. Results & discussion

3.1. Preliminary exploration of 20-residue repeat units highlights potentially problematic predictions

To begin, a small preliminary dataset of β-solenoid models was produced in which each model was of a randomly generated 20-residue repeat unit concatenated ten times. Of the 50 models generated, 36 resulted in a β-solenoid structure but there was some redundancy in the set as some sequences differed only at single positions with others. Nevertheless, these initial models served to identify interesting factors to explore, for example, investigating how model propensity for different secondary structure influences AF2 structure prediction. One particularly concerning observation identified involved the internal stacking of negatively charged residues within the protein core. Additionally, the protein structure validation software Verify3D highlighted several models as implausible, whilst alternative deep-learning modelling methods showed contradictory outputs to AF2. To test these observations on a larger scale, 78 sets of models were generated with AF2, covering a range of repeat unit lengths. The results for this dataset are presented below.

3.2. Expanding the exploration across different repeat unit lengths

3.2.1. AF2 generates confident β-solenoids across different repeat unit lengths

Following generation of the larger dataset in which the repeat unit lengths were altered, for each sequence modelled, the average propensity scores for alpha and beta secondary structure were determined and the model (here model refers to the rank 1 AF2 model), was fold classified by STRPsearch (Fig. 2). 32 of the 78 models were classified as β-solenoids, with a few having a potential repeat region identified by STRPsearch.

Fig. 2.

Fig. 2

Each point represents a model, its coordinates showing the average sequence propensity score for alpha secondary structure on the x-axis and beta secondary structure on the y-axis. The different markers represent the fold that each model was classified as, coloured by pLDDT - Dark blue for pLDDT> 90, cyan for 70 < pLDDT < 90, yellow for 50 < pLDDT < 70, orange for pLDDT < 50. ‘None’ refers to disordered proteins.

Despite being random repeat sequences that lack homologues, AF2 generates many confident β-solenoid models. With regard to secondary structure propensity, it appears that there is no strong pattern, with the exception that above a beta propensity of ∼1.3, there are no low-quality models. Figure S2 highlights that high pLDDT β-solenoids were produced across the size range in 32 of 78 (41 %) proteins, suggesting that the potential blind-spot observed in the preliminary exploration may extend across β-solenoids of all lengths.

3.2.2. Initial model observations show potentially implausible structures

Upon close examination of the β-solenoid models, it appeared that some possessed problematic features. Fig. 3a shows a confident model (pLDDT 83.5) with several hydrophobic residues located on the protein surface and obvious clashing of atoms. In addition, Fig. 3b displays another confident model (pLDDT 82.4) with aspartic acids stacked internally. Both observations are unlikely to occur within natural and stable proteins. Commonly, hydrophobic residues are found within the protein core, minimising their contact with solvent [50], although membrane proteins are an exception. Similarly, internal negatively charged stacking is energetically unfavourable due to repulsion [16] with few exceptions such as calcium-dependent antifreeze β-solenoid proteins, which have internal negative stacking to coordinate calcium ions [13]. As with the example mentioned in the Introduction, a favourable calcium binding site is not present in the model (Figure S1). Nevertheless, AF2’s confidence may therefore be explained by the fact that these calcium-bound structures were present in the training data, misleading its predictions. However, this does not explain other potentially problematic observations, which reinforce the suggestion that AF2 often produces confident but unrealistic β-solenoid predictions for artificial repetitive sequences.

Fig. 3.

Fig. 3

A) A model (sequence: FCTTCCIY) showing hydrophobic residues (shown as sticks), those on the protein surface are coloured pink. Hydrophobic amino acids are more likely to be found within the protein core. Clashing between residues is also observed. B) A different model (sequence: SMKLTDNQQDAAIIIIVDCFCI) showing uncompensated internal stacking of aspartic acid (pink) which is energetically unfavourable due to its negative charge. Figure made using PyMOL (pymol.org).

When analysing the 5 models for each sequence (the number corresponding to the rank), for one interesting case, models 1 and 2 differed significantly in secondary structure. Model 1 displays the β-solenoid fold (pLDDT 91.9), but model 2 appears as a long α-helix (pLDDT 87.2) (Fig. 4). Theoretically, both models may be plausible as some proteins are able to adopt multiple conformations [41] and proteins such as the prion protein (Prp) have both a normal and pathogenic form [43], which can differ in secondary structure [45]. However, these radically different predictions, both confident, more likely further highlight a contradiction in AF2’s predictions.

Fig. 4.

Fig. 4

Comparison of AF2 models 1 & 2 from the same random repeat sequence (WCFVCVIVTVFYCW). A) Model 1 (pLDDT 91.9) B) Model 2 (pLDDT 87.2). The RMSD between the two structures is 56.0 Å despite AF2 being confident in both. Figure made using PyMOL (pymol.org).

3.2.3. Model validation with Verify3D highlights model implausibility

To investigate the plausibility of the β-solenoid models further, they were validated quantitatively through Verify3D. Verify3D considers residue solvent accessibility, secondary structure, and the overall environment to determine whether a protein structure is plausible [11]. To establish the threshold for valid protein structures, seven β-solenoid crystal structures, chosen for the near-perfection of their sequence repeats, were first analysed. For three structures, Verify3D gave a categorical fail, possibly due to the much smaller number of β-solenoid structures available in the PDB when the software was developed. Nevertheless, all seven crystal structures received an average score > 0.1, allowing this to be adopted as threshold for interpretation of models.

After analysing the β-solenoid models, 20 out of 32 received an average score < 0.1 (Fig. 5), 16 of which had a pLDDT > 70. This highlights that AF2 generates confident β-solenoid models despite their scoring poorly by validation tools. It was hypothesised that models with a potential repeat region identified by STRPsearch may be more similar to natural proteins, as the software uses a library of structures for matching, and therefore receive a higher Verify3D score. However, no trend is observed.

Fig. 5.

Fig. 5

Verify3D results showing the average score for each model on the x-axis against average pLDDT on the y-axis. Each model is coloured by whether a repeat region was identified during classification via STRPsearch (blue for a positive identification, red otherwise).

As the Verify3D score for each model is an average, it was considered possible that problematic features may be neutralised by more favourable ones. For instance, one confident AF2 model received an average Verify3D score > 0.1, despite having several charged residues stacked internally, which would likely cause significant repulsion (Fig. 6). Consequently, AF2 models with a pLDDT > 70, but a Verify3D score < 0.1 or internal charged stacking, were defined as ‘problematic’.

Fig. 6.

Fig. 6

AF2 model (sequence: QNGITTKTKYGGRSHRDDDYD) with several charged residues inside the protein despite the pLDDT of 82. In some cases, a salt-bridge may be formed, but the large amount of charge inside this protein is more likely to destabilise the structure. Significant clashing of the stacked lysines with histidine residues is also observed. Figure made using PyMOL (pymol.org).

3.2.4. Comparison with alternative modelling methods

Alongside AF2, several other deep-learning protein prediction programs have been released in recent years. These include ESMFold, which uses a language model [25], RFAA [23] and AF3, which incorporates a diffusion model [1]. Interestingly, across the complete set of 78 sequences, AF3 predicted even more β-solenoid structures (44) than did AF2 (32) (Table S1). Crucially, however, the average pLDDT of the predictions was much lower with AF3 (60.9) than with AF2 (76.7). Furthermore, there were many fewer highly confident β-solenoid predictions with AF3: only two exceeded a threshold of 80 while 16 AF2 predictions exceeded that score and six of those had a mean pLDDT of greater than 90. One example, sequence IVYCYYVIFCVFC (13b in Table S1), was noted where AF3 produced a highly confident β-solenoid structure (mean pLDDT) that was instead predicted as an α-solenoid by AF2, but this exception apart, it is clear that AF3 does not share AF2’s tendency to predict highly confident β-solenoid structures for random repeat sequences. ESMfold and RFAA produced both fewer (25 and 8, respectively) and lower confidence (mean pLDDT 54.7 and 0.56 [RFAA pLDDTs are on a 0–1 scale]) β-solenoid predictions (Table S1).

For the 32 sequences that gave a β-solenoid prediction with AF2, a comparison with AF3 and ESMFold models was carried out. The TM-scores between AF2 and ESMFold models and AF2 and AF3 models were then calculated for each sequence. Overall, both ESMFold and AF3 frequently disagreed with AF2 in structure, pLDDT or both (e.g., Fig. 7).

Fig. 7.

Fig. 7

From left to right: AF2, ESMFold, AF3 models. pLDDT values are displayed beneath models. A) Example of a problematic model identified by Verify3D (sequence: CKMVKFQIQICF). B) Example of another problematic model (sequence: AVIHTPTYEAMNSVNKEHD). All three methods disagree in pLDDT, despite sometimes producing similar structures. Figure made using PyMOL (pymol.org).

Fig. 8 maps the structural divergence (difference in TM-score) between AF2 models and the predictions from ESMFold (x-axis) and AF3 (y-axis). Notably, a cluster of AF2 models defined as ‘problematic’ are found near the origin. This set generally have much low pLDDT values (especially in the case of ESMFold) meaning that the sequences in question give high confidence β-solenoids with AF2, but very different and low-confidence predictions with the alternative methods. These results suggest that any blind spot AF2 may have with regard to perfect repeat sequences and β-solenoids, is not shared by ESMFold and AF3.

Fig. 8.

Fig. 8

Comparison of AF2 models to ESMFold and AF3. The TM-score between AF2 and ESMFold models is shown on x-axis, and the TM-score between AF2 and AF3 is shown on the y-axis. Each point is coloured by both ESMFold pLDDT and AF3 pLDDT. Squares represent models that are considered to be ‘problematic’.

ESMFold also often predicted disordered structures. Additionally, AF3 sometimes generated long helical structures (e.g., Fig. 9), consistent with the observation that it can ‘hallucinate’ such structure for sequence that is in fact intrinsically disordered [1]. This led to the hypothesis (tested below) that these proteins may actually be intrinsically disordered, despite AF2’s confidence in the β-solenoid fold.

Fig. 9.

Fig. 9

From left to right: AF2, ESMFold and AF3 for the same sequence (CITYVCWFYFYCFFCC). Models are coloured by pLDDT. It can be seen that ESMFold models this protein as disordered, despite the confident β-solenoid structure from AF2. AF3 displays what could be a helical ‘hallucination’ for an intrinsically disordered sequence. Figure made using PyMOL (pymol.org).

Finally, it was noted that, for a large majority of the β-solenoid sequences, RFAA predicted a disordered structure (e.g., Fig. 10), supporting the hypothesis that these proteins if synthesised would be disordered. To explore this idea, sequence disorder predictions with AIUPred were carried out [12].

Fig. 10.

Fig. 10

From left to right: AF2, ESMFold, AF3 and RFAA predictions for the same sequence (VCFCCFVTTCYITVWCYFYIFWCFI). An example of a confident β-solenoid model from AF2 showing as a disordered protein when modelled with RFAA. Figure made using PyMOL (pymol.org).

3.2.5. Disorder predictions highlight another piece in the puzzle

Following disorder predictions, eight of the 32 β-solenoids had an average predicted disorder probability of ∼0.5 or above, seven of which had a pLDDT > 70 (Fig. 11). For those, ESMFold, RFAA, or both also predicted a disordered structure. Thus, AF2 may have a tendency to confidently fold some predicted disordered proteins composed of sequence repeats into secondary structure in a way that alternative modelling methods do not.

Fig. 11.

Fig. 11

AIUPred disorder prediction results showing the AF2 pLDDT of each model on the y-axis and the average probability of disorder on the x-axis. Eight out of the 32 β-solenoid models were predicted likely to be disordered (probability ∼0.5 or above), seven of which have a pLDDT > 70 (coloured red).

The comparison between AF2 and AF3 is also informative here. As mentioned, AF3 predicts more but lower confidence β-solenoids (Table S1). Analysis of mean pLDDT per model vs disorder prediction shows that the overall distribution across all models is comparable between the two methods, but there is a marked lower pLDDT for β-solenoids from AF3 compared to AF2 at higher values of predicted disorder (Fig S3). Thus, the ability of AF3 to correctly assign lower confidence to often implausible β-solenoid predictions may depend in part on detection of sequences with the characteristics to predict as intrinsically disordered.

3.2.6. Database searches demonstrate occasional chance matching of artificial sequences to natural sequences

It was observed that RFAA rarely agreed with the other methods both structurally and in confidence. However, where there seemed to be a consensus across all methods, it was hypothesised that the artificial sequences may be matching to natural ones coincidentally, which could be influencing structure prediction. The UniRef90 redundancy-reduced version of UniProt [48] and the PDB were therefore queried with all sequences using BLAST [2]. Somewhat surprisingly, all of the set of 78 sequences, though of random origin, had matches in UniRef90 with e-values below a 0.001 threshold (see Table S1). The number of sequences matching ranged from 1 to 873. However, only three sequences matched PDB entries and with much higher e-values (Table S1).

To further investigate any impact of chance sequence matches against experimental structures, the 32 sequences that AF2 modelled as β-solenoids were searched against the Pfam and PDB databases with HHPred [58]. Only three of the sequences produced significant hits, one of which was 6ZT4 from the Pentapeptide family (PF13599) (probability=98.4, E-value <0.01) [37]. This family does consist of β-solenoid structures, which were included in the training data for the modelling methods. Fig. 12 shows how the AF3 and RFAA models resemble 6ZT4, suggesting that the chance resemblance of the sequence in question, TGGDF (in its circularly permuted form, GDFTG), to the Pentapeptide family (consensus ADLSG) has influenced the predictions.

Fig. 12.

Fig. 12

A) From left to right: AF2, ESMFold, AF3 and RFAA models for the same artificial repeat sequence (TGGDF) coloured by pLDDT. Both AF2 and ESMFold agree structurally, but not in pLDDT, and disagree with AF3 and RFAA, which do agree structurally and in confidence. B) From left to right: AF3 and RFAA models, and PDB structure 6ZT4. It appears that both AF3 and RFAA may have been influenced by this Pfam family (PF13599) through a chance matching of their sequences, explaining their structural similarity and high confidence. Figure made using PyMOL (pymol.org).

For some sequences, the false positive hits detected are likely due to overrepresentation of cysteine, a rare amino acid [36]. It is possible that the different database search protocols implemented by the different deep learning-based methods are differently susceptible to detection of false positive hits. Fig. 13 illustrates another example where HHpred found a chance similarity to PF19627 (Activity-dependent neuroprotector homeobox protein N-terminal) domain sequences that may have influenced the RFAA prediction but not the others.

Fig. 13.

Fig. 13

From left to right: AF2, ESMFold, AF3 and RFAA models for the same artificial repeat sequence (CMDIMPQHSYTCIPCQTFVPIEMNHR). It can be seen that RFAA disagrees structurally to the other methods, possibly reflecting differences in how each modelling tool implements its database search. Figure made using PyMOL (pymol.org).

All models were queried for structural similarity against the PDB and AFDB (Table S1). There were very few significant matches (TMscore > 0.5) against experimental structures but a number of predictions, including some β-solenoids, did significantly resemble AFDB entries.

3.2.7. Molecular dynamics demonstrates model instability when compared to natural proteins

To test whether the models generated with high confidence, but internally stacked charged residues are unstable, all-atom MD was implemented with GROMACS. Within the preliminary dataset, there were two high confidence models with internally charged stacks, both of which were the same size. These models were selected for simulation along with a natural β-solenoid crystal structure (PDB: 3JX8:A), in which there is no internal charged stacking.

The results of these simulations showed that the crystal structure remained stable throughout the simulation (Fig. 14). In contrast, the models with negatively charged stacks were unstable. The model with internal glutamic acids (model 1 in Fig. 14) became twisted during the simulation, albeit without unfolding completely: it may be that the observed aromatic stacks prevent further distortions. For the model with internal aspartic acids (model 2), the instability is even more evident with unfolding starting around the charged stack and leading to a distorted structure that is bent along the solenoid axis. The clear instability of these β-solenoid structures, recalling their high pLDDT values, again argues that AF2 has a blind spot with regard to some sequence repeat structures.

Fig. 14.

Fig. 14

RMSD values over 1 microsecond for the three models tested with MD. Model 1 sequence: TVCWCMIWQCEEKFVIVEVM. Model 2 sequence: GDDPIYTKDNINPIGPYMGN. Model structures are highlighted at the start and end of the simulation. Negatively charged residues are shown as red sticks. It can be seen that both models with internal charged stacks are more unstable than the crystal structure β-solenoid.

3.2.8. β-solenoid bias may impact structure prediction of natural repeat proteins

As highlighted throughout this report, AF2 seems to have blind-spot surrounding β-solenoids when given random but ‘perfect’ repeat sequences. This has been demonstrated by the appearance of implausible features such as stacked charged residues, and quantified with Verify3D, which highlights model unfavourability despite AF2’s confidence. Additionally, when modelling the sequences with alternative modelling methods, a clear disagreement in confidence (and sometimes in fold prediction) reinforces this hypothesis. To test whether this bias towards β-solenoids is likely to influence prediction of natural proteins (some of which contain very near-perfect sequence repeats) we asked if the same pattern of higher pLDDT from AF2 compared to other methods was observed for these.

Accordingly, the sequences of Pfam families that AF2 had confidently predicted as possessing the β-solenoid fold [33], were modelled with the alternative deep-learning methods. For families PF07634, PF07012, PF02415, and PF01469, all four methods agreed structurally and in pLDDT, highlighting that these models are likely accurate. In each of these families the sequence repeats have diverged significantly (Table S2). Attention was then switched to the admittedly small numbers of near-perfect natural repeat sequences that may be influenced by this bias.

As Mesdaghi et al., [33] show, AF2 predicts the human mucin proteins to adopt the β-solenoid fold with confidence. The sequence of mucin 1 in particular, closely resembles the artificial sequences in this study, consisting of an exact repetition of a 20-residue sequence twelve times. Despite this, AF2’s confidence is low. Mucin 22 on the other hand is less ‘perfect’, but AF2 predicts the β-solenoid fold with confidence. When modelling the whole protein in AF3, the confidence is lower, suggesting that this prediction may be problematic (Fig. 15a, b).

Fig. 15.

Fig. 15

A) AF2 model of human protein mucin 22 from the AlphaFold Protein Structure Database [53]. B) AF3 model of mucin 22. C) AF2 model of mucin 22 fragment (residues 185–433). D) ESMFold model of mucin 22 fragment. E) AF3 model of mucin 22 fragment. F) RFAA model of mucin 22 fragment. Figure made using PyMOL (pymol.org).

Another pattern observed within this study was ESMFold and RFAA predicting disordered structures. Due to size constraints of ESMFold and RFAA, a fragment of mucin 22 (residues 185–433) was modelled for further comparison. Both AF2 and AF3 generated a β-solenoid model with confidence (Figs. 15c, 15e); however, both ESMFold and RFAA predict a more disordered structure (Figs. 15d, 15f). This suggests that this blind-spot has impacted this protein’s structure prediction, which should therefore be considered with scepticism.

4. Conclusions & further studies

The emergence of AlphaFold 2 has hugely advanced protein structure prediction, offering the capacity to model most proteins to unprecedented accuracy. Nevertheless, as with any new tool, it is important to consider its strengths and weaknesses and recent work has described its relatively poor performance on predicting coiled-coil protein topology [27] as well as a perceived blind-spot with regard to fold-switching proteins [5]. Evidence presented here suggests that AF2 has an unreasonable propensity to fold perfect sequence repeat proteins into β-solenoid structures, many of which – despite containing unusual and unstable features – it assigns high pLDDT values to.

The evidence for an AF2 blind-spot regarding β-solenoids is that more than 40 % of random repeat sequences over a range of lengths fold that way, a large majority with average pLDDT > = 70 (moderate confidence) and several with pLDDT > 90 (very high confidence). In some cases, there are dramatic inconsistencies within AF2’s sets of five predictions – sequences that highly confidently predict as a β-solenoid in one model, but a similarly confident α-helix in another. And the models themselves contain features – external hydrophobic stacks and, particularly, internal charged stacks – that are rarely observed in experimental protein structures. These peculiarities are detected, to some extent, by the structure validation tool Verify3D which assesses protein structural viability by considering the secondary structure-aware environments of different amino-acids. The internal charged stacks are demonstrated to lead to structural instability by molecular dynamics simulations. A quarter of sequences folded by AF2 into β-solenoids, all but one of moderate confidence or above, had mean intrinsic disorder scores from AIUPred of ∼0.5 or above suggesting that these sequences were in fact unlikely to have defined 3D folds. Finally, and importantly, alternative deep learning-based structure prediction tools, AF3, ESMFold and RFAA all behaved quite differently. Frequently, they produced quite different structure predictions but even when they did agree with the β-solenoid predicted by AF2 their confidence was much lower. The lower confidence of AF3 models may relate to the different way in which pLDDT is calculated for AF3 vs AF2 predictions [1]. Overall, for the sequences considered here, only AF2 has the twin concern of implausible structure predictions combined with high confidence estimates.

These findings raise questions as to why AF2, but not other tools, exhibits this apparent blind spot. It is particularly interesting to compare the unreasonable internal stacking of charged residues in many models with the observation that AF2 is considered to have effectively learned an energy function [44]. Conceivably, the explanation lies, as mentioned, with the internal calcium binding sites observed in some β-solenoids [13] at which the charge of acidic residues is compensated by the bound ion. Even though the models discussed here contain no positions resembling natural metal binding sites, AF2 might have learned that stacked acidic residues are acceptable in the context of the β-solenoid fold. Ultimately, the welcome and impressive ability of AF2 to accurately model ligand binding sites [19], even though the ligands are absent from its calculations, may come at a price.

Another factor to consider is the potential for the randomly generated repeat sequences to accidentally resemble natural repeat proteins or known structures in general. The modelling methods used here differ in their approaches to capture known sequence-structure relationships, with AF3, for example, using a simpler procedure than AF2 to search for homologous sequences [1]. However, chance resemblance to natural sequences and structures, revealed by HHPred analysis, appears to explain only a small part of the story: indeed, in the case of Pfam family PF13599 and its structure 6ZT4, only AF3 and RFAA models closely resembled the experimental structure.

Evidence from some natural repeat protein families suggests that the bias towards β-solenoids may only be consequential for repeat sequences that are (near-)perfect. Future work could explore whether this bias affects more imperfect sequences. For example, for the sequences in this study, each rung of the solenoid could be mutated randomly to disrupt their ‘perfection’ and the same methods applied, although whether this would meaningfully mimic the process by which repeat proteins become less perfect, post-duplication is open to debate. It would also be interesting to follow up this work with perfect repeat sequences that better followed natural amino-acid abundance rather than, as here, representing each amino-acid with equal frequency. Ultimately, experimental validation of β-solenoids predicted for natural proteins will be required for any confidence that the AF2 blind spot around β-solenoids affects only theoretical, perfect repeats.

Despite the fully justified excitement around AF2 and following structure prediction tools, it remains important to fully understand their strengths, weaknesses and quirks; particularly so since, like many deep learning-based tools, they have limited interpretability [49]. Exercises focused on performance against specific fold types, especially those that might be under-represented in training sets, are therefore important for each new tool that emerges (eg [54]). An alternative perspective on the blind spot reported here would view the perfect repeat sequences as “adversarial inputs” to AF2. Accordingly, and despite evidence that AF2 extrapolates beyond its training set (eg [10], [15]), they may be inputs that reveal unexpected weaknesses in neural networks. The long α-helices frequently predicted by AF2 for spurious and purely random sequences may represent a further case, and there too AF2 assigned high confidence to some models, although only to the shortest sequences [39]. The increasing use of deep learning methods for protein design brings the risk that undue confidence is placed on the results from adversarial inputs, emphasising the continued need for experimental structural biology validation of structure predictions of designed proteins. In conclusion, as computed structures [47], [53] become as broadly available as experimental structures, probing the performance boundaries of each new method is crucially important: only when all limitations and biases are comprehensively understood, can computational methods illuminate all uncharted areas of the protein universe.

Funding

This research was supported by a Biotechnology and Biological Sciences Research Council (BBSRC) Ph.D. studentship to RP (project reference BB/T008695/1) and by CCP4 Collaborative Framework Funding for AJS. LE’s studentship is co-funded by CCP-EM.

CRediT authorship contribution statement

Haon Margaux: Investigation. Mesdaghi Shahram: Investigation. Pratt Olivia S.: Writing – original draft, Investigation, Formal analysis, Data curation. Elliott Luc G.: Investigation. Rigden Daniel: Writing – review & editing, Supervision, Project administration, Methodology, Investigation, Funding acquisition, Conceptualization. Price Rebecca M.: Investigation. Simpkin Adam J.: Investigation.

Declaration of Competing Interest

No conflicts to declare.

Footnotes

Appendix A

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2025.01.016.

Appendix A. Supplementary material

Supplementary material

mmc1.docx (4.9MB, docx)

Supplementary material

mmc2.xlsx (28.2KB, xlsx)

References

  • 1.Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., et al. Accurate structure prediction of biomolecular interactions with AlphaFold3. Nature. 2024 doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Berman H.M., Henrick K., Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol. 2003;10(12):980. doi: 10.1038/nsb1203-980. [DOI] [PubMed] [Google Scholar]
  • 4.Bjelkmar P., Larsson P., Cuendet M.A., Hess B., Lindahl E. Implementation of the CHARMM Force Field in GROMACS: analysis of protein stability effects from correction maps, virtual interaction sites, and water models. J Chem Theory Comput. 2010;6(2):459–466. doi: 10.1021/ct900549r. [DOI] [PubMed] [Google Scholar]
  • 5.Chakravarty D., Lee M., Porter L.L. Proteins with alternative folds reveal blind spots in AlphaFold-based protein structure prediction. arXiv. arXiv preprint arXiv:2410.14898. 2024 doi: 10.1016/j.sbi.2024.102973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Clementel D., Arrías P.N., Mozaffari S., Osmanli Z., Castro X.A. In: RepeatsDB in 2025: expanding annotations of structured tandem repeats proteins on AlphaFoldDB. Ferrari C., Kajava A.V., Tosatto S.C., Monzon A.M., editors. Nucleic Acids Research; 2024. RepeatsDB curators. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Costantini S., Colonna G., Facchiano A.M. Amino acid propensities for secondary structures are influenced by the protein structural class. Biochem Biophys Res Commun. 2006;342(2):441–451. doi: 10.1016/j.bbrc.2006.01.159. [DOI] [PubMed] [Google Scholar]
  • 8.Di Domenico T., Potenza E., Walsh I., Parra R.G., Giollo M., Minervini G., et al. RepeatsDB: a database of tandem repeat protein structures. Nucleic Acids Res. 2014;42:D352–D357. doi: 10.1093/nar/gkt1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Drozdetskiy A., Cole C., Procter J., Barton G.J. JPred4: a protein secondary structure prediction server. Nucleic Acids Res. 2015;43(W1):W389–W394. doi: 10.1093/nar/gkv332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Durairaj J., Waterhouse A.M., Mets T., Brodiazhenko T., Abdullah M., Studer G., et al. Uncovering new families and folds in the natural protein universe. Nature. 2023;622(7983):646–653. doi: 10.1038/s41586-023-06622-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Eisenberg D., Luthy R., Bowie J.U. VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzym. 1997;277:396–404. doi: 10.1016/s0076-6879(97)77022-8. [DOI] [PubMed] [Google Scholar]
  • 12.Erdős G., Dosztányi Z. AIUPred: combining energy estimation with deep learning for the enhanced prediction of protein disorder. Nucleic Acids Res. 2024;52(W1):176–181. doi: 10.1093/nar/gkae385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Garnham C.P., Campbell R.L., Davies P.L. Anchored clathrate waters bind antifreeze proteins to ice. Biochemistry. 2011;108(18):7363–7367. doi: 10.1073/pnas.1100429108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Heger A., Holm L. Rapid automatic detection and alignment of repeats in protein sequences. Proteins. 2000;41(2):224–237. doi: 10.1002/1097-0134(20001101)41:2<224::aid-prot70>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
  • 15.Herzberg O., Moult J. More than just pattern recognition: prediction of uncommon protein structure features by AI methods. Proc Natl Acad Sci. 2023;120(28) doi: 10.1073/pnas.2221745120. e2221745120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Honig B., Nicholls A. Classical electrostatics in biology and chemistry. Science. 1995;268(5214):1144–1149. doi: 10.1126/science.7761829. [DOI] [PubMed] [Google Scholar]
  • 17.Humphrey W., Dalke A., Schulten K. VMD: visual molecular dynamics. J Mol Graph. 1996;14(1):27–28. doi: 10.1016/0263-7855(96)00018-5. 33-8. [DOI] [PubMed] [Google Scholar]
  • 18.Jorgensen W.L., Chandrasekhar J., Madura J.D., Impey R.W., Klein M.L. Comparison of simple potential functions for simulating liquid water. J Chem Phys. 1983;79:926–935. [Google Scholar]
  • 19.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., et al. Highly accurate protein structure with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kajava A.V. Tandem Repeats in proteins: from sequence to structure. J Struct Biol. 2012;179(3):279–288. doi: 10.1016/j.jsb.2011.08.009. [DOI] [PubMed] [Google Scholar]
  • 21.Kajava A.V., Steven A.C. β-rolls, β-helices, and other β-solenoid proteins. Adv Protein Chem. 2006;73:55–96. doi: 10.1016/S0065-3233(06)73003-0. [DOI] [PubMed] [Google Scholar]
  • 22.Karlin S., Brocchieri L., Bergman A., Mrazek J., Gentles A.J. Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA. 2002;99(1):333–338. doi: 10.1073/pnas.012608599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Krishna R., Wang J., Ahern W., Strumeels P., Venkatesh P., Kalvet I., et al. Generalized biomolecular modeling and design with RoseTTAFold all-atom. Science. 2024;384(6693) doi: 10.1126/science.adl2528. [DOI] [PubMed] [Google Scholar]
  • 24.Lee M.S., Gippert G.P., Soman K.V., Case D.A., Wright P.E. Three-dimensional solution structure of a single zinc finger DNA-binding domain. Science. 1989;245(4918):635–637. doi: 10.1126/science.2503871. [DOI] [PubMed] [Google Scholar]
  • 25.Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–1130. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 26.Luo H., Nijveen H. Understanding and identifying amino acid repeats. Brief Bioinform. 2014;15(4):582–591. doi: 10.1093/bib/bbt003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Madaj, R., Martinez-Goikoetxea, M., Kaminski, K., Ludwiczak, J.Dunin-Horkawicz, S. (2024) Applicability of AlphaFold2 in the modeling of dimeric, trimeric, and tetrameric coiled-coil domains. bioRxiv doi: 10.1101/2024.03.07.583852. [DOI] [PMC free article] [PubMed]
  • 28.Maity D., Pal D. MD DaVis: interactive data visualization of protein molecular dynamics. Bioinformatics. 2022;38(12):3299–3301. doi: 10.1093/bioinformatics/btac314. [DOI] [PubMed] [Google Scholar]
  • 29.Marcotte E.M., Pellegrini M., Yeates T.O., Eisenberg D. A census of protein repeats. J Mol Biol. 1999;293(1):151–160. doi: 10.1006/jmbi.1999.3136. [DOI] [PubMed] [Google Scholar]
  • 30.Mariani V., Biasini M., Barbato A., Schwede T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013;29(21):2722–2728. doi: 10.1093/bioinformatics/btt473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Marks D.S., Hopf T.A., Sander C. Protein structure prediction from sequence variation. Nat Biotechnol. 2012;30(11):1072–1080. doi: 10.1038/nbt.2419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.McPherson A., Gavira J.A. Introduction to protein crystallization. Acta Crystallogr Sect F Struct Biol Commun. 2014;1(70):2–20. doi: 10.1107/S2053230X13033141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Mesdaghi S., Price R.M., Madine J., Rigden D.J. Deep Learning-based structure modelling illuminates structure and function in uncharted regions of β-solenoid fold space. J Struct Biol. 2023;215(3) doi: 10.1016/j.jsb.2023.108010. [DOI] [PubMed] [Google Scholar]
  • 34.Michaud-Agrawal N., Denning E.J., Woolf T.B., Beckstein O. MDAnalysis: a toolkit for the analysis of molecular dynamics simulations. J Comput Chem. 2011;32(10):2319–2327. doi: 10.1002/jcc.21787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mirdita M., Schutze K., Moriwaki Y., Heo L., Ovchinnikov S., Steinegger ColabFold: making protein folding accessible to all. Nat Methods. 2022;19:697. doi: 10.1038/s41592-022-01488-1. -682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Miseta A., Csutora P. Relationship between the occurrence of cysteine in proteins and the complexity of organisms. Mol Biol Evol. 2000;17(8):1232–1239. doi: 10.1093/oxfordjournals.molbev.a026406. [DOI] [PubMed] [Google Scholar]
  • 37.Mistry J., Chuguransky S., Williams L., Qureshi M., Salazar G.A., Sonnhammer E.L.L., et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49(D1):D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Topf, M. (2020) Critical Assessment of techniques for protein structure prediction, fourteenth round. CASP 14 Abstract Book. Available from: 〈https://www.predictioncenter.org/casp14/doc/CASP14_Abstracts.pdf〉.
  • 39.Monzon V., Haft D.H., Bateman A. Folding the unfoldable: using AlphaFold to explore spurious proteins. Bioinform Adv. 2022;2(1):vbab043. doi: 10.1093/bioadv/vbab043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Mozaffari, S., Arrias, P.N., Clementel, D., Piovesan, D., Ferrari, C., Tosatto, S.C.E., (2024) STRPsearch: fast detection of structured tandem repeat proteins. bioRxiv. Available from: doi: 10.1101/2024.07.10.602726. [DOI] [PMC free article] [PubMed]
  • 41.Murzin A.G. Metamorphic proteins. Science. 2008;320(5884):1725–1726. doi: 10.1126/science.1158868. [DOI] [PubMed] [Google Scholar]
  • 42.Pronk S., Pall S., Schulz R., Larsson P., Bjelkmar P., Apostolov R., et al. GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics. 2013;29(7):845–854. doi: 10.1093/bioinformatics/btt055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Prusiner S.B. Prions. Proc Natl Acad Sci USA. 1998;95(23):13363–13383. doi: 10.1073/pnas.95.23.13363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Roney J.P., Ovchinnikov S. State-of-the-art estimation of protein model accuracy using AlphaFold. Phys Rev Lett. 2022;129(23) doi: 10.1103/PhysRevLett.129.238101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Rufai S.B., Gupta A., Singh S.(2019) Prion diseases: a concern for mankind. In: Hameed S., Fatima, Z. (eds) doi: 10.1007/978-981-32-9449-3_14. Pathogenicity and Drug Resistance of Human Pathogens. Available from: . [DOI]
  • 46.Simpkin A.J., Mesdaghi S., Sánchez Rodríguez F., Elliott L., Murphy D.L., Kryshtafovych A., et al. Tertiary structure assessment at CASP15. Protein Struct Funct Bioinform. 2023;91(12):1616–1635. doi: 10.1002/prot.26593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Steinegger, M., Levy Karin, E., & Kim, R.S. (2024). BFVD-a large repository of predicted viral protein structures. bioRxiv, 2024-09. Available from: 10.1101/2024.09.08.611582. [DOI] [PMC free article] [PubMed]
  • 48.Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H., UniProt Consortium UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31(6):926–932. doi: 10.1093/bioinformatics/btu739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Tan, J., & Zhang, Y. (2023). Explainablefold: understanding alphafold prediction with explainable ai. In: Proceedings of the twenty ninth ACM SIGKDD conference on knowledge discovery and data mining: pp. 2166-76.
  • 50.Tanford C. Contribution of hydrophobic interactions to the stability of the globular conformation of proteins. J Am Chem Soc. 1962;84(22):4240–4247. [Google Scholar]
  • 51.Van Der Spoel D., Lindhal E., Hess B., Groenhof G., Mark A.E., Berendsen H.J.C. GROMACS: fast, flexible and free. J Comput Chem. 2005;26(16):1701–1718. doi: 10.1002/jcc.20291. [DOI] [PubMed] [Google Scholar]
  • 52.Van Kempen M., Kim S.S., Tumescheit C., Mirdita M., Lee J., Gilchrist C.L., et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2024;42(2):243–246. doi: 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Varadi M., Bertoni D., Magana P., Paramval U., Pidruchna I., Radhakrishnan M., et al. AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024;52(D1):368–375. doi: 10.1093/nar/gkad1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Wojciechowska, A.W., Wojciechowski, J.W. & Kotulska, M. (2024) Non-standard proteins in the lenses of AlphaFold3 - case study of amyloids. bioRxiv. Available from: doi: 10.1101/2024.07.09.602655. [DOI]
  • 55.Wojdyr M. GEMMI: a library for structural biology. J Open Source Softw. 2022;7(73):4200. [Google Scholar]
  • 56.Xu J., Zhang Y. How significant is a protein structure similarity with TM-score = 0.5? Struct Bioinforma. 2010;26(7):889–895. doi: 10.1093/bioinformatics/btq066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zhang Y., Skolnick J. Scoring function for automated assessment of protein structure template quality. Protein Struct Funct Bioinform. 2004;57(4):702–710. doi: 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
  • 58.Zimmermann L., Stephens A., Nam S., Rau D., Kubler J., Lozajic M., et al. A completely reimplemented MPI bioinformatics Toolkit with a New HHpred server at its core. J Mol Biol. 2018;430(15):2237–2243. doi: 10.1016/j.jmb.2017.12.007. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx (4.9MB, docx)

Supplementary material

mmc2.xlsx (28.2KB, xlsx)

Articles from Computational and Structural Biotechnology Journal are provided here courtesy of AAAS Science Partner Journal Program

RESOURCES