Skip to main content
Genome Biology and Evolution logoLink to Genome Biology and Evolution
. 2023 Jul 18;15(7):evad129. doi: 10.1093/gbe/evad129

A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C3 and C4 Distinguishing Sequence Information

Nilanth Yogadasan 1,, Andrew C Doxey 2, Simon D X Chuong 3
Editor: Itay Mayrose
PMCID: PMC10368328  PMID: 37462292

Abstract

C4 photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22–24 origins, encompassing greater than 50% of all known C4 species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C4 photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C3 and C4 distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C3/C4 classification models. Our RbcL-trained model, for example, informs a C3/C4 classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C3 versus C4 metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C3/C4 status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C4 photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships.

Keywords: C4 photosynthesis, convergent evolution, PACMAD, grasses, plastome, machine learning


Significance.

The molecular determinants of C4 evolution, although extensively studied, remain largely unclear. A collective interest in these determinants is driven by their potential to guide efforts of engineering the formidable C4 photosynthetic pathway into key crop species. Several studies have uncovered associations between specific genomic and proteomic changes with C3–C4 transitions, but few have focused on plastomic associations. By uniquely employing a machine learning framework to screen plastomic information for C3 and C4 distinguishing features we have identified key sites and sequences undergoing changes associated with the C3–C4 transition in grasses. Presently identified sites include those yet-to-be-discussed meaningfully in the field, and we begin to outline here their potential significance in driving C4 evolution.

Introduction

The C4 photosynthetic pathway arose as an adaptation to falling atmospheric carbon dioxide (CO2) levels approximately 30 million years ago. Changes in the anatomy and metabolism of C4 plants allowed for CO2 concentration to take place in photosynthetic leaf tissue and cells, thereby inhibiting the negative effects of CO2 limitation (Sage et al. 2012). Under high temperature, high light and/or dry conditions, limited CO2 availability is further exacerbated by competition with oxygen (O2) for fixation by the key photosynthetic enzyme Rubisco and this is suggested to have led to the evolution, expansion, and diversification of C4 plant species across warmer climate regions (Bouchenak-Khelladi et al. 2009; Watcharamongkol et al. 2018). With only ∼3% (∼8,500 species) of all plant species employing C4 metabolism, C4 plant species are responsible for ∼23% of terrestrial gross primary production and this is in large part due to their outperformance of the ancestral C3 state in uncovered, hot to warm and arid regions (Still et al. 2003; Sage et al. 2012; Atkinson et al. 2016).

Despite extensive knowledge of the existence of various C4 pathways and a general understanding of how these pathways achieve and utilize CO2 concentration to negate photorespiration, exact genomic determinants and enablers of C4 evolution remain to be uncovered (Christin et al. 2007; Hibberd and Covshoff 2010; Gowik et al. 2011; Bianconi et al. 2020). Due to the complexity of the C4 trait, it has been suggested that changes to, potentially, hundreds of genes would be necessary for its evolution (Hibberd and Covshoff 2010; Sage 2017). Coordinated changes between the plastid and nuclear genomes (plastid genome hereafter referred to as plastome) would also be required to establish the necessary regulatory mechanisms governing the differences in plastid gene expression between C3 and C4 photosynthesis (Woodson and Chory 2008; Hibberd and Covshoff 2010). With all this in mind, it is surprising to note that C4 photosynthesis has evolved independently at least 61 times, making the trait a leading example of convergent evolution (Sage 2017).

Several studies have previously investigated evolutionary dynamics and selection acting on individual protein-encoding genes with known altered functionality and/or regulation during C4 photosynthesis compared to C3 (Hibberd and Covshoff 2010). These include genes encoding Rubisco small subunit (rbcS) (Berry et al. 2016), Rubisco large subunit (rbcL) (Kapralov and Filatov 2007; Christin et al. 2008; Kapralov et al. 2012, Piot et al. 2018), phosphoenolpyruvate carboxylase (Christin et al. 2007; Paulus et al. 2013; Rosnow et al. 2014; Moody et al. 2020), carbonic anhydrases (Tanz et al. 2009), C4-acid decarboxylases and more (Huang et al. 2017, Bianconi et al. 2020). However, apart from the plastome-encoded rbcL gene, little work has gone into uncovering selection and adaptation events associated with C4 photosynthesis and the C3—C4 photosynthetic transition within plastomes (Christin et al. 2008; Piot et al. 2018; Casola and Li 2022). The focus on rbcL has led to the identification of sites under strong positive selection in this gene that is, at least in part, driven by a C3—C4 photosynthetic shift (Christin et al. 2008; Kapralov et al. 2011; Kapralov et al. 2012; Piot et al. 2018; Casola and Li 2022). The existence of this positive selective pressure on one plastome-encoded gene suggests the potential for similar pressures being exerted at other sites in the plastome. Identified substitutions associated with the metabolic shift could be functionally and evolutionarily relevant in the establishment and maintenance of C4 photosynthesis.

In addition to the globe's most economically important crops—including rice, wheat, and corn—the grass family (Poaceae) provides a unique set of species from which a study of C3–C4 transitions can be conducted. Greater than 50% of all known C4 species belong to the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD), clade found within the Poaceae family (Edwards et al. 2010; Christin et al. 2013). The sister Bambusoideae, Ehrhartoideae, Pooideae clade contains roughly the same number of species all of which utilize C3 photosynthesis (Edwards et al. 2010; GPWG II 2012). Of the, at minimum, 61 origins of the C4 trait, a predicted 22–24 origins exist within the PACMAD clade alone (GPWG II 2012). In this study, we assess and compare the abilities of plastomic protein-coding sequences in providing C3 and C4 distinguishing information to a machine learning (ML) model for subsequent classification. With several C4 origins in grasses, we were able to compile a dataset of plastomes from C4 and closely related C3 species to ensure patterns identified by our models are due to photosynthetic type instead of lineage. After screening for genes that inform highly accurate classifiers, we determine the specific sites within the region sequence data that are most strongly influencing our classification models. The identified C3/C4 determining combinations of sites are potential changes—parallel, convergent, or otherwise—associating with photosynthetic type.

While there are several biochemical, anatomical, and physiological methods to determine the C3/C4 status of a plant species, our ML pipeline employed here does not aim to classify species but rather to answer 1) whether a photosynthetic type can be predicted from specific sequence information alone; and 2) what sequence features (genes, residues) enable the accurate prediction of C3/C4 phenotype and are, therefore, likely subject to key driving and/or adaptive forces separating C3 and C4 metabolism. Our framework answers these questions and, by the very nature of ML, uniquely arrives at patterns of changes associated with photosynthetic type by interpreting entire protein, and even whole complex, sequence data. This is in contrast to conventionally applied selection tests (Christin et al. 2008; Piot et al. 2018; Casola and Li 2022) which examine parallel/convergent associations on a residue-by-residue basis and independently of one another. By analyzing patterns across a functional unit (complex, subunit, protein, domain, motif, etc.), biologically relevant phenomena such as complementary mutations, compensatory mutations, functional redundancies, and more can be accounted for. However, our ML framework cannot provide insight into the selective pressures acting on identified associations and is not meant as a replacement for conventional selection testing. Instead, our work here offers alternative insight into plastomic associations with photosynthetic type from the unique perspective granted by the nature of ML and classification modeling. On a more expansive PACMAD dataset, than ones previously considered (Piot et al. 2018; Casola and Li 2022), we employed our ML framework and identified plastidic complexes, proteins, and patterns of substitutions within, capable of informing highly accurate C3/C4 classifiers. These include both known, via positive selection testing, and yet-to-be-considered sites and genes worth investigating further for a greater understanding of mechanisms driving the evolution of C4 photosynthesis.

Results

Plastomic Screen for Protein Sequences Informing Accurate C3/C4 Classification Models

With our assembled dataset of 134 PACMAD species, consisting of 76 C4 species and 58 C3 species (supplementary table S1, Supplementary Material online), we screened each of 76 plastid-encoded protein sequences based on their ability to train an accurate C3/C4 ML logistic regression classification model. Our dataset encompasses 16 different C4 lineages (fig. 1A, supplementary table S1, Supplementary Material online) with independent C4 origins (GPWG II 2012). A phylogenetic tree of the species in our dataset (fig. 1A) highlights the independent origins as well as the intent of the dataset curation; to include plastomes from C4 and closely related C3 species to ensure pattern identification and classification are driven by independent C3/C4 transitions instead of phylogenetic bias due to common ancestry. This tree is largely congruent with previous studies of the PACMAD clade (GPWG II 2012; Casola and Li 2022; Huang et al. 2022). Figure 1B illustrates the general framework employed in this study to first arrive at an unbiased summary of plastid-encoded protein sequences capable of informing accurate C3/C4 classification models, followed by algorithmic methods used to identify specific residues within each sequence that are most heavily influencing classification decisions.

Fig. 1.


Fig. 1.

Overview of ML framework employed to screen PACMAD plastomes for C3 and C4 distinguishing information. (A) Cladogram of 134 C3 and C4 PACMAD species from our assembled dataset (supplementary table S1, Supplementary Material online). Phylogeny was obtained using RAxML (GTR+Γ model) (Stamatakis 2014) from concatenated alignments of plastid marker genes (rbcL, ndhF, matK, rpoC2, psaA, psaB, psaI, rpl32, and ccsA) and rooted to Anomochloa marantoidea, a grass species from outside of the PACMAD clade. Independent C4 lineages, as previously reported, are labeled. Shaded and unshaded branches indicate C4 lineages and C3 lineages, respectively. The size of circles represents bootstrap support. (B) Pipeline employed for identification of residues and substitution patterns in plastid protein sequences from the PACMAD clade associating with photosynthetic type.

Full-length protein sequences from each species of our dataset were aligned and converted to feature sets for training, testing, and validation of a classification model (classification model v1 in fig. 1B, see methods for details). Of the 76 plastid-encoded proteins screened via classification model v1, the best predictor of C3/C4 status is RbcL (fig. 2A). Using RbcL amino acid (AA) sequence data, our v1 classification model can correctly classify species as C3 or C4 with >99% accuracy, on average, after cross-validation. The top ten performing v1 models are based on the full-length sequences from RbcL, NdhA, MatK, NdhI, NdhF, RpoC2, RpoA, Rps18, RpoC1, and AtpB. Each of these models correctly classifies C3 and C4 species of our dataset of 134 PACMAD species with greater than 84% accuracy (fig. 2A). These models perform significantly better (P < 0.001) than the performance achieved by 80% of v1 models. Of note, apart from RbcL, are the performances of v1 models built on NdhA, MatK, and NdhI which each score classification accuracies greater than 90% and perform significantly better (P < 0.001) than the next best-performing model built on NdhF.

Fig. 2.


Fig. 2.

Performance of C3/C4 classification models trained on plastid-encoded, protein sequence data from PACMAD grasses. (A) Average C3/C4 classification accuracy, after cross validation (with repeated random subsampling, n = 500), of models trained on each of 76 plastid-encoded, full-length protein sequences (70:30 train:test split) from 134 Poaceae species. Red bars highlight the ten best-performing classifiers. Absolute values of the regression coefficients associated with (B) RbcL alignment positions, (C) NdhA alignment positions, (D) MatK alignment positions, and (E) NdhI alignment positions from the initial (A) classification models are plotted as examples of the initial feature selection method. The ten highest peaks corresponding to the ten strongest features from each model are labeled. (F) Average C3/C4 classification accuracy, after cross validation (n = 500), of models trained on identified ten strongest features from each of 76 plastid-encoded, protein sequences from 134 Poaceae species (70:30 train:test split). Red bars highlight the ten best-performing classifiers.

The 10 strongest features, corresponding to 10 alignment positions most heavily influencing classification in each v1 model, were identified (fig. 2B-E four examples shown, supplementary table S2, Supplementary Material online) and then used exclusively to train our v2 classification models (classification model v2 in fig. 1B). Our v2 models, making use of only ten features per model, are not subject to the same potential for overfitting as our v1 models where the number of features is equivalent to the number of positions in each sequence alignment. Restricting feature size to below the number of independent C4 lineages in our dataset also limits the ability of our models to rely on phylogenetic signals for classification.

The ten strongest features from RbcL, NdhI, NdhA, MatK, AtpB, CcsA, NdhH, Rps18, Rpl2, PsbC all train v2 models with classification accuracies >84% and perform significantly better (P < 0.001) than 80% of all v2 models (fig. 2F). The ten strongest features alone from RbcL are enough to inform a classifier with >99% accuracy making it the strongest performing v2 model as well. Again, of note is the v2 model built on the ten strongest features from NdhI which has a classification accuracy of 93.2%, after cross validation. The identification of v2 models that perform classification with >90% accuracy from a restrictive feature set, that is an order of magnitude smaller (<10%) than our sample size and is also fewer than the number of independent C4 lineages in our dataset, strongly suggests informative ability that is likely linked to biology and is not a product of training biases or overfitting. This being said, screening with v2 models would be the most stringent, of our study, in identifying potentially informative sequences and sequence information. As such, only sequence information from RbcL and NdhI informed v2 models with >90% accuracy.

Recursive Feature Elimination to Identify Optimal Residue Combinations Informing Accurate Classification Models

Our v1 models (fig. 2A) were trained on entire protein sequence alignment data where each position in the alignment acted as a feature to be used for classification modeling. With an RbcL alignment of length 475, our RbcL v1 model would have, therefore, been trained on 475 features. With a dataset of 134 PACMAD species there exists a potential for overfitting due to a number of features being much greater than the dataset size (Vabalas et al. 2019; Chen et al. 2020). Our v2 models (fig. 2F) were trained on the ten strongest features, or ten alignment positions most heavily influencing classification, parsed from one iteration of each v1 model. By employing recursive feature elimination with cross validation (RFE-CV), a robust algorithm by which classification models are recursively built using fewer features in each iteration by dropping the previous iteration's weakest feature (Guyon et al. 2002; Arif et al. 2020), we were able to identify optimal feature combinations from each of the sequences informing top-performing v1 and v2 models (fig. 3).

Fig. 3.


Fig. 3.

RFE-CV to identify key feature combinations informing top C3/C4 classification models. (A) Cross validation (repeated random subsampling, n = 500) scores of classification models informed, initially, by full-length alignments and recursively modeled with a single identified least valuable feature (alignment position) removed in each iteration until a minimum of two features remained. The boxed and highlighted region corresponds to (B) which encompasses an “optimal” zone where models are built on ten or fewer features and have classification accuracy scores greater than or equal to 0.9 or 90%. (C) Comparison of a number of initial features used for classification to a number of features after optimal feature selection. (D) The average accuracy of the best-performing models after RFE-CV, from iterations making use of <11 features, is plotted against MPD from corresponding sequence alignments, and Pearson's correlation coefficient (r) is determined.

RbcL, MatK, NdhI, NdhF, NdhA, NdhD, RpoA, and RpoC2 are the only sequences identified to possess a subset of ten or fewer features that can inform a classifier with >90% average accuracy after cross validation. The minimal feature usage (fig. 3C) and high classification accuracies of our optimal v3 models (classification model v3 in fig. 1B) more confidently, but with less stringency than v2 models, suggest an association between the corresponding protein sequences and C3/C4 status.

To assess the propensity of our optimized feature models to rely on initial sequence alignment length or variation across species, we determined correlations of these factors with model accuracy. Accuracy of the top-performing model—identified from iterations making use of <11 features after RFE-CV—for each sequence was plotted against respective mean pairwise distance (MPD) (fig. 3D) calculated from corresponding alignments. C3 sequences were used exclusively to determine MPD so that variation due to C3–C4 photosynthetic transitions, which is expected to correlate to top model performance, was not captured. MPD is used here as a measure of the amount of information inherently present in queried alignments due to length and variation. Low-moderate correlation (r = 0.4935) of model accuracy with MPD was noted. Individually, both length (supplementary fig. S1A, Supplementary Material online) (r = 0.56) and variation (supplementary fig. S1B, Supplementary Material online) (r = 0.55) exhibited moderate levels of positive correlation with optimal model accuracy. However, neither factor exclusively dictated classification ability. Our optimized RbcL model was our strongest performing model despite ten other models trained on lengthier sequence alignments and 31 other models with greater variation/noise across species for coincident pattern detection. NdhI informed our third best-performing model despite 32 lengthier sequence alignments and 27 alignments with greater variation across sequences to learn from. The top performance of these models despite being provided with comparatively less information for pattern detection suggests an identification of C3/C4 specific signals and a lack of reliance on phylogenetics-associated sequence information. The high accuracy scores of optimal RpoC2, MatK, and NdhF models, however, are potentially artifacts of the corresponding sequence alignments being among the lengthiest and most variable (fig. 3D, supplementary fig. S1, Supplementary Material online) of all sequences assessed.

A subset of only six features from RbcL, identified via RFE-CV, is capable of informing a 99.98% accurate C3/C4 classifier, after cross validation. Specific residues at 101, 143, 281, 309, 418, and 468 (numbered according to positions in Zea mays RbcL) are all that is necessary for near-perfect classification, according to all relevant classification metrics, of PACMAD species in our dataset by photosynthetic type (table 1). Nine and ten feature subsets were identified as the optimal combinations from NdhI and NdhA, respectively, with both optimal models achieving >94% accuracy. Ten features from RpoA and ten features from NdhD were used to inform corresponding optimal models, both of which achieve accuracy scores >90% (table 1). Again, quite interestingly, three out of five of the top v3 models (not including potentially inflated performances from RpoC2, MatK, and NdhF) are trained on optimized features from sequences encoding subunits of the NDH complex. This enrichment of NDH subunits being identified among top informers of our v1, v2, and v3 classification models strongly implicates the NDH complex in roles related to, and likely differing between C3 and C4 metabolism.

Table 1.

Summary of Top Performing v3 Models After Feature Selection Using RFE-CV

AA seq Optimal Features (ap) Residue Positions (Zm) ROC AUC Precision Recall F1 Acc
RbcL 101, 143, 281, 309, 418, 468 101, 143, 281, 309, 418, 468 1.0 1.0 0.9997 0.9998 0.9998
RpoC2 357, 368, 450, 774, 1050, 1132, 1290 357, 368, 450, 706, 942, 1011, 1151 0.9639 0.9728 0.9471 0.9591 0.955
NdhI 5, 25, 38, 49, 84, 88, 89, 147, 153 5, 25, 38, 49, 84, 88, 89, 147, 153 0.9583 0.9784 0.9197 0.9464 0.9423
NdhA 5, 29, 36, 98, 110, 187, 298, 301, 319, 320 5, 27, 34, 96, 108, 184, 293, 296, 314, 315 0.9805 0.9910 0.9031 0.9436 0.9401
RpoA 6, 14, 146, 163, 180, 243, 279, 326, 329, 336 6, 14, 146, 161, 176, 237, 270, 317, 327 0.969 0.9444 0.9305 0.9359 0.9295
MatK 49, 147, 159, 314, 378, 417, 436 16, 111, 123, 274, 338, 377, 396 0.9673 0.8491 0.9716 0.93 0.9191
NdhD 64, 76, 114, 334, 364, 376, 442, 451, 497, 501 62, 74, 112, 332, 362, 374, 440, 449, 495, 499 0.963 0.9214 0.9197 0.9183 0.9089
NdhF 89, 145, 287, 340, 400, 566, 568, 597, 659 89, 145, 287, 340, 400, 555, 557, 586, 646 0.9653 0.9860 0.8497 0.9114 0.9082

ap—positions numbered according to alignment position.

Z. m—Residue positions numbered according to positions in Zea mays protein sequence.

Average scores of classification metrics (ROC AUC, Precision, Recall, F1, Accuracy) after cross validation (using repeated random subsampling, n = 500, 70/30 train/test split) are shown. Gray cells indicate models and model performances that are potential artifacts of corresponding sequence length/variation. ROC AUC, area under receiver operating characteristic curve.

Average accuracy and AUROC (area under receiver operating characteristic curve) scores were determined after cross-validation using repeated random subsampling (n = 500) for each v1, v2, and v3 model. These were compared to scores achieved by models trained using the same sequence information (feature data) with randomized C3 and C4 labels. Supplementary figure S2, Supplementary Material online summarizes the evaluation of our top optimal v3 models. The top-performing models, all have significantly higher accuracy and AUROC scores (P < 0.001, Student's t-test) than the associated model trained on randomized labels (supplementary fig. S2, Supplementary Material online). This is further confirmation that our top model performance is not a product of overfitting and that our ML framework is likely picking up on biologically relevant patterns in the sequence data rather than relying on coincident patterns expected to be found in datasets with many input features.

Combined Subunit Classification Modelling to Assess Entire Plastidic Complexes

To assess whether entire plastidic complexes harbor C3/C4 distinguishing information, we expanded upon our ML framework to determine whether accurate classification models could be trained on optimal features from across constituent plastid-encoded subunits (fig. 4). We assessed seven plastidic complexes, each of which including at least five different plastid-encoded subunits, for their abilities to inform accurate classifiers. The plastid ndh genes (ndhA—ndhK) encode 11 different subunits that are part of the NAD(P)H dehydrogenase complex (hereafter NDH complex) (Ifuku et al. 2011; Laughlin et al. 2020), the plastid atp genes (atpA, atpB, atpE, atpF, atpH, atpI) encode 6 different subunits of the ATP-synthase complex (Hennig and Herrmann 1986), the plastid pet genes (petA, petB, petD, petG, petL, petN) encode six different subunits of the cytochrome b6f complex (Stroebel et al. 2003), the plastid psa genes (psaA, psaB, psaC, psaB, psaJ) encode five different subunits included in the Photosystem I (PSI) supercomplex (Nelson and Yocum 2006), the plastid psb genes (psbA—psbF, psbH—psbM, psbT, psbZ) encode 14 different subunits included in the Photosystem II (PSII) supercomplex (Nelson and Yocum 2006), and the plastid rps and rpl genes typically encode 12 proteins for the small ribosomal subunits, and 9 proteins for the large ribosomal subunits respectively (Yamaguchi and Subramanian 2000; Yamaguchi and Subramanian 2003). By employing the pipeline summarized in figure 4A we were able to build classification models for each of the plastidic complexes based on optimal features selected from across their respective plastid-encoded subunits. In short, we obtained the strongest feature subset (v2 model features) for each of the subunits (as outlined in fig. 1B), concatenated these feature sets according to complex, and used RFE-CV to arrive at an optimal subset of ten or fewer features per complex for C3/C4 classification.

Fig. 4.


Fig. 4.

Performance of C3/C4 classification models trained on optimal features selected from plastidic, multimeric complexes. (A) Pipeline employed for generating optimal feature combinations from across all plastid-encoded subunits making up a specific plastidic complex. NDH complex is used here as an example. (B) Average C3/C4 classification accuracy of the top-performing model with ten or fewer optimal features after RFE-CV from each of seven plastidic complexes. NDH—NAD(P)H Dehydrogenase complex, ATP Synth—ATP synthase, cyt b6f—cytochrome b6f complex, PSI—Photosystem I, PSII—Photosystem II, 50S—50S ribosomal subunit, 30S—30S ribosomal subunit. Significance was determined using Student's t-tests (P < 0.001) with Bonferroni correction and significant differences are denoted by letters.

Figure 4 B compares the top-performing models, of ten or fewer features, for each plastidic complex. In agreement with what has been hinted at by our individual sequence models, NDH complex sequence information is capable of informing highly accurate classification models. With an average accuracy of 97.07% after cross validation, the NDH complex model performs significantly better (P < 0.001, Student's t-test) than all six other models trained on optimized, minimal (ten or fewer) feature sets from queried complexes. No other complex achieved average accuracies >92% (table 2). This significant difference in the informative power of the NDH complex versus other major plastid complexes suggests a specific association with factors driving C4 evolution. The NDH complex model also performs significantly better (P < 0.001, Student's t-test) than any model trained on data from its individual constituent subunits, while still making use of a minimal feature set.

Table 2.

Summary of Optimized Plastidic Complex Models and Identified Key Residues

Complex Combined Subunits Optimal Features (Subunit-ap) Optimal Residues (Subunit-Zm) Acc
NDH (ndh) A,B,C,D,E,F,G,H,I,J,K A-5, A-36, A-319, D-114, H-274, I-25 A-5, A-34, A-314, D-112, H-269, I-25 0.9707
ATP synthase (atp) A,B,E,F,H,I B-96, B-105, B-383, B-450, E-55, E-102, E-106, E-130, F-14, I-92 B-96, B-105, B-383, B-450, E-55, E-102, E-106, E-130, F-14, I-92 0.9190
Cyt b6f (pet) A,B,D,G,L,N A-5, A-20, A-26, A-63, A-145, A-177, D-17, L-19, L-29, N-5 A-5, A-20, A-26, A-63, A-145, A-177, D-3, L-19, L-29, N-3 0.7857
PSI (psa) A,B,C,I,J A-34, A-310, A-484, B-492, C-29, I-4, I-5, I-24 A-34, A-310, A-484, B-492, C-29, I-24, I-5, I-24 0.8682
PSII (psb) A,B,C,D,E,F,H,I,J,K,L,M,T,Z B-182, C-191, C-221, C-291, C-441, D-5, E-79, K-8, K-41, K-47 B-182, C-177, C-207, C-277, C-427, D-5, E-79, K-8, K-39, K-45 0.8909
50S (rpl) 2,14,16,20,22,23,33,36 2–185, 14–67, 16–120, 16–126, 16–142, 20–109, 22–103, 22–114, 23–92, 33–4 2–179, 14–67, 16–113, 16–119, 16–135, 20–103, 22–102, 22–113, 23–92, 33–4 0.8835
30S (rps) 2,3,4,7,8,11,12,14,15,16,18,19 3–118, 3–174, 3–179, 3–197, 11–13, 11–122, 14–14, 16–55, 16–37, 16–48 3–105, 3–161, 3–166, 3–184, 11–13, 11–122, 14–14, 16–54, 16–36, 16–47 0.7873

Subunit-ap—positions numbered according to alignment positions from the specific subunit.

Subunit-Zm—positions numbered according to protein sequence positions of the specific subunit in Zea mays.

Acc-average accuracy after cross validation (using repeated random subsampling, n = 500, 70/30 train/test split).

NDH, NAD(P)H Dehydrogenase complex; Cyt b6f, cytochrome b6f complex; PSI, Photosystem I; PSII, Photosystem II; 50S, 50S ribosomal subunit; 30S, 30S ribosomal subunit.

Assessment of Key Residue Combinations Informing C3/C4 Classification

Of all models (v1, v2, v3, and complex) trained, tested, and validated in our study, the models informed by optimal features from RbcL and the NDH complex are the strongest performing in terms of accuracy and all relevant classification metrics. Six features, or sites, from RbcL (table 1) and six combined features from plastid-encoded subunits of the NDH complex (table 2) inform classifiers with 99.98% and 97.07% accuracy, respectively. These both perform significantly better (P < 0.001, Student's t-test) than the next best model, of all considered, which is informed by optimal features from the individual NdhI subunit (94.23% average accuracy). Supplementary figure S3, Supplementary Material online maps all of the residues at each optimal feature position (relative to Zea mays orthologous sequences) from these top two performing models to the corresponding species of our PACMAD dataset.

Position 281 from RbcL is associated with the most consistent parallel substitution that had taken place during C3–C4 photosynthetic transitions of all features identified in our study. In all C3 species considered the residue aligned with position 281 is an alanine (A), while for >66% of C4 species in our dataset, the corresponding residue is a serine (S) and this A-281-S substitution is observed in 15 of the 16 C4 lineages examined. Similarly parallel are the changes at identified optimal feature positions 101, 143, 309, and 468 of RbcL. Residues aligned with position 101 are valine (V) for >98% of C3 species, but <56% of C4 species. Isoleucine (I) appears at this position in ∼44% of C4 species, and in 11 of 16 C4 lineages, but in only a single C3 species. All C3 species of our dataset have threonine (T) for residues aligned with position 143, but >64% of C4 species from across five C4 lineages have alanine (A) instead. Although methionine (M) appears at position 309 in the majority of C3 (100%) and C4 (∼65%) species of our dataset, an M-309-I substitution is only found in C4 species and across 11 different lineages. Residues aligned with position 468 are glutamic acid (E) for >96% of C3 species but only ∼30% of C4 species. Aspartic acid (D) appears instead in ∼70% of C4 species, across eight C4 lineages, and yet in only a single C3 species. Position 328, while not included as one of the optimal six features in our RbcL v3 model, had been one of the ten strongest features identified from our RbcL v1 model (fig. 2B) and exhibits notable parallel changes associated with C3—C4 transition >86% of C3 species of our dataset have alanine (A) aligned with position 328, while serine (S) appears instead in >78% of C4 species and in 12 of 16 C4 lineages.

Of the six optimal features informing our NDH complex model, the most notable level of parallel changes associated with the C3—C4 transition was observed at position 25 of the NdhI subunit. Here, serine (S) is present in >86% of C3 species and <16% of C4 species. Glycine (G) is the predominant C4 residue at this position, as it is present in >84% of C4 species, across 12 C4 lineages. Less notable, but relevant, changes associated with photosynthetic type are found at identified optimal feature positions 36 of the NdhA subunit and 274 of the NdhH subunit. ∼95% of C3 species and ∼65% of C4 species have threonine (T) as the residue corresponding to NdhA-36, but alanine (A) is found instead in ∼35% of C4 species, appearing in five lineages, and only 2 C3 species. The residues corresponding to NdhH-274 in ∼95% of C3 species and only ∼60% of C4 species is arginine (R). Arginine (R) is substituted with either lysine (K), glycine (G), or glutamic acid (E) in ∼40% of C4 species, across five lineages, while the only change from arginine across our C3 dataset is to lysine in three species.

Based on the crystal structure of rice (Oryza sativa) Rubisco (Matsumura et al. 2020) (Protein Data Bank [PDB] id: 6KYI), six of the ten strongest RbcL features (v2 model)—including four of the six identified optimal (v3 model) features—residues 143, 262, 281, 309, 328, and 418 of RbcL are surface exposed, interface proximal and/or active site proximal amino acids (fig. 5). Residue 262, although not exposed as part of the Rubisco complex, resides on the exterior of Rubisco LSU and is found at the interface of the LSU and Rubisco SSU. Residue 418 also lies at the interface between the LSU and SSU. Residue 328 is exposed and neighbors a sulfate-bound site which represents an active site of RbcL where the phosphate group of RuBP would normally bind (Matsumura et al. 2020). Residue 309 resides on the interior of RbcL and next to a RuBP binding site.

Fig. 5.


Fig. 5.

Structure of rice (Oryza sativa) Rubisco in complex with sulfate ions (Matsumura, et al. 2020) (PDB id: 6KYI). The strongest and optimal feature residues encoded by rbcL that are surface exposed, at subunit interfaces and/or near the active site are highlighted. Graphics were obtained using PyMol (DeLano 2002).

Discussion

The objective of our study was to compare the abilities of specific plastid-encoded protein sequence information to inform accurate ML C3/C4 classification models. We converted sequence alignments to feature sets for modeling and then, as part of a progressively optimizing pipeline, assessed full-length sequences (v1 models), selected features (v2 models) and algorithmically obtained “optimal” feature sets (v3 models) for classification ability. RbcL came out as the top informer at every iteration of modeling in our framework. Sequence data from RbcL alone was all that was necessary to train classifiers capable of distinguishing C3 from C4 species with >99% accuracy. The presence of distinguishing information here is in alignment with previous work suggesting positive selection acting on specific codons of rbcL in C4 species (Christin et al. 2008; Kapralov et al. 2011; Kapralov et al. 2012; Piot et al. 2018; Casola and Li 2022), but the extreme accuracy with which our RbcL models can classify PACMAD species as C3 or C4 further suggests a level of adherence to identified patterns of changes that are possibly critical for C4 evolution and/or maintenance.

Positions 101, 143, 281, 309, 418, and 468 of RbcL (all positions herein numbered relative to orthologous Zea mays sequences) were identified via RFE-CV as optimal features and were all that were necessary to inform a classifier with 99.98% accuracy. 101, 281, and 309 had previously been identified as strong candidates for being under positive selective pressure in C4 clades (Christin et al. 2008). Four hundred sixty-eight was later determined to be under the same selective pressure (Piot et al. 2018), and 143 had separately been identified as exhibiting convergent AA replacements in C4 clades (Casola and Li 2022). Three hundred and twenty-eight, although not one of the identified optimal features, had been one of the ten strongest selected features for our v2 model and this position had been highlighted in all three studies (Christin et al. 2008; Piot et al. 2018; Casola and Li 2022). Positive selection acting on residues 281 and 309 was not limited to grasses or even monocots. An Ala-281-Ser substitution and a Met-309-Ile substitution, the same substitutions distinguishing certain C4 species from ancestral C3 species in our dataset, were found to be positively selected for in C4 Amaranthaceae species as well (Kapralov et al. 2012). The Met-309-Ile substitution was again found to be positively selected for in C4 Flaveria species (Kapralov et al. 2011). This specific substitution was shown, in transplastomic experiments, to increase the carboxylation rate of Rubisco while reducing CO2 affinity (Whitney et al. 2011). Increased speed at the cost of CO2 affinity is a property associated with the C4-like enzyme due to the CO2 concentrating mechanisms employed by C4 species (Kapralov et al. 2011; Whitney et al. 2011). Although not discussed in the context of photosynthetic type, 143 and 262 (v2 model strongest feature) were identified as sites having undergone adaptive evolution in gymnosperms (Sen et al. 2011). Ser-143-Ala and Thr-262-Val substitutions were connected to ecological diversification while the ancestral state remained geographically localized (Sen et al. 2011). In our dataset, a similar polar-143-hydrophobic substitution (Thr-143-Ala) is exclusive to C4 species and arises in five distinct lineages. Structurally, 143, 262, 281, 309, 328, and 418 are all either surface exposed, interface proximal, and/or active site proximal residue positions (fig. 5) (Spreitzer and Salvucci 2002; Spreitzer et al. 2005; Sen et al. 2011). Substitutions at these key areas could alter complex stability, Rubisco holoenzyme expression, substrate binding efficiency, and/or regulation of Rubisco activation and therefore prove to be functionally relevant and possibly C3 and C4 distinguishing, mechanistically. Four hundred and eighteen which lies on an LSU–SSU interface was previously identified as part of a cluster that potentially plays a role in altering Rubisco's CO2/O2 specificity (Yu et al. 2005), a quality that differs between C3 and C4 species (Sage et al. 2012). This considerable alignment of identified optimal features with previously reported key sites associated with the evolution of C4 photosynthesis or related adaptive traits is both proof and an example of our framework's ability to arrive at biologically significant findings.

The rbcL gene has been the focus of the vast majority of comparative, plastid-encoded gene studies between C3 and C4 species. Selection tests, as discussed, have been employed to uncover sites in rbcL that are being acted upon by positive selective pressures in C4 lineages (Christin et al. 2008; Kapralov et al. 2011; Kapralov et al. 2012; Piot et al. 2018), but rarely have entire plastomes been screened for such sites. Piot et al. (2018) conducted one such plastome-wide screen in 113 PACMAD grasses and arrived at the conclusion that no obvious adaptive evolution occurred in protein-coding plastid genes, outside of rbcL, which could be attributed to the C3—C4 photosynthetic transition. Casola and Li (2022) more recently uncovered molecular convergence taking place in multiple chloroplast proteins via a comparative analysis of inferred AA replacements in C3 and C4 lineages, but from a more limited dataset of 64 PACMAD species. The results of our plastome wide screen, from 134 PACMAD species, do not attempt to dispute either finding because our framework is designed to answer different questions. While the aforementioned studies were designed to uncover positive selective pressures and convergence on individual amino acids independently of one another, our framework makes use of ML to consider all sites of a sequence of interest simultaneously. This allows for the detection of patterns or combinations of informative changes, across sites. Our work does demonstrate that plastid-encoded proteins, other than RbcL, harbor patterns of changes across PACMAD species that inform accurate classification models. These would be changes that are potentially associated with photosynthetic type. Sequences from NdhI, NdhA, RpoA, and NdhD were identified, in addition to RbcL, as being capable of informing classifiers with >90% accuracy.

During our investigations, we noted the enrichment of Ndh-subunit sequences among top informers of our v1, v2, and v3 models. This enrichment strongly implicates the NDH complex in roles related to, and likely differing between C3 and C4 metabolism. An increase in accumulation of the NDH complex in C4 species is suggested to aid in meeting the additional ATP demand required of C4 photosynthesis via NDH-mediated cyclic electron flow, an alternative to photosynthetic electron transport (Horváth et al. 2000; Martín and Sabater 2010; Nakamura et al. 2013; Ishikawa et al. 2016a, 2016b). While the complex is understood to play this pivotal role during photosynthesis few investigations into specific residue and substitution associations with C4 metabolism have taken place. We took advantage of our framework's ability to detect patterns across sites to determine if the combined plastid-encoded subunits of the NDH complex could inform accurate classification models and compared this performance to other plastidic complexes. Biologically relevant information that is lost when assessing individual subunits, such as functional redundancies and/or compensatory mutations among different subunits of a larger complex, is captured here by combining related subunits in our models. Our optimized NDH complex model had a classification accuracy of 97.07% and performed significantly better than the next best complex, ATP synthase, which informed an optimized model capable of 91.90% accuracy. Despite starting with the ten strongest features from all plastid-encoded Ndh subunits, after RFE-CV the final model made use of a combination of features from only NdhA, NdhD, NdhI, and NdhH, all of which inform a top-performing individual v1, v2, and/or v3 model. This again reinforces a potential biological relevance for these specific subunits, along with the entire complex, in the evolution and/or maintenance of C4 photosynthesis.

Residue 25 of NdhI exhibited the most notable level of parallel changes associated with the C3—C4 transition from identified NDH complex optimal features. NdhI exists as part of Subcomplex A (SubA) of the NDH complex as shown in recent structural work in Arabidopsis thaliana (PDB id: 7WFG) (Su et al. 2022). In this subcomplex (NdhH- NdhO), domains of NdhI are suggested to interact with ferredoxin (Fd) in the Fd-binding pocket (Laughlin et al. 2019; Zhang et al. 2020; Su et al. 2022), however residues near this binding site did not coincide with key sites identified by our framework. Instead, residue 25 of NdhI exists closer to the N-terminal end of this subunit, as part of a helical domain that is in contact with NdhT of Subcomplex E (SubE) (Yamamoto et al. 2011; Su et al. 2022). This contact between NdhT and SubA has been suggested to be necessary for the accumulation of SubA (Yamamoto et al. 2011) and shown to be strictly required for the assembly and functionality of the similarly structured human mitochondrial complex I (Stroud et al. 2016). Ishikawa et al. (2016a, 2016b) demonstrated that NDH accumulation is greatly increased in C4 species, relative to C3 species, and quite interestingly Nakamura et al. (2013) observed that accumulation of NdhH, a member of SubA, increased with the progression of C4 evolution (C3, C3–C4, C4-like, C4) in the genus Flaveria. Substitutions at NdhI-25 could therefore be altering NdhT-SubA interaction dynamics that are in turn corresponding to observed changes in accumulation levels of SubA and the NDH complex between C3 and C4 species. NdhT, however, does not just contact SubA at the NdhI interface. NdhT makes contact with NdhH, NdhI, NdhJ of SubA, and NdhA of SubM (Su et al. 2022). Features from NdhH and NdhA were identified as key sites from our NDH complex model, optimal features from NdhA informed the third best performing v3 model after RbcL and NdhI, and NdhH informed the sixth best performing v2 model. NdhH-269, an identified optimal feature informing our NDH complex model, is exposed on the same SubA face as NdhI-25 and is in close proximity to a possible NdhT–NdhH interface, as modeled in Su et al. (2022). Here, one can see how substitutions at various combinations of our identified key sites from NdhI, NdhA, or NdhH could similarly alter subunit interaction dynamics to potentially bring about the necessary increase in NDH accumulation for C4 photosynthesis (Ishikawa et al. 2016a, 2016b) and thus alleviate selective pressures at unchanged sites. This is an example of the kind of patterns of changes, associated with a convergent trait, which our ML framework can pick up on that might be overlooked with selection testing.

C4 species are often grouped according to the distinct biochemical pathways and Kranz anatomical variations they employ in the establishment of C4 photosynthesis (Hatch et al. 1975; Muhaidat et al. 2007). Three main C4 biochemical subtypes have been identified and they mainly differ in the mechanisms used for shuttling and decarboxylation of CO2. These subtypes are named according to the main decarboxlyation enzymes employed in each pathway, specifically, nicotinamide adenine dinucleotide phosphate (NADP)-malic enzyme (NADP-ME), nicotinamide adenine dinucleotide (NAD)-malic enzyme (NAD-ME), and phosphoenolpyruvate carboxykinase (PEP-CK) (Kanai and Edwards 1999). There are several more Kranz anatomy variations (Giussani et al. 2001; Muhaidat et al. 2007; Christin et al. 2008; Christin et al. 2009), or C4 anatomical subtypes, and it is expected that C4 subtypes likely differ in genotype-photosynthetic type associations due to the evolution of different biochemical and mechanistic approaches to establishing C4 photosynthesis. Variation in biochemical subtype is well represented within our PACMAD species dataset, however, less anatomical variation is present due to Poaceae C4 species largely employing Classical Kranz anatomy (supplementary table S1, Supplementary Material online) (Christin et al. 2009; Sage et al. 2011; Rudov et al. 2020). Supplementary Figure S4, Supplementary Material online summarizes the misclassification rates per C4 subtype of each of our identified top models in an attempt to determine if false classifications for certain models are associated with specific subtypes. Our RbcL v3 model only misclassified species of our dataset at a rate of 0.02%, and the only C4 species it would misclassify as C3, albeit rarely, was Alloteropsis angusta, a species with Neurachneoid Kranz-typing and NAD-ME + PEP-CK biochemical subtype. Apart from this, our RbcL model misclassified all C4 subtypes equally at a rate of 0%, suggesting that changes to RbcL are key to all C4 subtypes. Our NDH complex model misclassified NADP-ME, NAD-ME, and PEP-CK subtypes similarly at low rates of 4–6%, again suggesting that changes to the NDH complex, as a whole, and their associations with the evolution of C4 photosynthesis likely do not depend on biochemical subtype, or if dependance does exist, it is accounted for at the complex level. Other top v3 models (NdhI, NdhA, RpoA, and NdhD) showed variation in their rates of misclassification of the three biochemical subtypes. This suggests that these proteins, and changes to them, may play lesser or larger roles in C4 development depending on the evolved biochemical subtype. Misclassification based on Kranz-type did vary more obviously. Our NDH complex model misclassified Aristidoid Kranz-type species at a rate of 59.7%, Neurachneoid Kranz-type species at a rate of 33.1%, Classical Kranz-type species at a rate of 2.2%, and all other Kranz-types at 0%. This and similar variations in Kranz-type misclassification rates from other top models suggest that changes to certain plastid sequences might only be required for certain Kranz-typings, however, with the very limited representation of certain Kranz-types in our PACMAD dataset (e.g., only one Aristidoid species) it is difficult to effectively elaborate on such findings.

Several factors were considered in the implementation and validation of our models before attempting to draw conclusions from classification metrics. Importantly, it was necessary for our ML study to be designed in a manner that allowed for the deciphering of true C3/C4 associations from potential phylogenetic signals in the dataset. Our PACMAD species dataset, used for training and testing, encompasses a reported 16 independent origins of C4 photosynthesis (GPWG II, 2012) with intervening C3 lineages (fig. 1, supplementary table S1, Supplementary Material online), and this was to ensure the resulting classification metrics were a measure of C3 or C4 specific pattern detection instead of phylogenetic correlation. Cross validation with repeated (500 iterations) random subsampling (Dubitzky et al. 2007; Vabalas et al. 2019) was used for all models to generate average classification metrics for comparison that were not based solely on any one coincidentally optimal or suboptimal sampling of the dataset. Initial modeling on full-length sequences revealed great variation in informative power across the 76 protein sequences screened. This observed variation suggests a specificity when it comes to sequences capable of informing a strong classifier and this was important to our aim both for assigning biological significance and to again alleviate concerns of phylogenetic biases. Feature selection and RFE-CV were employed to generate models from a minimal number of features to avoid the potential for overfitting due to the feature number being much greater than the dataset size. RFE-CV was combined with permutation testing to ensure that coincident patterns found in randomly labeled sets could not inform classifiers with comparable performance metrics to true models. Length-accuracy, variation-accuracy, and MPD-accuracy correlation analyses (supplementary fig. S1, Supplementary Material online, fig. 3D) revealed only moderate associations present, meaning our optimized v3 models were not exclusively relying on the length or variation/noise present in sequence alignments to achieve strong classification metrics. We expect that if models were relying solely on phylogenetic signals to distinguish C3 from C4 species then sequences of greater length, variation, and/or MPD should typically inform better classifiers due to greater opportunity to pick up on phylogenetic-specific information. RbcL and NdhI were neither among the most variable nor lengthy sequences of all considered, yet they informed the two of the strongest performing v1, v2, and v3 models. However, we do acknowledge that RpoC2, MatK, and NdhF were among the lengthiest and most variable sequences of all queried and so for this reason we did not attempt to draw conclusions from the relatively strong performance of their corresponding models. A smaller study to identify optimal features/residues from eudicot species was conducted, using our pipeline, and summarized in supplementary table S3, Supplementary Material online and supplementary table S4, Supplementary Material online. Despite the species of our eudicot dataset being distant relatives to grasses, several identified key residues for distinguishing C3/C4 species in eudicots overlapped with those we identified as key from PACMAD sequences. The identification of overlapping key residue combinations in species far removed from those of the PACMAD clade is a further indication that our study is identifying true C3/C4 distinguishing signals. Complete agreement on key residues identified from eudicots and grasses is not to be expected because of the differences in Kranz-types and biochemical types between the C4 species of these two groups (supplementary table S1, Supplementary Material online, supplementary table S3, Supplementary Material online). Furthermore, the current limited eudicot sequence data availability does not allow for as robust of a analysis as was carried out with our PACMAD dataset. C3 and C4 distinguishing information found in eudicot plastomes, and/or other specific families with several C4 origins, will be very interesting to explore further, as soon as relevant sequence data becomes available. The same potential for discovery and validation of key residue combinations in C4 nuclear markers, such as phosphoenolpyruvate carboxylase and pyruvate, phosphate dikinase, exists with our pipeline, again, if and when sufficient sequence data becomes available.

Countless research efforts today are geared toward contributing to a greater understanding of the C4 pathway, not least because of the obvious potential for applications in synthetic biology and crop improvement (Matsuoka et al. 2001; Sage and Sage 2009; Kajala et al. 2011; Schuler et al. 2016; Jobe et al. 2020; Roell et al. 2021). The C4 trait is a formidable one because it has allowed inheriting species to outperform the ancestral C3 state in hot and arid regions due to a convergence of several biochemical, physiological, and anatomical features with adaptive significance (Still et al. 2003; Sage et al. 2012; Atkinson et al. 2016). The identification of sequence-photosynthetic type associations in grasses, a family to which many of the world's staple crops belong, would be key to furthering an understanding of what might be necessary to emulate C4 efficiency in crop species. The presented work is a demonstration and validation of the use of an ML framework in offering an alternative perspective from which to determine key sequences and sites, from across grass plastomes, potentially driving the evolution of this complex, convergent trait.

Materials and Methods

Dataset Assembly

Complete plastomes for 135 Poaceae species (134 PACMAD + Anomochloa marantoidea for rooting) were retrieved from the Chloroplast Genome Database (CpGDB) (Singh et al. 2020). Fifty-nine C3 species and 76 C4 species (supplementary table S1, Supplementary Material online) were selected based on the availability of experimental verification of photosynthetic type as summarized in Osborne et al. (2014) and used in Piot et al. (2018). Species sampling spanned multiple reported C4 origins (GPWG II 2012) from the PACMAD clade and was curated specifically to include C3 and C4 species more closely related to each other than to other species of the same photosynthetic type in the dataset. C4 species were sampled from 16 different lineages with independent C4 origins (fig. 1) and C3 species were sampled from closely related lineages of each.

Protein-encoding regions were extracted from each plastome according to gene coordinates obtained from the CpGDB. Each of 76 plastid-encoded coding sequences from across 135 Poaceae species was translated using the plant plastid code (translation table 11) to obtain AA sequences.

Phylogenetic Analysis

Sequence alignments of nine plastid marker genes (rbcL, ndhF, matK, rpoC2, psaA, psaB, psaI, rpl32, and ccsA) from 134 PACMAD species of our dataset was performed using MUltiple Sequence Comparison by Log-Expectation (Edgar 2004) and concatenated. Maximum likelihood phylogenies were estimated from the concatenated alignments using RAxML (GTR+Γ model) (Stamatakis 2014) with 1,000 bootstrap replicates and rooted to Anomochloa marantoidea, a grass species existing outside of the PACMAD clade.

Feature Engineering From Alignment Data

Feature lists were created for sets of sequence alignment data from each of 76 plastid-encoded protein sequences of the 134 PACMAD species of our dataset (see Data Availability). Here we vectorize each of our sets of alignment data to obtain a list of features corresponding to each position in each alignment for entire sequence interpretability by our ML algorithms. To do this, we created keys for each position in each alignment that map corresponding residues/gaps to a value up to N–1 where N is the number of different characters (AA or gap) that appear at each position. For the vast majority of positions in each alignment, N <= 2 and therefore each AA at each position is typically represented by [0] or [1]. However, a small portion of amino acids are represented by [2], [3], or greater if N > 2 at any position. We acknowledge that it is inadvisable to represent nonlinearly related variables as such but combining the small frequency of occurrence with a relatively large number of features makes the negative effects on the overall classifier near negligible while still extracting potentially distinguishing information from highly variable positions. To demonstrate this, we trained models with oppositely swapped AA encodings (with respect to our initial feature representation set) for 50% random positions, and randomly swapped AA encodings at each alignment position and compared these to models trained on our initial feature representation set (supplementary fig. S5, Supplementary Material online). Identification of top classifiers and most informative residues, after recursive feature elimination, remained the same regardless of the specific AA encoding at each alignment position. Each sequence represented as a list of predominantly binary features, was labeled as C3 or C4 in accordance with the specific species/plastome it had been extracted from.

C3/C4 Classification Modelling With Cross Validation

The following was achieved in Python using tools available with the Scikit-learn library (Pedregosa et al. 2011). For each protein sequence, labeled (C3/C4) feature data representing the sequence alignment datasets of the curated 135 Poaceae species were split into training and independent test sets (70:30 split). Each feature was standardized and the labeled, standardized feature data from the training sets were used for supervised learning with a logistic regression classification algorithm.

Repeated random subsampling (Dubitzky et al. 2007), also known as Nested Cross-Validation, was used for cross-validation of each model as described in Vabalas et al. 2019. Five hundred different models were generated based on 500 different and random 70:30 splits into training and test sets respectively for each sequence dataset while keeping all other parameters equal. Each of these 500 models was independent of one another and each set of test data was kept independent from the training of their respective model to prevent data leakage. The average accuracy across 500 models was computed and these average accuracies are what we compare in this study. Classification metrics (AUROC, precision, recall, F1) were also computed as averages after cross validation. Models built on randomized labels were generated, for comparison, by providing the logistic regression classification algorithm with randomly labeled (C3/C4), but in equal proportion, feature data.

Feature Selection and Recursive Feature Elimination

Initial logistic regression models (v1 models, see fig. 1) were parsed, and the regression coefficients attributed to each feature in the final models were extracted. Specifically for regression coefficient parsing and extraction, entire datasets, instead of split datasets, were used. The ten strongest features, for use in our v2 models, were the positions corresponding to the ten largest (absolute) coefficient values of the final decision function. The v2 models were generated, as described above, using only the identified ten strongest features from the corresponding v1 models.

To determine specific feature combinations from each feature set that informed top-performing classification models we employed recursive feature elimination (Guyon et al. 2002) with cross validation (RFE-CV). Starting with all initial features, 500 different logistic regression models would be generated using 500 different and random 70:30 splits into training and test sets, respectively, and averages for all classification metrics would be recorded. The most consistent weakest feature—lowest (absolute) attributed regression coefficient—from across 500 samplings would be eliminated from the subsequent feature set and the process repeated until one feature remains. We employed a custom implementation of RFE-CV that allowed recursion to continue past the point of no improvement in accuracy so that we were always able to determine performance for feature sets of ten or fewer features. The strongest performing feature combination of ten or fewer features was extracted as the optimal feature set. This generates classification metric data, after cross validation, for as many iteratively stronger combinations of features as there were initial features.

For the comparative analysis of informative information held by plastid-encoded genes and the identification of potential key residues informing C3/C4 status, we made use of our entire PACMAD dataset (134 species) for optimal feature selection via RFE-CV. This was done in order to screen as much sequence information as possible for potentially informative residues. Supplementary Table S5, Supplementary Material online summarizes the results of conducting RFE-CV with 90% of our PACMAD dataset and testing models trained with optimal features on an independent validation set (10% of our PACMAD dataset).

Accuracy Correlation Analyses

Mean pairwise distance was calculated as a measure of the total variation across the various lengths of our sequence alignments (Shpak et al. 2017). For each alignment, the average number of differences between each possible pair of sequences in a set was computed and then plotted against the respective classification model's accuracy. For variation-accuracy correlation analyses, MPD as a proportion of sequence length (per 100 residues) was calculated. A length-accuracy correlation plot was also generated by plotting model accuracy against respective sequence alignment lengths.

Combined Subunit/Complex Classification Modeling

For each complex queried (table 2), feature selection was used to obtain the ten strongest features from each of its constituent subunits as listed in supplementary table S2, Supplementary Material online. The optimal features from constituent subunits were concatenated and RFE-CV was employed to obtain classification metrics from iteratively stronger combinations of features across subunits. The strongest performing models, of iterations making use of ten or fewer features, from each complex were compared in our study. Recent work has revealed that PsbN might not be a constituent member of the PSII supercomplex (Torabi et al. 2014) and so was not included in our PSII model development. Rpl32 was not included in our 50S model because of its absence from several Poaceae plastid genomes.

Supplementary Material

evad129_Supplementary_Data

Acknowledgments

This work was supported by Discovery Grants from the Natural Sciences and Engineering Research Council of Canada (NSERC) [Grant number RGPIN-2017-04416 to S.D.X.C.]. The authors declare no competing interests. We would also like to thank our reviewers for their insightful comments and suggestions that have led to an improved manuscript draft, both in terms of quality of work and clarity of significance.

Contributor Information

Nilanth Yogadasan, Department of Biology, University of Waterloo, Waterloo, ON, Canada.

Andrew C Doxey, Department of Biology, University of Waterloo, Waterloo, ON, Canada.

Simon D X Chuong, Department of Biology, University of Waterloo, Waterloo, ON, Canada.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

Author Contributions

N.Y. developed study components, carried out the study, and wrote the manuscript. A.C.D. verified study components, and provided expertise on methods implementation, and revised manuscript drafts. S.D.X.C. supervised the study and revised manuscript drafts.

Data Availability

The data underlying this article are available in the article and its online Supplementary material. All sequence alignment data informing components of this study are openly available on figshare at https://doi.org/10.6084/m9.figshare.21379725.v1 and the necessary code for pipeline implementation is publically available at https://github.com/Nil29/C4PredictorAnalysis

Literature Cited

  1. Arif M, et al. 2020. Pred-BVP-Unb: fast prediction of bacteriophage virion proteins using unbiased multi-perspective properties with recursive feature elimination. Genomics 112:1565–1574. [DOI] [PubMed] [Google Scholar]
  2. Atkinson RR, et al. 2016. C4 photosynthesis boosts growth by altering physiology, allocation and size. Nat Plants. 2(5):16038. [DOI] [PubMed] [Google Scholar]
  3. Berry JO, Mure CM, Yerramsetty P. 2016. Regulation of Rubisco gene expression in C4 plants. Curr Opin Plant Biol. 31:23–28. [DOI] [PubMed] [Google Scholar]
  4. Bianconi ME, et al. 2020. Continued adaptation of C4 photosynthesis after an initial burst of changes in the Andropogoneae grasses. Syst Biol. 69(3):445–461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bouchenak-Khelladi Y, et al. 2009. The origins and diversification of C4 grasses and savanna-adapted ungulates. Glob Chang Biol. 15(10):2397–2417. [Google Scholar]
  6. Casola C, Li J. 2022. Beyond RuBisCO: convergent molecular evolution of multiple chloroplast genes in C4 plants. PeerJ 10:e12791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen RC, Dewi C, Huang SW, Caraka RE. 2020. Selecting critical features for data classification based on machine learning methods. J Big Data. 7(1):1–26. [Google Scholar]
  8. Christin PA, et al. 2008. Evolutionary switch and genetic convergence on rbcL following the evolution of C4 photosynthesis. Mol Biol Evol. 25(11):2361–2368. [DOI] [PubMed] [Google Scholar]
  9. Christin PA, et al. 2013. Anatomical enablers and the evolution of C4 photosynthesis in grasses. Proc Natl Acad Sci U S A. 110(4):1381–1386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Christin PA, Salamin N, Kellogg EA, Vicentini A, Besnard G. 2009. Integrating phylogeny into studies of C4 variation in the grasses. Plant Physiol. 149(1):82–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Christin PA, Salamin N, Savolainen V, Duvall MR, Besnard G. 2007. C4 photosynthesis evolved in grasses via parallel adaptive genetic changes. Curr Biol. 17(14):1241–1247. [DOI] [PubMed] [Google Scholar]
  12. DeLano WL. 2002. Pymol: an open-source molecular graphics tool. CCP4 Newsletter Pro Crystallogr. 40(1):82–92. [Google Scholar]
  13. Dubitzky W, Granzow M, Berrar DP. 2007. Fundamentals of data mining in genomics and proteomics. Springer Science & Business Media. [Google Scholar]
  14. Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Edwards EJ, et al. 2010. The origins of C4 grasslands: integrating evolutionary and ecosystem science. Science 328(5978):587–591. [DOI] [PubMed] [Google Scholar]
  16. Giussani LM, Cota-Sánchez JH, Zuloaga FO, Kellogg EA. 2001. A molecular phylogeny of the grass subfamily Panicoideae (Poaceae) shows multiple origins of C4 photosynthesis. Am J Bot. 88(11):1993–2012. [PubMed] [Google Scholar]
  17. Gowik U, Bräutigam A, Weber KL, Weber AP, Westhoff P. 2011. Evolution of C4 photosynthesis in the genus Flaveria: how many and which genes does it take to make C4?. Plant Cell. 23(6):2087–2105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Grass Phylogeny Working Group II. 2012. New grass phylogeny resolves deep evolutionary relationships and discovers C4 origins. New Phytol. 193(2):304–312. [DOI] [PubMed] [Google Scholar]
  19. Guyon I, Weston J, Barnhill S, Vapnik V. 2002. Gene selection for cancer classification using support vector machines. Mach Learn. 46(1–3):389–422. [Google Scholar]
  20. Hatch MD, Kagawa T, Craig S. 1975. Subdivision of C4-pathway species based on differing C4 acid decarboxylating systems and ultrastructural features. Funct Plant Biol. 2(2):111–128. [Google Scholar]
  21. Hennig J, Herrmann RG. 1986. Chloroplast ATP synthase of spinach contains nine nonidentical subunit species, six of which are encoded by plastid chromosomes in two operons in a phylogenetically conserved arrangement. Mol Gen Genet. 203(1):117–128. [Google Scholar]
  22. Hibberd JM, Covshoff S. 2010. The regulation of gene expression required for C4 photosynthesis. Annu Rev Plant Biol. 61:181–207. [DOI] [PubMed] [Google Scholar]
  23. Horváth EM, et al. 2000. Targeted inactivation of the plastid ndhB gene in tobacco results in an enhanced sensitivity of photosynthesis to moderate stomatal closure. Plant Physiol. 123(4):1337–1350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Huang W, et al. 2022. A well-supported nuclear phylogeny of Poaceae and implications for the evolution of C4 photosynthesis. Mol Plant. 15(4):755–777. [DOI] [PubMed] [Google Scholar]
  25. Huang P, Studer AJ, Schnable JC, Kellogg EA, Brutnell TP. 2017. Cross species selection scans identify components of C4 photosynthesis in the grasses. J Exp Bot. 68(2):127–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ifuku K, Endo T, Shikanai T, Aro EM. 2011. Structure of the chloroplast NADH dehydrogenase-like complex: nomenclature for nuclear-encoded subunits. Plant Cell Physiol. 52(9):1560–1568. [DOI] [PubMed] [Google Scholar]
  27. Ishikawa N, et al. 2016a. NDH-mediated cyclic electron flow around photosystem I is crucial for C4 photosynthesis. Plant Cell Physiol. 57(10):2020–2028. [DOI] [PubMed] [Google Scholar]
  28. Ishikawa N, Takabayashi A, Sato F, Endo T. 2016b. Accumulation of the components of cyclic electron flow around photosystem I in C4 plants, with respect to the requirements for ATP. Photosyn Res. 129(3):261–277. [DOI] [PubMed] [Google Scholar]
  29. Jobe TO, Rahimzadeh Karvansara P, Zenzen I, Kopriva S. 2020. Ensuring nutritious food under elevated CO2 conditions: a case for improved C4 crops. Front Plant Sci. 11:1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kajala K, et al. 2011. Strategies for engineering a two-celled C4 photosynthetic pathway into rice. J Exp Bot. 62(9):3001–3010. [DOI] [PubMed] [Google Scholar]
  31. Kanai R, Edwards GE. 1999. The biochemistry of C4 photosynthesis. C4 Plant Biol. 49:87. [Google Scholar]
  32. Kapralov MV, Filatov DA. 2007. Widespread positive selection in the photosynthetic rubisco enzyme. BMC Evol Biol. 7(1):73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kapralov MV, Kubien DS, Andersson I, Filatov DA. 2011. Changes in rubisco kinetics during the evolution of C4 photosynthesis in Flaveria (Asteraceae) are associated with positive selection on genes encoding the enzyme. Mol Biol Evol. 28(4):1491–1503. [DOI] [PubMed] [Google Scholar]
  34. Kapralov MV, Smith JAC, Filatov DA. 2012. Rubisco evolution in C4 eudicots: an analysis of Amaranthaceae sensu lato. PLoS One. 7(12):e52974. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Laughlin TG, Bayne AN, Trempe JF, Savage DF, Davies KM. 2019. Structure of the complex I-like molecule NDH of oxygenic photosynthesis. Nature 566(7744):411–414. [DOI] [PubMed] [Google Scholar]
  36. Laughlin TG, Savage DF, Davies KM. 2020. Recent advances on the structure and function of NDH-1: the complex I of oxygenic photosynthesis. Biochim Biophys Acta Bioenerg. 1861(11):148254. [DOI] [PubMed] [Google Scholar]
  37. Martín M, Sabater B. 2010. Plastid ndh genes in plant evolution. Plant Physiol Biochem. 48(8):636–645. [DOI] [PubMed] [Google Scholar]
  38. Matsumura H, et al. 2020. Hybrid rubisco with complete replacement of rice Rubisco small subunits by sorghum counterparts confers C4 plant-like high catalytic activity. Mol Plant. 13(11):1570–1581. [DOI] [PubMed] [Google Scholar]
  39. Matsuoka M, Furbank RT, Fukayama H, Miyao M. 2001. Molecular engineering of C4 photosynthesis. Annu Rev Plant Physiol Plant Mol Biol. 52(1):297–314. [DOI] [PubMed] [Google Scholar]
  40. Moody NR, Christin PA, Reid JD. 2020. Kinetic modifications of C4 PEPC are qualitatively convergent, but larger in Panicum than in Flaveria. Front Plant Sci. 11:1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Muhaidat R, Sage RF, Dengler NG. 2007. Diversity of Kranz anatomy and biochemistry in C4 eudicots. Am J Bot. 94(3):362–381. [DOI] [PubMed] [Google Scholar]
  42. Nakamura N, Iwano M, Havaux M, Yokota A, Munekage YN. 2013. Promotion of cyclic electron transport around photosystem I during the evolution of NADP–malic enzyme-type C 4 photosynthesis in the genus Flaveria. New Phytol. 199(3):832–842. [DOI] [PubMed] [Google Scholar]
  43. Nelson N, Yocum CF. 2006. Structure and function of photosystems I and II. Annu Rev Plant Biol. 57:521–565. [DOI] [PubMed] [Google Scholar]
  44. Osborne CP, et al. 2014. A global database of C 4 photosynthesis in grasses. New Phytol. 204(3):441–446. [DOI] [PubMed] [Google Scholar]
  45. Paulus JK, Schlieper D, Groth G. 2013. Greater efficiency of photosynthetic carbon fixation due to single amino-acid substitution. Nat Commun. 4(1):1518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Pedregosa F, et al. 2011. Scikit-learn: machine learning in python. J Mach Learn Res. 12:2825–2830. [Google Scholar]
  47. Piot A, Hackel J, Christin PA, Besnard G. 2018. One-third of the plastid genes evolved under positive selection in PACMAD grasses. Planta 247(1):255–266. [DOI] [PubMed] [Google Scholar]
  48. Roell MS, et al. 2021. A synthetic C4 shuttle via the β-hydroxyaspartate cycle in C3 plants. Proc Natl Acad Sci U S A. 118(21):e2022307118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Rosnow JJ, Edwards GE, Roalson EH. 2014. Positive selection of Kranz and non-Kranz C4 phosphoenolpyruvate carboxylase amino acids in Suaedoideae (Chenopodiaceae). J Exp Bot. 65(13):3595–3607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Rudov A, Mashkour M, Djamali M, Akhani H. 2020. A review of C4 plants in southwest Asia: an ecological, geographical and taxonomical analysis of a region with high diversity of C4 eudicots. Front Plant Sci. 11:546518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sage RF. 2017. A portrait of the C4 photosynthetic family on the 50th anniversary of its discovery: species number, evolutionary lineages, and Hall of fame. J Exp Bot. 68(2):4039–4056. [DOI] [PubMed] [Google Scholar]
  52. Sage RF, Christin PA, Edwards EJ. 2011. The C4 plant lineages of planet earth. J Exp Bot. 62(9):3155–3169. [DOI] [PubMed] [Google Scholar]
  53. Sage TL, Sage RF. 2009. The functional anatomy of rice leaves: implications for refixation of photorespiratory CO2 and efforts to engineer C4 photosynthesis into rice. Plant Cell Physiol. 50(4):756–772. [DOI] [PubMed] [Google Scholar]
  54. Sage RF, Sage TL, Kocacinar F. 2012. Photorespiration and the evolution of C4 photosynthesis. Annu Rev Plant Biol. 63:19–47. [DOI] [PubMed] [Google Scholar]
  55. Schuler ML, Mantegazza O, Weber AP. 2016. Engineering C4 photosynthesis into C3 chassis in the synthetic biology age. Plant J. 87(1):51–65. [DOI] [PubMed] [Google Scholar]
  56. Sen L, et al. 2011. Molecular evolution of rbcL in three gymnosperm families: identifying adaptive and coevolutionary patterns. Biol Direct. 6(1):29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Shpak M, Ni Y, Lu J, Müller P. 2017. Variance in estimated pairwise genetic distance under high versus low coverage sequencing: the contribution of linkage disequilibrium. Theor Popul Biol. 117:51–63. [DOI] [PubMed] [Google Scholar]
  58. Singh BP, Kumar A, Kaur H, Singh H, Nagpal AK. 2020. CpGDB: a comprehensive database of chloroplast genomes. Bioinformation 16(2):171–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Spreitzer RJ, Peddi SR, Satagopan S. 2005. Phylogenetic engineering at an interface between large and small subunits imparts land-plant kinetic properties to algal Rubisco. Proc Natl Acad Sci U S A. 102(47):17225–17230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Spreitzer RJ, Salvucci ME. 2002. Rubisco: structure, regulatory interactions, and possibilities for a better enzyme. Annu Rev Plant Biol. 53(1):449–475. [DOI] [PubMed] [Google Scholar]
  61. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Still CJ, Berry JA, Collatz GJ, DeFries RS. 2003. Global distribution of C3 and C4 vegetation: carbon cycle implications. Global Biogeochem Cycles. 17(1):6–1. [Google Scholar]
  63. Stroebel D, Choquet Y, Popot JL, Picot D. 2003. An atypical haem in the cytochrome b6f complex. Nature 426(6965):413–418. [DOI] [PubMed] [Google Scholar]
  64. Stroud DA, et al. 2016. Accessory subunits are integral for assembly and function of human mitochondrial complex I. Nature 538(7623):123–126. [DOI] [PubMed] [Google Scholar]
  65. Su X, et al. 2022. Supramolecular assembly of chloroplast NADH dehydrogenase-like complex with photosystem I from Arabidopsis thaliana. Mol Plant. 15(3):454–467. [DOI] [PubMed] [Google Scholar]
  66. Tanz SK, Tetu SG, Vella NG, Ludwig M. 2009. Loss of the transit peptide and an increase in gene expression of an ancestral chloroplastic carbonic anhydrase were instrumental in the evolution of the cytosolic C4 carbonic anhydrase in Flaveria. Plant Physiol. 150(3):1515–1529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Torabi S, et al. 2014. Psbn is required for assembly of the photosystem II reaction center in Nicotiana tabacum. Plant Cell. 26(3):1183–1199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Vabalas A, Gowen E, Poliakoff E, Casson AJ. 2019. Machine learning algorithm validation with a limited sample size. PLoS One. 14(11):e0224365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Watcharamongkol T, Christin PA, Osborne CP. 2018. C4 photosynthesis evolved in warm climates but promoted migration to cooler ones. Ecol Lett. 21(3):376–383. [DOI] [PubMed] [Google Scholar]
  70. Whitney SM, et al. 2011. Isoleucine 309 acts as a C4 catalytic switch that increases ribulose-1, 5-bisphosphate carboxylase/oxygenase (rubisco) carboxylation rate in Flaveria. Proc Natl Acad Sci U S A. 108(35):14688–14693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Woodson JD, Chory J. 2008. Coordination of gene expression between organellar and nuclear genomes. Nat Rev Genet. 9(5):383–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Yamaguchi K, Subramanian AR. 2000. The plastid ribosomal proteins: identification of all the proteins in the 50 S subunit of an organelle ribosome (chloroplast). J Biol Chem. 275(37):28466–28482. [DOI] [PubMed] [Google Scholar]
  73. Yamaguchi K, Subramanian AR. 2003. Proteomic identification of all plastid-specific ribosomal proteins in higher plant chloroplast 30S ribosomal subunit: PSRP-2 (U1A-type domains), PSRP-3α/β (ycf65 homologue) and PSRP-4 (Thx homologue). Eur J Biochem. 270(2):190–205. [DOI] [PubMed] [Google Scholar]
  74. Yamamoto H, Peng L, Fukao Y, Shikanai T. 2011. An Src homology 3 domain-like fold protein forms a ferredoxin binding site for the chloroplast NADH dehydrogenase-like complex in Arabidopsis. Plant Cell. 23(4):1480–1493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Yu GX, Park BH, Chandramohan P, Geist A, Samatova NF. 2005. An evolution-based analysis scheme to identify CO2/O2 specificity-determining factors for ribulose 1, 5-bisphosphate carboxylase/oxygenase. Protein Eng Des Sel. 18(12):589–596. [DOI] [PubMed] [Google Scholar]
  76. Zhang C, et al. 2020. Structural insights into NDH-1 mediated cyclic electron transfer. Nat Commun. 11(1):888. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

evad129_Supplementary_Data

Data Availability Statement

The data underlying this article are available in the article and its online Supplementary material. All sequence alignment data informing components of this study are openly available on figshare at https://doi.org/10.6084/m9.figshare.21379725.v1 and the necessary code for pipeline implementation is publically available at https://github.com/Nil29/C4PredictorAnalysis


Articles from Genome Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES