Abstract
Chronic Helicobacter pylori infection is known to be associated with the development of peptic ulcer, gastric cancer and gastric lymphoma. Currently, the bacterial factors of H. pylori are reported to be important in the development of gastroduodenal diseases. CagA protein, encoded by the cagA, is the best studied virulence factor of H. pylori. The pathogenic CagA protein contains a highly polymorphic Glu-Pro-Ile-Tyr-Ala (EPIYA) repeat region in the C-terminal. This repeat region is reported to be involved in the pathogenesis of gastroduodenal diseases. The segments containing EPIYA motifs have been designated as segments A, B, C, and D; however the classification and disease relation are still unclear. This study used 560 unique CagA sequences containing 1,796 EPIYA motifs collected from public resources, including 274 Western and 286 East Asian strains with clinical data obtained from 433 entries. Fifteen types of EPIYA or EPIYA-like sequences are defined. In addition to four previously reported major segment types, several minor segment types (e.g., segment B′, B′′) and more than 30 sequence types (e.g., ABC, ABD) were defined using our classification method. We confirm that the sequences from Western and East Asian strains contain segment C and D, respectively. We also confirm that strains with two EPIYA segment C have a greater chance of developing gastric cancer than those with one segment C. Our results shed light on the relationships between the types of CagAs, the country of origin of each sequence type, and the frequency of gastric disease.
Introduction
Helicobacter pylori is a Gram-negative bacterium etiologically involved in peptic ulcer disease, gastric adenocarcinoma, and primary gastric B-cell lymphoma [1]. Although infection with H. pylori almost always results in chronic active gastritis, only a fraction of those infected develop clinical disease. While this phenomenon remains unexplained, host genetics, host immune response, and the relationship of the host response to bacterial virulence factors are likely to be important factors. A tremendous number of groups have investigated the roles of putative virulence factors of H. pylori, and the best studied is the CagA protein [2]–[7]. CagA producing strains are reported to be associated with severe clinical outcomes, especially in Western countries [8]–[11].
CagA is a highly immunogenic protein with a molecular weight between 120 and 140 kDa [12], [13]. Variation in the size of CagA is due to the presence of a variable number of repeat sequences located in the 3′ region of the gene [12], [14]–[16]. The repeat regions contain the Glu-Pro-Ile-Tyr-Ala (EPIYA) motif. To characterize the different sequence patterns in the 3′ region, at least four methods of classification are typically reported. First, the terms D1, D2, and D3 are used to designate three specific sequences [12]. Second, sequences are denoted with combinations of R1, R2, and R3 [14], [15]. Third, each EPIYA motif is assigned a motif type (e.g., EPIYA-A, -B, -C, or –D motif) [17], [18], [19]. Finally, sequences are annotated according to segments (20–50 amino acids) flanking the EPIYA motifs (segments EPIYA-A, -B, -C, or –D) [20–23], after the identification of the essential CagA phosphorylation sites as confirmed by mutagenesis during infection and transfection [24]. Initially, the two Csk binding sites are designated as segments EPIYA-A and –B, and the Src homology 2 (SH2) domain of Src homology 2 phosphatase (SHP-2) binding sites in Western and East Asian type CagA are designated as segments EPIYA-C and –D, respectively. Here, “motif” and “segment” are used to designate the five-member sequence (EPIYA) and the short sequences around the EPIYA motif, respectively (Figure 1). However, none of the four sequence classification methods work well with non-standard sequences, and a modified classification method was deemed necessary.
CagA is encoded by the cagA gene, which is located at one end of the cag pathogenicity island (PAI) [25]. The cag PAI encodes a type IV secretion system, by which CagA proteins are delivered into host cells [26–30]. CagA interacts with various target molecules in addition to Csk and SHP-2, including Src [31], [32] and Abl [33]. Recent study clearly confirmed that almost one dozen of factors such as SHP-1, Grd2, Grb2, phosphatidylinositol 3-OH kinase (PI3K), have also binding abilities to CagA phosphorylation sites [34]. Mutations of SHP-2 have been found in various human malignancies and altered SHP-2 signaling culminates in the development of gastric adenocarcinoma in genetically engineered mice [35], [36], indicating that SHP-2 is involved in the development of gastric cancer. Recent studies reported that the East Asian type CagA containing segments EPIYA-D exhibits stronger binding activity for SHP-2 and a greater ability to induce morphological changes in epithelial cells than Western type CagA containing segments EPIYA-C [17], [20], [23]. The recent study also showed that H. pylori strains possessing East Asian type CagA have an ability to induce higher amounts of interleukin-8 from gastric epithelial cells than H. pylori strains possessing Western type CagA [37]. Accordingly, East Asian strains are believed to be more virulent than Western strains, and this might be the reason why the incidences of gastric cancer in East Asian countries are relatively higher than those in Europe, North America, and Australia (Data available at http://www-dep.iarc.fr/). In addition, the incidence of gastric cancer is reported to be higher in patients infected with strains carrying multiple EPIYA repeats compared to those infected with strains of a single repeat [14], [15], [38], [39].
However, there are also controversial reports that the genotypes (DNA analysis) of the CagA repeat region are not associated with clinical outcomes [40]–[43]. This controversy might be due in part to the fact that genotypes are not necessarily mutations in protein sequences and that the previous studies of the diversity of CagAs and the relationship of diseases and protein sequence types used only limited information, mostly relying on their own data sets. Indeed, there lacked comprehensive study considering all CagAs deposited in GenBank (http://www.ncbi.nlm.nih.gov/). Moreover, although CagA EPIYA repeats can be assigned to consensus sequence types, the existing sequence analyses did not completely consider the sequence variation patterns in the CagA repeat region. An in-depth analysis of the non-typical type repeats [15], [44] becomes necessary for addressing the question. In this study, we used sequence comparison and statistical method to analyze 560 unique CagAs selected from 4,534 CagAs from three data sources. Our results shed light on the relationships between the types of CagAs, the country of origin of each sequence type, and the frequency of gastric disease.
Results and Discussion
EPIYA Motifs Classification
By sequence alignment or pattern comparison, we found that there were sequences similar to EPIYA (such as EPIYT, ESIYT), although most sequences contained EPIYA. In this study, the EPIYA or EPIYA-like sequences were defined as any five member amino acid sequence with at least three amino acids corresponding to the sequence, EPIYA (where Y is always constant). By searching all sequences before data filtering, we obtained 16 types of EPIYA or EPIYA-like sequences. Of these, 15 types were chosen for further study because their surrounding sequences were similar to those of EPIYA (Table 1), indicating that these sequences might have a function similar to EPIYA. One sequence, MAIYA, from entry ABA26023 was excluded because the pattern of its flanking sequences was very different from those of the other 15 types of EPIYA or EPIYA-like sequences (Table 1). The 15 types listed in Table 1 are called EPIYA “motifs” for simplicity, in this work.
Table 1. Frequencies of the 15 types of EPIYA motifs.
Motif | EPIYA | EPIYT | ESIYA | ESIYT | EPIYV | EHIYA | ELIYA | EPVYA |
Freq. | 1657 | 92 | 24 | 7 | 3 | 2 | 2 | 2 |
Motif | EPIYD | EPIYS | EPKYA | EPRYA | ETIYA | KPIYA | NPIYA | Total |
Freq. | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1,796 |
The frequency of each EPIYA motif in the filtered data set is listed in Table 1. In total, 1,796 EPIYA motifs were obtained from the 560 CagAs. On average, each CagA sequence contained approximately three EPIYA motifs. The three most frequent EPIYA motifs were EPIYA (1,657/1,796 = 92.3%), EPIYT (92/1,796 = 5.1%), and ESIYA (24/1,796 = 1.3%).
EPIYA Segments Classification
We categorized the EPIYA segments according to the segments flanking the EPIYA motifs (Figure 1). In addition to the four major segments originally designated, EPIYA-A, -B, -C, and –D [20], [22], we designated several minor segments, including EPIYA-B′ and -B′′. Representative examples of these types of segments, derived from the 560 CagAs, are listed in Table 2 (a few more other types of segments with frequency less than 10 are given in Table S1. For simplicity, we refer to segment EPIYA-A, -B, -C, or –D as segment A, B, C, or D. Segments A, B, B′, and B′′ have subscripts C and D, which indicate that the sequences that contain segments A, B, B′, and B′′ contain segments C and D, respectively (Figure 1). However, 19 short sequences did not contain either segments C or D, and we manually assigned a subscript C or D to the segment type, according to their sequence patterns.
Table 2. Representative segments of EPIYA motifsa.
Type | Freq. | Representative sequence |
AC | 272 | KELNAKLGNFNNNNNNGLKN..EPIYAKVNKKK |
AD | 295 | KELNEKLFGNSNNNNNGLKNNTEPIYAQVNKKK |
BC | 262 | TGQVASPEEPIYAQVAKKVNAKIDRLNQIASGLGGVGQAAG |
BD | 281 | TGQATSPEEPIYAQVAKKVSAKIDQLNEATS |
C | 343 | FPLKRHDKVDDLSKVGRSVSPEPIYATIDDLGGP |
D | 284 | AINRKIDRINKIASAGKGVGGFSGAGRSASPEPIYATIDFDEAN |
B′C | 10 | AGQAASPEEPIYAKVNKKK |
B′D | 14 | AGQATSPEEPIYAQVNKKK |
B′′D | 19 | AINRKIDRINKIASAGKGVGGFSGAGRSANPEPIYAQVARKVSA-KIDQLNEATS |
Total | 1,780 |
Note: the values in the table are the frequencies of similar sequences, not the number of identical sequences within a sequence type. Other segments of 16 EPIYA motifs are listed in Table S1.
We named the minor segments according to the patterns of the sections immediately following EPIYA (Table 2). This was because the four amino acids, TIDD and TIDF, following EPIYA in segments C and D, respectively are reported to be important for the binding of SHP-2 [17], [24]. For example, segments B′C and B′D are shorter versions of segments BC and BD, respectively (Table 2). In segment B′′D, the sequences before EPIYA are similar to those of segment D, whereas the sequences after EPIYA are similar to those of segment BD.
The segment B displayed the biggest change in the five amino acids; EPIYA motif (Table S2). For the three most frequent motifs (excluding EPIYA), 89 out of 92 EPIYTs, all 23 ESIYAs, and all 7 ESIYTs, appear in segment B. Interestingly, 88 EPIYT motifs belong to the segment BC, and only 1 EPIYT belongs to the segment BD. In contrast, the changes of the five amino acids in segments A, C, and D were relatively small. In other reports [18], [19], the NPIYA, EPIYT, ESIYA and ESIYT motifs were named as A′, B′, B″ and B″′, respectively. However, their terminology seems to be confusing, otherwise all 15 types of pseudo EPIYA motifs should have different names. Their motif A′ belonged to our segment A and their B′, B″, and B″′ fell into our segments B, B′, or B′′ (Table S2).
CagA Sequence Type Classification
Each CagA sequence was assigned a sequence type consisting of the names of the EPIYA segments in its sequence (such as ABC or ABD) (Table S3). Depending on the number of EPIYA segments, they are termed as AnBnCn or AnBnDn, where “n” is the repeating motifs and does not have to be equal for A, B, C, and D types (e.g., ABCCCC). In the event that there was an additional segment that lacked an EPIYA motif between two neighboring EPIYA segments, a hyphen was added between the two EPIYA segments (e.g., A-C, A-D). In total there were 28 segments without EPIYA motifs between two neighboring EPIYA segments among the 560 CagAs (Table S3). These 28 interval segments are of various lengths and contents. In total, 41 different sequence types were found (Table S4). Among the 41 sequence types, 32 sequence types are remained (Table 3) after removing the types containing rare EPIYA segment types (i.e., B′′C, C′, D′, C′′ and D′′). The majority of the sequences were of types ABD (43%) and ABC (30%). Interestingly, there were no CagA sequences containing both segments C and D. This suggests hybridization (recombination) between Western and East Asian CagA is very rare.
Table 3. Frequencies of the 32 sequence typesa.
Seq. Type | Freq. | Seq. Type | Freq. | Seq. Type | Freq. | Seq. Type | Freq. |
ABD | 240 | AB′-ABD | 4 | C | 2 | ABCCCC | 1 |
ABC | 167 | A-D | 4 | A | 1 | A-B″D | 1 |
ABCC | 51 | A-ABD | 3 | AB′B′BC | 1 | AB-D | 1 |
ABB″D | 16 | AB-ABD | 2 | ABB″BD | 1 | ABD-ABD | 1 |
AB | 15 | AB′B′BD | 2 | AB′BCC | 1 | ABD-BD | 1 |
ABCCC | 10 | AB′BD | 2 | AB′-C | 1 | ABD-D | 1 |
AB′BC | 6 | ABCCCCC | 2 | AB-C | 1 | A-CCC | 1 |
A-C | 5 | AB′D | 2 | ABCB″CC | 1 | CC | 1 |
A small number of CagAs were classified differently between our current study and previous studies (examples shown in Table S5. For example, the CagA sequence of BAF45291 was classified as AC in a previous study [44]. However the sequence type was A-C in our classification, which meant that an interval segment (VKAKIDQLNQAASGFGNVGQAG) lacking EPIYA-like motif was present between the sequences A and C. For the CagA sequence of BAF45283, the sequence type was reported to be ABDD in a previous study [44]. However, the sequence type was classified as ABB″D in this work. The 3rd segment that differs between the two studies (D vs. B″) is AINRKIDRINKIASAGKGVGGFSGAGRSASPEPIYAQVAKKVSAKIDQLNEATS. In this segment, the part before the EPIYA motif is similar to segment D, whereas the part after the EPIYA motif is similar to segment B. Obviously, this segment is neither D nor B, rather B″, a variant of segment B (Table 2). Overall, we believe that the definitions of segment and the sequence classifications used in this study are more meaningful and accurate than those used in previous studies.
Each of the 560 CagAs was found to have at least one, and as many as seven, EPIYA segments (or EPIYA motifs). The distributions are 3, 27, 416, 86, 23, 3, 2, and 0 for number of sequences containing 1 through 8 EPIYA segments (Table S6), respectively. For example, a sequence of type A has only one EPIYA segment A and a sequence of type ABCCCCC has seven EPIYA segment, including five repeats of segment C. The majority (74% = 416/560) of sequences had three EPIYA segments.
Detailed Analyses of EPIYA Segments
The EPIYA segment types were defined according to the segment patterns (Table 2); however the composite amino acids varied slightly within each segment type. The two most frequent segments in segments A, B, C and D are shown in Table 4. The segments of EPIYA-AC or -AD contain from two to eight Ns (Gln) at the upstream of the pseudo EPIYA-AC or -AD motif. The segments C and D have higher consensus than segments AC, AD, BC and BD.
Table 4. Two most frequent EPIYA segmentsa.
Segment | Ratio | |
AC | KELNAKLGNFNNNNNNGLKN..EPIYAKVNKKK | 53/272 |
AC | KELNAKLGNFNNNNNNGLKNSTEPIYAKVNKKK | 22/272 |
AD | KELNEKLFGNSNNNNNGLKNNTEPIYAQVNKKK | 53/272 |
AD | XXXXXKLFGNSNNNNNGLKNNTEPIYAQVNKKK | 22/272 |
BC | TGQVASPEEPIYAQVAKKVNAKIDRLNQIASGLGGVGQAAG | 25/262 |
BC | AGQAASPEEPIYAQVAKKVNAKIDRLNQIASGLGGVGQAAG | 19/262 |
BD | TGQATSPEEPIYAQVAKKVSAKIDQLNEATS | 25/262 |
BD | TGQVASPEEPIYAQVAKKVSAKIDQLNEATS | 19/262 |
C | FPLKRHDKVDDLSKVGRSVSPEPIYATIDDLGGP | 144/343 |
C | FPLKRHDKVDDLSKVGRAVSPEPIYATIDDLGGP | 50/343 |
D | AINRKIDRINKIASAGKGVGGFSGAGRSASPEPIYATIDFDEAN | 144/343 |
D | AINRKIDRINKIASAGKGVGGFSGAGRSASPEPIYATIDFDETN | 50/343 |
X represents unknown amino acids; the amino acids which are different in two sequences shown are highlighted; Ratio = (Frequency of the type)/(Total frequency).
There were obvious differences between segment C and D when analyzed using the program, WebLogo (Figure 2). The segments were aligned using BioEdit before they were entered into WebLogo. As WebLogo had a problem analyzing a column of aligned sequences if BioEdit had added many spaces, all spaces in the sequence alignments were replaced by Z (meaning zero or nothing). In this way, the inserted space (Z) and the minor amino acids were easily identified. In the alignments, X indicates that an amino acid was not-available. As shown in Figure 2, the lengths of segments AC and AD are the same and the sequences of segments AC and AD are very similar. However the lengths of segments BC and BD, and the segments C and D are quite different. The sequences after the stretch of amino acids, QVAKKV, in segments BC and BD were highly variable, while the sequences of segments C and D were completely different. Overall, the sequence main variation between Western and East Asian strains starts after QVAKKV in segments BC and BD.
The four amino acids TIDD and TIDF following EPIYA motifs in segments C and D are reported to be important for the binding SHP-2 [17], [24]; therefore, the frequency of the four amino acids following EPIYA motifs in all EPIYA segments may be useful. As illustrated in Table S7, the sequences, KVNK and QVNK, occupy this position in the majority of segments AC and AD, respectively. QVAK occupied this position in most segments BC and BD. In the literature [17], the criteria for identifying EPIYA segments C and D are that the EPIYA motif is followed by TIDD and TIDF, respectively. However, by sequence pattern comparison, we found that EPIYA also belongs to segment C if it is followed by TIEE, TIDE, SIDD, TIDG, TIAE, or TIAD. If EPIYA is followed by TIDS, then it belongs to motif type D. As shown in Table S2, the segments B, B′, and B′′ had the biggest changes in their composite five amino acids. However, the four amino acids following the EPIYA motif were most variable in segment A (Table S7).
Correlation of Sequence Types and Geographic Areas
H. pylori strains from different geographic areas are associated with clear phylogeographic differentiation and H. pylori populations tend to spread along the lines of human migratory fluxes [45]–[50]. Furthermore, several studies concluded that CagA isoforms with segments C and D are related to Western and East Asian countries, respectively [14]–[16]. We tested this hypothesis using our comprehensive system of CagA classification. The frequency of each sequence class in individual countries is shown in Table 5. As expected, all 227 (100%) samples from Western countries contain EPIYA segment C. In contrast, of 307 sequences from East Asian countries (Japan, China, Korean, and Viet Nam), 26 (∼8%) contain EPIYA segment C instead of segment D. Interestingly, of the 21 Japanese strains with CagA sequence types related to segment C, 17 have names beginning with OK (Table S8), signifying that they were isolated in Okinawa, Japan (discussed below). The prevalence of sequences containing segments C and D in Southeast Asian countries (Thailand and Malaysia) were the same; and all samples from Iran, Kazakhstan (Kazak), and India were classified as segment C, although they are Asian countries. Overall, we found that it is largely true that CagA with sequences segments C and D are related to Western and East Asian countries, respectively; however, there are some exceptions for East Asian strains. Southeast Asian countries form the geographical border between segment C and segment D. The fact that some East Asian countries have Western type CagA reflects the partial transmission of H. pylori from Western to East Asian countries either during the human migration long time ago or recent transmission.
Table 5. Frequency of CagAs with respect to countrya.
Country | total # | # of seq. containing EPIYA-C | # of seq. containing EPIYA-D |
Japan | 249 | 21 | 228 |
China | 48 | 4 | 44 |
Korea | 6 | 1 | 5 |
Viet Nam | 4 | 0 | 4 |
Thailand | 5 | 2 | 3 |
Malaysia | 3 | 2 | 1 |
Iran | 5 | 5 | 0 |
India | 4 | 4 | 0 |
Kazakhstan | 3 | 3 | 0 |
Greece | 100 | 100 | 0 |
Italy | 34 | 34 | 0 |
Sweden | 5 | 5 | 0 |
Ireland | 3 | 3 | 0 |
USA | 22 | 22 | 0 |
Costa Rica | 33 | 33 | 0 |
Colombia | 24 | 24 | 0 |
Austria, Chile, and Germany each have one strain. The country information of 11 sequences or strains is not available.
As mentioned above, there are 21 strains from Japan with sequences related to EPIYA segment C instead of segment D (Table 5). The detailed information of these 21 strains is given in Table S8. Most of these segment C strains were isolated from Okinawa, which was governed by the United States from the end of World War II until 1972, and even today there are many US populations living in Okinawa. These data show that transmission of H. pylori between different populations may not be a rare event. In fact, previous reports of native Americans in Peru show that all H. pylori strains in this population are of the Western type [51], while only 4 of 17 strains isolated from American primitive, an isolated group living in the Amazonian jungles of Colombia, were East Asian type strains [48]. Based on our data, the Western strains are more easily transferred to East Asian people than the other way around. Another possibility for Western type CagA in Okinawa is that the Okinawan CagA is the novel type CagA; the origin did not come from modern Western people, but came to Japan long ago. Further studies will be necessary to test this hypothesis. If it proves true, elucidating the mechanism will be important for understanding the transmission of H. pylori in human populations.
Among the 21 strains from Okinawa, 20 contain EPIYA segment B (Table S8). Of 20 EPIYA motifs in segment B, 15 are EPIYT and 4 are ESIYT. Comparing this information with the data in Table S2, we found that the frequencies of the EPIYT and ESIYT motifs among the sequences of the 21 Okinawa strains are also relatively high. Detailed analyses for large number of strains from Okinawa will provide us some information about the roles and evolution of EPIYA motifs.
Correlation of Sequence Types and Strain Diseases
We were able to obtain clinical information for 433 strains out of the 560 strains in our data set (Table 6). In our data sheet, disease G contains gastritis, atrophic gastritis, epigastrial pain, gastric hyperplastic polyp, non-ulcer dyspepsia, chronic gastritis, chronic atrophic gastritis, and chronic gastritis-associated dyspepsia as well as “gastris”, which are regarded as typo of gastritis. Disease DU and GU (peptic ulcer PU = DU + GU) represent duodenal ulcer and gastric ulcer, respectively. Disease GC contains gastric cancer, gastric carcinoma, gastric adenocarcinoma, gastric adenoma and adenomatous polyps. Disease MALT contains MALT lymphoma and MALToma. Disease E represents esophagitis. Among those 433 samples, 42%, 32%, and 20% of the patients had diseases G, PU, and GC, respectively, which shows that there is a potential for selection bias in the sequence samples. For example, the prevalence of GC is approximately 3% in H. pylori-positive patients [52]. Nonetheless, the data are useful when comparing patterns of sequence types among diseases.
Table 6. Frequency and percentage of strains of certain type diseasea.
Disease | G | DU | GU | GC | E | MALT | Total |
Occurrence | 181 | 90 | 43 | 87 | 21 | 5 | 433 |
Percentage | 42% | 21% | 10% | 20% | 5% | 1% | 100% |
The diseases are designated in the text.
We compared three types ABC, ABD and ABCC in relation to clinical outcomes. Other EPIYA types were excluded since the number of other minor types was relatively small. As shown in Table 7, the prevalence of ABCC was 22% (17/[22 + 38 + 17]) in GC; whereas only 12% (18/[65 + 66 + 18]) in G and 7% (8/[42 + 64 + 8]) in PU. The ratio of ABCC/ABC was therefore significantly higher in GC (17/22 = 0.77) than in PU (8/42 = 0.19) and G (18/65 = 0.28) (The calculated chi-square is 8.24 and 6.22, and the probabilities of null hypothesis are less than 0.03 and 0.01, respectively). The data that strains with more EPIYA segment C have a greater chance of developing gastric cancer is consistent with previous studies [15], [38]. The ratio of ABD/ABC was also higher in GC (38/22 = 1.73) than in PU (64/42 = 1.52) and G (66/65 = 1.02); however the differences were not statistically significant (The calculated chi-square is 0.14 and 2.79, and the probabilities of null hypothesis are more than 0.90 and 0.10, respectively).
Table 7. EPIYA types and clinical outcomesa.
Total | G | PU | GC | |
ABC | 129 | 65, 50%, 1.0 | 42, 33%, 1.0 | 22, 17%, 1.0 |
ABD | 168 | 66, 39%, 0.8 | 64, 38%, 1.2 | 38, 23%, 1.3 |
ABCC | 43 | 18, 42%, 0.8 | 8, 19%, 0.6 | 17, 40%, 2.4 |
PU = DU + GU. Other diseases are designated in the text. The strains with unavailable disease information are not included.
The 145, 44, and 169 sequences of types ABC, ABD, and ABCC, respectively, from strains with disease information were used for phylogenic analysis with ClustalW (http://align.genome. jp/). The resulting trees are shown in Table S9, S10 and S11 in the supplementary material. The phylogenetic analysis did not reveal any association between a particular disease and a specific CagA sequence.
Conclusion
In this study, 560 unique CagA sequences containing EPIYA-like motifs were analyzed and in addition to the four previously reported major CagA segment types (A, B, C and D), we found that there are various novel types. Our results allow a clearer classification of the CagA protein sequences and provide a basis for further molecular studies of the pathogenicity of this important protein. In addition, we confirmed that strains with two EPIYA segment C have a greater chance of developing gastric cancer than those with one segment C. However, we did not find any association between a particular disease and specific CagA sequences through phylogenic tree analysis and further studies with larger number of sequences might be necessary whether the specific CagA sequences are involved in the development of clinical outcomes.
Materials and Methods
Data Collection
Three databases, NCBI (National Center for Biotechnology Information, U.S. National Library of Medicine, www.ncbi.nlm.nih.gov), UniProtKB/Swiss-Prot (the Swiss Institute for Bioinformatics and the European Bioinformatics Institute, www.ebi.ac.uk/swissprot/), and DDBJ (DNA Data Bank of Japan, the National Institute of Genetics, www.ddbj.nig.ac. jp/), were used to obtain CagA sequencing data. As of Apr 16, 2007, 1,423 entries were retrieved by searching “protein” at NCBI for “Helicobacter pylori CagA” with display format of “GenPept (Full)”. All related data were saved to a local disk. 1,034 entries were retrieved by searching the library, “UniProtKB/Swiss-Prot & UniProtKB/TrEMBL” at Swiss-Prot for “Helicobacter pylori CagA”. The related data were downloaded in a “Flat File Format”. Similarly, 2,077 entries were retrieved by searching “protein” at DDBJ for “Helicobacter pylori CagA”. By choosing “Complete entries”, the data were saved as ASCII text on a local disk. The data from DDBJ include the data from NCBI and UniProtKB/Swiss-Prot. We found that the sequences from NCBI included all sequences from UniProtKB/Swiss-Prot and DDBJ; therefore, only the NCBI data were used for sequence analyses. We have collected clinical information for 433 strains related to H. pylori CagA. The information is from our data base (from Y.Y.), the NCBI database, and the literature [53], [54], [18], [19].
Data Filtering
EPIYA motifs are located in the C-terminus of the CagA protein. 1,423 entries annotated as CagA in NCBI were downloaded from GenBank. Two rounds of data filtering were used to refine the data obtained from NCBI: (1) removing 832 sequences not containing EPIYA or EPIYA-like motifs (Table S12) and (2) removing 31 redundant sequences (Table S13). Among the 31 sequences, 18 sequences are completely same as others and 13 sequences are parts of others. After the two rounds of filtering, 560 unique CagAs containing EPIYA or EPIYA-like motifs remained (Table S3).
Statistical Analyses
Chi-square test is used to test the statistical significance of the difference of strains of sequence types ABCC and ABC in disease groups GC, PU and G. From Table 7, 17 and 22 strains with ABCC and ABC types appear in disease GC group, and 8 and 42 strains with ABCC and ABC types appear in disease PU group. The calculated chi-square (http://math.hws.edu/javamath/ryan/ ChiSquare.html ) is 8.24 from a 2×2 matrix. Similarly, 17 and 22 strains with ABCC and ABC types appear in disease GC group, and 18 and 65 strains with ABCC and ABC types appear in disease G group. The calculated chi-square is 6.22 from a 2×2 matrix. Then from a chi-square table, the probabilities of null hypothesis are less than 0.03 and 0.01, respectively, with a df = 1 (df: degree of freedom).
Software for Data Analysis
Home-made program based on MATLAB was used to extract information from the original data retrieved from NCBI, search the sequences, sort the sequences according to disease, create files in FASTA format, etc. BioEdit and WebLogo were used to align and display protein sequences [55], [56]. ClustalW (http://align.genome.jp/) and TreeView (http://taxonomy.zoology.gla.ac.uk/rod/treeview. html) were applied to build and view phylogenic trees.
Supporting Information
Acknowledgments
Authors thank Dr. Tongbin Li at University of Minnesota for his helpful suggestions in numerous discussions.
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: This research was in part supported by grants from NIH (R42 GM067364 to XG and RO1 DK62813 to YY) and the Robert A. Welch Foundation (E-1027 to XG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Suerbaum S, Michetti P. Helicobacter pylori infection. N Engl J Med. 2002;347:1175–1186. doi: 10.1056/NEJMra020542. [DOI] [PubMed] [Google Scholar]
- 2.Ferreira AC, Isomoto H, Moriyama M, Fujioka T, Machado JC, et al. Helicobacter and gastric malignancies. 2008;Helicobacter(Suppl 1):28–34. doi: 10.1111/j.1523-5378.2008.00633.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Franco AT, Johnston E, Krishna U, Yamaoka Y, Israel DA, et al. Regulation of gastric carcinogenesis by Helicobacter pylori virulence factors. Cancer Res. 2008;68:379–87. doi: 10.1158/0008-5472.CAN-07-0824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hatakeyama M. Helicobacter pylori and gastric carcinogenesis. J Gastroenterol. 2008;44:239–48. doi: 10.1007/s00535-009-0014-1. [DOI] [PubMed] [Google Scholar]
- 5.Rizwan M, Alvi A, Ahmed N. Novel protein antigen (JHP940) from the genomic plasticity region of Helicobacter pylori induces tumor necrosis factor alpha and interleukin-8 secretion by human macrophages. J Bacteriol. 2008;190:1146–51. doi: 10.1128/JB.01309-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Snider JL, Cardelli JA. Helicobacter pylori induces cancer cell motility independent of the c-Met receptor. J Carcinog. 2009;8:7. doi: 10.4103/1477-3163.50892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Umeda M, Murata-Kamiya N, Saito Y, Ohba Y, Takahashi M, et al. Helicobacter pylori CagA causes mitotic impairment and induces chromosomal instability. J Biol Chem. 2009 Jun 22. [Epub ahead of print] 2009 doi: 10.1074/jbc.M109.035766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Blaser MJ, Perez-Perez GI, Kleanthous H, Cover TL, Peek RM, et al. Infection with Helicobacter pylori strains possessing cagA is associated with an increased risk of developing adenocarcinoma of the stomach. Cancer Res. 1995;55:2111–2115. [PubMed] [Google Scholar]
- 9.Kuipers EJ, Perez-Perez GI, Meuwissen SGM, Blaser MJ. Helicobacter pylori and atrophic gastritis: importance of the cagA status. J Natl Cancer Inst. 1995;87:1777–1780. doi: 10.1093/jnci/87.23.1777. [DOI] [PubMed] [Google Scholar]
- 10.Nomura AM, Lee J, Stemmermann GN, Nomura RY, Perez-Perez GI, et al. Helicobacter pylori CagA seropositivity and gastric carcinoma risk in a Japanese American population. J Infect Dis. 2002;186:1138–1144. doi: 10.1086/343808. [DOI] [PubMed] [Google Scholar]
- 11.Parsonnet J, Friedman GD, Orentreich N, Vogelman H. Risk for gastric cancer in people with CagA positive or CagA negative Helicobacter pylori infection. Gut. 1997;40:297–301. doi: 10.1136/gut.40.3.297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Covacci A, Censini S, Bugnoli M, Petracca R, Burroni D, et al. Molecular characterization of the 128-kDa immunodominant antigen of Helicobacter pylori associated with cytotoxicity and duodenal ulcer. Proc Natl Acad Sci USA. 1993;90:5791–5795. doi: 10.1073/pnas.90.12.5791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tummuru MK, Cover TL, Blaser MJ. Cloning and expression of a high-molecular-mass major antigen of Helicobacter pylori: evidence of linkage to cytotoxin production. Infect Immun. 1993;61:1799–1809. doi: 10.1128/iai.61.5.1799-1809.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Yamaoka Y, Kodama T, Kashima K, Graham DY, Sepulveda AR. Variants of the 3′ region of the cagA gene in Helicobacter pylori isolates from patients with different H. pylori-associated diseases. J Clin Microbiol. 1998;36:2258–2263. doi: 10.1128/jcm.36.8.2258-2263.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yamaoka Y, El-Zimaity HM, Gutierrez O, Figura N, Kim JG, et al. Relationship between the cagA 3′ repeat region of Helicobacter pylori, gastric histology, and susceptibility to low pH. Gastroenterology. 1999;117:342–349. doi: 10.1053/gast.1999.0029900342. [DOI] [PubMed] [Google Scholar]
- 16.Yamaoka Y, Osato MS, Sepulveda AR, Gutierrez O, Figura N, et al. Molecular epidemiology of Helicobacter pylori: separation of H. pylori from East Asian and non-Asian countries. Epidemiol Infect. 2000;124:91–96. doi: 10.1017/s0950268899003209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Higashi H, Tsutsumi R, Fujita A, Yamazaki S, Asaka M, et al. Biological activity of the Helicobacter pylori virulence factor CagA is determined by variation in the tyrosine phosphorylation sites. Proc Natl Acad Sci USA. 2002;99:14428–14433. doi: 10.1073/pnas.222375399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Satomi S, Yamakawa A, Matsunaga S, Masaki R, Inagaki T, et al. Relationship between the diversity of the cagA gene of Helicobacter pylori and gastric cancer in Okinawa, Japan. J Gastroenterol. 2006;41:668–673. doi: 10.1007/s00535-006-1838-6. [DOI] [PubMed] [Google Scholar]
- 19.Yamazaki S, Yamakawa A, Okuda T, Ohtani M, Suto H, et al. Distinct diversity of vacA, cagA, and cagE genes of Helicobacter pylori associated with peptic ulcer in Japan. J Clin Microbiol. 2005;43:3906–3916. doi: 10.1128/JCM.43.8.3906-3916.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hatakeyama M. Oncogenic mechanisms of the Helicobacter pylori CagA protein. Nat Rev Cancer. 2004;4:688–694. doi: 10.1038/nrc1433. [DOI] [PubMed] [Google Scholar]
- 21.Higashi H, Yokoyama K, Fujii Y, Ren S, Yuasa H, et al. EPIYA motif is a membrane-targeting signal of Helicobacter pylori virulence factor CagA in mammalian cells. J Biol Chem. 2005;280:23130–23137. doi: 10.1074/jbc.M503583200. [DOI] [PubMed] [Google Scholar]
- 22.Hatakeyama M. Helicobacter pylori CagA – a bacterial intruder conspiring gastric carcinogenesis. Int J Cancer. 2006;119:1217–1223. doi: 10.1002/ijc.21831. [DOI] [PubMed] [Google Scholar]
- 23.Naito M, Yamazaki T, Tsutsumi R, Higashi H, Onoe K, et al. Influence of EPIYA-repeat polymorphism on the phosphorylation-dependent biological activity of Helicobacter pylori CagA. Gastroenterology. 2006;130:1181–1190. doi: 10.1053/j.gastro.2005.12.038. [DOI] [PubMed] [Google Scholar]
- 24.Backert S, Moese S, Selbach M, Brinkmann V, Meyer TF. Phosphorylation of tyrosine 972 of the Helicobacter pylori CagA protein is essential for induction of a scattering phenotype in gastric epithelial cells. Mol. Microbiol. 2001;42:631–644. doi: 10.1046/j.1365-2958.2001.02649.x. [DOI] [PubMed] [Google Scholar]
- 25.Censini S, Lange C, Xiang Z, Crabtree JE, Ghiara P, et al. cag, a pathogenicity island of Helicobacter pylori, encodes type I-specific and disease-associated virulence factors. Proc Natl Acad Sci USA. 1996;93:14648–14653. doi: 10.1073/pnas.93.25.14648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Asahi M, Azuma T, Ito S, Ito Y, Suto H, et al. Helicobacter pylori CagA protein can be tyrosine phosphorylated in gastric epithelial cells. J Exp Med. 2000;191:593–602. doi: 10.1084/jem.191.4.593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Backert S, Ziska E, Brinkmann V, Zimny-Arndt U, Fauconnier A, et al. Translocation of the Helicobacter pylori CagA protein in gastric epithelial cells by a type IV secretion apparatus. Cell Microbiol. 2000;2:155–164. doi: 10.1046/j.1462-5822.2000.00043.x. [DOI] [PubMed] [Google Scholar]
- 28.Odenbreit S, Püls J, Sedlmaier B, Gerland E, Fischer W, Haas R al. Translocation of Helicobacter pylori CagA into gastric epithelial cells by type IV secretion. Science. 2000;287:1497–1500. doi: 10.1126/science.287.5457.1497. [DOI] [PubMed] [Google Scholar]
- 29.Segal ED, Cha J, Lo J, Falkow S, Tompkins LS, et al. Altered states: involvement of phosphorylated CagA in the induction of host cellular growth changes by Helicobacter pylori. Proc Natl Acad Sci USA. 1999;96:14559–14564. doi: 10.1073/pnas.96.25.14559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Stein M, Rappuoli R, Covacci A. Tyrosine phosphorylation of the Helicobacter pylori CagA antigen after cag-driven host cell translocation. Proc Natl Acad Sci USA. 2000;97:1263–1268. doi: 10.1073/pnas.97.3.1263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Higashi H, Tsutsumi R, Muto S, Sugiyama T, Azuma T, et al. SHP-2 tyrosine phosphatase as an intracellular target of Helicobacter pylori CagA protein. Science. 2002;295:683–686. doi: 10.1126/science.1067147. [DOI] [PubMed] [Google Scholar]
- 32.Backert S, Moese S, Selbach M, Brinkmann V, Meyer TF. Phosphorylation of tyrosine 972 of the Helicobacter pylori CagA protein is essential for induction of a scattering phenotype in gastric epithelial cells. Mol Microbiol. 2001;42:631–644. doi: 10.1046/j.1365-2958.2001.02649.x. [DOI] [PubMed] [Google Scholar]
- 33.Tammer I, Brandt S, Hartig R, König K, Backert S. Activation of Abl by Helicobacter pylori: A Novel kinase for CagA and crucial mediator of host cell scattering. Gastroenterology. 2007;132:1309–1319. doi: 10.1053/j.gastro.2007.01.050. [DOI] [PubMed] [Google Scholar]
- 34.Selbach M, Paul FE, Brandt S, Guye P, Daumke O, et al. Host cell interactome of tyrosine-phosphorylated bacterial proteins. Cell Host & Microbe. 2009;5:397–403. doi: 10.1016/j.chom.2009.03.004. [DOI] [PubMed] [Google Scholar]
- 35.Judd LM, Alderman BM, Howlett M, Shulkes A, Dow C, et al. Gastric cancer development in mice lacking the SHP2 binding site on the IL-6 family co-receptor gp130. Gastroenterology. 2004;126:196–207. doi: 10.1053/j.gastro.2003.10.066. [DOI] [PubMed] [Google Scholar]
- 36.Tebbutt NC, Giraud AS, Inglese M, Jenkins B, Waring P, et al. Reciprocal regulation of gastrointestinal homeostasis by SHP2 and STAT-mediated trefoil gene activation in gp130 mutant mice. Nat Med. 2002;8:1089–1097. doi: 10.1038/nm763. [DOI] [PubMed] [Google Scholar]
- 37.Argent RH, Hale JL, EL-Omar EM, Atherton JC, et al. Differences in helicobacter pylori cagA tyrosine phosphorylation motif patterns between western and east asian strains, and influences on interleukin-8 secretion. J med microbiol. 2008;57:1062–1067. doi: 10.1099/jmm.0.2008/001818-0. [DOI] [PubMed] [Google Scholar]
- 38.Argent RH, Kidd M, Owen RJ, Thomas RJ, Limb MC, et al. Determinants and consequences of different levels of CagA phosphorylation for clinical isolates of Helicobacter pylori. Gastroenterology. 2004;127:514–523. doi: 10.1053/j.gastro.2004.06.006. [DOI] [PubMed] [Google Scholar]
- 39.Azuma T, Yamakawa A, Yamazaki S, Fukuta K, Ohtani M, et al. Correlation between variation of the 3′ region of the cagA gene in Helicobacter pylori and disease outcome in Japan. J Infect Dis. 2002;186:1621–1630. doi: 10.1086/345374. [DOI] [PubMed] [Google Scholar]
- 40.Kidd M, Lastovica AJ, Atherton JC, Louw JA, et al. Heterogeneity in the Helicobacter pylori vacA and cagA genes: association with gastroduodenal disease in South Africa? Gut. 1999;45:499–502. doi: 10.1136/gut.45.4.499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Rota CA, Pereira-Lima JC, Blaya C, Nardi NB. Consensus and variable region PCR analysis of Helicobacter pylori 3′ region of cagA gene in isolates from individuals with or without peptic ulcer. J Clin Microbiol. 2001;39:606–612. doi: 10.1128/JCM.39.2.606-612.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Jenks PJ, Mégraud F, Labigne A. Clinical outcome after infection with Helicobacter pylori does not appear to be reliably predicted by the presence of any of the genes of the cag pathogenicity island. Gut. 1998;43:752–758. doi: 10.1136/gut.43.6.752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zhou J, Zhang J, Xu C, He L. cagA genotype and variants in Chinese Helicobacter pylori strains and relationship to gastroduodenal diseases. J Med Microbiol. 2004;53:231–235. doi: 10.1099/jmm.0.05366-0. [DOI] [PubMed] [Google Scholar]
- 44.Uchida T, Kanada R, Tsukamoto Y, Hijiya N, Matsuura K, et al. Immunohistochemical diagnosis of the cagA-gene genotype of Helicobacter pylori with anti-East Asian CagA-specific antibody. Cancer Sci. 2007;98:521–528. doi: 10.1111/j.1349-7006.2007.00415.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, et al. Traces of human migrations in Helicobacter pylori populations. Science. 2003;299:1582–1585. doi: 10.1126/science.1080857. [DOI] [PubMed] [Google Scholar]
- 46.Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. Gain and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet. 2005;1:e43. doi: 10.1371/journal.pgen.0010043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Linz B, Balloux F, Moodley Y, Manica A, Liu H, et al. An African origin for the intimate association between humans and Helicobacter pylori. Nature. 2007;445:915–918. doi: 10.1038/nature05562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Yamaoka Y, Orito E, Mizokami M, Gutierrez O, Saitou N, et al. Helicobacter pylori in North and South America before Columbus. FEBS Lett. 2002;517:180–184. doi: 10.1016/s0014-5793(02)02617-0. [DOI] [PubMed] [Google Scholar]
- 49.Yamaoka Y, Kato M, Asaka M. Geographic differences in gastric cancer incidence can be explained by differences between Helicobacter pylori strains. Intern Med. 2008;47:1077–1083. doi: 10.2169/internalmedicine.47.0975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Moodley Y, Linz B, Yamaoka Y, Windsor HM, Breurec S, et al. The peopling of the pacific from a bacterial perspective. Science. 2009;323:527–530. doi: 10.1126/science.1166083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kersulyte D, Mukhopadhyay AK, Velapatiño B, Su W, Pan Z, et al. Differences in genotypes of Helicobacter pylori from different human populations. J Bacteriol. 2000;182:3210–3218. doi: 10.1128/jb.182.11.3210-3218.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Uemura N, Okamoto S, Yamamoto S, Matsumura N, Yamaguchi S, et al. Helicobacter pylori infection and the development of gastric cancer. N Engl J Med. 2001;345:784–789. doi: 10.1056/NEJMoa001999. [DOI] [PubMed] [Google Scholar]
- 53.Azuma T, Yamakawa A, Yamazaki S, Ohtani M, Ito Y, et al. Distinct diversity of the cag pathogenicity island among Helicobacter pylori strains in Japan. J Clin Microbiol. 2004;42:2508–2517. doi: 10.1128/JCM.42.6.2508-2517.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Hoshino FB, Katayama K, Watanabe K, Takahashi S, Uchimura H, et al. Heterogeneity found in the cagA gene of Helicobacter pylori from Japanese and non-Japanese isolates. J Gastroenterol. 2000;35:890–897. doi: 10.1007/s005350070002. [DOI] [PubMed] [Google Scholar]
- 55.Hall TA. Bioedit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp. 1999;41:95–98. [Google Scholar]
- 56.Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.