Peptide Vocabulary Analysis Reveals Ultra-Conservation and Homonymity in Protein Sequences

Derek Gatherer

doi:10.4137/bbi.s415

. 2009 Nov 24;1:101–126. doi: 10.4137/bbi.s415

Peptide Vocabulary Analysis Reveals Ultra-Conservation and Homonymity in Protein Sequences

Derek Gatherer ^1,^✉

PMCID: PMC2789693 PMID: 20066129

Abstract

A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%–70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively context-independent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time.

Keywords: peptide vocabulary, vocabulary analysis, word detection, motif, protein structure, bioinformatics, gene families, genome signature, peptide conservation, peptide homonymity

Introduction

First used at least as early as the beginning of the 1970s, the concept of “the language of the genes” has become a recurring explanatory tool in popular presentations of molecular genetics (Chargaff, 1971; Jones, 1993). Genomes may be compared to libraries of genetic information, with each chromosome as a book, genes as chapters, and DNA bases as the letters in which the text is written (Ridley, 1999). In principle, the linguistic analogy may be applied equally to protein sequences as to DNA, simply by increasing the alphabet from 4 to 20 letters. The prevalence, and utility, of this metaphor in undergraduate teaching and the popular science media, obscures a deeper controversy concerning its genuine applicability in research (Searls, 1993; Ji, 1999; Searls, 2002; Sakakibara, 2005). Attempts have been made to apply generative grammar structures to gene organization in bacteria (Collado-Vides, 1991, 1992, 1996), DNA-protein interaction (Bentolila, 1996; Wang et al. 2005), the problem of gene prediction (Dong and Searls, 1994; Muggleton et al. 2001), protein folding (Gimona, 2006) and RNA structure prediction (Matsui et al. 2004). These efforts in molecular biology are in the tradition of wider attempts to create formal grammars, or to use the grammatical metaphor, for other kinds of biological data (Gutfreund, 1976; Jerne, 1985; Hamilton, 1993; Wang, 2004). A related metaphor is that of genome sequence as a code to be deciphered by the molecular biologist, who thus becomes a “biomolecular cryptologist” (Konopka, 1994; Bodnar et al. 1997). Conversely, techniques developed in molecular biology are now being recycled back into cryptography (Spencer et al. 2004).

Under the terms of these general analogies, short sequences of DNA may be regarded as words. Often, any k-mer is referred to as a word (Mantegna et al. 1994; Chatzidimitriou-Dreismann et al. 1996) but here these will be designated strings. Where a string has some local functional significance in a sequence and consequently has been conserved throughout the evolutionary process, it may be referred to as a motif (Waterman, 1989; Hu et al. 2000). Identification of motifs is usually based on large-scale comparative analysis and alignment of related sequences.

Counts of DNA string frequency have been used as a means of differentiating classes of DNA sequence, such as exons, introns and promoters (Beckmann et al. 1986; Solovyev and Lawrence, 1993; Solovyev et al. 1994b, 1994a; Bains, 1997; Frontali and Pizzi, 1999; Bultrini et al. 2003), although the meaning of such differences in terms of the linguistic metaphor of the genome has been disputed (Konopka and Martindale, 1995; Chatzidimitriou-Dreismann et al. 1996; Martindale and Konopka, 1996; Tsonis et al. 1997). String counts, after correction for underlying base composition, have been assembled into vectors known as genome signatures, reflecting their apparent distinctiveness between genomes (Karlin and Mrázek, 1997; Karlin et al. 1997; Karlin, 1998; Karlin et al. 1998; Campbell et al. 1999). Such composition-corrected string frequency vectors have proved useful in detecting horizontal gene transfer events between species of bacteria (Karlin, 2001). A further development based on genome signatures is that of compositional spectra, designed to reduce vector size and increase technical tractability (Bolshoy, 2003; Kirzhner et al. 2003).

This paper investigates the meaning of the linguistic metaphor in more detail in protein sequences, with particular emphasis on the identification of words. A protein word, rather than a string, is here taken to be more literally comparable to a word within a text of human origin. Therefore, words are only a subset of strings. Likewise, a word differs from a motif, in that motifs are often fuzzy (meaning tolerant to substitution) and are best viewed in the context of an alignment of related sequences. Within a text of human origin, a word has some context-independence. It has clear boundaries and may appear flanked by very different text in different cases. Fuzziness is also not tolerated; a word has a correct spelling. The total assembly of detected words is referred to as the vocabulary, and the word detection process as vocabulary analysis.

The pioneering vocabulary analysis in DNA sequences was carried out by Brendel et al. (1986). Their metric was based on contrasting frequencies of substrings within the candidate word. For a string, s, of length k, its expected occurrence, E, is the product of the occurrences of its left and right substrings, divided by the occurrence of its internal substring.

\begin{matrix} E (s_{1} \dots s_{k}) = f (s_{1} \dots s_{k - 1}) * f (s_{2} \dots s_{k}) \\ / f (s_{2} \dots s_{k - 1}) \end{matrix}

For each string, s, the difference between its expected occurrence, E(s) as calculated above, and actual occurrence, f(s), is quantified by:

s t d (s) = (f (s) - E (s)) / max {\sqrt{E (s)}, 1}

This provides a z-score for the actual occurrence of the string. Brendel et al. (1986) define a contrast word as any string where std(s) ≥ 3. Brendel et al. (1986) were able to identify several contrast words of lengths k = 3 to 6 in the genomes of E. coli and two coliphages. Conversely, avoided words could also be detected, where std(s) ≤ −3. An essentially similar metric has been implemented by others (Phillips et al. 1987a, 1987b; Merkl et al. 1992; Colosimo et al. 1993; Castrignanò et al. 1997; Rocha et al. 1998; Apostolico et al. 2003).

In principle, this method could also be applied to detect contrast words in protein sequences, but the combinatorial explosion caused by the presence of a 20-letter code in proteins as opposed to the 4-letter code in DNA, has restricted work on string frequency in proteins to k = 2 (i.e. dipeptides) only (Solovyev and Makarova, 1993). Application of the contrast words method to human texts was extended by Schmitt et al. (1996). Analysing Alice in Wonderland, they found that it performed relatively poorly, essentially due to the fact that the 26-letter alphabet of a text in English has a string combinatorial explosion problem even worse than that of 20-letter protein sequences.

This paper proposes improvements on the contrast words method, initially comparing their performance, in the tradition of Schmitt et al. (1996), on Alice in Wonderland. The most accurate method for identifying true words is then applied to several other human texts, to the NRL3D set of proteins of solved structure, and to the proteome sets of several species from all three superkingdoms (NCBI Taxonomy Browser classification) of cellular organisms.

The concept of synonymity is familiar in molecular biology. Within the degenerate genetic code, many amino acids may be encoded by more than one codon. A protein sequence may therefore be potentially coded by a combinatorially vast number of synonymous DNA sequences. Here the term is used in a more general sense. When two protein strings have different sequences, but perform the same function in their respective proteins, they are said to be functionally synonymous. Fuzzy motifs are an example of functional synonymity within protein families. The converse concept, that of homonymity, has not been explored (although see Lennon and Nussinov, 1984). Where a non-fuzzy word occurs in two different proteins and performs a different function in each, that peptide word is functionally homonymous. At a trivial level, it is immediately possible to see that the longer a peptide, the less likelihood it has of functional homonymity. The questions of the longest existing homonymous word, the prevalence of peptide homonymity, and its origins are all explored in this paper.

Methods

Texts and protein sequence sources

Public domain texts were downloaded from Project Gutenberg (http://www.gutenberg.org). Punctuation, non-alphabetic characters, numbers and spaces were removed. Word counts were case-insensitive.

The NRL3D set of sequences of proteins of solved structure (Pattabiraman et al. 1990) was downloaded from the University of Hong Kong (http://bioinfo.hku.hk/db/nrl_3d/NRL3D/nrl_3d.seq). Non-contiguous sequences (those annotated as “fragments”), sequences containing ambiguities and exact duplicates were removed using a Perl script. This reduces the number of sequences from 23301 to 6168. Further trimmings were performed using CD-HIT (Li and Godzik, 2006), which can produce datasets with maximum degrees of pairwise identity. Such reduced sets are subsequently referred to as NRL3D_nn, where nn is the maximum pairwise identity. The justification for this trimming is that most words will occur in closely related sequences, and will consequently be explicable at a trivial level. Trimming with CD-HIT reduces the number of words detected and maximises the likelihood that they will be found in less closely related proteins, and thereby be potentially more interesting from a functional point of view. As a negative control, trimmed NRL3D data sets were shuffled using shuffleseq (http://emboss.sourceforge.net/apps/release/4.0/emboss/apps/shuffleseq.html) from EMBOSS (Rice et al. 2000).

Proteomes (meaning predicted protein sets derived from genome projects) were downloaded from the EBI Integr8 database (http://www.ebi.ac.uk/integr8). They were similarly reduced by CD-HIT.

Vocabulary analysis algorithms

For each text or proteome, and for NRL3D, overlapping strings of all lengths from k = 1 to 20 were counted using a Perl script running the BioPerl (Stajich et al. 2002) SeqWords module (http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Tools/SeqWords.html). The SeqWords output was then analysed in the following ways. Each metric is given an acronym for easier reference.

1). CW: Contrast words method (see Introduction)

This is the method of Brendel et al. (1986). The difference is that the std(s) threshold was set at 0.1 to maximise the number of candidate words.

2). RS: Raw strings

The simplest possible method: all strings of length k ≥ 5, occurring at n ≥ 20, were assessed as candidate words.

3). ES: Equal substrings

The raw strings extracted as above were trimmed to include only those having equal occurrences of left and right substrings.

f (s_{1} \dots s_{k - 1}) = f (s_{2} \dots s_{k - 1})

The rationale for this approach is that many true words tend to satisfy this criterion. For instance, in Alice in Wonderland, the true word ALICE is revealed by:

f (A L I C) = f (L I C E)

following to the fact that Alice in Wonderland, despite referring to several species, does not mention lice.

4). CW-ESM: Equal substrings of middle substring of contrast words

Combining methods 1 and 3, middle substrings were extracted from contrast words with std(s) > = 0.1. These were then examined for equal substrings:

f (s_{2} \dots s_{k - 2}) = f (s_{3} \dots s_{k - 1})

The rationale for this approach is the ad hoc empirical observation that false positive contrast words, of which there are many (Schmitt et al. 1996), frequently have true words embedded within them as middle substrings.

5). RS-ESM: Equal substrings of middle substring of raw strings

Combining methods 2 and 3, since equality of substrings within the middle strings of contrast words was frequently found to be an indicator of a true word, the same was applied to raw strings. The additional proviso was that the left and right substrings of the raw string were not of equal occurrence to each other or the middle substring.

\begin{array}{l} f (s_{2} \dots s_{k - 2}) = f (s_{3} \dots_{k - 1}) \\ and \\ f (s_{1} \dots s_{k - 1}) \neq f (s_{2} \dots_{k}) \\ and \\ f (s_{1} \dots s_{k - 1}) \neq f (s_{2} \dots s_{k - 1}) \\ and \\ f (s_{1} \dots s_{k}) \neq f (s_{2} \dots s_{k - 1}) \end{array}

The rationale for this was that, for instance, within the raw string DALICET, the true word ALICE is revealed by:

\begin{array}{l} f (A L I C) = f (L I C E) \\ and \\ f (D A L I C E) \neq f (A L I C E T) \\ and \\ f (D A L I C E) \neq f (A L I C E) \\ and \\ f (A L I C E T) \neq f (A L I C E) \end{array}

CW-ESM and RS-ESM are equivalent, excepting that CW-ESM takes contrast words as its starting point, and RS-ESM uses raw strings. In both cases the candidate word is the middle substring, should it satisfy the criteria given.

Measurement of accuracy

In human texts it is possible to score true words among the detected candidate words. Accuracy is measured using the Sen2 statistic (Milanesi and Rogozin, 1998):

S e n 2 = T P / (T P + F P)

where TP are those candidate words identified as true positives, and FP are those identified as false positives.

Perl scripts are available on request from the author.

Assessment of hits

Protein domains were determined by reference to Pfam (http://www.sanger.ac.uk/Software/Pfam—Finn et al. 2006) and Prosite motifs detected using ScanProsite (http://www.expasy.ch/tools/scan-prosite—de Castro et al. 2006). Alignments were performed using ClustalW (Chenna et al. 2003) or bl2seq (http://www.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi—Tatusova and Madden, 1999).

Structural visualization

Solved proteins structures were downloaded from PDB (http://www.pdb.org) and visualization was carried out in MOE (http://www.chemcomp.com).

Results

Vocabulary analysis in human texts

Alice in Wonderland is a short novel of 26587 words. The total vocabulary is 2593 different words, of which 1475 are used more than once and 1072 more than twice. For illustrative purposes, the 10 commonest words are shown in Table 1. As might be expected, these are all small prepositions and pronouns, except for the name “Alice” which has 386 occurrences and is the 10th commonest word, and the verb past tense “said” at 462 occurrences.

Table 1.

Commonest 10 words in Alice in Wonderland, sorted by their occurrence, n.

Word	n
THE	1631
AND	865
TO	728
A	628
SHE	541
IT	530
OF	512
SAID	462
I	410
ALICE	386

Word	n
ALICE	397
SAIDT	266
AIDTH	224
SAIDTH	222
IDTHE	221
SAIDTHE	212
AIDTHE	212
ANDTH	169
THING	169
DALICE	162

Word	k	n	*std*(s)
OFTHE	5	68	5.18
LITTLE	6	44	4.18
ALICE	5	198	3.21
SHOULD	6	14	3.20
THEMARCHHARE	12	4	3.17
THEDORMOUSE	11	9	3.09
BEGIN	5	11	2.63
WHICH	5	8	2.51
MINUTE	6	21	2.50
VENTURE	7	10	2.50

Method	Hits	TP	*Sen2*
RS-ESM, k = 2–18, n ≥ 2	1312	895	0.682
CW-ESM, k = 4–18, n ≥ 2	673	388	0.577
ES, k = 5–20, n ≥ 20	241	61	0.253
RS, k = 3–20, n ≥ 10	2293	540	0.235
RS, k = 5–20, n ≥ 20	1213	206	0.170
CW, k = 7–20, n ≥ 4	1927	117	0.061

Word	k	n	n-L	n-R	n-M	std(s)
ROUGHTH	7	11	14	13	114	7.44
TOTHINK	7	7	7	7	43	5.49
TOFTHEW	7	10	14	25	156	5.18
OINTHED	7	9	21	10	109	5.10
AIDNOTH	7	6	6	6	34	4.80
POFTHEE	7	5	7	6	156	4.73
DIDTHEY	7	5	9	8	221	4.67
THECOUR	7	16	16	18	52	4.45
ESAIDTO	7	26	40	77	266	4.24
RLITTLET	8	7	15	14	128	4.18

Word	k	n	*n-R*	*n-L*	*n-M*
ALICE	5	397	397	397	401
LITTLE	6	128	128	128	128
LITTL	5	128	128	128	193
SAIDALICE	9	116	116	116	116
SAIDALIC	8	116	116	116	116
THOUGH	6	91	91	91	91
HERSELF	7	83	83	83	83
THEQUE	6	77	77	77	78
THEKING	7	62	62	62	62
HEKING	6	62	62	62	64

Word	Length	Protein family	No. of proteins
TCNVAHPASSTKVDKKI	17	immunoglobulin	3
LLQLTVWGIKQLQAR	15	gp41	3
DATDRCCFVHDCCY	14	phospholipase	5
EKPYKCPECGKSFS	14	zinc finger domain	1 (internal repeat)
LGGTCVNVGCVPKK	14	2 kinds of reductase	3
LGRSGYTVHVQCNA	14	viral coat protein	3
TLGNSTITTQEAAN	14	viral coat protein	3
AFLGIPFAEPPVG	13	lipase/acetylcholinesterase	3
LGNGGLGRLAACF	13	phosphorylase	3
LLRISLLLIQSWL	13	growth hormone	3
TTPPSVYPLAPGS	13	immunoglobulin	3
AVLPGDGIGPEV	12	dehydrogenase	3
CLNVGCIPSKAL	12	dehydrogenase	3
FDTGSSNLWVPS	12	pepsin	3
HVQCNASKFHQG	12	viral coat protein	3
LRKAMKGLGTDE	12	annexin	3
PKDATDRCCFVH	12	phospholipase	4
QSQIVSFYFKLF	12	interferon	3
SDGIMVARGDLG	12	pyruvate kinase	3
SHVSTGGGASLE	12	phosphoglycerate kinase	3
SNASCTTNCLAP	12	phosphatase	4

Superkingdom	Number of species	Total proteins	av. proteome size (kres)	av. protein len. (res)	Words/kres	Words/prot
eukarya	36	382698	4902	461	1.615	0.745
euk. (eub. range)	7	15205	1025	472	0.957	0.452
euk. (arc. range)	4	3459	374	432	0.424	0.183
eubacteria	35	104006	947	319	0.670	0.214
archaea	28	65197	681	292	0.665	0.194
human texts	9	N/A	1322	N/A	13.939	N/A

Species	Proteins	Kres	Words
H. sapiens	37993	16405.3	30360
M. musculus	32971	14645.0	29463
A. thaliana	34712	14124.8	47516
T. nigroviridis	27836	11286.0	26742
C. elegans	22434	9699.5	14167
D. melanogaster	14396	8055.1	7509
D. discoideum	13017	6817.5	16628
A. gambiae	15145	6125.0	6509
C. briggsae	13192	6038.7	6687
G. zeae	11636	5952.6	5302
B. rerio	14049	5940.9	12267
A. oryzae	12053	5410.2	5498
R. norvegicus	11839	5350.2	10466
L. major	8010	5137.4	4795
D. pseudoobscura	9877	5115.2	4412
A. fumigatus	9906	4782.5	3891
P. falciparum (3D7)	5282	4001.3	10494
C. neoformans	6569	3558.9	2787
C. neoformans (JEC21)	6437	3449.1	2461
P. yoelii	7590	3385.6	9444
Y. lipolytica	6524	3118.5	3661
D. hansenii	6309	2902.5	2401
S. cerevisiae	5800	2891.7	2227
B. taurus	8292	2890.4	3652
C. glabrata	5180	2610.3	1742
K. lactis	5326	2504.6	1249
G. gallus	5387	2443.6	3384
S. pombe	5011	2351.3	1306
A. gossypii	4720	2314.3	1103
T. annulata	3790	2025.0	3409
T. parva	4070	1895.3	1899
C. hominis	3886	1757.6	924
E. cuniculi	1909	693.5	308
T. gondii	489	377.9	230
G. theta	598	178.5	38

Species	Proteins	Kres	Words
R. baltica	7271	2290.5	1658
Anabaena. sp	6069	1955.5	2196
A. tumefaciens (Cereon)	5305	1687.1	1195
A. bacterium	4771	1677.5	1011
B. fragilis (ATCC 25285)	4234	1537.1	841
A. dehalogenans	4345	1516.6	1902
B. anthracis (Sterne)	5288	1460.1	996
Azoarcus. Sp.	4490	1393.5	764
G. violaceus	4406	1377.7	1412
E. coli (K12)	4323	1372.0	687
C. difficile	3711	1164.1	884
L. interrogans (icterohaemorrhagiae)	3654	1150.7	705
Acinetobacter. Sp.	3310	1048.4	266
A. ehrlichei	2862	984.1	524
A. borkumensis	2752	908.2	236
D. geothermalis	2821	901.0	538
B. abortus	3023	877.9	273
T. denticola	2753	863.6	573
S. elongatus	2451	770.0	378
S. haemolyticus	2634	756.6	370
C. chlorochromatii	1991	750.9	907
T. thermophilus (HB27)	2200	667.6	552
P. amoebophila	2023	658.9	1210
F. nucleatum	2046	641.1	250
B. longum	1723	638.8	181
T. maritima	1852	582.8	191
C. jejuni	1836	538.6	104
H. pylori (26695)	1551	491.8	172
A. aeolicus	1552	488.9	121
P. marinus (CCMP 1378)	1707	484.4	105
D. ethenogenes	1502	416.5	95
B. afzelii	1257	357.8	223
C. muridarum	916	324.3	40
M. pneumoniae	687	239.7	380
A. yellows	690	176.6	266

Species	Proteins	kres	Words
M. acetivorans	4467	1392.1	2317
H. marismortui	4234	1200.1	1006
M. barkeri	3616	1126.3	1701
M. mazei	3302	1004.3	939
M. hungatei	3095	997.2	792
S. solfataricus	2910	827.9	823
N. pharaonis	2784	815.9	571
H. walsbyi	2644	787.4	387
S. tokodaii	2816	757.9	399
H. salinarium	2426	680.0	449
M. burtonii	2242	676.7	313
A. fulgidus	2398	660.8	262
P. aerophilum	2589	654.9	473
P. kodakaraensis	2301	637.7	219
S. acidocaldarius	2221	631.7	188
P. furiosus	2045	577.7	202
P. horikoshii	2077	569.4	159
P. abyssi	1785	539.2	180
M. thermoautotrophicum	1869	524.7	145
M. jannaschii	1782	504.9	213
M. kandleri	1687	501.0	205
M. stadtmanae	1533	493.6	277
M. maripaludis	1722	490.7	113
A. pernix	1576	482.7	143
P. torridus	1535	471.3	79
T. acidophilum	1482	453.2	49
T. volcanium	1523	452.7	56
N. equitans	536	151.5	17

Word	k	Protein family	n
GGKGGLTVQIGG	12	adherence factor	3
FQEEHGHCRVP	11	helicase	4
GPNGAGKSTL	10	ABC transporter	3
EPAPEPAPE	9	low complexity	3
SDTESTNGN	9	low complexity	3
NPQLASWV	8	helicase	4
SGSGKSSL	8	ABC transporter	3
IHDVEQNG	8	DUF1547 (PF07577)	3
GPTGSGK	7	various	4
FRVTDPN	7	adherence factor	3
GIEGLIH	7	S1 RNA binding	3
DDDDDDD	7	low complexity	3
EGRCMGL	7	adherence factor	3
NDVTPAD	7	adherence factor	3
KTAAKKA	7	histone-like	3
LGGGAIL	7	Chlam_PMP (PF02415)	3
HGIWIAG	7	adherence factor	3
SSSSSS	6	low complexity	5
RSLLNK	6	homonym	4
VLLGLG	6	homonym	4
GKLSED	6	helicase	4
SFRAIP	6	adherence factor	3
LPLFSL	6	homonym	3
SSSFAL	6	homonym	3
IAILLS	6	homonym	3
RLKTIL	6	homonym	3
ALGIAA	6	homonym	3
VVLFDE	6	homonym	3
AASLIR	6	homonym	3
SLQEGL	6	homonym	3
ALPGVG	6	homonym	3
PNVGKS	6	MMR_HSR1 (PF01926)	3
EKILSL	6	homonym	3
VLSYEL	6	homonym	3

Word	k	Protein family	n
STDDSTDDSTDDSTDDST	18	low complexity	27
EIHHRIKNNLQVISSLL	17	histidine kinase	27
HHRIKNNLQVISSLLDL	17	histidine kinase	21
NMPVEYFDFNGN	12	PKD domain	20
VAYFHNMDWIE	11	PKD domain	20
GDGLYEDLTGNGEFSFVD	18	PKD domain	19
DLDGDGLYEDLTG	13	PKD domain	17
VVLATLTVSGKEKGSAN	17	PKD domain	15
VSGKEKGSANLSIGV	15	PKD domain	15
ISSLLDLQAEKF	12	histidine kinase	14
PLGIIVNELVSNSLKHAF	18	histidine kinase	13
GSANLSIGVKRLE	13	PKD domain	13
YSFLPVYSFLPVYSFLPV	18	low complexity	12
EGAADVVLATLTVSGKE	17	PKD domain	11
TVPEENITVPEEN	13	low complexity	11
AVPLGIIVNELVSNSLK	17	histidine kinase	10
GTAPLTVNFTDQSTGSP	17	PKD domain	9
STGSPTSWFWDFGDG	15	PKD domain	9
VSEASGSTVTLYFDP	15	PKD domain	9
PTSWFWDFGDGANST	15	PKD domain	9
LSPLPDQEYAPKDL	14	PKD domain	9
DITERKKAEEAL	12	histidine kinase	9
MDTAVPLGII	10	histidine kinase	9

Word	Protein family	k	n
QQQQQQQQQQQQQQQQQQ	low complexity	18	67
ATDTGATATDTGATATDT	low complexity	18	22
TVTGPTAGTTTITGTDGK	low complexity	18	15
SYSPTSPSYSPTSPSYSP	low complexity	18	14
IFIFIFIFIFIFIFIFIF	low complexity	18	14
YDSYDSYDSYDSYDSYDS	low complexity	18	11
PLAEPMPLPLAEPMPLPL	low complexity	18	10
SGSGSSGSGSSGSGSSGS	low complexity	18	9
ATDTATDTAATDTATDT	low complexity	17	31
GSGSGSGSEGSGSGSGS	low complexity	17	19
SQSQSQSQSQSQSQSQS	low complexity	17	17
NGNGSDGSNGNGSDGSN	low complexity	17	16
GSGSGSGSDSGSGSGSG	low complexity	17	13
SSSIPTGDVSSATPTGD	low complexity	17	11
DASSSIPTGDVSSATPT	low complexity	17	11
PTGDVSSATPTGDASSS	low complexity	17	10
TGGADASSTGGADASST	low complexity	17	10
TATDTGATDTATDTGAT	low complexity	17	9
TEQITVAPTGPVTTKTV	low complexity	17	9
KQKQKQKQKQKQKQKQK	low complexity	17	9
ATQTGGNGNNSGSNTAT	low complexity	17	9
ATDTGATATDTGATDT	low complexity	16	12
SPSYSPTSPSYSPTS	low complexity	15	13
ATDTGATATDTATD	low complexity	14	12
EPVTSEPVTSEPVT	low complexity	14	10
PGPAPSPGPGPAPS	low complexity	14	10
SDSDSDSDSDSDS	low complexity	13	32
DSDSDSDSDSDSD	low complexity	13	29
PSSTEAPSSTEAP	low complexity	13	14
GSNTATQTGGNGN	low complexity	13	9
TKTVTGPTAGT	low complexity	11	13
ASASASASASA	low complexity	11	9

Word	Protein family	k	n
DDDDDDDDDDDDDDDDDD	low complexity	18	184
QQQQQQQQQQQQQQQQQQ	low complexity	18	38
YQCKCEGLFVWPNDTCHA	EGF domain	18	22
GSFNCSCLSAFTVTDRNQ	EGF domain	18	19
AQAQAQAQAQAQAQAQAQ	low complexity	18	16
KKKKKKKKKKKKKKKKKK	low complexity	18	16
NGTEYECKCEVDHVWPSN	EGF domain	18	14
CGLNGTEYECKCEVDHVW	EGF domain	18	14
CGPNSICNNTIGSYNCSC	EGF domain	18	14
MSDPEPCRIKQEETEELI	zinc finger	18	13
YSNCTNEIGSYNCSCLDG	EGF domain	18	12
CDVITNGSCTCINGLPA	EGF domain	17	22
NGSCTCINGLPADGQFC	EGF domain	17	21
VCSLNETRYQCKCEGLF	EGF domain	17	19
ECLFSPPVCGPYSNCTN	EGF domain	17	17
THTHTHTHTHTHTHTHT	low complexity	17	17
CRELDCGAPVQVLRAA	SCRC (PF00530)	16	19
DINECEDAASVCGQYS	EGF domain	16	17
TDRNQPVSNSNPCNVC	EGF domain	16	17
TCGCIQALPSEGSLCQ	EGF domain	16	15
CDAAFDQQDAEVVCR	SCRC (PF00530)	15	25
NSIGSFNCSCLSAFT	EGF domain	15	19
IGGYMCSCWNGFNVS	EGF domain	15	15
QVCDSIVGSTCGCIQ	EGF domain	15	14
SINNTCEDVNECLKS	EGF domain	15	12
SNSNPCNVCSLNET	EGF domain	14	18
PERPPVSAPAPERP	low complexity	14	16
LTETQVKIWFQNRR	homeobox	14	12
DIDECLFSPPVCG	EGF domain	13	12
PVCGPYSNCTNE	EGF domain	12	15
PGGVGGVPGGVG	low complexity	12	13
NLPINSNNTCTD	EGF domain	12	13
LRAAAFDKGD	SCRC (PF00530)	10	13

PERMALINK

Peptide Vocabulary Analysis Reveals Ultra-Conservation and Homonymity in Protein Sequences

Derek Gatherer

Abstract

Introduction

Methods

Texts and protein sequence sources

Vocabulary analysis algorithms

1). CW: Contrast words method (see Introduction)

2). RS: Raw strings

3). ES: Equal substrings

4). CW-ESM: Equal substrings of middle substring of contrast words

5). RS-ESM: Equal substrings of middle substring of raw strings

Measurement of accuracy

Assessment of hits

Structural visualization

Results

Vocabulary analysis in human texts

Table 1.

Table 2.

Table 6.

Table 7.

1). RS metric

2). CW metric

Table 3.

3). ES metric

Table 4.

4). RS-ESM metric

Table 5.

5). CW-ESM

Comparison of methods

Figure 1.

Figure 2.

Vocabulary analysis in sets of real and shuffled protein sequences

Figure 3.

Structural meaning of words

Table 8.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Vocabulary analysis on individual proteomes

Figure 9.

Table 9.

Vocabulary analysis in a small eubacterial proteome

Figure 10.

Vocabulary analysis in a large archaeal proteome

Figure 11.

Vocabulary analysis in a medium-sized eukaryotic proteome

Vocabulary analysis in a large eukaryotic proteome

Discussion

Figure 12.

Supplement Material

Table S1.

Table S2.

Table S3.

Table S4.

Table S5.

Table S6.

Table S7.

Acknowledgments

References

Associated Data

Supplementary Materials

Table S1.

Table S2.

Table S3.

Table S4.

Table S5.

Table S6.

Table S7.

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases