EDGAR 2.0: an enhanced software platform for comparative gene content analyses

Jochen Blom; Julian Kreis; Sebastian Spänig; Tobias Juhre; Claire Bertelli; Corinna Ernst; Alexander Goesmann

doi:10.1093/nar/gkw255

. 2016 Apr 20;44(Web Server issue):W22–W28. doi: 10.1093/nar/gkw255

EDGAR 2.0: an enhanced software platform for comparative gene content analyses

Jochen Blom ^1,^*, Julian Kreis ¹, Sebastian Spänig ¹, Tobias Juhre ¹, Claire Bertelli ^2,³, Corinna Ernst ⁴, Alexander Goesmann ¹

PMCID: PMC4987874 PMID: 27098043

Abstract

The rapidly increasing availability of microbial genome sequences has led to a growing demand for bioinformatics software tools that support the functional analysis based on the comparison of closely related genomes. By utilizing comparative approaches on gene level it is possible to gain insights into the core genes which represent the set of shared features for a set of organisms under study. Vice versa singleton genes can be identified to elucidate the specific properties of an individual genome. Since initial publication, the EDGAR platform has become one of the most established software tools in the field of comparative genomics. Over the last years, the software has been continuously improved and a large number of new analysis features have been added. For the new version, EDGAR 2.0, the gene orthology estimation approach was newly designed and completely re-implemented. Among other new features, EDGAR 2.0 provides extended phylogenetic analysis features like AAI (Average Amino Acid Identity) and ANI (Average Nucleotide Identity) matrices, genome set size statistics and modernized visualizations like interactive synteny plots or Venn diagrams. Thereby, the software supports a quick and user-friendly survey of evolutionary relationships between microbial genomes and simplifies the process of obtaining new biological insights into their differential gene content. All features are offered to the scientific community via a web-based and therefore platform-independent user interface, which allows easy browsing of precomputed datasets. The web server is accessible at http://edgar.computational.bio.

INTRODUCTION

The revolutionary improvements in high-throughput DNA sequencing during the last 10 years have dramatically increased the availability of complete and draft microbial genome sequences. As a result, thousands of sequences are now available in the public sequence repositories, and tens of thousands of sequencing projects are ongoing. Thanks to this treasure of available data, the comparative analysis of the differential gene content of genomes quickly became a routine task in modern genomics. Especially the estimation of the core genome, the pan genome and singleton genes as defined by Tettelin et al. (1) and Medini et al. (2) are important steps in the analysis of groups of genomes. Several software platforms for comparative gene content analyses have been developed in the last decade like IMG (3), MicrobesOnline (4), MBGD (5) or OrtholugeDB (6). IMG and MicrobesOnline are designed as general purpose genomics databases for a broad variety of genomic information, but provide only a limited range of comparative analysis features. MBGD and OrtholugeDB are focused on comparative genomics, but both don not place much emphasis on result visualization and don't provide phylogenetic analyses. To support comparative gene content analyses combined with visual result representation, the software EDGAR (7) was developed. The initial version of EDGAR, referred to as ‘EDGAR 1.0’ in the following, supported only a limited range of analysis features, namely the calculation of genomic subsets and visualizations like Venn diagrams and pairwise synteny plots. The collection of features provided by EDGAR has been extended to inlcude a range of sophisticated analyses since then, with a focus on phylogenetic and statistical analyses. Existing features have been modernized and updated continuously. In the following chapters the updated and new features will be presented in detail.

TECHNICAL UPGRADES IN EDGAR 2.0

Since the publication of EDGAR 1.0 in 2009, several changes of the back-end and front-end of the software have been realized.

In EDGAR 1.0, all mathematical calculations were implemented in Perl. Most graphics were created using gnuplot (http://gnuplot.info) and Perl/CGI graphics. For the release of EDGAR 2.0, the visualization frameworks and libraries were changed to allow more up-to-date interactive graphics. All statistical and curve fitting calculations are now implemented in the statistical computing language R ((8), https://www.r-project.org/), as well as the respective plots. For interactive result visualization, a combination of HTML5, JavaScript in general and the Highcharts (http://www.highcharts.com/) charting library in particular was used. The database back-end was changed from one local SQLite (http://www.sqlite.org/) database per EDGAR project to a central MySQL server (http://www.mysql.com) running the InnoDB storage engine. Project calculations are distributed to a 1000 CPU core compute cluster.

IMPROVED AND MODERNIZED FEATURES

For the high-throughput computation of comparative analyses it is crucial to rely on a robust orthology criterion consistent within the analyzed genome set. For this purpose, EDGAR utilizes the so called BLAST Score Ratio Values (SRVs) suggested by Lerat et al. (9). The basic principle used in EDGAR is still the same as described in (7), but significant improvements have been made to the method. A detailed description of the updated orthology calculation of EDGAR 2.0 is provided in the Supplementary Data.

Genomic subset calculation

The main feature of EDGAR was and still is the fast calculation of the genomic subsets defined in the introduction: the core genome, pan genome and singleton genes. All calculations require the selection of one reference genome, and a set of genomes to which the reference should be compared. The reference genome acts as starting point for iterative extension or reduction of the result gene set, which is presented in tabular form. The result table shows the locus tags as well as descriptions of the genes. In addition, result tables now provide multiple alignments of the ortholog sets on nucleotide as well as on protein level. Results can be saved as multiple FASTA file (DNA or protein sequence) or as a TAB separated flat file.

Venn diagrams

Venn diagrams show the number of genes for all possible logical combinations of a selection of genomes. They allow an easy visual inspection of the core genome size and the gene numbers in every subset of the dispensable genome. The EDGAR web interface features the creation of Venn diagrams with an upper limit of five genomes because the number of logical combinations within a Venn diagram of higher order results in too many areas for a meaningful graphical representation. Genome comparisons of a higher order are possible, though, via a new interface that enables calculation of any possible intersection of any arbitrary number of genomes. In this interface the user can select single genomes as included, excluded, or ignored, and EDGAR will calculate the gene set matching the query and present the results in tabular form. The diagram layout has been notably improved since EDGAR 1.0, providing more even sized areas and an improved coloring scheme. An example of the new Venn diagram layout used in EDGAR 2.0 is shown in the Supplementary Data.

Synteny plots

Synteny describes the co-localization of genes on a stretch of DNA. A synteny plot showing the conservation of gene order among several genomes is an easy way to identify large scale evolutionary events like genome rearrangements. The original EDGAR web server provided an interface to create synteny plots of pairs of genomes based on the stop positions of genes that were identified as being orthologous. Plots were generated as static images with gnuplot. In EDGAR 2.0 synteny plots can be created for up to 20 genomes at a time. The genomes are compared to a selected reference genome, and a track is plotted in a different color for each of them (see Figure 1). The individual tracks can be switched on and off, and the order in which the genome tracks are superimposed on each other can be changed dynamically. Thus, the synteny plot is now a highly interactive tool for the analysis of large scale genome rearrangements.

Figure 1. — Synteny plot of four *Xanthomonas campestris* chromosomes compared to *X. campestris pv. campestris* strain B100.

Genome browser

To gain more convenient visual access to the genomic neighborhood of orthologous genes, a new genome browser was added to the EDGAR web interface as replacement for the comparative viewer presented in the original publication. In EDGAR 2.0, we introduce a JavaScript and HTML5 based Genome Browser. This interactive tool allocates the same color to orthologous genes, and shows the genomic context in a window of 20 kb. Thereby the genome browser allows rapid detection of the presence or absence of orthologous genes and variations in the gene order. Additionally, users can interactively realign the genes in the genome browser window by clicking on a gene. Moreover, a multiple alignment of a selected gene set can be generated, allowing biologists to verify the ortholog relationship. All gene sets visible in the 20 kb window at a given time are additionally presented in tabular form below the interactive genome browser.

NEW FEATURES ADDED TO THE EDGAR WEB SERVER

Besides the presented improvements, EDGAR 2.0 also provides novel features and concepts that have not been available before. For example, in EDGAR 1.0 only chromosomes could be compared, but organisms with multiple replicons could not be handled properly. In EDGAR 2.0, multi-replicon-organisms are fully supported, and all analysis features can be run either on the single replicons, or an a virtual container comprising all genes of an organism. These containers are automatically generated during the EDGAR project calculation and are named “ALL_<organism name >”. In the following the most important new features of EDGAR 2.0 are presented in detail.

Genomic subset statistics

A calculated genomic subset, e.g. a core genome calculated on a specific set of genomes, is always only a snapshot of the situation for the given genome set. One possible solution to obtain a more comprehensive estimation of genomic subset sizes is to calculate the respective numbers for every possible combination of all available genomes and to use the resulting data to extrapolate how subset sizes would develop for an infinite number of genomes. Mathematical approaches for genomic subset extrapolation were proposed by Tettelin et al. in 2005 (1) and 2008 (10) and are now implemented in EDGAR 2.0.

Core genome and singleton development extrapolation

The development of the core genome size for increasing numbers of genomes can be predicted by a curve fitting approach using an exponential decay function. An identical approach is used to extrapolate the development of the expected number of singletons, thus, to facilitate the mathematical description only the core genome development calculation is described here.

If k genomes are available, one estimates the number of core genes for all Inline graphic possible permutations of the genomes. Subsequently, the number of core genes is plotted as a function of the number of compared genomes. Using a non-linear least squares curve fitting approach, an exponential decay function of the form:

(1)

is fitted to the data, where c is the amplitude of the exponential function, n is the number of compared genomes, τ is the decay constant that defines the speed at which f converges to its asymptotic value and Ω is the extrapolated size of the core genome for n → ∞. Thus, the Ω value indicates how well the core genome size of the currently available genomes represents the ‘real’ core genome size of the analyzed genus. Figure 2A shows the core genome development plot for 14 Xanthomonas genomes.

Figure 2. — (A) Core genome development plot for 14 *Xanthomonas* genomes. The red curve shows the fitted exponential decay function, blue and green curves indicate the upper and lower boundary of the 95% confidence interval. The extrapolated core genome size is 2364 genes. (B) Pan genome development plot for 14 *Xanthomonas* genomes. The red curve shows the fitted exponential Heaps’ law function, blue and green curves indicate the upper and lower boundary of the 95% confidence interval. Based on these results the pan genome is considered to be open with a growth exponent of 0.409.

Pan genome development extrapolation

The development of the pan genome size can be estimated using a Heaps’ law function. Heaps’ law is an empirical law mainly used in linguistics describing the number of distinct words in a document (or a set of documents) as a function of the document length. When an increasing number of texts is analyzed, the number of different words grows according to a sub-linear power law of the total number of scanned words. The development of the pan genome shows a comparable development and can be extrapolated by a power law of the form:

(2)

where n is the number of compared genomes, c is a proportionality constant and γ the growth exponent. As in the core genome and singleton statistics the parameters c and γ can be estimated by non-linear least squares curve fitting to the data points from a calculation of the pan genome size for all possible permutations of the available genomes. An exemplary pan genome size extrapolation for 14 Xanthomonas genomes is shown in Figure 2B.

Pan vs. Core development plot

When the aforementioned statistical features are used, it is crucial to ensure that consistent genomic data are used. The results can be strongly influenced by genomes with a high evolutionary distance to the rest of the dataset. Furthermore, the calculations can be disturbed by genomes with highly differing gene content due to poor gene prediction accuracy or highly fragmented draft genomes. To identify such outliers, the pan versus core development plot is the ideal tool. Starting with one genome, a sequence of core and pan genome sizes is calculated by iteratively adding one genome at a time to the comparison in a user-defined order. Outliers can be easily detected in the resulting pan versus core plot as demonstrated by Figure 3.

Figure 3. — Pan versus core development plot of 15 *Xanthomonas campestris* genomes. The drastic drop of the core genome size with the introduction of *Xanthomonas albilineans* strain GPE PC73 is clearly visible. The outlier status of this genome is confirmed by the phylogenetic tree.

Phylogenetic analysis features

While phylogenetic analyses were not part of the web server in EDGAR 1.0, a phylogenetic tree of all available genomes is now calculated by default for all EDGAR projects. For that purpose, EDGAR 2.0 uses the phylogenetic analysis pipeline developed on the basis of the ideas of Zdobnov et al. (11) which was described in the use case in (7). This pipeline analyzes the phylogenetic relationships between genomes based on the thousands of orthologous genes in the complete core genome. Multiple alignments of each orthologous gene set of the core genome are calculated using the MUSCLE software (12). The resulting alignments are concatenated to one large complete core alignment which is used to create a phylogenetic tree using the neighbor joining method as implemented in the PHYLIP package (13).

Subtrees

For some genera subbranches in the phylogenetic tree might be hard to resolve due to the close phylogenetic proximity of a certain species, e.g. for Mycobacterium tuberculosis within the Mycobacterium genus. For such cases EDGAR 2.0 offers an interface to calculate phylogenetic trees of a subset of genomes in the project. This feature enables a more detailed view of the selected subset of genomes. At the same time the reliability of the result is increased compared to the parent tree, since the size of the core genome, which is the basis of the tree calculation, increases for the reduced genome set.

ANI and AAI

While the computation of a phylogenetic tree based on the complete core genome shows good results, it is still a computationally intensive task. Two different approaches toward a phylogenetic evaluation based on the increasing availability of whole-genome sequences were proposed by Konstantinidis et al. (14–16), i.e. the average amino acid identity (AAI) and the average nucleotide identity (ANI). Both methods are provided by the EDGAR 2.0 web server.

For the AAI method, the average AAIs of all conserved genes in the core genome as computed by the BLAST algorithm (17) are collected. The results can be easily extracted from the EDGAR database. ANI values are computed as described in (18) and as implemented in the popular JSpecies package (19). For both methods, the resulting phylogenetic distance values are arranged in an AAI/ANI matrix, clustered according to their distance patterns and visualized as heatmaps. The heatmap images as well as the raw AAI/ANI values can be exported from the web server.

Retrieval of orthologous gene sets

The EDGAR 2.0 web server provides several ways to search and retrieve data. One of them is the retrieval of orthologous gene sets, which allows users to define a set of query genes, e.g. all genes of an operon. All genes that are orthologous to the query genes in all selected comparison genomes are identified and presented as detailed tables. This feature is thus the perfect tool to quickly find genes of interest for scientists focusing on a certain type of genes.

Upstream motif search

The EDGAR 2.0 database not only stores all coding sequences of a set of genomes, but also stores up to 400 bp of the sequence upstream of the gene start. This allows a search for conserved motifs in these upstream regions like, the Pribnow box (20), σ^B-binding motifs (21), cold shock protein binding motifs (22), etc.

Inspired by the GECO software (23), an upstream motif search was implemented in EDGAR 2.0 using the fuzznuc software provided by the EMBOSS package (24). Users can search for PROSITE-style nucleotide patterns, either in an exact search or with up to two allowed mismatches. Genes that have the query motif in their upstream region will be displayed in a table showing the exact position of the motif.

Core HMM scan

As EDGAR stores huge amounts of data from millions of BLAST comparisons, the question arose how this data could be used to analyze data that was not included in the EDGAR project calculation. One approach in this direction is the creation of profile Hidden Markov Models (HMMs, (25)) from orthologous gene sets. During the calculation of an eDGAR project, the protein sequences of all sets of orthologous genes with more than five members are aligned using MUSCLE (12) and a profile HMM is created using the HMMER3 package (26). The resulting HMM database can be queried in the EDGAR 2.0 web interface.

Higher level analysis features

As already described, EDGAR provides data structures to compare single replicons or complete organisms. For comparing the gene content of all plasmids of one organism to the genes of all plasmids of another organism, a higher level of abstraction is needed. For such cases EDGAR allows users to create groups consisting of all genes of these replicons.

These groups are well suited to join replicons from one organism, but if replicons from different organisms need to be grouped, the orthologous genes from the involved organisms act as artificial paralogs and prevent a reasonable analysis. If such multi-organism-groups are needed, e.g if a researcher wishes to compare the gene content of a set of pathogenic bacteria against a set of non-pathogenic bacteria, disjunct gene sets representing a group of organisms are required. A straight forward solution is to calculate the core or pan genome of a set of genomes and to store one representative of each gene set for subsequent calculations. Such non-redundant representations of genomic subsets are called ‘meta contigs’ and can be created in the EDGAR 2.0. These meta contigs allow higher level comparisons. For instance, it is possible to check which genes of a genome are unique in comparison to the complete pan genome of a genome set.

REQUIREMENTS

The precomputed EDGAR databases can be accessed via the EDGAR 2.0 web server at http://edgar.computational.bio. EDGAR projects are organized on genus level, and an alphabetically ordered overview of available projects is provided on the start page. The EDGAR website is free and open to all users and there is no login requirement.

In addition, for the analysis of unpublished data, password protected private databases can be created on request. For such private databases, any arbitrary collection of annotated genomes can be used. The accepted input files are all DDBJ/ENA/GenBank feature table formats.

Incomplete genomes

EDGAR 2.0 is capable of processing incomplete genomes by joining the contigs of such draft genomes to pseudo-chromosomes. As the EDGAR method is based on the comparison of coding sequences, incomplete genomes do not pose any technical problems. Nevertheless, one should be aware that every draft genome adds bias to the EDGAR results, as every gap in a sequence may split, truncate or completely mask a gene. Thus, the usage of heavily fragmented genomes or too many draft genomes should be avoided when using the EDGAR platform. This is also the reason why the public EDGAR databases only use completed genomes.

DISCUSSION AND CONCLUDING REMARKS

Since the publication of EDGAR 1.0, the EDGAR web server has become one of the most popular resources for comparative genomics. The number of publicly available projects has increased from 75 genus-based projects with 582 genomes to 167 genera and 2160 genomes. Furthermore, more than 300 private projects are currently provided to users from more than 100 universities and institutes all over the world. The largest EDGAR user base is located in Europe, but more than 50% of the researchers using EDGAR are from outside of Europe. The popularity of the EDGAR service is also reflected by the fact that in 2015 alone more than 10 000 genomes have been processed.

During the last years, the web interface has been modernized with up-to-date graphical visualizations, and the feature set was significantly extended. With the genomic subset statistics, higher level comparison features and several data search and retrieval methods EDGAR 2.0 offers a unique and comprehensive set of comparative analyses which in most cases were developed based on user feedback. The added value of the new features has been proven in numerous studies that successfully used EDGAR for phylogenetic and taxonomic analyses (27). Furthermore EDGAR was used in studies with medical (28), ecological (29,30) or agricultural (31) background.

The EDGAR platform has been continuously developed and improved, and the next upcoming features are already planned. One new concept will be the usage of EDGAR data for genome annotation. Another field for improvement are the phylogenetic analysis features, where more in silico genome-to-genome comparison features will be implemented. The main task for the mid-term development of EDGAR will be to replace the data back end once again. While SQlite was sufficient in the 454 sequencing era and the MySQL server works fine for the amounts of data that have to be analyzed today, the ever increasing amounts of data provided by modern sequencing systems make a further stage of development necessary. Thus, a change of the EDGAR data model and the development of a NoSQL data back end have already been started.

With the presented features, EDGAR 2.0 supports a quick survey of evolutionary relationships among microbial organisms, simplifies the search for genes of interest and provides new biological insights into the differential gene content of kindred genomes.

Supplementary Material

SUPPLEMENTARY DATA

supp_44_W1_W22__index.html^{(812B, html)}

Acknowledgments

The authors further wish to thank the Bioinformatics Core Facility (BCF) for expert technical support.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

The EDGAR platform is financially supported by the German Federal Ministry of Education and Research within the de.NBI network [FKZ 031A533]. Funding for open access charge: Open Access publication fund of the Justus-Liebig-University.

Conflict of interest statement. None declared.

REFERENCES

1.Tettelin H., Masignani V., Cieslewicz M., Donati C., Medini D., Ward N., Angiuoli S., Crabtree J., Jones A., Durkin A., et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl. Acad. Sci. U.S.A. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Medini D., Donati C., Tettelin H., Masignani V., Rappuoli R. The microbial pan-genome. Curr. Opin. Genet. Dev. 2005;15:589–594. doi: 10.1016/j.gde.2005.09.006. [DOI] [PubMed] [Google Scholar]
3.Markowitz V.M., Chen I.-M.A., Palaniappan K., Chu K., Szeto E., Grechkin Y., Ratner A., Jacob B., Huang J., Williams P., et al. IMG: the integrated microbial genomes database and comparative analysis system. Nucleic Acids Res. 2012;40:D115–D122. doi: 10.1093/nar/gkr1044. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dehal P.S., Joachimiak M.P., Price M.N., Bates J.T., Baumohl J.K., Chivian D., Friedland G.D., Huang K.H., Keller K., Novichkov P.S., et al. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010;38:D396–D400. doi: 10.1093/nar/gkp919. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Uchiyama I., Mihara M., Nishide H., Chiba H. MBGD update 2013: the microbial genome database for exploring the diversity of microbial world. Nucleic Acids Res. 2013;41:D631–D635. doi: 10.1093/nar/gks1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Whiteside M.D., Winsor G.L., Laird M.R., Brinkman F.S. OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis. Nucleic Acids Res. 2013;41:D366–D376. doi: 10.1093/nar/gks1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Blom J., Albaum S.P., Doppmeier D., Pühler A., Vorhölter F.-J., Zakrzewski M., Goesmann A. EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinformatics. 2009;10:154. doi: 10.1186/1471-2105-10-154. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.R Development Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2008. ISBN 3-900051-07-0. [Google Scholar]
9.Lerat E., Daubin V., Moran N. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol. 2003;1:101–109. doi: 10.1371/journal.pbio.0000019. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Tettelin H., Riley D., Cattuto C., Medini D. Comparative genomics: the bacterial pan-genome. Curr. Opin. Microbiol. 2008;11:472–477. doi: 10.1016/j.mib.2008.09.006. [DOI] [PubMed] [Google Scholar]
11.Zdobnov E., Bork P. Quantification of insect genome divergence. Trends Genet. 2007;23:16–20. doi: 10.1016/j.tig.2006.10.004. [DOI] [PubMed] [Google Scholar]
12.Edgar R. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Felsenstein J. PHYLIP-phylogeny inference package (version 3.2) Cladistics. 1989;5:163–166. [Google Scholar]
14.Konstantinidis K.T., Tiedje J.M. Towards a genome-based taxonomy for prokaryotes. J. Bacteriol. 2005;187:6258–6264. doi: 10.1128/JB.187.18.6258-6264.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Konstantinidis K.T., Tiedje J.M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. U.S.A. 2005;102:2567–2572. doi: 10.1073/pnas.0409727102. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Konstantinidis K.T., Ramette A., Tiedje J.M. The bacterial species definition in the genomic era. Philos. Trans. R. Soc. Lond B Biol. Sci. 2006;361:1929–1940. doi: 10.1098/rstb.2006.1920. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
18.Goris J., Konstantinidis K.T., Klappenbach J.A., Coenye T., Vandamme P., Tiedje J.M. DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 2007;57:81–91. doi: 10.1099/ijs.0.64483-0. [DOI] [PubMed] [Google Scholar]
19.Richter M., Rosselló-Móra R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl. Acad. Sci. U.S.A. 2009;106:19126–19131. doi: 10.1073/pnas.0906412106. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Pribnow D. Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc. Natl. Acad. Sci. U.S.A. 1975;72:784–788. doi: 10.1073/pnas.72.3.784. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Hain T., Hossain H., Chatterjee S.S., Machata S., Volk U., Wagner S., Brors B., Haas S., Kuenne C.T., Billion A., et al. Temporal transcriptomic analysis of the Listeria monocytogenes EGD-e regulon. BMC Microbiol. 2008;8:1–12. doi: 10.1186/1471-2180-8-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Morgan H.P., Estibeiro P., Wear M.A., Max K.E., Heinemann U., Cubeddu L., Gallagher M.P., Sadler P.J., Walkinshaw M.D. Sequence specificity of single-stranded DNA-binding proteins: a novel DNA microarray approach. Nucleic Acids Res. 2007;35:e75. doi: 10.1093/nar/gkm040. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Kuenne C., Ghai R., Chakraborty T., Hain T. GECO–linear visualization for comparative genomics. Bioinformatics. 2007;23:125–126. doi: 10.1093/bioinformatics/btl556. [DOI] [PubMed] [Google Scholar]
24.Rice P., Longden I., Bleasby A., et al. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
25.Eddy S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996;6:361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]
26.Eddy S. Accelerated profile HMM searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Borriss R., Chen X.-H., Rueckert C., Blom J., Becker A., Baumgarth B., Fan B., Pukall R., Schumann P., Spröer C., et al. Relationship of Bacillus amyloliquefaciens clades associated with strains DSM 7T and FZB42T: a proposal for Bacillus amyloliquefaciens subsp. amyloliquefaciens subsp. nov. and Bacillus amyloliquefaciens subsp. plantarum subsp. nov. based on complete genome sequence comparisons. Int. J. Syst. Evol. Microbiol. 2011;61:1786–1801. doi: 10.1099/ijs.0.023267-0. [DOI] [PubMed] [Google Scholar]
28.Sangal V., Blom J., Sutcliffe I.C., von Hunolstein C., Burkovski A., Hoskisson P.A. Adherence and invasive properties of Corynebacterium diphtheriae strains correlates with the predicted membrane-associated and secreted proteome. BMC Genomics. 2015;16:1–15. doi: 10.1186/s12864-015-1980-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Glaeser S.P., Imani J., Alabid I., Guo H., Kumar N., Kämpfer P., Hardt M., Blom J., Goesmann A., Rothballer M., et al. Non-pathogenic Rhizobium radiobacter F4 deploys plant beneficial activity independent of its host Piriformospora indica. ISME J. 2015;10:871–884. doi: 10.1038/ismej.2015.163. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Ngugi D.K., Blom J., Stepanauskas R., Stingl U. Diversification and niche adaptations of Nitrospina-like bacteria in the polyextreme interfaces of Red Sea brines. ISME J. 2015 doi: 10.1038/ismej.2015.214. doi:10.1038/ismej.2015.214. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Mann R., Blom J., Bühlmann A., Plummer K., Beer S., Luck J., Goesmann A., Frey J., Rodoni B., Duffy B., et al. Comparative analysis of the Hrp pathogenicity island of Rubus- and Spiraeoideae-infecting Erwinia amylovora strains identifies the IT region as a remnant of an integrative conjugative element. Gene. 2012;504:6–12. doi: 10.1016/j.gene.2012.05.002. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY DATA

supp_44_W1_W22__index.html^{(812B, html)}

supp_gkw255_nar-00293-web-b-2016-File009.pdf^{(258.8KB, pdf)}

[B1] 1.Tettelin H., Masignani V., Cieslewicz M., Donati C., Medini D., Ward N., Angiuoli S., Crabtree J., Jones A., Durkin A., et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl. Acad. Sci. U.S.A. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Medini D., Donati C., Tettelin H., Masignani V., Rappuoli R. The microbial pan-genome. Curr. Opin. Genet. Dev. 2005;15:589–594. doi: 10.1016/j.gde.2005.09.006. [DOI] [PubMed] [Google Scholar]

[B3] 3.Markowitz V.M., Chen I.-M.A., Palaniappan K., Chu K., Szeto E., Grechkin Y., Ratner A., Jacob B., Huang J., Williams P., et al. IMG: the integrated microbial genomes database and comparative analysis system. Nucleic Acids Res. 2012;40:D115–D122. doi: 10.1093/nar/gkr1044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Dehal P.S., Joachimiak M.P., Price M.N., Bates J.T., Baumohl J.K., Chivian D., Friedland G.D., Huang K.H., Keller K., Novichkov P.S., et al. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010;38:D396–D400. doi: 10.1093/nar/gkp919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Uchiyama I., Mihara M., Nishide H., Chiba H. MBGD update 2013: the microbial genome database for exploring the diversity of microbial world. Nucleic Acids Res. 2013;41:D631–D635. doi: 10.1093/nar/gks1006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Whiteside M.D., Winsor G.L., Laird M.R., Brinkman F.S. OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis. Nucleic Acids Res. 2013;41:D366–D376. doi: 10.1093/nar/gks1241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Blom J., Albaum S.P., Doppmeier D., Pühler A., Vorhölter F.-J., Zakrzewski M., Goesmann A. EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinformatics. 2009;10:154. doi: 10.1186/1471-2105-10-154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.R Development Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2008. ISBN 3-900051-07-0. [Google Scholar]

[B9] 9.Lerat E., Daubin V., Moran N. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol. 2003;1:101–109. doi: 10.1371/journal.pbio.0000019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Tettelin H., Riley D., Cattuto C., Medini D. Comparative genomics: the bacterial pan-genome. Curr. Opin. Microbiol. 2008;11:472–477. doi: 10.1016/j.mib.2008.09.006. [DOI] [PubMed] [Google Scholar]

[B11] 11.Zdobnov E., Bork P. Quantification of insect genome divergence. Trends Genet. 2007;23:16–20. doi: 10.1016/j.tig.2006.10.004. [DOI] [PubMed] [Google Scholar]

[B12] 12.Edgar R. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Felsenstein J. PHYLIP-phylogeny inference package (version 3.2) Cladistics. 1989;5:163–166. [Google Scholar]

[B14] 14.Konstantinidis K.T., Tiedje J.M. Towards a genome-based taxonomy for prokaryotes. J. Bacteriol. 2005;187:6258–6264. doi: 10.1128/JB.187.18.6258-6264.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Konstantinidis K.T., Tiedje J.M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. U.S.A. 2005;102:2567–2572. doi: 10.1073/pnas.0409727102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Konstantinidis K.T., Ramette A., Tiedje J.M. The bacterial species definition in the genomic era. Philos. Trans. R. Soc. Lond B Biol. Sci. 2006;361:1929–1940. doi: 10.1098/rstb.2006.1920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[B18] 18.Goris J., Konstantinidis K.T., Klappenbach J.A., Coenye T., Vandamme P., Tiedje J.M. DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 2007;57:81–91. doi: 10.1099/ijs.0.64483-0. [DOI] [PubMed] [Google Scholar]

[B19] 19.Richter M., Rosselló-Móra R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl. Acad. Sci. U.S.A. 2009;106:19126–19131. doi: 10.1073/pnas.0906412106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Pribnow D. Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc. Natl. Acad. Sci. U.S.A. 1975;72:784–788. doi: 10.1073/pnas.72.3.784. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Hain T., Hossain H., Chatterjee S.S., Machata S., Volk U., Wagner S., Brors B., Haas S., Kuenne C.T., Billion A., et al. Temporal transcriptomic analysis of the Listeria monocytogenes EGD-e regulon. BMC Microbiol. 2008;8:1–12. doi: 10.1186/1471-2180-8-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Morgan H.P., Estibeiro P., Wear M.A., Max K.E., Heinemann U., Cubeddu L., Gallagher M.P., Sadler P.J., Walkinshaw M.D. Sequence specificity of single-stranded DNA-binding proteins: a novel DNA microarray approach. Nucleic Acids Res. 2007;35:e75. doi: 10.1093/nar/gkm040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Kuenne C., Ghai R., Chakraborty T., Hain T. GECO–linear visualization for comparative genomics. Bioinformatics. 2007;23:125–126. doi: 10.1093/bioinformatics/btl556. [DOI] [PubMed] [Google Scholar]

[B24] 24.Rice P., Longden I., Bleasby A., et al. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]

[B25] 25.Eddy S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996;6:361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]

[B26] 26.Eddy S. Accelerated profile HMM searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Borriss R., Chen X.-H., Rueckert C., Blom J., Becker A., Baumgarth B., Fan B., Pukall R., Schumann P., Spröer C., et al. Relationship of Bacillus amyloliquefaciens clades associated with strains DSM 7T and FZB42T: a proposal for Bacillus amyloliquefaciens subsp. amyloliquefaciens subsp. nov. and Bacillus amyloliquefaciens subsp. plantarum subsp. nov. based on complete genome sequence comparisons. Int. J. Syst. Evol. Microbiol. 2011;61:1786–1801. doi: 10.1099/ijs.0.023267-0. [DOI] [PubMed] [Google Scholar]

[B28] 28.Sangal V., Blom J., Sutcliffe I.C., von Hunolstein C., Burkovski A., Hoskisson P.A. Adherence and invasive properties of Corynebacterium diphtheriae strains correlates with the predicted membrane-associated and secreted proteome. BMC Genomics. 2015;16:1–15. doi: 10.1186/s12864-015-1980-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Glaeser S.P., Imani J., Alabid I., Guo H., Kumar N., Kämpfer P., Hardt M., Blom J., Goesmann A., Rothballer M., et al. Non-pathogenic Rhizobium radiobacter F4 deploys plant beneficial activity independent of its host Piriformospora indica. ISME J. 2015;10:871–884. doi: 10.1038/ismej.2015.163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30.Ngugi D.K., Blom J., Stepanauskas R., Stingl U. Diversification and niche adaptations of Nitrospina-like bacteria in the polyextreme interfaces of Red Sea brines. ISME J. 2015 doi: 10.1038/ismej.2015.214. doi:10.1038/ismej.2015.214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31.Mann R., Blom J., Bühlmann A., Plummer K., Beer S., Luck J., Goesmann A., Frey J., Rodoni B., Duffy B., et al. Comparative analysis of the Hrp pathogenicity island of Rubus- and Spiraeoideae-infecting Erwinia amylovora strains identifies the IT region as a remnant of an integrative conjugative element. Gene. 2012;504:6–12. doi: 10.1016/j.gene.2012.05.002. [DOI] [PubMed] [Google Scholar]

PERMALINK

EDGAR 2.0: an enhanced software platform for comparative gene content analyses

Jochen Blom

Julian Kreis

Sebastian Spänig

Tobias Juhre

Claire Bertelli

Corinna Ernst

Alexander Goesmann

Abstract

INTRODUCTION

TECHNICAL UPGRADES IN EDGAR 2.0

IMPROVED AND MODERNIZED FEATURES

Genomic subset calculation

Venn diagrams

Synteny plots

Figure 1.

Genome browser

NEW FEATURES ADDED TO THE EDGAR WEB SERVER

Genomic subset statistics

Core genome and singleton development extrapolation

Figure 2.

Pan genome development extrapolation

Pan vs. Core development plot

Figure 3.

Phylogenetic analysis features

Subtrees

ANI and AAI

Retrieval of orthologous gene sets

Upstream motif search

Core HMM scan

Higher level analysis features

REQUIREMENTS

Incomplete genomes

DISCUSSION AND CONCLUDING REMARKS

Supplementary Material

Acknowledgments

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases