Abstract
Background
Sequence analysis of immunoglobulin heavy chain (IGH) gene rearrangements and frequency analysis is a powerful tool for studying the immune repertoire, immune responses and immune dysregulation in health and disease. The challenge is to provide user friendly, secure and reproducible analytical services that are available for both small and large laboratories which are determining VDJ repertoire using NGS technology.
Results
In this study we describe ImmunoGlobulin Galaxy (IGGalaxy)- a convenient web based application for analyzing next-generation sequencing results and reporting IGH gene rearrangements for both repertoire and clonality studies. IGGalaxy has two analysis options one using the built in igBLAST algorithm and the second using output from IMGT; in either case repertoire summaries for the B-cell populations tested are available. IGGalaxy supports multi-sample and multi-replicate input analysis for both igBLAST and IMGT/HIGHV-QUEST. We demonstrate the technical validity of this platform using a standard dataset, S22, used for benchmarking the performance of antibody alignment utilities with a 99.9 % concordance with previous results. Re-analysis of NGS data from our samples of RAG-deficient patients demonstrated the validity and user friendliness of this tool.
Conclusions
IGGalaxy provides clinical researchers with detailed insight into the repertoire of the B-cell population per individual sequenced and between control and pathogenic genomes. IGGalaxy was developed for 454 NGS results but is capable of analyzing alternative NGS data (e.g. Illumina, Ion Torrent). We demonstrate the use of a Galaxy virtual machine to determine the VDJ repertoire for reference data and from B-cells taken from immune deficient patients. IGGalaxy is available as a VM for download and use on a desktop PC or on a server.
Electronic supplementary material
The online version of this article (doi:10.1186/s12865-014-0059-7) contains supplementary material, which is available to authorized users.
Keywords: Next generation sequencing, Immunoglobulin heavy chain, Repertoire, IMGT/HIGHV-QUEST, igBLAST
Background
B lymphocytes recognize pathogens (antigens) with an antigen-specific receptor, also called antibody or immunoglobulin. This immunoglobulin is unique for every B lymphocyte, and consists of a variable and a constant domain. The locus encoding for immunoglobulin contains many different variable (V), diversity (D), and joining (J) genes. The recombination of one V and J gene, or one V, D and J gene results in the variable domain of the immunoglobulin (Figure 1A) [1] which when combined with additional processing of the gene junctions results in a potential repertoire of more than 1012 different immunoglobulins. However, individuals with defects in the generation of the immunoglobulins have a more restricted immunoglobulin repertoire, and are therefore vulnerable for infections.
The immunoglobulin repertoire can be studied by sequencing the rearrangements in the immunoglobulin heavy chain (IGH) locus. The use of next generation sequencing technology has increased the number of IgH rearrangements that can be assessed, but has also increased the amount of raw data requires technical knowledge to analyze. To ensure accurate and reproducible determination of immunoglobulin repertoire within and between clinical laboratories thus requires standardized and reproducible data analysis. Analysis of IGH rearrangements is complicated since potentially all sequence reads contain a unique combination of a V, D and J gene, and stretches of nucleotides that are randomly added or deleted at the V-D and D-J junctions which is spanned by the CDR3 region (Figure 1B).
Previously, bioinformatics tools have been developed that are able to align the IGH rearrangement to the reference sequences. Most frequently used tools are the ImMunoGeneTics (IMGT)/HighV-Quest [2], and IgBLAST [3]. These tools provide detailed information on the rearrangement, including the specific V, D and J genes and the composition of the junctions however information extraction and visualization of these data must be completed by the user.
Implementation
To deliver a client and a server web-based immunoglobulin repertoire analysis and reporting tool required two implementation efforts. First the necessary analytical and reporting tools had to be implemented and the second was to implement a web-based application capable of use on and individual PC and on a server. The analytical reporting tools were integrated into the Galaxy framework (http://galaxyproject.org) [4,5] as separate tools and then as ‘end to end’ workflows to utilize Galaxy’s graphical user interface (GUI) [6]. To enable IGGalaxy to be deployed more easily on both a local PC and a server all the dependencies required to run IGGalaxy were combined into a standalone VMware (http://www.vmware.com) virtual machine (VM) - a pre-packaged environment that is used like any physical computer [7] but can be downloaded ‘ready-to-run’.
Components for web-based analysis and visualisation
A set of Perl, Python and R components have been developed to generate the framework components which include: (1) igBLAST service; (2) igBLAST parser; (3) IMGT Loader; (4) ExperimentalDesign and (5) Results summarizer and reporting (Figure 2). All tools and their respective functional description are summarized in Table 1. Instructions for installation and use of IGGalaxy are available at the project website (http://bioinformatics.erasmusmc.nl/wiki/index.php/Immunoglobulin_Galaxy).
Table 1.
Tool name | Description |
---|---|
igBLASTn | Galaxy wrapper for the igBLASTn executable |
igBLAST Report Parser | Parsing igBLASTn report |
IMGT Loader | Converting IMGT output (a zip file) into our proprietary format. |
Experimental Design | Merging multiple BLAST or IMGT/H files and creating an experimental design |
Report | Plotting the merged data Plotting the merged data Creating a graph of the merged reports |
igBLASTn ImmuneRepertoire | Automatically runs all the tools needed for a repertoire analysis with igBLASTn |
IMGT ImmuneRepertoire | Automatically runs all the tools needed for a repertoire analysis with output from IMGT (zip files). |
IMGT Convert | Converting the 1_, 5_ and 6_ files |
All tools with the name including BLAST or IMGT are used for analysis of igBAST and IMGT respectively. The remaining tools work for both igBLAST and IMGT.
IGGalaxy VDJ CDR3 identification
FASTA reads must be preprocessed prior to analysis for tag removal and by trimming for low quality reads either prior to loading into IGGalaxy or by utilizing existing tools in Galaxy and IGGalaxy. FASTA sequences for all samples and replicates can be uploaded to the server quality checked, and are then ready for further analysis in IGGalaxy.
The identification of the CDR3 region and corresponding V, D, and J chains from the submitted FASTA sequences is achieved in IGGalaxy either with igBLAST or using the external IMGT/High-V-QUEST server. The igBLAST service (igBLAST) and corresponding human BLAST database (ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/) have been implemented in IGGalaxy. The standardized output using version 2.2.0.7 of igBLAST is delivered by wrapping igBLASTn with default parameters and configured to use the human reference database for germline V(D)J alignment. The output from the igBLAST service is extracted using a purpose built BLAST parser (igBLASTreport parser) designed to extract the V,D,J and CDR3 regions and convert them into the IGGalaxy feature format (Figure 3 and Additional file 1: Table S1).
In addition the user can submit their FASTA sequences to the IMGT/High-V-QUEST service (http://www.imgt.org/) [2] which has complementary features to igBLAST. The standard IMGT/High-V-QUEST analysis delivers eleven output files, a description of which is provided at http://www.imgt.org/IMGT_vquest/share/textes/imgtvquest.html#output3. To allow our users access to this content in as the alternative input to igBLAST in IGGalaxy we developed a Python loader (Upload Zip File) which transfers the compressed IMGT/High-V-QUEST output to IGGalaxy. In addition a Python parser (IMGT Loader) extracts the contents from specific files that contain the columns needed to form the proprietary format (the “1_Summary”, “5_AA-sequence” and “6_Junction” files). The parsed output from IMGT/High-V-QUEST uses common definitions with the igBLAST ReportParser (Additional file 1: Table S1) which thus normalizes the data for further analysis with IGGalaxy (Additional file 1: Table S2).
IGGalaxy experimental design
This module allows the user to combine the parsed igBLAST results or the IMGT/High-V-QUEST with an experimental design (i.e. name samples and replicates) prior to calculation and reporting of the resultant IGH repertoires frequencies. The merger tool (ExperimentalDesign) is a python module which accepts either the parsed igBLAST file or the parsed IMGT/High-V-QUEST files and outputs a standardized file, which is required for IGGalaxy Reporting-(see Additional file 1: Tables S1 and S2). This feature of IGGalaxy allows users to utilize both formats for repertoire analysis and provides IGGalaxy extensible and backward compatibility with igBLAST and IMGT/High-V-QUEST should requirements change.
To allow the output of other annotation tools - in addition to igBLAST and IMGT - to be used with this pipeline, a clear definition of the data format has been created (Additional file 1: Table S1).
IGGalaxy reporting
IGGalaxy provides a flexible web based analysis, viewing and downloading of results, and further analysis options using existing Galaxy genomic and statistical analysis functionality via the Report tool.
The Report tool uses a combination of several columns to determine if a sequence is functional (i.e. not out of frame or in frame with a stop codon and the CDR3 region identified) when using igBLAST and alternatively the “unproductive” annotation supplied by IMGT/HighV-QUEST.
The output is presented in an HTML report page which summarizes the frequency of V, D, and J chains as bar charts as well as the combination V-D, V-J and D-J heatmaps based on the definition of unique sequences (e.g. V-J-CDR3 amino acid) selected in the drop down menu (Figure 4). A hyperlink embedded in the HTML report page provides the user with access to the underlying sequence results from which these plots were generated. The categories include the sample identifier as defined by the user, the unique sequence identifier relating to the original FASTA input files, the class for V-DV-J & D-J-and associated CDR (1, 2 and 3) regions and if the V-D-J, class was found in both nucleotide and protein sequences (Figure 4). The distribution of colors from yellow (low counts) to purple (high counts) normalized within a sample represent a visual cue to the variation on V, D and J repertoire for each individual sample analyzed (Figure 4).
“End to end” analysis workflows
The implementation of the VDJ CDR3 identification, Experimental Design and Reporting in Galaxy provides the capability to chain components together in a workflow. The two Immune Repertoire analysis workflows for both igBLAST and IMGT input provide the user with an “end to end” analysis from data upload to reporting. Note that the QC component remains separate and should still be performed prior to using either of these workflows.
Results and discussion
Evaluation of searching method
IMGT/High-V-QUEST and igBLAST are two pattern matching algorithms used to determine the IgH repertoire from long read NGS data (e.g. 454 reads) of S22 which has been used to demonstrate the validity of IGH VDJ assignment by several alignment utilities [8]. The S22 genotype [8] was used as the gold standard for subsequent calculation. We utilized the functionality within IGGalaxy to process both the IMGT and the parsed igBLAST results and to determine similarity of VDJ recombination detection obtained by both methods. The output of both workflows can be summarized in a single view to compare the frequency of V, D and J recombination as determined by each application using the igReport functionality (Figure 5).
The majority of VDJ recombination events determined for S22 with IGGalaxy achieved 99.9% similarity to the published genotype of S22 according to Jackson et al [8] (supplementary data) whilst the remaining 0.1% composed mainly of V4-61 or V4-59 assignment represented by 249 of the 283 sequences (Table 2). The comparison between IGGalaxy output and that of Jackson et al [8] was achieved by matching the sequences based on their VDJ identification and then determining similarity using the sequence read identifier number.
Table 2.
Method | IGHV | IGHD | IGHJ | Total |
---|---|---|---|---|
IMGT | 0.48 | 5.45 | 0.00 | 5.93 |
igBLAST | 0.75 | 4.55 | 0.40 | 5.70 |
The values represent the % errors compared to the total number of reads assigned to the golden standard S22 genotype.
Case study: RAGD project
We used IGGalaxy to characterize the IgH repertoire related to the immunological mechanisms causing the clinical spectrum of recombination-activating gene (RAG) deficiency (RAGD) [9]. Mutations in the recombination activating genes 1 (RAG1) and RAG2 result in loss or reduction of V(D)J recombination. In those patients that had detectable peripheral B cells we studied the IGH V(D)J recombination repertoire. IgH gene rearrangements were amplified from mononuclear cells derived from peripheral blood (PB) and/or bone marrow (BM), subsequently sequenced using next generation sequencing in healthy controls (PB and BM) and in three patients with combined immunodeficiency (CID) [9].
We previously analyzed these samples using IMGT/High-V-QUEST in combination with excel to generate the V(D)J usage frequencies [9]. The output from this reanalysis but using the IGGalaxy resulted in a similar trend to that found by Ijspeert [9], which is for control samples to have more unique sequences (Table 3).
Table 3.
Samples | All sequences | Unique sequences | Unproductive | Productive |
---|---|---|---|---|
Control BM | 35800 | 18553 (51.8) | 9198 (25.7) | 26602 (74.3) |
P16 BM | 12166 | 3282 (27) | 2812 (23.1) | 9354 (76.9) |
Control PB | 18901 | 11912 (63) | 5043 (26.7) | 13858 (73.3) |
P15 PB | 16080 | 7667 (47.7) | 2433 (15.1) | 14375 (89.4) |
P16 PB | 14543 | 3712 (25.5) | 2126 (14.6) | 12417 (85.4) |
P18 PB | 25093 | 3656 (14.6) | 4421 (17.6) | 20672 (82.4) |
Unproductive = non- functional sequences and productive = functional sequences as determined with IgBLAST. PB = peripheral blood, BM = bone marrow, control = unaffected individual.
The distribution of V, D and J regions as determined by the igBLAST ImmunoRepertoire function in IGGalaxy is shown as a summary between individuals in a bar chart (Figure 6) and also as a heatmap. There is clear difference between the J gene usage in the control population as compared with the RAGD patients sampled from either peripheral blood or from the bone marrow.
Performance
We tested IGGalaxy VM using one sample (3.37 Mb, 13153 sequences) on a dual socket X5550 running at 3Ghz with 24GB of ram; to represent a more realistic environment for a standard PC the IGGalaxy VM was allocated 1 core and 1GB of memory. The igBLASTn component is the rate limiting step in the workflow (Table 4) which can be improved by assigning more cores to execute IgBLAST code with an optimal number between 4 to 8 cores (unpublished observations).
Table 4.
IGGalaxy tool | Average time (seconds) | Sequences per second |
---|---|---|
igBLASTn | 604 | 21 |
igBLASTn parser | 14 | 939 |
IMGT Loader | 9 | 1461 |
Experimental Design | 8 | 1644 |
Report Repertoire | 12 | 1096 |
Conclusions
In summary IGGalaxy provides the immunologist with a user friendly standardized client (or server) application to study repertoire analysis by two methods, igBLAST and IMGT/HighV-QUEST. IGGalaxy simplifies pre- and post-processing of FASTA sequences based on user defined clonal type selection and providing extensive visual reporting of the repertoire and associated downloadable report.
We have demonstrated accuracy of determining clonal types using IGGalaxy with the S22 data used as a golden standard and the added value of IGGalaxy for accuracy and user friendliness in re-analysis of the previously determined repertoires for RAGD patients [9]. To achieve these results we have implemented several open source services (including – of primary importance igBLAST parser, and IGGalaxy VM) which are freely available for download and for use by the research community to support immune repertoire analysis.
The common format for parsed BLAST and imported IMGT/HighV-QUEST results by IGGalaxy allows users to utilize both formats for repertoire analysis and provides IGGalaxy with future backward compatibility for both igBLAST and IMGT/High-V-QUEST and to extend the functionality of IGGalaxy beyond its current capability.
Availability and requirements
Project name: ImmonoGlobulin galaxy
Availability: http://bioinformatics.erasmusmc.nl/wiki/index.php/Immunoglobulin_Galaxy
Operating system: Linux running in a virtual machine, VMWare, which is available for Windows, Linux and Mac OS. The virtual machine requires VMware player (https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0).
Other Dependencies: The web application is implemented independently of operating system and has been successfully tested with Microsoft Internet Explorer 11.0 and Firefox 2/3 (under different versions of Linux, Microsoft Windows).
Programming Language: Python (IGGalaxy), Perl (igBLAST parser), R (IGGalaxy)
License: None required
Any restrictions to use by non-academics: No
Acknowledgements
This study was supported by a grant from the Dutch Organization for Scientific Research (NWO/ZonMw
VIDI grant 91712323). We would like to thank Ivo Palli from the Bioinformatics Department for his IT support for this project.
Abbreviations
- BLAST
Basic local alignment and search tool
- RAG
Recombination-activating gene
- RAGD
Recombination-activating gene deficiency
- VM
Virtual machine
- IMGT/HighV-Quest
ImMunoGeneTics (IMGT)/HighV-Quest
- CID
Combined immunodeficiency
Additional file
Footnotes
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MM, DvZ, SH and APS developed IGGalaxy webserver, analysis pipeline and drafted the manuscript. HI, APS and MvdB developed the experimental design and conception of this study. MvdB, HI and PvdS participated in the pipeline development and writing of the final manuscript. All authors read and approved the final version of the manuscript.
Contributor Information
Michael J Moorhouse, Email: michael.moorhouse@oicr.on.ca.
David van Zessen, Email: d.vanzessen@erasmusmc.nl.
Hanna IJspeert, Email: h.ijspeert@erasmusmc.nl.
Saskia Hiltemann, Email: s.hiltemann@erasmusmc.nl.
Sebastian Horsman, Email: s.horsman@erasmusmc.nl.
Peter J van der Spek, Email: p.vanderspek@erasmusmc.nl.
Mirjam van der Burg, Email: m.vanderburg@erasmusmc.nl.
Andrew P Stubbs, Email: a.stubbs@erasmusmc.nl.
References
- 1.Alt FW, Yancopoulos GD, Blackwell TK, WoodAlt FW, Yancopoulos GD, Blackwell TK, Wood C, Thomas E, Boss M, Coffman R, Rosenberg N, Tonegawa S, Baltimore D. Ordered rearrangement of immunoglobulin heavy chain variable region segments. EMBO J. 1984;3:1209–1219. doi: 10.1002/j.1460-2075.1984.tb01955.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Alamyar E, Duroux P, Lefranc MP, Giudicelli V. IMGT((R)) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS. Methods Mol Biol. 2012;882:569–604. doi: 10.1007/978-1-61779-842-9_32. [DOI] [PubMed] [Google Scholar]
- 3.Ye J, Ma N, Madden TL, Ostell JM. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013;41:W34–W40. doi: 10.1093/nar/gkt382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15:1451–1455. doi: 10.1101/gr.4086505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010;89:19.10.1–19.10.21. doi: 10.1002/0471142727.mb1910s89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Goecks J, Nekrutenko A, Taylor J, The Galaxy Team Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;25:R86. doi: 10.1186/gb-2010-11-8-r86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Smith JE, Nair R. The architecture of virtual machines. Computer. 2005;3:32–38. doi: 10.1109/MC.2005.173. [DOI] [Google Scholar]
- 8.Jackson KJ, Boyd S, Gaëta BA, Collins AM. Benchmarking the performance of human antibody gene alignment utilities using a 454 sequence dataset. Bioinformatics. 2010;26:3129–3130. doi: 10.1093/bioinformatics/btq604. [DOI] [PubMed] [Google Scholar]
- 9.IJspeert H, Driessen GJ, Moorhouse MJ, Hartwig NG, Wolska-Kusnierz B, Kalwak K, Pituch-Noworolska A, Kondratenko I, van Montfrans JM, Mejstrikova E, Lankester AC, Langerak AW, van Gent DC, Stubbs AP, van Dongen JJ, van der Burg M. Similar recombination-activating gene (RAG) mutations result in similar immunobiological effects but in different clinical phenotypes. J Allergy Clin Immunol. 2014;133:1124–1133. doi: 10.1016/j.jaci.2013.11.028. [DOI] [PMC free article] [PubMed] [Google Scholar]