ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

Aine Fairbrother-Browne; Sonia García-Ruiz; Regina Hertfelder Reynolds; Mina Ryten; Alan Hodgkinson

doi:10.46471/gigabyte.91

. 2023 Sep 14;2023:1–10. doi: 10.46471/gigabyte.91

ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

Aine Fairbrother-Browne ^1,^2,^3,^*, Sonia García-Ruiz ^1,⁴, Regina Hertfelder Reynolds ^1,⁴, Mina Ryten ^1,^4,^†, Alan Hodgkinson ^2,^†

PMCID: PMC10507293 PMID: 37732134

Abstract

We present ensemblQueryR, an R package for querying Ensembl linkage disequilibrium (LD) endpoints. This package is flexible, fast and user-friendly, and optimised for high-throughput querying. ensemblQueryR uses functions that are intuitive and amenable to custom code integration, familiar R object types as inputs and outputs as well as providing parallelisation functionality. For each Ensembl LD endpoint, ensemblQueryR provides two functions, permitting both single- and multi-query modes of operation. The multi-query functions are optimised for large query sizes and provide optional parallelisation to leverage available computational resources and minimise processing time. We demonstrate improved computational performance of ensemblQueryR over an exisiting tool in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase whilst using a third of the RAM. Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through Docker and singularity images, making this tool widely accessible to the scientific community.

Statement of need

Background

Linkage disequilibrium (LD) is the non-random association of alleles arising from different loci [1]. In population genetics, LD is a measure of the frequency with which an allele of one variant is correlated with an allele of a proximal variant within a particular population [2]. There are many applications for LD measures in genomics workflows. For example, in the context of genome-wide association studies (GWAS), which have been used to detect associations between genetic variants and a wide range of human phenotypes, downstream interrogation of local LD structure is required to identify the potential ‘causal’ variant at a nominated locus that exerts an effect on the downstream phenotype. Equally, in expression quantitative trait loci (eQTL) analyses, which aim to uncover associations between genetic variants and the expression of a cis or trans gene (eGene), LD information is required for the identification of the potential causal variant affecting the expression of the eGene. Further downstream, LD information is useful for functional annotations, where genetic variants or regions in LD with a target variant can aid in the identification of biological processes that might be affected by the GWAS- or eQTL-implicated target variant. As such, it is important that the LD information for a range of human populations can be easily queried by researchers in an efficient and accessible way.

Despite the widespread usage of LD measures in genomic research, the majority of tools available at present are web-based. Although these offer user-friendly interfaces and can be useful for one-off or small queries, they do not promote reproducibility and are not suited to workflow-oriented researchers wishing to submit multiple large queries. Programmatic tools offer a solution to these problems; however, very few tools for the retrieval of LD metrics exist.

To our knowledge, only one R package provides a programmatic interface for LD metric retrieval. LDlinkR (version 1.2.3) [3] provides an R-based interface to the web-based tool LDlink [4], permitting retrieval of LD metrics using a range of query types. However, LDlinkR has a number of key limitations with respect to speed and query handling. Firstly, the user is required to obtain an access token by signing up on the NCBI website, which is then supplied as an argument to all LDlinkR functions. This requirement is in place to limit user queries, meaning that attempts to speed up the tool using parallelisation easily exceed query limits and cause the tool to return timeout errors. This can result in the user’s access token being blocked. Secondly, a number of functions for retrieving LD metrics are configured for singular queries only – such as the LDpair and LDproxy functions – meaning that the user must write custom code to submit more than one query at one time. As such, although LDlinkR is a useful programmatic alternative to the LDlink web tool, it is not suited to fast, high-throughput multi-query retrieval of LD metrics.

Ensembl (RRID:SCR_002344) is another widely used source of LD metrics, offering an application programming interface (API) that supports an array of query configurations[5, 6]. However, some challenges are presented by direct API usage as its usage requires some technical expertise. Additionally, it is not easily integrable with typical R workflows, precludes the input of standard R objects (such as data frames, lists or vectors), does not output data in an intuitive format and is not easily adaptable to high-throughput workflows. To our knowledge, no R package has been developed to facilitate querying the Ensembl API and, in particular, to retrieve Ensembl LD metrics. In light of this, and to address the limitations of current tools, we present ensemblQueryR. Our R package provides fast, efficient, user-friendly querying of Ensembl LD data, with a focus on intuitive, high-throughput R workflow integration.

ensemblQueryR has been made freely available (DOI: 10.5281/zenodo.7837882) [7, 8]. The package can also be used in Docker (RRID:SCR_016445) [9] or Singularity [10] containers, for which the images can be found on Docker Hub [11] or the Singularity image repository [12].

Implementation

Our approach

ensemblQueryR provides a suite of functions that wrap around three Ensembl API ‘endpoints’. These endpoints operate to retrieve data from Ensembl databases through the following query configurations:

1.
Window: retrieval of the LD metrics for a variant and all the other variants in a window around the target variant;
2.
Pair: retrieval of the LD metrics between a pair of target variants;
3.
Region: retrieval of the LD metrics between all pairs of variants in a defined target region.

Single-query and multi-query wrapper functions are provided for each of these Ensembl API endpoints, all of which are described in detail in Table 1.

Table 1.

The functions comprising the ensemblQueryR package and their relationship to the three LD Ensembl API endpoints.

ensemblQueryR function	Ensembl API endpoint	Arguments	Output	Description
ensemblQueryGetPops	Information [5]	N/A	A list of human populations for which LD metrics can be retrieved.	This function retrieves a list of the Ensembl populations for which LD metrics can be queried (for further information on the available populations, see Ensembl [13]). These can be supplied to the ‘pop’ argument across ensemblQueryR functions.
pingEnsembl	Information [5]	N/A	An integer (and a message) to indicate the status of the Ensembl API. Returns 1 and reports “Server OK.” if the server is up.	This function checks and informs the user of the status of the Ensembl API.
ensemblQueryLDwithSNPwindow	Window [5]	rsid r2 d.prime window.size pop	A data frame with five columns: ‘query’ (the variant input to ‘rsid’), snp_in_ld (variant(s) in LD with ‘query’), r2 (r-squared statistic), d_prime (D′ statistic), population_name (the population supplied to ‘pop’).	This function retrieves variants in LD with the query variant within a given genomic window.
ensemblQueryLDwithSNPwindowDataframe	Window [5]	in.table r2 d.prime window.size pop cores	A data frame with five columns: ‘query’ (the variant input to ‘rsid’), snp_in_ld (variant(s) in LD with ‘query’), r2 (r-squared statistic), d_prime (D′ statistic), population_name (the population supplied to ‘pop’).	This function takes a data frame with a column of variant rsIDs (Reference SNP cluster IDs). It retrieves variants in LD with each query variant within a given genomic window.
ensemblQueryLDwithSNPpair	Pair [5]	rsid1 rsid2 pop	A data frame with five columns: ‘query1’ (the variant input to ‘rsid1’), ‘query2’ (the variant input to ‘rsid2’), r2 (r-squared statistic), d_prime (D′ statistic), population_name (the population supplied to ‘pop’).	This function takes a pair of rsIDs and retrieves their LD metrics (D′ and R²).
ensemblQueryLDwithSNPpairDataframe	Pair [5]	in.table pop cores	A data frame with five columns: ‘query1’ (the variant input to ‘rsid1’), ‘query2’ (the variant input to ‘rsid2’), r2 (r-squared statistic), d_prime (D′ statistic), population_name (the population supplied to ‘pop’).	This function takes a data.frame containing paired rsIDs, retrieving LD metrics (D′ and R²) for all pairs.
ensemblQueryLDwithSNPregion	Region [5]	chr start end pop	A data frame with eight columns: ‘query_chr’ (the query chromosome supplied to ‘chr’), ‘query_start’ (the query start coordinate supplied to ‘start’), ‘query_end’ (the query end coordinate supplied to ‘end’), ‘rsid1’ (variant one of two in the pair), ‘rsid2’ (variant two of two in the pair), r2 (r-squared statistic), d_prime (D′ statistic), population_name (the population supplied to ‘pop’).	This function takes a genomic coordinate, retrieving LD metrics (D′ and R²) for all rsID within the defined region.
ensemblQueryLDwithSNPregionDataframe	Region [5]	in.table pop cores	A data frame with eight columns: ‘query_chr’ (the query chromosome supplied to ‘chr’), ‘query_start’ (the query start coordinate supplied to ‘start’), ‘query_end’ (the query end coordinate supplied to ‘end’), ‘rsid1’ (variant one of two in the pair), ‘rsid2’ (variant two of two in the pair), r2 (r-squared statistic), d_prime (D′ statistic), population_name (the population supplied to ‘pop’).	This function takes a data frame containing genomic coordinate(s) and retrieves LD metrics (D′ and R²) for all rsID within the defined region(s).

Open in a new tab

To make ensemblQueryR useful in a high-throughput context, the main challenge is that the Ensembl API endpoints are configured to handle single queries. To address this, ensemblQueryR’s three multi-query functions (with names ending in ‘Dataframe’, as described in Table 1) take data frame objects as input, where each row needs to be submitted as a separate query to the Ensembl API. The base R lapply function (Version 4.0.5) [14] is then used to apply the corresponding single-query function over the input data frame, iteratively formulating an API query from each data frame row.

Building on ensemblQueryR’s high-throughput capabilities, we implemented optional parallelisation for all multi-query functions. Each multi-query function (those with names ending in ‘Dataframe’ in Table 1) has an argument that allows the user to set a number of ‘cores’ to parallelise the query across. Using this functionality can significantly reduce run-time, particularly for larger queries where the parallelisation overheads represent a small proportion of the overall memory requirements. For example, with a query size of 1,000, the ensemblQueryLDwithSNPpairDataframe function running on a single core takes ∼2.4 min, whereas ten cores speed the process up by 22 times, reducing the execution time to ∼0.11 min (System tested on: Ubuntu server 16.04 LTS with kernel version 4.4.0-210-generic, total RAM 251G).

Consistency of the data output format is an important feature of ensemblQueryR, making it amenable to R workflow integration. All functions (single- and multi-query) return a data frame object, including instances where the query returns a null result or an error. This consistent output format simplifies writing custom code, allowing users to incorporate ensemblQueryR functionality into bespoke workflows. It is important to note that when data cannot be retrieved, for example, if a variant is not found in Ensembl databases or the user input is invalid, console messages alert the user of this and a data frame row containing ‘NA’ values is returned for that query.

Benchmarking

LDlinkR is an alternative R package that offers LD metric retrieval. As such, it was important to benchmark against this tool to demonstrate the utility of ensemblQueryR for high-throughput querying. Of the functions contained in the LDlinkR and ensemblQueryR packages, two functions are particularly comparable in their functionality. Both LDpair (from LDlinkR) and ensemblQueryLDwithSNPpair (from ensemblQueryR) take a pair of reference SNP cluster identifiers (rsIDs) as input, while the output is a table containing the LD metrics for the query pair. As such, these functions were selected for benchmarking. To compare the performance of the two functions, the computation speed and RAM usage at three query sizes representing a range of throughputs – 100, 1,000 and 10,000 queries – were assessed (Figure 1). For each function and query size combination, performance (speed and peak RAM usage) was tested ten times to account for temporal fluctuations in processing speed and peak RAM usage, thus enabling a precise performance assessment.

Figure 1. — Comparison of performance metrics between analogous functions of the ensemblQueryR and LDlinkR R packages. (a) Plot showing, for query sizes of 100, 1,000 and 10,000, the maximum RAM usage (mebibytes, MiB) during the execution of ensemblQueryR’s ensemblQueryLDwithSNPpair (red) and LDlinkR’s LDpair (blue). (b) Plot showing, for query sizes of 100, 1,000 and 10,000, the execution time (minutes, mins) of ensemblQueryR’s ensemblQueryLDwithSNPpair (red) and LDlinkR’s LDpair (blue).

Firstly, comparing execution speed, we found that, on average (across the ten tests), ensemblQueryLDwithSNPpair was 10.2 times faster in the 100-query test, taking an average of 0.208 min compared to the 2.12 min for LDpair (Figure 1b). The 1,000-query test found that ensemblQueryLDwithSNPpair was, on average, 9.92 times faster than LDlinkR, taking an average of 1.97 min compared to the 19.5 min for LDpair. Finally, in the 10,000-query test, LDpair was unable to produce a final results table in seven out of ten tests, in these instances returning an error message (‘Bad Gateway (HTTP 502)’). In contrast, ensemblQueryLDwithSNPpair produced a final results table in all tests, demonstrating its reliability for high-throughput querying. Looking at the only three successful tests of LDpair, we found that ensemblQueryLDwithSNPpair was, on average, 10.9 times faster than LDpair, taking an average of 18.5 min compared to the 202 min (>3 h) for LDpair. These speed improvements are likely due to request rate limits from the server side, which are higher for Ensembl, thus enabling fast concurrent or parallel requests.

Secondly, we compared the peak RAM usage – the maximum RAM utilised at any time during function execution – between ensemblQueryLDwithSNPpair and LDpair (Figure 1a). We found that across query sizes, ensemblQueryLDwithSNPpair had approximately a third (range: 20.8–49.6%) of the peak RAM usage of LDpair. These peak RAM usage improvements are likely due to a focus on within-function reductions of intermediate object storage and a reduction of the number of operations carried out within the ensemblQueryLDwithSNPpair function.

We conclude that, by comparing analogous functions (ensemblQueryLDwithSNPpair and LDpair), ensemblQueryR provides a performance improvement over LDlinkR with respect to both speed (×10) and memory usage (1/3), underscoring the utility of our tool in the context of high-throughput workflows.

Usage

The following code provides an implementation example for ensemblQueryR. In this case, the target variant was taken from the OpenTargets homepage [15] with rsID rs4129267. To find out which variants, within a set of genomic window sizes, are in LD (R² > 0.8 and D′ > 0.8) with the target variant, the ensemblQueryLDwithSNPWindow function was implemented as in Figure 2.

Figure 2. — Working example of an ensemblQueryR implementation.

Methods

Docker and singularity images

In order to make this tool widely accessible to the research community, we have provided a Docker image (Figure 3). It is based on a Rocker (RRID:SCR_024215) [16] with R version 4.0.0 and Ubuntu version 20.04 LTS (Focal Fossa) pre-installed. To start a container using the ‘ensemblqueryr’ Docker image and launch an interactive R session within which the tool can be used, the following command can be executed on the command line:

docker pull ainefairbrotherbrowne/ensemblqueryr:1.0; ∖

docker run -t -d --name ensemblqueryr ainefairbrotherbrowne/ensemblqueryr:1.0; ∖

docker exec -i -t ensemblqueryr R

To mount a local volume that contains a file of variant IDs, the following command can be used:

docker pull ainefairbrotherbrowne/ensemblqueryr:1.0; ∖

docker run -t -d --name ensemblqueryr ainefairbrotherbrowne/ensemblqueryr:1.0 -- path/to/volume; ∖

docker exec -i -t ensemblqueryr R

Additionally, for use-cases wherein the user does not have sufficient permissions to use Docker, such as on HPC clusters, a singularity image can be used instead. To generate a singularity image, we converted the ensemblQueryR docker image [11] into a singularity image using the Docker-based tool docker2singularity (Version 2.6) [17]. This image can be found on the sylabs.io singularity repository [12]. The singularity image can be pulled and executed, launching a command line R session as follows:

singularity pull --arch amd64 library://ainefairbrother/ensemblqueryr/ensemblqueryr:sha256.e387ea11ae4eaea8f94d81c625c2c1d5a22dd351858ebcd03910a7736d76ca30; ∖

singularity exec ensemblqueryr_sha256.e387ea11ae4eaea8f94d81c625c2c1d5a22dd351858ebcd03910a7736d76ca30.sif R

Benchmarking

To benchmark our package against LDlinkR, we first selected functions that performed similar tasks. The most comparable functions in this context were LDlinkR’s ‘LDpair’ and ensemblQueryR’s ‘ensemblQueryLDwithSNPpair’. Both are configured for single queries and take a pair of variant IDs as input, retrieving the corresponding LD metrics through an API query. To run multiple queries using these single-query functions, both functions were applied over variant identifier (ID) vectors using the base R lapply (Version 4.0.5) [14].

To compare the utility of these functions for LD metric querying in the context of high-throughput workflows, we measured the peak RAM usage and execution speed using the R package peakRAM (version 1.0.2) [18] 10 times for each function and at three query sizes (100, 1,000 and 10,000). The standard deviation (sd) of the performance metric for each function at each time point across the ten iterations was calculated. Mean performance metrics were plotted with error bars to show the sd from the mean for peak RAM usage (Figure 1a) and time measured in minutes (Figure 1b) for the three query sizes.

Limitations

It is important to note that there are some limitations to this R package. First, although this package enables high-throughput querying of the Ensembl API, there is an inherent limit to the number of queries that can be submitted arising from the API query limit, which is set at 54,000 requests per hour [5, 6]. On the Ubuntu system used to develop ensemblQueryR (Ubuntu server 16.04 LTS with kernel version 4.4.0-210-generic, total RAM 251 G), 54,000 API requests via ensemblQueryLDwithSNPpairDataframe took ∼1.93 h on a single core, making the per-hour request rate 27,867. As such, even query sizes of 54,000 can be run unparallelised and are unlikely to exceed the Ensembl API hourly rate. However, this request limit must be considered by users when applying parallelisation to large queries. The second limitation is that the parallelisation library used to enable the multi-core functionality is the R package ‘parallel’ (version 3.6.2) [19], which works on OSIX systems (Mac, Linux and other Unix-based operating systems) but does not work on Windows.

Scope for future development

At present, this package provides wrappers for the Ensembl API endpoints that retrieve LD data. However, the Ensembl API offers ∼109 other endpoints [5], all of which have the potential to be wrapped into R functions and included in this package. As such, there is scope for the usefulness of this package beyond LD metrics and further development will expand its utility to R users across an array of bioscience disciplines.

Availability of source code and requirements

Project name: ensemblQueryR
Project home page: https://github.com/ainefairbrother/ensemblQueryR
Operating system(s): Platform independent
Programming language: R
Licence: MIT
RRID: SCR_024216.

Funding Statement

AFB was supported through the award of a Biotechnology and Biological Sciences Research Council (BBSRC UK) London Interdisciplinary Doctoral Fellowship. SGR and MR were supported through the award of a Tenure Track Clinician Scientist Fellowship to MR (MR/N008324/1). AH was funded by a BBSRC award (BB/R006075/1).

Data Availability

Snapshots of the code are available in GigaDB [20].

Abbreviations

API, Application Programming Interface; eGene, expression of a cis or trans gene; eQTL: Expression Quantitative Trait Loci; GWAS: Genome-Wide Association Study; ID, identifier; LD: Linkage Disequilibrium; RAM: Random Access Memory; REST: Representational State Transfer; rsID: Reference SNP cluster IDs; sd: standard deviation.

Declarations

Ethical approval

The authors declare that ethical approval was not required for this type of research.

Competing Interests

RHR is currently employed by CoSyne Therapeutics (Lead Bioinformatician). All work performed for this publication was performed in her own time, and not as a part of her duties as an employee.

Author contributions

AFB and RHR: conceptualisation. AFB: software. AFB and SGR: docker image. AFB: benchmarking. AFB: writing - original draft. AFB, SGR, AH, MR and RHR: writing - review and editing. MR and AH: supervision.

Funding

References

1.Slatkin M. . Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet., 2008; 9: 477–485. doi: 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bush WS, Moore JH. . Chapter 11: Genome-wide association studies. PLoS Comput. Biol., 2012; 8(12): e1002822. doi: 10.1371/journal.pcbi.1002822. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Myers TA, Chanock SJ, Machiela MJ. . LDlinkR: An R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Front. Genet., 2020; 11: 157. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Machiela MJ, Chanock SJ. . LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 2015; 31(21): 3555–3557. doi: 10.1093/bioinformatics/btv402. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ensembl Rest API - Ensembl REST API Endpoints . 2015; https://rest.ensembl.org/. Accessed 5 Nov 2022.
6.Yates A, Beal K, Keenan S et al. The ensembl REST API: Ensembl data for any language. Bioinformatics, 2015; 31(1): 143–145. doi: 10.1093/bioinformatics/btu613. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Fairbrother-Browne A. ensemblQueryR (Version 2.0.0). Zenodo, 23 May 2023; 10.5281/zenodo.7837882. [DOI]
8.Fairbrother-Browne A. ensemblQueryR (Version 2.0.0). GitHub, 23 May 2023; https://github.com/ainefairbrother/ensemblQueryR.
9.Merkel D. . Docker: lightweight Linux containers for consistent development and deployment. Linux J., 2014; 2014(239): 2. [Google Scholar]
10.Kurtzer GM, Sochat V, Bauer MW. . Singularity: Scientific containers for mobility of compute. PLOS One, 2017; 12(5): e0177459. doi: 10.1371/journal.pone.0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Fairbrother-Browne A. ensemblqueryr (Version 1.0). Docker Hub, 22 May 2023; https://hub.docker.com/r/ainefairbrotherbrowne/ensemblqueryr.
12.Fairbrother-Browne A. ensemblqueryr (Version 1.0). sylabs.io, 22 May 2023; https://cloud.sylabs.io/library/ainefairbrother/ensemblqueryr/ensemblqueryr.
13.Ensembl Population Frequencies & Genotypes . https://www.ensembl.org/info/genome/variation/species/populations.html. Accessed 23 May 2023.
14.R Core . R: A language and environment for statistical computing, 2021; Available at: https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
15.Open Targets Genetics . https://genetics.opentargets.org/. Accessed 23 May 2023.
16.Boettiger C, Eddelbuettel D. . An introduction to Rocker: Docker containers for R. R J., 2017; 9(2): 527–536. doi: 10.32614/RJ-2017-065. [DOI] [Google Scholar]
17.Singularity Hub . Docker2singularity (Version 2.6), 2018; https://github.com/singularityhub/docker2singularity/releases/tag/v2.6-release.
18.Quinn T. peakRAM (Version 1.0.2), 2017; https://cran.r-project.org/web/packages/peakRAM/vignettes/peakRAM.html.
19.R-core. Parallel (Version 3.6.2), . 2023; https://www.rdocumentation.org/packages/parallel/versions/3.6.2.
20.Fairbrother-Browne A, García-Ruiz S, Reynolds RH et al. Supporting data for “ensemblQueryR: fast, flexible and high-throughput querying of Ensembl API LD endpoints in R”. GigaScience Database, 2023; 10.5524/102435. [DOI] [PMC free article] [PubMed] [Google Scholar]

GigaByte. 2023 Sep 14;2023:1–10.

Article Submission

Aine Fairbrother-Browne

GigaByte.

Assign Handling Editor

Editor: Scott Edmunds

GigaByte.

Editor Assess MS

Editor: Hongfang Zhang

GigaByte.

Curator Assess MS

Editor: Chris Armit

GigaByte.

Review MS

Editor: Aziz Khan

Reviewer name and names of any other individual's who aided in reviewer	Aziz Khan
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	No
Additional Comments	There are two license files (the LICENSE, LICENSE.md) and both are conflicting. The authors should use one to make sure it is an appropriate Open Source Initiative license.
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	No
Additional Comments	There are no guidelines on how to contribute.
Is the code executable?	Yes
Additional Comments
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Yes
Additional Comments	Installation was straightforward.
Is the documentation provided clear and user friendly?	Yes
Additional Comments
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?	Yes
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments	The DESCRIPTION file has all the required packages.
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	No
Additional Comments
Additional Comments
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments	There are some use cases in the minimal documentation. Some of these are not working.
Is automated testing used or are there manual steps described so that the functionality of the software can be verified?	Yes
Additional Comments	The package is already in the CRAN, which means it passed its checks. It does have example snippets of code to be run.
Any Additional Overall Comments to the Author	The ensemblQueryR package offers an R interface for querying Ensembl REST API linkage disequilibrium (LD) endpoints. The package aims to support high-throughput querying, utilizing familiar R object types and parallelization. The package is already in CRAN and comes with Docker/Singularity options. While the paper highlights notable strengths, such as the R interface, high-throughput support, parallelization, and performance improvements, a few points warrant further consideration and clarification. (1). The package is already in CRAN, which is good, but there are issues with the use-case examples. When I executed the test case to query LD metrics for a variant within a window, the package returned a tibble containing NAs. (2). The title claims the package supports querying "several" Ensembl API endpoints, but the current version is limited to LD endpoints. Revising the title to reflect the package's current capabilities accurately is recommended. (3). Two conflicting license files exist (the LICENSE, LICENSE.md). The authors should select one to ensure it aligns with an appropriate Open Source Initiative license. Additionally, there are no guidelines provided for contributing to the project. (4). I have encountered this R package to get Ensembl data in R using the REST API - https://github.com/dwinter/rensembl. Can the authors comment on this?
Recommendation	Minor Revisions

Open in a new tab

GigaByte.

Review MS

Editor: Andrew Yates

Reviewer name and names of any other individual's who aided in reviewer	Andy Yates
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	Yes
Additional Comments	Whilst nothing explicit is there the code is hosted on GitHub & therefore using the issue tracker would appear to be the right way to do this
Is the code executable?	Yes
Additional Comments
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Yes
Additional Comments
Is the documentation provided clear and user friendly?	Yes
Additional Comments
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?	Yes
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Yes
Additional Comments
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?	No
Additional Comments	The code uses data from Ensembl so this is somewhat sufficient and uses rsIDs
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments
Is automated testing used or are there manual steps described so that the functionality of the software can be verified?	Yes
Additional Comments
Any Additional Overall Comments to the Author	The authors present in their manuscript “ensemblQueryR: fast, flexible and high-throughput querying of Ensembl API endpoints in R”; a library for the popular R programming language and specifically addresses fast retrieval of linkage disequilibrium from Ensembl’s REST APIs. The toolkit provides a convenient set of functions which mediates queries into the Ensembl REST API and presents results in a manner which is optimised and familiar for R users. Overall I find the manuscript good and provides a useful toolkit for R developers. It is pleasing to see more people use the Ensembl REST APIs and to expand its user base. The analysis benchmarks also appear appropriate. I have only two minor comments. 1. Code examples Whilst I was able to execute the given example I did encounter some issues. Namely that when executing the code ensemblQueryR::ensemblQueryLDwithSNPwindow(rsid="rs4129267", r2=0.8, d.prime=0.8, window.size=500, pop="1000GENOMES:phase_3:EUR") I got the following result query snp_in_ld r2 d_prime population_name <chr> <lgl> <lgl> <lgl> <lgl> 1 rs4129267 NA NA NA NA It was disconcerting to not get any results back and now I am unsure if I did the right thing. I then found the GitHub repository was a far better reference for the correct commands to run and would encourage perhaps less information in the manuscript and the authors pointing readers to the GitHub repository or the site https://ainefairbrother.github.io/ensemblQueryR/. 2. Extensions The authors suggest possible extensions to ensemblQueryR but it would be useful to know if there are any specific developments planned by the authors.
Recommendation	Minor Revisions

Open in a new tab

GigaByte.

Editor Decision

Editor: Hongfang Zhang

GigaByte. 2023 Sep 14;2023:1–10.

Minor Revision

Aine Fairbrother-Browne

GigaByte.

Assess Revision

Editor: Hongfang Zhang

GigaByte.

Final Data Preparation

Editor: Chris Armit

GigaByte.

Editor Decision

Editor: Hongfang Zhang

GigaByte.

Accept

Editor: Scott Edmunds

Editor’s Assessment

ENSEMBL is a crucial resource for the vertebrate genomics community. And ensemblQueryR is an R interface to the Ensembl API that facilitates flexible, fast, user-friendly and R workflow integrable querying of the database linkage disequilibrium endpoints. It’s been presented as easy to deploy, and after some debugging the work demonstrate that ensemblQueryR has improved performance memory usage and speed, delivering a 10-fold speed increase over analogous software whilst using a third of the RAM. As the Ensembl API offers >100 other endpoints, all of these have the potential to be wrapped into R functions and included in this package.

Open in a new tab

GigaByte.

Export to Production

Editor: Scott Edmunds

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Snapshots of the code are available in GigaDB [20].

[ref1] 1.Slatkin M. . Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet., 2008; 9: 477–485. doi: 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2.Bush WS, Moore JH. . Chapter 11: Genome-wide association studies. PLoS Comput. Biol., 2012; 8(12): e1002822. doi: 10.1371/journal.pcbi.1002822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3.Myers TA, Chanock SJ, Machiela MJ. . LDlinkR: An R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Front. Genet., 2020; 11: 157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4.Machiela MJ, Chanock SJ. . LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics, 2015; 31(21): 3555–3557. doi: 10.1093/bioinformatics/btv402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5.Ensembl Rest API - Ensembl REST API Endpoints . 2015; https://rest.ensembl.org/. Accessed 5 Nov 2022.

[ref6] 6.Yates A, Beal K, Keenan S et al. The ensembl REST API: Ensembl data for any language. Bioinformatics, 2015; 31(1): 143–145. doi: 10.1093/bioinformatics/btu613. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7.Fairbrother-Browne A. ensemblQueryR (Version 2.0.0). Zenodo, 23 May 2023; 10.5281/zenodo.7837882. [DOI]

[ref8] 8.Fairbrother-Browne A. ensemblQueryR (Version 2.0.0). GitHub, 23 May 2023; https://github.com/ainefairbrother/ensemblQueryR.

[ref9] 9.Merkel D. . Docker: lightweight Linux containers for consistent development and deployment. Linux J., 2014; 2014(239): 2. [Google Scholar]

[ref10] 10.Kurtzer GM, Sochat V, Bauer MW. . Singularity: Scientific containers for mobility of compute. PLOS One, 2017; 12(5): e0177459. doi: 10.1371/journal.pone.0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11.Fairbrother-Browne A. ensemblqueryr (Version 1.0). Docker Hub, 22 May 2023; https://hub.docker.com/r/ainefairbrotherbrowne/ensemblqueryr.

[ref12] 12.Fairbrother-Browne A. ensemblqueryr (Version 1.0). sylabs.io, 22 May 2023; https://cloud.sylabs.io/library/ainefairbrother/ensemblqueryr/ensemblqueryr.

[ref13] 13.Ensembl Population Frequencies & Genotypes . https://www.ensembl.org/info/genome/variation/species/populations.html. Accessed 23 May 2023.

[ref14] 14.R Core . R: A language and environment for statistical computing, 2021; Available at: https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

[ref15] 15.Open Targets Genetics . https://genetics.opentargets.org/. Accessed 23 May 2023.

[ref16] 16.Boettiger C, Eddelbuettel D. . An introduction to Rocker: Docker containers for R. R J., 2017; 9(2): 527–536. doi: 10.32614/RJ-2017-065. [DOI] [Google Scholar]

[ref17] 17.Singularity Hub . Docker2singularity (Version 2.6), 2018; https://github.com/singularityhub/docker2singularity/releases/tag/v2.6-release.

[ref18] 18.Quinn T. peakRAM (Version 1.0.2), 2017; https://cran.r-project.org/web/packages/peakRAM/vignettes/peakRAM.html.

[ref19] 19.R-core. Parallel (Version 3.6.2), . 2023; https://www.rdocumentation.org/packages/parallel/versions/3.6.2.

[ref20] 20.Fairbrother-Browne A, García-Ruiz S, Reynolds RH et al. Supporting data for “ensemblQueryR: fast, flexible and high-throughput querying of Ensembl API LD endpoints in R”. GigaScience Database, 2023; 10.5524/102435. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

Aine Fairbrother-Browne

Sonia García-Ruiz

Regina Hertfelder Reynolds

Mina Ryten

Alan Hodgkinson

Roles

Abstract

Statement of need

Background

Implementation

Our approach

Table 1.

Benchmarking

Figure 1.

Usage

Figure 2.

Methods

Docker and singularity images

Figure 3.

Benchmarking

Limitations

Scope for future development

Availability of source code and requirements

Funding Statement

Data Availability

Abbreviations

Declarations

Ethical approval

Competing Interests

Author contributions

Funding

References

Article Submission

Dr Aine Fairbrother-Browne

Roles

Assign Handling Editor

Roles

Editor Assess MS

Roles

Curator Assess MS

Roles

Review MS

Roles

Review MS

Roles

Editor Decision

Roles

Minor Revision

Dr Aine Fairbrother-Browne

Roles

Assess Revision

Roles

Final Data Preparation

Roles

Editor Decision

Roles

Accept

Roles

Export to Production

Roles

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases