Table 2.
Online comparison of Bystro and recent programs in filtering 8.49 × 107 variants from 1000 Genomes
Group | Search query | Time (s) | Variants | Tr:Tv |
---|---|---|---|---|
1 | Exonic | 0.030 ± 0.030 | 993,343 | 2.96 |
2 (a) | cadd > 20 maf < .001 pathogenic expert review missense | 0.029 ± 0.009 | 65 | 1.71 |
2 (b) | cadd > 20 maf < .001 pathogenic expert’s review non-synonymous | 0.036 ± 0.019 | 65 | 1.71 |
2 (c) | cadd > 20 maf < .001 pathogen expert-reviewed nonsynonymous | 0.044 ± 0.025 | 65 | 1.71 |
3 (a) | Early onset breast cancer | 0.046 ± 0.029 | 4335 | 2.51 |
3 (b) | Early-onset breast cancer | 0.037 ± 0.020 | 4335 | 2.51 |
3 (c) | Early onset breast cancers | 0.033 ± 0.015 | 4335 | 2.51 |
4 (a) | Pathogenic nonsense Ehlers-Danlos | 0.038 ± 0.027 | 1 | NA |
4 (b) | Pathogenic nonsense E.D.S | 0.078 ± 0.087 | 1 | NA |
4 (c) | Pathogenic stopgain eds | 0.040 ± 0.022 | 1 | NA |
The full 1000 Genomes Phase 3 VCF file (853 GB, 8.49 × 107 variants, 2504 samples) was filtered in the publicly available Bystro web application using the Bystro natural-language search engine. VEP, GEMINI, and wANNOVAR (not shown) were also tested, but were unable to annotate this dataset or filter it. Bystro’s search engine uses a natural language parser that allows for unstructured queries: queries in groups 2, 3, and 4 show phrasing variations that did not affect results returned, as would be expected for a search engine that could handle normal language variation. “Tr:Tv” is the transition to transversion ratio automatically calculated for each query by the search engine. The transition to transversion ratio of 2.96 for the “exonic” query is close to the ~ 2.8–3.0 ratio expected in coding regions, suggesting that the search engine accurately identified exonic (coding) variants