Abstract
During the last decades, there is a vast data explosion in bioinformatics. Big data centres are trying to face this data crisis, reaching high storage capacity levels. Although several scientific giants examine how to handle the enormous pile of information in their cupboards, the problem remains unsolved. On a daily basis, there is a massive quantity of permanent loss of extensive information due to infrastructure and storage space problems. The motivation for sequencing has fallen behind. Sometimes, the time that is spent to solve storage space problems is longer than the one dedicated to collect and analyse data. To bring sequencing to the foreground, scientists have to slide over such obstacles and find alternative ways to approach the issue of data volume. Scientific community experiences the data crisis era, where, out of the box solutions may ease the typical research workflow, until technological development meets the needs of Bioinformatics.
Introduction
Since 1956, but mainly in the last decades, storage space needs have grown spectacularly. The problem is that, as time flows, the storage funding issue has increased more than sequencing. That is a big problem that the modern scientist has to face. Sequencing has become more troubling because this issue makes the whole procedure difficult. The motivation for sequencing and producing new data has started to fall away (De Silva and Ganegoda, 2016).
Such data comes in the form of short sequencing reads, i.e. short character strings (typically having lengths in the range 75–150). Each character represents a nucleotide (which is also called a “base”), and can assume the values of A (adenine), C (cytosine), G (guanine), T (thymine), or N (failure in the base calling process) (Langmead, 2010). The nucleotide string is usually accompanied by a corresponding string of ASCII characters, encoding the “quality” (that is, the error probability of the base calling) of each of the nucleotides. This is a representative case of how a typical sequencing setup works when a resequencing problem is considered. In such a case, a reference (possibly not 100% accurate) for the genome/transcriptome of the organism being sequenced is already known. One has to map the DNA/RNA sequence reads to the reference (i.e., understand where such reads come from in the reference) and find variants present in the genetic code of the specific organisms compared to the reference (Xu et al., 2014).
Depending on the biological application at hand, one might need to perform several tasks on the data, possibly in several steps, with both per-read and global computations required (Libbrecht and Noble, 2015). A typical workflow corresponding to the above use case might be as follows:
store the reads in compressed searchable form (necessary to avoid excessive storage consumption);
retrieve (a subset of) the reads based on some criterion, possibly depending on the experiment metadata (for instance, select all the sequencing reads derived from a given tissue subject to a specific biological condition);
select/process the reads, for example: identify all the reads containing long stretches of low-quality nucleotides, and trim/eliminate them;
pattern/match the surviving data, read by read, onto a reference genome;
store the reads and their alignments to the reference genome (that is, the matches found in the genome for each read) in compressed searchable form again.
In the meantime, the Cern data centre has upgraded storage capacity on 200 petabytes, breaking the previous record of 100 petabytes. Information produced every day is one petabyte per second. This leads to lack of space capacity within 3 minutes. Then all this information has to be filtered for any findings which are stored for later use, after three minutes everything is deleted and three minutes is a very short period to trace back all this information (Britton and Lloyd, 2014).
All this data that need to be retrieved and handled is being held up in I/O traffic because of slow processing power (Fan et al., 2014). Even if process power isn’t still satisfying for such needs, there are other ways to slide over this obstacle. Technology and science go on hand by hand, and someone has to think out of the box to solve any occurring problem, without being stuck conventionally. The other suggested path is the information packings. By limiting, not only the data space needed for the information that we already have but also the new information we get, we can go further in a less chaotic and more organised environment by throwing away unnecessary information (repeats) (Fan et al., 2014).
The important thing is to compress information without losing data that is needed. One should keep in mind that not only huge amounts of data will need to be processed each day, but also that some operations might need to be performed incrementally. For instance, the data produced at some point might be used to refine the results obtained from some other data generated previously, implying the reprocessing of a possibly much bigger dataset. For these reasons the development of a robust and extensible high-throughput storage/matching/processing system is necessary. Many other workflows might be envisaged, but most of them share the same skeleton structure, that is storage, retrieval, filtering/processing, and final storage of the results.
Clustering information based on a representative model (in some permissible limits) is an interesting way to approach the problem (Slonim et al., 2005). For instance, when information is recorded in output, the ones that don’t differ from our first recorded ones should not be referred. The differences are the essential information for our search.
To some extent, sequencing data are intrinsically noisy (they depend on chemical reactions which are stochastic in nature) (Alvarez et al., 2015). On the one other hand, high-throughput sequencing techniques have now reached a high degree of reliability, so sequencing errors are relatively rare (Pareek et al., 2011). Also, as mentioned above, sequencing machines provide a quantification of the sequencing error at each nucleotide regarding “qualities”, which can be used to pinpoint problematic nucleotides/regions in the read.
Storage state of the art
Since several years, under the pressure of increasing volumes of data and due to reduced hardware costs, the view of databases as centralised data access points has become vaguer (Sreenivasaiah and Kim, 2010). Fundamental paradigms of data organisation and storage have been revised to accommodate parallelisation, disreputability and efficiency. The storage mechanics, the querying methods and the analysis and aggregation of the results follow new models and practices. Search has gone beyond the boolean match, being directly linked to efficient indexes allowing approximate matching in domains ranging from string to graph matching (Pienta et al., 2016). The main points of this progress can be summarised as follows.
From row-oriented representation, nowadays the trend is to move to column-oriented representation and database systems (Abadi et al., 2009), which are the evolution of what was called “large statistical databases” in earlier literature (Corwin et al., 2007; Turner et al., 1979). Column-oriented database systems allow high compressibility per column (Abadi et al., 2008), by direct application of existing ratio-optimised compression algorithms (Abadi et al., 2006). Furthermore, several threads are pulling current database practices away from the relational paradigm. Large-scale storage and access may include dynamic control over data layout. Peer-topeer (P2P) overlays are also used in distributed stores, exchanging, e.g., index information to contributing nodes in distributed data warehouses (Doka et al., 2011), where even the queries can be executed in a peerbased fashion spreading the processing load. Another alternative, related to large-scale analysis is the case of Pig Latin (Gates et al., 2009), where a SQL-like syntax is used to provide the data flow requirements for analysis over a map-reduce infrastructure. Other efforts offer partial SQL support, as is the case of Hive (Ashish et al., 2010) and the corresponding query language, named HiveQL.
Recently, parallel databases (e.g., Oracle Exadata, Teradata) allowed high efficiency at the expense of failure recovery and elasticity (Pavlo et al., 2009). Newer approaches and versions of these parallel databases integrate a map-reduce approach into the systems to alleviate these drawbacks, see (Abouzeid et al., 2009) for more information.
The increased availability of low-cost, legacy computers has brought cloud computing settings to the front line. Shared-nothing architectures, implying selfsufficient storage or computation nodes, are applied to storage settings (O’Driscoll et al., 2013). There exist also alternative clouds based on active data storage (Delmerico et al., 2009; Fan et al., 2014) where part of the computational database effort is distributed among the processing units of storage peripherals. Such an example is the case of DataLab (Moretti et al., 2010) where data operations, both read and write, are based on “sets” - essentially named collections of files - distributed across several active storage units (ASUs).
Finally, task-focused storage solutions are devised to face problems in bioinformatics (Hsi-Yang Fritz et al., 2011), social networks (Ruflin et al., 2011) and networkmonitoring and forensics (Giura and Memon, 2010), showing how much data requirements drive the need for research on storage systems. Especially in bioinformatics, there exist approaches that combine compressed storage and indexing under a common approach, based on sequence properties and works on indexed string storage (Arroyuelo and Navarro, 2011; Ferragina and Manzini, 2005). There are cases where the system provides tunable parameters that allow a balance between data reuse and space recovery (Hsi-Yang Fritz et al., 2011), by keeping only the data that may be reused shortly. At this point it must be stressed that there still exist relational databases that are used for high-throughput data storage, an example being the NCBI GEO archive (Barrett et al., 2009) which supports the submission of experimental outputs and provides a set of tools to retrieve, explore and visualise data. However, even in the case of NCBI GEO, the relational nature of the underlying database is used to identify specific datasets and not specific sequences (i.e., instances). Further analysis tools are used to locate sequences and aggregate information from them. In time series and sensor networks, storage can be a severe problem. In the literature, there are methods such as Sparse Indexing (Lillibridge et al., 2009), where sampling and backup streams are used to create indexes that avoid disk bottlenecks and storage limitations.
Beyond the full-text indexing - combined with compressed storage, as explained above - often met in bioinformatics, there are several works on time series indexing and graph indexing. These two types of indexes, together with the string (and, thus, sequence) indexes, provide full artillery of methods that can cope with a great variety of problems and settings. Graph indexing is under massive research, due to its applicability on such cases as chemical compounds, protein interactions, XML documents, and multimedia.
Graph indexes are often based on frequent subgraphs (Yan et al., 2005), or otherwise “semantically” interesting (Jiang et al., 2007). There exist hierarchical graph index methods (Abello and Kotidis, 2003), and hash-based ones. A related recent work (Schafer et al., 2017) relies on “fingerprints” of graphs - derived from hashing on cycles and trees within a graph - for efficient indexing. The method is part of an open source software, named “Scaffold Hunter”, for visual analysis of chemical compound databases.
In the case of time series, to efficiently process and analyse large volumes of data, one must consider operating on summaries (or approximations) of these data series. Several techniques have been proposed in the literature (Anguera et al., 2016), including Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Piecewise Aggregate Approximation (PAA), Discrete Wavelet Transform (DWT), Adaptive Piecewise Constant Approximation (APCA), Approximation (SAX), and others. Recent works (Emil Gydesen et al., 2015) based on the iSAX (Shieh and Keogh, 2009) algorithm have focused on the batch update process of indexing very large collections of time series and have proposed highly efficiency algorithms with optimised disk I/O, managing to index “one billion time series” very efficiently on a single machine. Another system, Cypress (Reeves et al., 2009), applies multi-scale analysis to decompose time series and to obtain sparse representations in various domains, allowing reduced storage requirements. Furthermore, this method can answer many statistical queries without the need to reconstruct the original data.
Conclusions
The life sciences are becoming a “big data business”. Modern science needs have changed, and lack of storage space has become of great interest among the scientific community. There is an urgent need for computational ability and storage capacity development. In a short period, several scientists are finding themselves unable to extract full value from the large amounts of data becoming available. The revolution that happened in next-generation sequencing, bioinformatics and biotechnology are unprecedented. Sequencing has to come first in priority but, because of technical problems during this process, the time spent to solve space problems is longer than the one dedicated to the part of collecting and analysing data. During this problem, a huge amount of data produced every day is being lost. As we understand, the scientist must overcome some hurdles, from storing and moving data to integrate and analysing it, which will require a substantial cultural shift. Moreover, similar problems will appear in many other fields of life science. As an example, the challenges that neuroscientists have to face in the future will be even greater than those we nowadays deal with the next generation sequencing in genomics. The nervous system and the brain are far more complicated entities than the genome. Today, the whole genome of a species can fit on a CD, but in the future how we will handle the brain which is comparable to the digital content of the world. Therefore, new technological methods more effective and efficient must be found, to serve the needs of scientific search. Solving that “bottleneck” has enormous consequences for human health and the environment.
Acknowledgments
The research reported in the present paper was partially supported by the FrailSafe Project (H2020-PHC-21-2015 - 690140) “Sensing and predictive treatment of frailty and associated co-morbidities using advanced personalized models and advanced interventions”, co-funded by the European Commission under the Horizon 2020 research and innovation programme.
Footnotes
Competing interests: LP none; PE none; SR none; MM none; VM none; DM none
References
- 1.Abadi D, Boncz P, Harizopoulos S. Column-oriented database systems. Proceedings of the VLDB Endowment. 2009;2(2):1664–1665. http://dx.doi.org/10.14778/1687553.1687625. [Google Scholar]
- 2.Abadi D, Madden S, Ferreira M. Integrating Compression and Execution in Column-Oriented Database Systems. Proceedings of the 2006 ACM SIGMOD international conference on Management of data; 2006. pp. 671–682. https://doi.org/10.1145/1142473.1142548. [Google Scholar]
- 3.Abadi D, Madden S, Hachem N. Column-stores vs. row-stores: how different are they really?. Proceedings of the 2008 ACM SIGMOD international conference on Management of data; 2008. pp. 67–980. http://dx.doi.org/10.1145/1376616.1376712. [Google Scholar]
- 4.Abello J, Kotidis Y. Hierarchical graph indexing. Proceedings of the twelfth international conference on Information and knowledge management. ACM, New Orleans, LA, USA. 2003;1:453–460. https://doi.org/10.1145/956863.956948. [Google Scholar]
- 5.Abouzeid A, Pawlikowski K, Abadi D, Saliberschatz A, Rasin A. Hadoop DB: an architectural hybrid of Map Reduce and DBMS technologies for analytical workloads. Journal Proceedings of the VLDB Endowment. 2009;2(1):922–933. http://dx.doi.org/10.14778/1687627.1687731. [Google Scholar]
- 6.Alvarez JR, Skachkov D, Massey SE, Kalitsov A, Velev JP. DNA/RNA transverse current sequencing: intrinsic structural noise from neighboring bases. Frontiers in genetics. 2015;6(213):1–11. doi: 10.3389/fgene.2015.00213. https://doi.org/10.3389/fgene.2015.00213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Anguera A, Barreiro JM, Lara JA, Lizcano D. Applying data mining techniques to medical time series: an empirical case study in electroencephalography and stabilometry. Computational and structural biotechnology journal. 2016;14:185–199. doi: 10.1016/j.csbj.2016.05.002. https://doi.org/10.1016/j.csbj.2016.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Arroyuelo D, Navarro G. Space-efficient construction of Lempel–Ziv compressed text indexes. Information and Computation. 2011;209(7):1070–1102. https://doi.org/10.1016/j.ic.2011.03.001. [Google Scholar]
- 9.Ashish T, Joydeep S, Namit J, Zheng S, Prasad C, et al. Hive A Petabyte Scale Data Warehouse Using Hadoop. Proceedings of the 26th International Conference on Data; 2010. pp. 996–1005. http://dx.doi.org/10.1109/ICDE.2010.5447738. [Google Scholar]
- 10.Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic acids research. 2009;37:885–890. doi: 10.1093/nar/gkn764. https://doi.org/10.1093/nar/gkn764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Britton D, Lloyd SL. Reports on progress in physics. Physical Society; 2014. How to deal with petabytes of data: the LHC Grid project. https://doi.org/10.1088/0034-4885/77/6/065902. [DOI] [PubMed] [Google Scholar]
- 12.Corwin J, Silberschatz A, Miller PL, Marenco L. Dynamic tables: an architecture for managing evolving, heterogeneous biomedical data in relational database management systems. Journal of the American Medical Informatics Association : JAMIA. 2007;14:86–93. doi: 10.1197/jamia.M2189. https://doi.org/10.1197/jamia.M2189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.De Silva PY, Ganegoda GU. New Trends of Digital Data Storage in DNA. BioMed research international. 2016 doi: 10.1155/2016/8072463. http://dx.doi.org/10.1155/2016/8072463. [DOI] [PMC free article] [PubMed]
- 14.Doka K, Tsoumakos D, Koziris N. Online Querying of Dimensional Hierarchies. Journal of Parallel and Distributed Computing. 2011;71(3):424–437. http://dx.doi.org/10.1016/j.jpdc.2010.10.005. [Google Scholar]
- 15.Delmerico JA, Byrnes NA, Bruno AE, Jones MD, Gallo SM, et al. Comparing the performance of clusters. Hadoop, and Active Disks on microarray correlation computations; 2009 International Conference on High Performance Computing (HiPC); 2009. pp. 378–387. http://dx.doi.org/10.1109/HIPC.2009.5433190. [Google Scholar]
- 16.Emil Gydesen J, Haxholm H, Sonnich Poulsen N, Wahl S, Thiesson B. HyperSAX: Fast Approximate Search of Multidimensional Data. 2015 http://dx.doi.org/10.5220/0005185201900198.
- 17.Fan J, Han F, Liu H. Challenges of Big Data Analysis. National science review. 2014;1:293–314. doi: 10.1093/nsr/nwt032. https://doi.org/10.1093/nsr/nwt032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005;52:552–581. http://dx.doi.org/10.1145/1082036.1082039. [Google Scholar]
- 19.Gates A, Natkovich O, Chopra S, Kamath P, Narayanamurthy S, et al. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment. 2009;2(2):1414–1425. http://dx.doi.org/10.14778/1687553.1687568. [Google Scholar]
- 20.Giura P, Memon N. NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring. International Workshop on Recent Advances in Intrusion Detection; 2010. pp. 277–296. https://doi.org/10.1007/978-3-642-15512-3_15. [Google Scholar]
- 21.Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome research. 2011;21(5):734–740. doi: 10.1101/gr.114819.110. http://dx.doi.org/10.1101/gr.114819.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jiang H, Wang H, Yu PS, Zhou S. GString: A Novel Approach for Efficient Search in Graph Databases. 2007 IEEE 23rd International Conference on Data Engineering; 2007. http://dx.doi.org/10.1109/ICDE.2007.367902. [Google Scholar]
- 23.Langmead B. Aligning short sequencing reads with Bowtie. Current protocols in bioinformatics. 2010 doi: 10.1002/0471250953.bi1107s32. http://dx.doi.org/10.1002/0471250953.bi1107s32. [DOI] [PMC free article] [PubMed]
- 24.Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics, Nature reviews. Genetics. 2015;16(6):321–332. doi: 10.1038/nrg3920. http://dx.doi.org/10.1038/nrg3920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, et al. Sparse indexing: large scale, inline deduplication using sampling and locality. Proccedings of the 7th conference on File and storage technologies; San Francisco, California: USENIX Association; 2009. pp. 111–123. [Google Scholar]
- 26.Moretti C, Bui H, Hollingsworth K, Rich B, Flynn P, et al. All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids. IEEE Transactions on Parallel and Distributed Systems. 2010;21(1):33–46. http://dx.doi.org/10.1109/TPDS.2009.49. [Google Scholar]
- 27.O’Driscoll A, Daugelaite J, Sleator RD. ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics. 2013;46(5):774–781. doi: 10.1016/j.jbi.2013.07.001. https://doi.org/10.1016/j.jbi.2013.07.001. [DOI] [PubMed] [Google Scholar]
- 28.Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. Journal of applied genetics. 2011;52(4):413–435. doi: 10.1007/s13353-011-0057-x. https://doi.org/10.1007/s13353-011-0057-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Pavlo A, Paulson E, Rasin A, Abadi D, Dewitt D, et al. A comparison of approaches to large-scale data analysis. ACM SIGMOD, International Conference on Management of data; 2009. http://dx.doi.org/10.1145/1559845.1559865. [Google Scholar]
- 30.Pienta R, Navathe S, Tamersoy A, Tong H, Endert A, et al. VISAGE: Interactive Visual Graph Querying. AVI : proceedings of the Workshop on Advanced Visual Interfaces; AVI; 2016. http://dx.doi.org/10.1145/2909132.2909246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Reeves G, Liu J, Nath S, Zhao F. Managing massive time series streams with multi-scale compressed trickles. Proc VLDB Endow. 2009;2(1):97–108. http://dx.doi.org/10.14778/1687627.1687639. [Google Scholar]
- 32.Ruflin N, Burkhart H, Rizzotti S. Databases and Social Networks. ACM; Athens, Greece: 2011. Social-data storagesystems. http://dx.doi.org/10.1145/1996413.1996415. [Google Scholar]
- 33.Schafer T, Kriege N, Humbeck L, Klein K, Koch O, et al. Scaffold Hunter: a comprehensive visual analytics framework for drug discovery. Journal of cheminformatics. 2017;9(1):28–32. doi: 10.1186/s13321-017-0213-3. https://doi.org/10.1186/s13321-017-0213-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Shieh J, Keogh E. iSAX: disk-aware mining and indexing of massive time series datasets. Data Min Knowl Discov. 2009;19:57. https://doi.org/10.1007/s10618-009-0125-6. [Google Scholar]
- 35.Slonim N, Atwal GS, Tkacik G, Bialek W. Informationbased clustering. Proceedings of the National Academy of Sciences of the United States of America; 2005. pp. 18297–18302. https://doi.org/10.1073/pnas.0507432102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sreenivasaiah PK, Kim DH. Current trends and new challenges of databases and web applications for systems driven biological research. Frontiers in physiology. 2010 doi: 10.3389/fphys.2010.00147. http://dx.doi.org/10.3389/fphys.2010.00147. [DOI] [PMC free article] [PubMed]
- 37.Turner M, Hammond R, Cotton P. A DBMS for large statistical databases. Proceeding VLDB ‘79 Proceedings of the fifth international conference on Very Large Data Bases; 1979. pp. 319–327. http://dx.doi.org/10.1109/VLDB.1979.718147. [Google Scholar]
- 38.Xu P, Zhang X, Wang X, Li J, Liu G, et al. Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nature genetics. 2014;46(11):1212–1219. doi: 10.1038/ng.3098. http://dx.doi.org/10.1038/ng.3098. [DOI] [PubMed] [Google Scholar]
- 39.Yan X, Yu PS, Han J. Graph indexing based on discriminative frequent structure analysis. ACM Trans Database Syst. 2005;30(4):960–993. http://dx.doi.org/10.1145/1114244.1114248. [Google Scholar]