. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22

Table 2.

Description of the Hadoop related projects/ecosystems

Hadoop related project and technology	Description	Download URL
Avro	• Avro is a framework for performing remote procedure calls and data serialization.	http://avro.apache.org
Flume	• Flume is a tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop.	http://flume.apache.org
HBase	• Based on Google’s Bigtable, HBase is an open-source, distributed, versioned, column-oriented store that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive datasets.	http://hbase.apache.org
HCatalog	• An incubator-level project at Apache, HCatalog is a metadata and table storage management service for HDFS.	http://Incubator.apache.org/hcatalog/
Hive	• Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources	http://hive.apache.org
Mahout	• Mahout is a scalable machine-learning and data mining library.	http://mahout.apache.org
Oozie	• Oozie is a job coordinator and workflow manager for jobs executed in Hadoop, which can include non-MapReduce jobs.	http://oozie.apache.org
Pig	• Pig is a framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster.	http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html
Sqoop	• Sqoop (SQL-to-Hadoop) is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase.	http://sqoop.apache.org
ZooKeeper	• ZooKeeper is a service for maintaining configuration information, naming, providing distributed synchronization and providing group services.	http://zookeeper.apache.org
YARN	• YARN is a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.	http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html
Cascading	• Cascading is an alternative API to Hadoop MapReduce. Cascading now has support for reading and writing data to and from a HBase cluster.	http://wiki.apache.org/hadoop/Hbase/Cascading
Twitter Storm	• Twitter Storm is a free and open source distributed real time computation system.	http://storm.incubator.apache.org/
High performance computing cluster (HPCC)	• HPCC is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions	http://hpccsystems.com/
Dremel	• Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data	http://research.google.com/pubs/pub36632.html