. Author manuscript; available in PMC: 2017 Jul 21.

Published in final edited form as: Proc SPIE Int Soc Opt Eng. 2017 Mar 13;10138:101380B. doi: 10.1117/12.2254712

Table 1.

Theoretical model parameter definition

Definition	Description
I/O speed (V_sourceR, V_sourceW, V_hostR, V_hostW)	I/O speed is the data read/write rate on source where data stores and retrieves, and on host where data get processed. Source for SGE is the Network storage accessed by NFS. Source for Hadoop is the hard disk allocated for HBase.
Bandwidth (B, B_real):	B is the bandwidth of cluster. In our case, it is based on a gigabit network. B_real is the real bandwidth that is shared by one job. Once there is a network congestion, the value of B_real is actually smaller than or equal to the cluster's given fixed bandwidth B.
Disk speed (V_diskR, V_diskW)	Data read/ write speed of local hard drive.
Input/output dataset (data_in / data_out)	The total data size for one job that is downloaded/uploaded while processing.
Core (#core, #allowed_core, core_i)	#core is the number of processor cores in the cluster. #allowed_core is the maximum number of allowed concurrent cores can be used without causing network congestion. We use core_i to represent the number of individual cores on each machine.
Job (#job, job_i)	#job is the total number of jobs. job_i is the number of jobs will be dispatched to each machine.
Job time per dataset (T_j)	Processing time per dataset.
Round (#Round)	The total number of round for parallel jobs.
Region (#region, region_i)	Each part of a Hbase table is split into several regions (blocks of data).The total number of regions # region per Regionserver is approximately balanced[7]. Region_i denotes the number of region on specified machine.
Machine ( #machine)	The number of machines (workstations) of the cluster.
File ( #file)	All files that are involved in the whole process.
Negligible overhead (η)	An insignificant overhead that could be eliminated and not counted.
α	α is a binary parameter. It helps the equations explain when do they only valid for Hadoop scenario (α = 1) rather than SGE scenario (α = 0).
β	β is a binary parameter. It helps the equations explain when do they only valid for SGE scenario (β = 1) rather than Hadoop scenario (β = 0).
γ	γ is an experimental empirical parameter to represent the ratio of rack-local map task for Hadoop scenario, namely the data is loaded/stored via network. Similarly, the value of γ is always 1 for SGE scenario since all data is transferred through network.