Skip to main content
. 2020 Nov 4;1(8):100136. doi: 10.1016/j.patter.2020.100136

Table 2.

Characteristics of the Dataset Repository Corpus Used in This Study

Type Characteristics Mean (±SD) Quantile
Data file no. of rows (csv) 4,115 (±50,094) [39.0, 92.0, 108.0]
no. of columns (csv) 20.5 (±373) [3.0, 5.0, 12.0]
no. of rows (xls(x)) 607 (±13610) [28.0, 65.0, 108.0]
no. of columns (xls(x)) 30.5 (±412.1) [8.0, 15.0, 19.0]
no. of missing values csv (ratio) 8.9 (±17.5) [0.0, 0.0, 11.5]
avgerage size of data files (csv) 331,343 (±3,719,328) [1,625.0, 8,375.0, 47,752.5]
average size of data files (xlsx) 428,586(±2,595,222) [18,804.0, 34,723.0, 121,633.0]
Repository size of repository 51,372 kilobytes (±211,729) [983.0, 7,740.0, 32,715.0]
no. of open issues 5.2 (±51.2) [0.0, 0.0, 0.0]
no. of closed issues 40.6 (±552.3) [0.0, 0.0, 2.0]
description length 7.2 (±9.2) [1.0, 5.0, 10.0]
ratio of data files per repo 7.2% (±13%) [0.3, 1.9, 8.0]
age of repository (days) 1,521.9 (±539.7) [1,108.0, 1,478.0, 1,844.0]
ratio of problematic files with respect to a standard config (Pandas) 0.3% (±2.6%) [0.0, 0.0, 0.0]
README no. of words in README (non-code related) 378.2% (±1,126.6%) [10.0, 112.0, 431.0]
no. of tables 0.1 (±1.0) [0.0, 0.0, 0.0]
no. of code blocks 1.4 (±4.7) [0.0, 0.0, 1.0]
no. of headers 3.6 (±17.0) [1.0, 1.0, 5.0]
no. of urls 9.1 (±36.9) [1.0, 3.0, 12.0]
no. of images 0.7 (±3.9) [0.0, 0.0, 0.0]

Average values are reported in the “mean (±SD)” format. Quantiles values are reported in the [x25, x50, x75] format, where x25, x50 and x75 represent the 25th, 50th, and 75th quantile of a particular group's characteristic.