Skip to main content
. 2020 Nov 4;1(8):100136. doi: 10.1016/j.patter.2020.100136

Table 6.

Characteristics of the Dataset Corpus and for Four Groups of Reuse: 1 = Lowest Reuse, 4 = Highest Reuse

Type Characteristics Mean G1 Mean G2 Mean G3 Mean G4 Quantile G1 Quantile G2 Quantile G3 Quantile G4
README no. of words in README (non-code related)a 286.2 (± 963.8) 345.1 (±835.6) 541.9 (±1,509.7) 801.9 (±1,808.7) [6.0, 48.0, 287.0] [15.0, 125.0, 389.8] [63.0, 250.0, 626.0] [151.5, 416.0, 869.0]
no. of tablesa 0.0 (±0.5) 0.1 (± 0.6) 0.1 (±1.6) 0.3 (±2.2) [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.0, 0.0, 0.0]
no. of code blocksa 0.9 (±3.5) 1.3 (±4.2) 2.3 (±6.1) 3.5 (±8.1) [0.0, 0.0, 1.0] [0.0, 0.0, 1.0] [0.0, 0.0, 2.0] [0.0, 1.0, 4.0]
no. of headersa 2.3 (± 4.1) 3.6 (±5.6) 5.3(±7.9) 8.8 (±54.6) [0.0, 1.0, 3.0] [1.0, 1.0, 5.0] [1.0, 3.0, 7.0] [2.0, 6.0, 10.0]
no. of URLSa 6.0 (±10.4) 8.1 (±18.4) 12.8 (±21.1) 25.2 (±113.7) [1.0, 2.0, 8.0] [1.0, 4.0, 11.0] [2.0, 8.0, 17.0] [6.0, 15.0, 28.0]
no. of imagesa 0.3 (±1.7) 0.7 (±5.5) 1.1 (±4.8) 2.5 (±6.1) [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.0, 0.0, 1.0] [0.0, 1.0, 3.0]
Repository repository sizea 33,689.8 (± 152,529) 50,916.3 (± 194,154) 70,511.1 (±225,835) 133,307.1 (±423,076) [580.0, 5,386.5, 22,780.2] [1,230.0, 7,667.0, 33,723.8] [2,174.5, 14,557.0, 52,912.2] [4,896.5, 27,393.0, 113,130.0]
no. of open issuesa 1.1 (±10.8) 2.0 (±13.2) 6.4 (±21.8) 38.1 (±163.7) [0.0, 0.0, 0.0] [0.0, 0.0, 1.0] [0.0, 1.0, 4.0] [0.0, 5.0, 25.0]
no. of closed issuesa 1.9 (±13.5) 7.6 (±31.7) 38.4 (±130.8) 3,74.7 (±1,823.4) [0.0, 0.0, 0.0] [0.0, 0.0, 3.0] [0.0, 2.0, 19.0] [2.0, 25.0, 175.5]
description lengtha 6.2 (± 8.3) 7.7 (±9.2) 8.9 (±11.2) 9.6 (±10.2) [0.0, 4.0, 9.0] [2.0, 6.0, 11.0] [4.0, 7.0, 11.0] [4.0, 7.0, 12.0]
ratio of data files per repositorya 8.2 (±14.0) 7.1 (±12.7) 5.4 (±10.9) 3.6 (±8.7) [0.2, 2.3, 10.0] [0.4, 2.2, 7.7] [0.3, 1.4, 5.3] [0.1, 0.7, 2.8]
age of repository (days)a 1,467.9 (±490.0) 1,513.4 (±545.2) 1,627.7 (±592.3) 1,725.3 (±653.0) [1,067.0, 1,448.0, 1,791.0] [1,093.2, 1,453.0, 1,816.0] [1,214.0, 1,562.0, 1,964.0] [1,256.5, 1,628.0, 2,082.5]
ratio of problematic files for a standard config (Pandas)b 0.3 (±2.7) 0.4 (±2.8) 0.3 (±2.6) 0.2 (±1.5) [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.0, 0.0, 0.0] [0.0, 0.0, 0.0]
Data File average size of data files (csv)b 309,999.4 (±4,314,537) 337,453.3 (±2,901,912) 532,226.8 (±3,595,252) 248,120.4 (±2,268,705) [1,732.0, 7,017.0, 33,942.0] [1,419.0, 6,046.5, 53,402.0] [1,692.0, 10,398.0, 79,279.0] [4,763.8, 28,315.0, 73,671.0]
average size of data files(xls(x))b 426,555.6 (±2,755,034.2) 528,439.2 (±2,953,938) 360,737.8 (±2,050,485.3) 330,846.9 (±1,518,167.8) [20,430.2, 30,511.0, 83,968.0] [20,287.0, 45,568.0, 147,138.5] [16,856.8, 45,056.0, 203,837.5] [16,896.0, 34,462.0, 95,356.0]
no. of rows (csv)a 3,845.2 (±50,528) 4,324.6 (±52,089) 6,221.6 (±55,637) 3,087.6 (±35,192.0) [41.0, 85.0, 569.0] [33.0, 79.0, 719.0] [42.0, 147.0, 930.0] [41.0, 118.0, 293.0]
no. of columns (csv)b 23.3 (±340.0) 16.3 (±376.5) 23.7 (±524.6) 14.7 (±363.2) [3.0, 7.0, 18.0] [2.0, 4.0, 7.0] [3.0, 6.0, 13.0] [4.0, 11.0, 11.0]
no. of rows (xls(x)) 1,337.2 (±22,013.9) 409.4 (±10,184.4) 324.2 (±8,992.9) 1,105.0 (±16,615.8) [26.0, 64.0, 141.0] [64.0, 86.0, 122.0] [19.0, 31.0, 52.0] [20.0, 46.0, 176.0]
no. of columns (xls(x)) 29.8 (±397.2) 36.2 (±531.0) 23.8 (±155.0) 25.6 (±423.3) [5.0, 9.0, 16.0] [19.0, 19.0, 19.0] [9.0, 12.0, 16.0] [6.0, 10.0, 15.0]
missing values ratio (csv)a 8.7 (±16.6) 7.2 (±19.1) 10.5 (±20.5) 13.0 (±13.6) [0.0, 0.0, 11.3] [0.0, 0.0, 0.0] [0.0, 0.0, 11.7] [0.0, 19.0, 19.8]

Quantiles values are reported in the [x25, x50, x75] format, where x25, x50 and x75 represent the 25th, 50th, and 75th quantile of a particular group's characteristic.

a

Indicates statistically significant differences (p0.05) of pairwise comparisons across all four groups.

b

Denotes cases for which statistical significant differences are observed between the values of groups 1 and 4 but not necessarily between the rest of pairwise comparisons.