. 2020 Nov 4;1(8):100136. doi: 10.1016/j.patter.2020.100136

Table 6.

Characteristics of the Dataset Corpus and for Four Groups of Reuse: 1 = Lowest Reuse, 4 = Highest Reuse

Type	Characteristics	Mean G1	Mean G2	Mean G3	Mean G4	Quantile G1	Quantile G2	Quantile G3	Quantile G4
README	no. of words in README (non-code related)^a	286.2 ( $\pm$ 963.8)	345.1 ( $\pm$ 835.6)	541.9 ( $\pm$ 1,509.7)	801.9 ( $\pm$ 1,808.7)	[6.0, 48.0, 287.0]	[15.0, 125.0, 389.8]	[63.0, 250.0, 626.0]	[151.5, 416.0, 869.0]
	no. of tables^a	0.0 ( $\pm$ 0.5)	0.1 ( $\pm$ 0.6)	0.1 ( $\pm$ 1.6)	0.3 ( $\pm$ 2.2)	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]
	no. of code blocks^a	0.9 ( $\pm$ 3.5)	1.3 ( $\pm$ 4.2)	2.3 ( $\pm$ 6.1)	3.5 ( $\pm$ 8.1)	[0.0, 0.0, 1.0]	[0.0, 0.0, 1.0]	[0.0, 0.0, 2.0]	[0.0, 1.0, 4.0]
	no. of headers^a	2.3 ( $\pm$ 4.1)	3.6 ( $\pm$ 5.6)	5.3( $\pm$ 7.9)	8.8 ( $\pm$ 54.6)	[0.0, 1.0, 3.0]	[1.0, 1.0, 5.0]	[1.0, 3.0, 7.0]	[2.0, 6.0, 10.0]
	no. of URLS^a	6.0 ( $\pm$ 10.4)	8.1 ( $\pm$ 18.4)	12.8 ( $\pm$ 21.1)	25.2 ( $\pm$ 113.7)	[1.0, 2.0, 8.0]	[1.0, 4.0, 11.0]	[2.0, 8.0, 17.0]	[6.0, 15.0, 28.0]
	no. of images^a	0.3 ( $\pm$ 1.7)	0.7 ( $\pm$ 5.5)	1.1 ( $\pm$ 4.8)	2.5 ( $\pm$ 6.1)	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]	[0.0, 0.0, 1.0]	[0.0, 1.0, 3.0]
Repository	repository size^a	33,689.8 ( $\pm$ 152,529)	50,916.3 ( $\pm$ 194,154)	70,511.1 ( $\pm$ 225,835)	133,307.1 ( $\pm$ 423,076)	[580.0, 5,386.5, 22,780.2]	[1,230.0, 7,667.0, 33,723.8]	[2,174.5, 14,557.0, 52,912.2]	[4,896.5, 27,393.0, 113,130.0]
	no. of open issues^a	1.1 ( $\pm$ 10.8)	2.0 ( $\pm$ 13.2)	6.4 ( $\pm$ 21.8)	38.1 ( $\pm$ 163.7)	[0.0, 0.0, 0.0]	[0.0, 0.0, 1.0]	[0.0, 1.0, 4.0]	[0.0, 5.0, 25.0]
	no. of closed issues^a	1.9 ( $\pm$ 13.5)	7.6 ( $\pm$ 31.7)	38.4 ( $\pm$ 130.8)	3,74.7 ( $\pm$ 1,823.4)	[0.0, 0.0, 0.0]	[0.0, 0.0, 3.0]	[0.0, 2.0, 19.0]	[2.0, 25.0, 175.5]
	description length^a	6.2 ( $\pm$ 8.3)	7.7 ( $\pm$ 9.2)	8.9 ( $\pm$ 11.2)	9.6 ( $\pm$ 10.2)	[0.0, 4.0, 9.0]	[2.0, 6.0, 11.0]	[4.0, 7.0, 11.0]	[4.0, 7.0, 12.0]
	ratio of data files per repository^a	8.2 ( $\pm$ 14.0)	7.1 ( $\pm$ 12.7)	5.4 ( $\pm$ 10.9)	3.6 ( $\pm$ 8.7)	[0.2, 2.3, 10.0]	[0.4, 2.2, 7.7]	[0.3, 1.4, 5.3]	[0.1, 0.7, 2.8]
	age of repository (days)^a	1,467.9 ( $\pm$ 490.0)	1,513.4 ( $\pm$ 545.2)	1,627.7 ( $\pm$ 592.3)	1,725.3 ( $\pm$ 653.0)	[1,067.0, 1,448.0, 1,791.0]	[1,093.2, 1,453.0, 1,816.0]	[1,214.0, 1,562.0, 1,964.0]	[1,256.5, 1,628.0, 2,082.5]
	ratio of problematic files for a standard config (Pandas)^b	0.3 ( $\pm$ 2.7)	0.4 ( $\pm$ 2.8)	0.3 ( $\pm$ 2.6)	0.2 ( $\pm$ 1.5)	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]	[0.0, 0.0, 0.0]
Data File	average size of data files (csv)^b	309,999.4 ( $\pm$ 4,314,537)	337,453.3 ( $\pm$ 2,901,912)	532,226.8 ( $\pm$ 3,595,252)	248,120.4 ( $\pm$ 2,268,705)	[1,732.0, 7,017.0, 33,942.0]	[1,419.0, 6,046.5, 53,402.0]	[1,692.0, 10,398.0, 79,279.0]	[4,763.8, 28,315.0, 73,671.0]
	average size of data files(xls(x))^b	426,555.6 ( $\pm$ 2,755,034.2)	528,439.2 ( $\pm$ 2,953,938)	360,737.8 ( $\pm$ 2,050,485.3)	330,846.9 ( $\pm$ 1,518,167.8)	[20,430.2, 30,511.0, 83,968.0]	[20,287.0, 45,568.0, 147,138.5]	[16,856.8, 45,056.0, 203,837.5]	[16,896.0, 34,462.0, 95,356.0]
	no. of rows (csv)^a	3,845.2 ( $\pm$ 50,528)	4,324.6 ( $\pm$ 52,089)	6,221.6 ( $\pm$ 55,637)	3,087.6 ( $\pm$ 35,192.0)	[41.0, 85.0, 569.0]	[33.0, 79.0, 719.0]	[42.0, 147.0, 930.0]	[41.0, 118.0, 293.0]
	no. of columns (csv)^b	23.3 ( $\pm$ 340.0)	16.3 ( $\pm$ 376.5)	23.7 ( $\pm$ 524.6)	14.7 ( $\pm$ 363.2)	[3.0, 7.0, 18.0]	[2.0, 4.0, 7.0]	[3.0, 6.0, 13.0]	[4.0, 11.0, 11.0]
	no. of rows (xls(x))	1,337.2 ( $\pm$ 22,013.9)	409.4 ( $\pm$ 10,184.4)	324.2 ( $\pm$ 8,992.9)	1,105.0 ( $\pm$ 16,615.8)	[26.0, 64.0, 141.0]	[64.0, 86.0, 122.0]	[19.0, 31.0, 52.0]	[20.0, 46.0, 176.0]
	no. of columns (xls(x))	29.8 ( $\pm$ 397.2)	36.2 ( $\pm$ 531.0)	23.8 ( $\pm$ 155.0)	25.6 ( $\pm$ 423.3)	[5.0, 9.0, 16.0]	[19.0, 19.0, 19.0]	[9.0, 12.0, 16.0]	[6.0, 10.0, 15.0]
	missing values ratio (csv)^a	8.7 ( $\pm$ 16.6)	7.2 ( $\pm$ 19.1)	10.5 ( $\pm$ 20.5)	13.0 ( $\pm$ 13.6)	[0.0, 0.0, 11.3]	[0.0, 0.0, 0.0]	[0.0, 0.0, 11.7]	[0.0, 19.0, 19.8]

Quantiles values are reported in the [ $x_{25}$ , $x_{50}$ , $x_{75}$ ] format, where $x_{25}$ , $x_{50}$ and $x_{75}$ represent the 25th, 50th, and 75th quantile of a particular group's characteristic.

Indicates statistically significant differences ( $p \leq 0.05$ ) of pairwise comparisons across all four groups.

Denotes cases for which statistical significant differences are observed between the values of groups 1 and 4 but not necessarily between the rest of pairwise comparisons.