Table 8.
The total number of overlapped code and their license distributions for each dataset (OL represents overlap).
| Dataset | # OL | OL% | Public Domain | Permissive | Weak Copyleft | Strong Copyleft |
|---|---|---|---|---|---|---|
|
| ||||||
| GCPY | 526k | 17.17% | 1.49% | 69.75% | 3.18% | 25.57% |
| CodeParrot-Clean | 424k | 97.03% | 1.68% | 64.89% | 3.41% | 30.02% |
| CodeSearchNet | 79k | 69.30% | 0.14% | 92.43% | 2.16% | 5.27% |
| The Pile | 65k | 34.03% | 1.34% | 84.66% | 2.64% | 11.35% |