Table 6.
License composition for individual code corpus datasets.
| Dataset | Total | Public Domain | Permissive | Weak Copyleft | Strong Copyleft |
|---|---|---|---|---|---|
|
| |||||
| GCPY | 3063k | 2.18% | 72.77% | 3.18% | 21.88% |
| CodeParrot-Clean | 437k | 1.67% | 64.98% | 3.41% | 29.94% |
| CodeSearchNet | 114k | 0.50% | 91.88% | 1.26% | 6.37% |
| The Pile | 191k | 40.52% | 50.82% | 1.58% | 7.08% |