Skip to main content
. Author manuscript; available in PMC: 2025 Aug 26.
Published in final edited form as: Proc Mach Learn Res. 2023 Jul;23:40373–40389.

Table 6.

License composition for individual code corpus datasets.

Dataset Total Public Domain Permissive Weak Copyleft Strong Copyleft

GCPY 3063k 2.18% 72.77% 3.18% 21.88%
CodeParrot-Clean 437k 1.67% 64.98% 3.41% 29.94%
CodeSearchNet 114k 0.50% 91.88% 1.26% 6.37%
The Pile 191k 40.52% 50.82% 1.58% 7.08%