Skip to main content
. Author manuscript; available in PMC: 2025 Aug 26.
Published in final edited form as: Proc Mach Learn Res. 2023 Jul;23:40373–40389.

Table 8.

The total number of overlapped code and their license distributions for each dataset (OL represents overlap).

Dataset # OL OL% Public Domain Permissive Weak Copyleft Strong Copyleft

GCPY 526k 17.17% 1.49% 69.75% 3.18% 25.57%
CodeParrot-Clean 424k 97.03% 1.68% 64.89% 3.41% 30.02%
CodeSearchNet 79k 69.30% 0.14% 92.43% 2.16% 5.27%
The Pile 65k 34.03% 1.34% 84.66% 2.64% 11.35%