. Author manuscript; available in PMC: 2025 Aug 26.

Published in final edited form as: Proc Mach Learn Res. 2023 Jul;23:40373–40389.

Table 2.

Data statistics of the constructed prompts using CODEIPPROMPT. The number of tokens is measured by tokenizers from the multi-lingual CodeGen model, and is presented with its mean and the standard deviation (in subscripts).

# Prompts	Total	Permissive 52.0K	Weak Copyleft 77.1K		Strong Copyleft 50.0K
# Prompts	179.1K	C 6.2K	C++ 6.2K	C# 37.6K	Python 30.1K	Java 99.0K
# Tokens	Avg.	Permissive 13.2_7.3	Weak Copyleft 13.2_7.5		Strong Copyleft 13.3_6.2
# Tokens	13.2_7.1	C 18.2_9.5	C++ 18.3_11.5	C# 14.6_7.9	Python 11.6_6.0	Java 12.6_6.1