Table 5.
Evaluation results of models with respect to prompts derived from individual training datasets. Left: prompts were derived from direct sampling of training datasets. Right: prompts were derived from filtered datasets where overlapped corpus were not included (*NO stands for non-overlap).
| Model╲Dataset | CodeRL |
CodeGen |
CodeParrot |
|
CodeRL-NO* |
CodeGen-NO |
CodeParrot-NO |
||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EM | EP | EM | EP | EM | EP | EM | EP | EM | EP | EM | EP | ||
|
| |||||||||||||
| CodeT5-large | 0.280.26 | 0.18 | 0.190.26 | 0.07 | 0.32 0.26 | 0.22 | 0.31 0.14 | 0.19 | 0.220.16 | 0.13 | 0.260.19 | 0.17 | |
|
| |||||||||||||
| CodeT5-large-ntp-py | 0.92 0.13 | 0.98 | 0.740.21 | 0.84 | 0.640.11 | 0.94 | 0.91 0.15 | 0.98 | 0.380.21 | 0.24 | 0.400.06 | 0.14 | |
|
| |||||||||||||
| CodeGen-350M | 0.590.23 | 0.74 | 0.76 0.16 | 0.95 | 0.650.25 | 0.68 | 0.330.06 | 0.34 | 0.78 0.14 | 0.94 | 0.320.05 | 0.28 | |
|
| |||||||||||||
| CodeGen-2.7B | 0.540.12 | 0.80 | 0.78 0.15 | 0.96 | 0.660.24 | 0.66 | 0.280.04 | 0.33 | 0.75 0.11 | 0.98 | 0.360.08 | 0.26 | |
|
| |||||||||||||
| CodeParrot-110M | 0.500.20 | 0.20 | 0.550.17 | 0.63 | 0.66 0.17 | 0.76 | 0.310.04 | 0.22 | 0.230.06 | 0.30 | 0.71 0.23 | 0.80 | |
|
| |||||||||||||
| CodeParrot-1.5B | 0.580.17 | 0.65 | 0.600.23 | 0.68 | 0.65 0.17 | 0.73 | 0.340.07 | 0.27 | 0.270.09 | 0.36 | 0.70 0.20 | 0.73 | |