| Bug reports |
Runeson, Alexandersson & Nyholm (2007)
|
Ranking problem |
Defect reports from development projects. |
|
24–42% |
|
The approach uses NLP techniques. |
|
Alipour, Hindle & Stroulia (2013)
|
Classification task |
Android ecosystem bug repository. |
|
|
91% |
The approach uses ML algorithms. |
|
Kukkar et al. (2020)
|
Classification task |
Six datasets from (Lazar, Ritchey & Sharif, 2014; Lerch & Mezini, 2013) |
|
79–94% R@20 |
85–99% |
The approach is a CNN-based strategy. |
|
Cooper et al. (2021)
|
Ranking problem |
RICO dataset (Deka et al., 2017) |
|
|
83% top-2 |
The approach uses computer vision, optical recognition, and text retrieval techniques. |
| Q&A forums |
Zhang et al. (2015)
|
Ranking problem |
Pre-labeled Stack Overflow database |
|
64% R@20 |
|
The approach combines the similarity scores of four features |
|
Ahasanuzzaman et al. (2016)
|
Classification task |
Pre-labeled Stack Overflow database |
|
66% R@20 |
|
The approach is a supervised classification strategy. |
|
Zhang et al. (2017)
|
Classification task |
Pre-labeled Stack Overflow database |
|
87% |
|
The approach is based on ML algorithms. |
|
Mizobuchi & Takayama (2017)
|
Ranking problem |
Pre-labeled Stack Overflow database |
|
43% R@20 |
|
The approach is based on Word2vec models. |
|
Zhang et al. (2018)
|
Ranking-classification task |
Pre-labeled Stack Overflow database |
75–86% |
66–86% |
|
The approach uses rank strategies, deep learning, and IR techniques. |
|
Wang, Zhang & Jiang (2020)
|
Classification task |
Pre-labeled Stack Overflow database |
|
76–79% R@5 |
|
The approach is based on CNNs, RNNs, and LSTMs |
|
Mohomed Jabbar et al. (2021)
|
Classification task |
Pre-labeled datasets from the Stack Exchange sub-communities |
|
|
75–78% |
The approach is based on deep learning and transfer learning techniques |
|
Pei et al. (2021)
|
Classification task |
Pre-labeled Stack Overflow database |
82% |
82% |
|
The approach is based on an Attention-based Sentence and ASIM model |
|
Gao, Wu & Xu, 2022
|
Classification task |
Pre-labeled Stack Overflow database |
|
68–79% |
|
The approach is based on word embedding and CNNs |
| GitHub activities |
Wang et al. (2019)
|
Classification task |
DupPR (Yu et al., 2018) |
73% P@1 |
65% R@1 |
|
The approach is based on AdaBoost algorithm. |
|
Li et al. (2017)
|
Ranking problem |
The authors constructed a dataset of duplicate PRs |
|
54–83% R@20 |
|
The approach is based on IR and NLP techniques |
|
Ren et al. (2019)
|
Classification task |
DupPR (Yu et al., 2018) |
83% |
11% |
|
The approach is based on IR and NLP techniques |
|
Zhang et al. (2020)
|
Recommendation task |
The authors constructed a dataset of duplicates (https://github.com/yangzhangs/iLinker) |
|
45–61% R@10 |
|
The approach is based on IR and deep learning techniques |