RF23
|
A random forest (RF) classifier
is a machine learning algorithm that combines the output of multiple
decision trees to produce a single result. In more detail, it fits
a number of decision tree classifiers on a collection of subsamples
from the data set and uses averaging to improve predictive accuracy
and control overfitting. |
SVM24
|
A support vector machine
(SVM) is a supervised machine learning algorithm that is primarily
used for classification problems. In this algorithm, each data item
is plotted as a point in n-dimensional space, representing each feature
as a coordinate. Then, the algorithm finds the optimal decision boundary
(i.e., the hyperplane) to separate the various classes by using the
extreme points/vectors (i.e., the support vectors). |
BERT18
|
BERT,
which stands for Bidirectional
Encoder Representations from Transformers, is a general-purpose language
model that was pretrained on the BookCorpus (800 M words) and English
Wikipedia (2500M) data sets. By employing the use of self-attention
mechanisms, BERT can be fine-tuned with additional layers to complete
new tasks with new data, making it a foundation for many transformer-based
language models. At the time of its release, BERT achieved state-of-the-art
performance for several tasks of the General Language Understanding
Evaluation (GLUE) benchmark. The base model is composed of approximately
100 million parameters. |
ALBERT20
|
ALBERT, or A Lite BERT,
addresses the problems of memory limitation and lengthy training times
by modifying BERT through the incorporation of the parameter reduction
techniques of factorized embedding parametrization and cross-later
parameter sharing. Despite having much fewer parameters (i.e., only
12 million parameters in the base model), ALBERT experiences minimal
loss in language understanding, performing almost equally to BERT
for the GLUE benchmark. |
DistilBERT21
|
DistilBERT, a distilled
version of BERT, addresses concerns about the computational efficiency
of large transformer-based language models by applying knowledge distillation
during BERT’s pretraining phase. This method reduced the size
of BERT by 40%, while retaining 97% of its language understanding
capabilities and being 60% faster. |
RoBERTa22
|
RoBERTa, a robustly optimized
BERT pretraining approach, modifies BERT by training the model longer
and with more data, training on longer sequences, and dynamically
changing the masking pattern applied to the training data. In more
detail, this model’s training data is composed of the BookCorpus
and English Wikipedia (16GB), CC-News (76GB), OpenWebText (38GB),
and Stories (31GB) data sets. With these modifications, RoBERTa improved
on the results obtained by BERT, achieving several state-of-the-art
results. The base model is composed of approximately 110 million parameters. |