Using Deep Learning to Choose Optimal Smoothing Values for Equating

Chunyan Liu; Zhongmin Cui

doi:10.1177/01466216251363244

. 2025 Aug 23:01466216251363244. Online ahead of print. doi: 10.1177/01466216251363244

Using Deep Learning to Choose Optimal Smoothing Values for Equating

Chunyan Liu ^1,^✉, Zhongmin Cui ²

PMCID: PMC12374957 PMID: 40881830

Abstract

Test developers typically use alternate test forms to protect the integrity of test scores. Because test forms may differ in difficulty, scores on different test forms are adjusted through a psychometrical procedure called equating. When conducting equating, psychometricians often apply smoothing methods to reduce random error of equating resulting from sampling. During the process, they compare plots of different smoothing degrees and choose the optimal value when using the cubic spline postsmoothing method. This manual process, however, could be automated with the help of deep learning—a machine learning technique commonly used for image classification. In this study, a convolutional neural network was trained using human-classified postsmoothing plots. The trained network was used to choose optimal smoothing values with empirical testing data, which were compared to human choices. The agreement rate between humans and the trained network was as large as 71%, suggesting the potential use of deep learning for choosing optimal smoothing values for equating.

Keywords: equating, smoothing, deep learning, convolutional neural network, cubic spline, automation

Introduction

A panel of three experts sat at a desk, on which a screen was showing equating curves with different degrees of smoothness. As experienced psychometricians, they examined the curves on the screen, checked the statistics on pieces of paper and discussed which curve represented the optimal balance between smoothness and fidelity. On one hand, the optimal curve should be smooth so that random equating error was reduced; on the other hand, it should not be far away from the original, typically rough curve so that systematic error was under control.

After extensive deliberation and debate, the psychometricians reached a consensus on the optimal value; and the screen started to show a second set of curves.

Imagine the process repeated for more than one hundred sets of curves. It would be a daunting task for the psychometricians to complete, especially under the pressure of quick turnarounds. This situation, however, is not uncommon during equating and score reporting for large-scale assessments when many forms are used.

Equating and Smoothing

Test developers of large testing programs typically construct alternate test forms to protect the integrity of test scores. The difference in difficulty among test forms would likely result in different raw scores if the same test taker took different forms, even under the same testing conditions. To ensure comparability, scores on different test forms are adjusted through a psychometrical procedure called equating (Kolen & Brennan, 2014). This ensures that an examinee would earn the same equivalent score regardless of the test form that they took after equating. There are various equating methods, one of which is equipercentile equating. With this method, test developers often apply smoothing techniques to reduce random equating error, an inherent component when equating is based on a random sample from a population. The magnitude of random equating error is influenced by several factors, including sample size, the number of scored items, the equating and smoothing methods used, and the content of the exam (e.g., Cui & Kolen, 2009; Kolen & Brennan, 2014; Liu & Kolen, 2018, 2020). Although the application of smoothing may increase bias or systematic error, Cui and Kolen (2009, p. 155) showed that “smoothing can improve estimation of equating relationships by reducing total error.” Other researchers (e.g., Hanson et al., 1994; Kolen, 1991; Liu & Kolen, 2018, 2020) also supported the use of smoothing in equating.

Kolen and Brennan (2014) described a cubic spline postsmoothing method designed to smooth the equating relationship after the equating process has been completed. This method, however, requires subjective judgement in choosing the smoothing degree, a parameter controlling the smoothness of a curve. More specifically, Kolen and Brennan (2014, p. 72) suggested experimenting with different smoothing degrees and inspecting the graphs and central moments of equated scores. Liu and Kolen (2020) proposed a method that automatically selects the smoothing degree by combining the estimated bias and standard error of equating. However, their approach did not account for the smoothness of the equating curve—a property typically evaluated by psychometricians during postsmoothing. This omission may explain why their method has not been widely adopted by testing programs and why manual inspection of smoothing graphs remains common practice.

The Challenge

The need to manually choose smoothing degree leads to two potential problems.

First, the choice of smoothing degree is subjective. Different psychometricians may choose different smoothing degrees. Even the same psychometrician may choose a different smoothing degree on a different occasion. The authors participated in an exercise to choose the optimal smoothing degrees, during which the three-person team reached consensus on the same values 39 out of 65 times (i.e., 60%), even though the differing selections often fell into adjacent categories. While having a panel of experts vote on the optimal smoothing value can help to mitigate the subjectivity issue of a single person to some degree, there is still no guarantee that the same panel will choose the same value on another occasion. In addition, a different panel is likely to choose a different value.

Second, the process is slow and unscalable. When the number of test forms increases, the manual work can easily become a bottleneck for the scoring and score reporting process, which typically faces the pressure of quick turnarounds. In a recent survey of a large-scale testing program (unpublished), both parents and students ranked fast reporting of scores on the top of their wish lists.

The Solution

One possible solution to this problem is to use machine learning, a technique that has gained popularity in recent years. Its applications span a vast range: from agriculture to astronomy, from business to biology, from communication to chemistry, from data mining to dentistry, from education to economy—the list goes on. The application of machine learning—especially deep learning—in the field of educational measurement, however, is still in its early stages compared to its use in other fields. Only a few studies can be found in this field, although they have already shown promising results in automating the processes traditionally done by humans. Examples of these studies are: Liu et al. (2016) found that a machine-learning based technology could reduce the cost and complexity of scoring constructed response items in a science assessment. Cui (2018, 2021) found the machine learning technique could help screen irregularity reports generated during the administration of a test. Additionally, von Davier (2018) applied machine learning to generate test items automatically.

The purpose of this study is to broaden the application of machine learning in educational measurement by automating the smoothing process in equating. In particular, deep learning (a special case of machine learning) was proposed for automatically choosing the optimal smoothing degree for the cubic spline method. An empirical study was carried out to investigate the feasibility of using deep learning. Specifically, a convolutional neural network (CNN) was trained using human-classified smoothing plots. The CNN was chosen because it is specifically designed to process image data and is widely used for image classification tasks. The trained network was used to choose optimal smoothing degrees for another set of testing data. The optimal smoothing degrees chosen by the network were compared to human choices. The impact on equating was evaluated using the root mean square deviation of equating statistic.

Method

In this section, the equipercentile equating method and the cubic spline postsmoothing method are briefly reviewed. The term “postsmoothing” refers to the fact that smoothing is conducted after equating. Readers may consult Kolen and Brennan (2014) for details on both methods. The proposed deep learning technique is described next, followed by the empirical study which closes this section.

The Equipercentile Equating Method

With equipercentile equating, a score on one test form should have the same percentile rank as the equivalent score on another test form. Mathematically, the equivalent raw score on Form Y of a raw score of $x$ on Form X can be obtained by

{e q}_{Y} (x) = G_{Y}^{- 1} (G_{X} (x))

where $G ()$ stands for the cumulative distribution of the scores on a test form (Y or X).

The Cubic Spline Postsmoothing Method

After equating, the equated raw scores can be smoothed with a cubic spline smoothing function:

f (x) = a_{i} + b_{i} (x - x_{i}) + c_{i} {(x - x_{i})}^{2} + d_{i} {(x - x_{i})}^{3}, x_{i} \leq x < x_{i + 1}

where $x_{i}$ stands for the observed score at $i^{t h}$ point, typically taking integer values from zero to the maximum score, and $a_{i}$ , $b_{i}$ , $c_{i}$ , $d_{i}$ are coefficients of the cubic spline (Kolen, 1984; Kolen & Brennan, 2014).

The function is constrained such that it is continuous up to the second derivative at each score point. For example, the following equations should hold at score $x_{5}$ :

f (x_{5}) = a_{4} + b_{4} (x_{5} - x_{4}) + c_{4} {(x_{5} - x_{4})}^{2} + d_{4} {(x_{5} - x_{4})}^{3} = a_{5}, f^{'} (x_{5}) = b_{4} + {2 c}_{4} (x_{5} - x_{4}) + {3 d}_{4} {(x_{5} - x_{4})}^{2} = b_{5}, a n d f^{″} (x_{5}) = 2 c_{4} + 6 d_{4} (x_{5} - x_{4}) = c_{5}

The second derivative $f^{″} (x)$ controls the smoothness of the function and the coefficients of the spline function can be found by minimizing $\int {[f^{″} (x)]}^{2} d x$ subject to the following constraint:

\frac{1}{m} \sum_{i = 1}^{m} {(\frac{{e q}_{Y} (x_{i}) - f (x_{i})}{s e ({e q}_{Y} (x_{i}))})}^{2} \leq S

where m is the number of score points, S is the smoothing parameter, and $s e ({e q}_{Y} (x_{i}))$ stands for the standard error of equating at score $x_{i}$ and is typically computed by either an analytical procedure (Lord, 1982) or a bootstrap procedure (Cui & Kolen, 2008; Kolen & Brennan, 2014). The analytical procedure developed by Lord (1982) was used to estimate the standard error of equating in this study.

Reinsch (1971) described an algorithm to find the coefficients for the spline function. The algorithm was implemented in the computer program RAGE-RGEQUATE (Zeng et al., 2004) which can be used for both equipercentile equating and cubic spline postsmoothing.

As shown in the equation above, larger S values are associated with larger squared difference between the smoothing function and the observed data, indicating a smoother fit function. The parameter S ranges from zero to infinity. If S takes the value of zero, this method produces the same result as when no smoothing is applied, because in this case no difference between the fit function and the observed data is allowed. Conversely, if S takes the value of infinity, this method yields a straight line which is the smoothest curve because it has no curvature at all.

The optimal value is typically chosen to balance smoothness of the curve—which can be influenced by both sample size and the number of score points—and its closeness to the observed data points. In the context of equating, there are also other considerations such as the central moments of the equated scores and the properties of the scaled scores. Kolen and Brennan (2014, pp. 80-89) suggest that the smoothed curve should fall within ±1 standard error of equating for raw scores within the 0.5% and 99.5% percentiles, and a smoothing value between 0 and 1 usually yields adequate equating results in practice. In this study, the optimal smoothing value was chosen from six candidates typically used in practical equating situations: 0.01, 0.05, 0.10, 0.15, 0.30, and 0.50. While the process of choosing the optimal value is typically done manually, it may be automated using deep learning.

Deep Learning

Deep learning is a special case of machine learning. It should be relatively easier to understand deep learning under the umbrella of machine learning. The next subsection devotes to a brief introduction of machine learning.

Machine Learning

“Each machine learning problem can be precisely defined as the problem of improving some measure of performance P when executing some task T, through some type of training experience E,” said Mitchell (2017). Task T can be classifying emails as spam or non-spam, classifying images as dog or cat, or predicting the probability of a consumer buying a specific product. The list goes on.

With the help of training data, a computer can learn to establish a relationship between the inputs (e.g., features of an email such as the length of the message body, the subject, the inclusion of specific words, etc.) and the outputs (e.g., spam or non-spam). If machine learning is successful, the computer can produce outputs consistent with or even better than a human judgement. For the case of choosing the optimal smoothing value, the inputs are the plots examined by a panel of psychometricians and the outputs are the optimal values.

Machine learning with images typically requires special attention because the input data is large. A small image with size 200×200 has 40,000 pixels. If color is considered, the number of input data points will be 40,000×3 = 120,000. If logistic regression, a commonly seen classification method, were used, the number of independent variables would be more than one hundred thousand, without considering interactions.

Pixels, however, depend on each other to make an image meaningful. If the interaction between each pair of pixels were considered, the number of independent variables for logistic regression would be around seven billion. The interaction among pixels, however, is not limited to pairs. The number of variables would skyrocket if complicated interactions were considered, making the learning task unbearable.

To the rescue comes deep learning.

Deep Learning and Convolutional Neural Networks

“In essence, deep learning is a neural-networks technique that organizes the neurons in many layers” (Kubat, 2015). An artificial neural network mimics the function of a brain in that it has neurons that connect with each other. In the network, a neuron receives signals from sensors or other neurons, processes the information, and creates new signals which can serve as inputs for other neurons. Figure 1 illustrates a simple artificial neural network with four layers. The input layer has three sensors as inputs, while the output layer consists of two outputs. Between the input and output layers are two hidden layers with five and three neurons, respectively. Typically, in deep learning, there are two or more hidden layers in the artificial neural network.

Figure 1. — An Example of an Artificial Neural Network. Note. S1-S3 Denote Three Sensors as Inputs; N11-N15 Denote Five Neurons in the First Layer; N21-N23 Denote Three Neurons in the Second Layer; O1 and O2 Denote Two Outputs

For image classification, the hidden layers in the artificial neural network typically include convolutional layers and pooling layers.

Convolution

A convolutional layer convolves the input with a filter and passes its result to the next layer. Mathematically, the output layer can be computed by $O = I * F$ where I stands for the input layer and F the filter. The operator ‘*’ denotes the operation of convolution. Shown below is an example of the computation.

For a 4×3 input layer I and a 2×2 filter F, the output layer O can be computed as the following:

[\begin{array}{l} \begin{array}{l} \begin{array}{l} i_{11} & i_{12} & i_{13} \end{array} \\ \begin{array}{l} i_{21} & i_{22} & i_{23} \end{array} \end{array} \\ \begin{array}{l} i_{31} & i_{32} & i_{33} \end{array} \\ \begin{array}{l} i_{41} & i_{42} & i_{43} \end{array} \end{array}] * [\begin{array}{l} f_{11} & f_{12} \\ f_{21} & f_{22} \end{array}] = [\begin{array}{l} O_{11} & O_{12} \\ O_{21} & O_{22} \\ O_{31} & O_{32} \end{array}]

where $O_{11} = i_{11} \cdot f_{11} + i_{12} \cdot f_{12} + i_{21} \cdot f_{21} + i_{22} \cdot f_{22}$ , $O_{12} = i_{12} \cdot f_{11} + i_{13} \cdot f_{12} + i_{22} \cdot f_{21} + i_{23} \cdot f_{22}$ , and so on.

Pooling

A pooling layer reduces the size of the data by combining small neuron clusters in one layer into a single neuron in the next layer. Typical techniques for pooling are max pooling and average pooling. With max pooling, a cluster is replaced with the maximum value in the cluster; by contrast, the average value is used for the average pooling.

Shown below is an example of applying the max pooling technique to a 4×4 input layer with a 2×2 window:

[\begin{array}{l} [\begin{array}{l} 1 & 5 \\ 3 & 6 \end{array}] & [\begin{array}{l} 3 & 7 \\ 2 & 3 \end{array}] \\ [\begin{array}{l} 2 & 0 \\ 5 & 6 \end{array}] & [\begin{array}{l} 0 & 2 \\ 1 & 1 \end{array}] \end{array}] \to [\begin{array}{l} 6 & 7 \\ 6 & 2 \end{array}]

In this example, the input matrix is divided into four subsets, each with a size of 2×2. For each subset, the maximum number is chosen as the representative of the subset. All representatives form a new matrix which has a smaller size (2×2) than the input matrix (4×4).

Average pooling is similar to max pooling with only one difference – the average number rather than the maximum is used as the representative.

An artificial neural network with one or more convolutional layers is typically referred to as a convolutional neural network. Figure 2 shows the convolutional neural network used in this study, which included two convolutional layers – each followed by a layer of max pooling. In this network, the input layer receives the raw pixel data of the image, the convolutional layers extract features such as edges and textures, and the pooling layers reduce the spatial dimensions of the feature maps, improving computational efficiency and helping to control overfitting. Two convolutional layers are typically used for classifying images of curves. The first layer learns basic features such as edges and lines, while the second layer captures more complex patterns, such as curves and arcs.

Figure 2 also shows a Flatten & Dense layer, which is different from a convolutional layer and a pooling layer. For a convolutional layer or a pooling layer, a data element only interacts with its neighbors before entering the next layer. By contrast, for a Flatten & Dense layer, all data elements work as one group before entering the next layer. The Flatten layer takes the multi-dimensional output (i.e., height, width, and channels) from the convolutional or pooling layers and converts it into a one-dimensional vector. This conversion is necessary because Dense layers operate only on flat vectors. A Dense layer connects every neuron in the previous layer to every neuron in the current layer, enabling the model to combine the extracted features for tasks such as classification or regression.

Empirical Study

Data

Equating data from a large-scale reading assessment were used for the empirical study. There were 161 test forms, each administered to approximately 3,000 examinees. Each form consisted of 40 items, with each item worth one point. The resulting score distributions were approximately bell-shaped, indicating a roughly normal distribution of examinee performance. Equipercentile equating with cubic spline postsmoothing was employed to equate new forms to base reference forms. Three psychometricians compared the postsmoothing plots with different smoothing values and chose the optimal smoothing values through consensus decision-making.

The plots and the optimal smoothing values were used as the input and output data for deep learning, respectively. The operational equating results were used to evaluate the performance of the deep learning model used in the present study.

Procedure

For each test form, cubic spline curves with different smoothing values (0.01, 0.05, 0.1, 0.15, 0.3, and 0.5) were plotted in two ways. Figure 3 shows one way, combining all curves in one plot, while Figure 4 shows the other, plotting each curve in a separate plot which was placed on a panel of plots. The curve of the unsmoothed equating relationship and the equating error bands were also included because they were part of the plots examined by psychometricians. Each plot had a width of 768 pixels and a height of 512 pixels. Although the plots could have been slightly trimmed to reduce the number of white pixels, this was intentionally avoided to ensure that both humans and the machine analyzed identical plots. While removing white pixels might reduce computation time, it does not impact the method’s performance, as CNNs learn from foreground features rather than the white background. If computation time had been a major concern, more advanced models such as Faster R-CNN or Vision Transformers— which focus on the most relevant regions of an image—could have been considered.

Figure 4. — A Sample Image of a Panel With Plots of Cubic Spline Curves for Six Smoothing Values. Note. 1. The Curves of the Unsmoothed Equating Relationship and the Standard Error Bands Were Also Included. 2. Smoothed Curves are Associated With Smoothing Values of 0.01, 0.05, 0.1, 0.15, 0.3, and 0.5 From Left to Right and Top to Bottom, Respectively

The two figures capture different relationships: Figure 3 shows all smoothing curves on top of each other, thus revealing the interaction among all curves, while Figure 4 displays each smoothing curve in one plot of the panel, thus emphasizing the relationship between the smoothing curve, the unsmoothed curve, and the equating error bands. It was not clear which approach was better for deep learning, but the present study would aim to address this question.

The test forms were randomly split into three data sets: the training data set (100 test forms or 62%), the validation data set (33 test forms or 20%), and the testing data set (28 test forms or 18%). Each data set had its unique purpose: the training data set was used to fit the model, the validation data set was used to tune the parameters of the model, and the testing data set was used to evaluate the performance of the model.

Keras (Chollet & Others, 2015) was used to train the convolutional neural network. Specifically, two convolutional layers and two pooling layers were used. The first convolutional layer used a 3×3 window as the kernel (i.e., a small matrix of learnable weights) for convolution and yielded 32 filters or nodes. The second convolutional layer used a 3×3 window as the kernel for convolution and yielded 32 filters or nodes. Both pooling layers used a 2×2 window and employed the max pooling approach. After the second pooling layer and flattening, a regular densely connected neural network layer with 128 neurons was added before the output layer. The activation function for this densely connected layer and both convolutional layers was a rectified linear unit (ReLU), the most commonly used activation function in neural networks. The output layer applied the softmax function because the classification had more than two categories; for dichotomous categories, the sigmoid function would be used. The Adam algorithm was used for optimization with categorical cross entropy being the loss function. The model was trained in 128 seconds on a ThinkPad equipped with 16 GB of RAM and an Intel Core i7-1185G7 processor.

Evaluation

The performance of the deep learning model was evaluated in terms of accuracy of the selected optimal smoothing values, i.e., the agreement rate between human and machine on the optimal smoothing values. A successfully trained deep learning algorithm should result in decisions consistent with those of the psychometricians.

Because equated scores were used to determine the scale scores reported to examinees, the impact of using deep learning on equated scores was also evaluated. The root mean square deviation (RMSD) in equating was computed between the equated scores yielded by the deep learning model and those by humans. The statistic was calculated by

R M S D = \sqrt{\sum_{i} r_{i} {({e q_m}_{i} - {e q_h}_{i})}^{2}}

where $e q_m$ and $e q_h$ are equated scores yielded by the model and humans, respectively, $r$ is the relative frequency of test takers who earned a particular score, and $i$ indexes the score points. RMSD quantifies the weighted average of the equating difference between human- and machine-generated equating results across all raw score points. We would like to highlight that the RMSD equals zero when both humans and machines select the same smoothing parameter.

Results

Figure 5 shows the model accuracy of the convolutional neural network in choosing optimal smoothing values with all curves combined in one plot for the training and validation data sets. As can be seen from this figure, the accuracy index values obtained from both the training data set and the validation data set stabilized after two epochs (i.e., two complete passes of the training data set through the algorithm), centering on the neighborhood of .60. The model performed better with the training data set than the validation data set, which is not surprising because the model was fit on the training data set. The difference between the two index values, however, was small after stabilization. This finding suggests a potential generalization of the model.

Figure 6 shows the model-predicted accuracy of the convolutional neural network in choosing optimal smoothing values with curves in separate plots on a panel. The results suggest that the model accuracies are similar to those presented in Figure 5, except that it took more epochs—six in Figure 6 compared to two in Figure 5—for the model-predicted accuracy to stabilize.

When the trained model was used to choose optimal smoothing values for the testing data, the choices agreed with the humans’ for 18 out of 28 test forms using the plots with combined curves, and 20 out of 28 using the plots with separate curves on the same panel. In other words, 64% and 71% of the machine’s choices agreed with those of the humans’ when using the combined curves and separate curves, respectively—slightly outperforming the 60% agreement rate observed among human raters. This finding appears to favor the use of a panel of separate curves for deep learning.

In order to examine the magnitude of the difference in the selected optimal smoothing values, Table 1 shows the optimal smoothing values chosen by humans and machine with combined curves on one plot. To save space, test forms with the same optimal values chosen by humans and machine (18 test forms) were not included. As can be seen from this table, most differences were small in that they were in adjacent categories (e.g., smoothing value of 0.01 vs. 0.05). Out of the 10 forms with different smoothing parameters selected by humans and machine, only the second and 10th forms were not in adjacent categories. Table 1 also shows the results of root mean square deviation in equating for test forms where the machine chose different smoothing values from humans. No root mean squared difference in equating was larger than 0.08. Considering random equating error values at most score points were larger than 0.20 (results not shown), these differences were small.

Table 1.

Different Smoothing Values Chosen by Humans and Machine with Smoothing Curves on One Plot and Their Impact on Equating

Human selection	Machine selection	RMSD of equating
0.01	0.05	0.04
0.01	0.10	0.06
0.01	0.05	0.04
0.05	0.10	0.02
0.05	0.10	0.03
0.10	0.05	0.03
0.10	0.05	0.03
0.10	0.05	0.03
0.15	0.05	0.07
0.30	0.10	0.08

Open in a new tab

Note. The optimal smoothing values were chosen from six candidates: 0.01, 0.05, 0.10, 0.15, 0.30, and 0.50. The same smoothing values chosen by humans and machine (64% of total examined curve sets) were not included.

Table 2 presents similar information to Table 1, with the only difference being that the curves for different smoothing values were individually plotted on a panel instead of being combined on one plot. Similar conclusions can be drawn from Table 2. We would like to point out that even though Tables 1 and 2 only show smoothing values of 0.05 and 0.10, the classification was not binary because other values were also selected by humans or machine in other forms. These values were not included in both tables because the same values were chosen by humans and machine. The actual task was a six-class classification.

Table 2.

Different Smoothing Values Chosen by Humans and Machine with Smoothing Curves on Separate Plots and Their Impact on Equating

Human selection	Machine selection	RMSD of equating
0.01	0.05	0.04
0.01	0.05	0.04
0.01	0.05	0.04
0.10	0.05	0.03
0.10	0.05	0.03
0.10	0.05	0.03
0.15	0.05	0.07
0.30	0.05	0.10

Open in a new tab

Note. The optimal smoothing values were chosen from six candidates: 0.01, 0.05, 0.10, 0.15, 0.30, and 0.50. The same smoothing values chosen by humans and machine (71% of total examined curve sets) were not included.

In summary, the deep learning model used in this study tended to choose the same optimal smoothing values two times out of three as those chosen by humans. When the choices differed, the differences were small, and so was the impact on equating.

Discussion and Conclusions

Process automation plays an important role in almost every corner of the world. The importance of developing an automatic procedure for choosing optimal smoothing values in equating cannot be overemphasized. Visually inspecting equating graphs is the current practice for choosing the optimal degree of smoothing. This manual process is time-consuming, especially when it is conducted multiple times. It is also prone to inconsistency because a person may make a different decision if asked to make the judgement a second time. By contrast, a well-trained deep learning procedure not only saves time and money for operational testing programs but also leads to more consistent equating results, with the same smoothing degree chosen regardless of who the psychometricians or expert panels are.

Although deep learning with convolutional neural networks has been widely used for image recognition and classification, the present study concerns a novel usage – choosing optimal smoothing values in the context of equating. The results are promising because the trained deep learning model successfully predicted human choices 64% of the times or more. The level of agreement was comparable to—if not better than—that observed among human raters.

The results from this study also suggest that, when the choices differed, the differences were small—most choices by humans and machine fell in adjacent categories. As a result, the impact on equating results was also minimal, even smaller than random equating error. In other words, if equating were conducted using a different random sample, the difference in equating results would be much larger than the difference resulting from using a machine to choose the optimal smoothing degree. The comparable agreement between humans and machine choices, coupled with the minor impact on equating, suggests the potential of using deep learning to automatically choose optimal smoothing values for equating. Even if a testing program chooses not to fully automate the process—due to concerns such as insufficient accuracy or limited control—machines can still play a valuable role as expert assistants. For example, a traditional panel of three psychometricians could be restructured to include two psychometricians and one machine, thereby reducing manpower by one third while maintaining the quality of decision-making.

The promising results of this study also shed light on the potential of using deep learning with small datasets. Typically, deep learning models require large sample sizes to perform effectively. For example, Kaggle (2013) used a training set of 25,000 images for classifying dogs versus cats, while Wang et al. (2017) trained their model on 112,000 X-ray images for disease classification. However, Cui (2021) demonstrated that the effectiveness of machine learning is not necessarily restricted to big data. The results of this study also indicate that deep learning can yield robust performance, even when the sample size is relatively small.

Although other smoothing methods—for example, the strong true score method (Lord, 1965) and the polynomial log-linear method (Kolen & Brennan, 2014)—can be used to test the feasibility of automatic smoothing degree selection using deep learning, the present study focused on the cubic spline method for two reasons. First, literature has shown that the strong true score method is not flexible enough to handle certain score distributions in practice (Cui & Kolen, 2009; Hanson et al., 1994; Kolen, 1991). For example, Cui and Kolen (2009) found that the strong true score method did not work well when the score distribution was bimodal. Hanson (1991) also indicated that sometimes only three out of the four parameters were estimated in the strong true score method. The second reason for focusing on cubic spline method is that some practitioners may prefer this method over others because smoothing is applied directly on the equipercentile equating function. As Kolen and Brennan (2014, p. 98) indicated, this smoothing method “[has] been used in practice in testing programs with good results.”

For testing programs that do use the polynomial log-linear method, however, the deep learning model investigated in this study could still be applied. In this case, the plots of polynomial log-linear smoothed curves and human choices of the optimal polynomial degrees could be used to train and test the deep learning model. The results could also be compared to other smoothing value selection methods, such as AIC (Akaike, 1981).

There are two ways to plot the curves: one with all curves combined in one plot and the other with each curve separately plotted on a panel. The present study shows that the differences between the two were small in terms of model accuracy and impact on equating. Although the panel performed slightly better than the combined, additional research is warranted to confirm its performance. The combined is appealing because of the simplicity of graph creation and the smaller graph size. It is slightly easier to draw one plot in the graph than multiple plots on a panel. The size of one plot is smaller than a panel if the same resolution is maintained, resulting in smaller data size for the neural network to process and thus less computation time.

In this study, we focused on using convolutional neural networks (CNNs) to select optimal smoothing values, due to their inherent suitability for image analysis. Other simpler machine learning methods—such as k-Nearest Neighbors, Support Vector Machines (SVM), and Random Forests—could also be considered, particularly given their typically lower sample size requirements. However, since these methods are not specifically designed for image data, extensive preprocessing would be required. One possible approach involves extracting statistical features associated with different smoothing values and using them as input for these methods. Furthermore, a hybrid approach that combines CNNs for image analysis with one or more of these methods applied to collected statistics may yield more accurate results in selecting optimal smoothing values. This approach warrants further research.

This study could also be extended to include other measures of model quality beyond accuracy. We chose accuracy because it is commonly used for simple classification tasks and during the early stages of model development (Sammut & Webb, 2010). It is also easy for a broad audience to understand, as it directly answers the question: ‘How often is the model right?’ However, other measures—such as precision, recall, F1 score, receiver operating characteristic (ROC) curve, and confusion matrix—could offer complementary insights into model performance from different perspectives.

With more research on deep learning in the field of educational measurement, it is reasonable to expect that deep learning can help automate many processes traditionally done by humans, easing the demand for manual work. Caution, however, should be taken when applying deep learning to reap the benefits of automation. It is true that automation has brought tremendous benefits to every corner of the world: labor saved, time saved, reliability improved, just to name a few. Automation, however, is not without problems. For example, the crash of two Boeing 737 Max aircrafts was the result of an automatic stability control system being fooled by a failed sensor. Additionally, we would like to point out that, given the unique characteristics of different testing programs, such as test length, test content, number of examinees, etc., machine learning models need to be constructed through training, validating, and testing data sets specifically for each testing program before they can be used to select the optimal postsmoothing parameter.

The use of deep learning would not eliminate the need for experts. They would still need to work at their desks – but this time, with a different task. Instead of repeating the process for more than one hundred sets of curves, they would randomly select several sets and monitor the performance of the deep learning model. Time would be saved, and accuracy would be improved. Machines should not replace humans completely, but can help humans, clearly.

Footnotes

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Chunyan Liu https://orcid.org/0000-0002-7863-8229

Zhongmin Cui https://orcid.org/0000-0003-2426-6762

References

Akaike H. (1981). Likelihood of a model and information criteria. Journal of Econometrics, 16(1), 3–14. 10.1016/0304-4076(81)90071-3 [DOI] [Google Scholar]
Chollet F. Others. (2015). Keras. Retrieved from https://keras.io
Cui Z. (2018). Can a machine learn to screen irregularities? [ACT data bytes]. ACT. [Google Scholar]
Cui Z. (2021). Machine learning and small data. Educational Measurement: Issues and Practice, 40(4), 8–12. 10.1111/emip.12472 [DOI] [Google Scholar]
Cui Z., Kolen M. J. (2008). Comparison of parametric and nonparametric bootstrap methods for estimating random error in equipercentile equating. Applied Psychological Measurement, 32(4), 334–347. 10.1177/0146621607300854 [DOI] [Google Scholar]
Cui Z., Kolen M. J. (2009). Evaluation of two new smoothing methods in equating: The cubic B-spline presmoothing method and the direct presmoothing method. Journal of Educational Measurement, 46(2), 135–158. 10.1111/j.1745-3984.2009.00074.x [DOI] [Google Scholar]
Hanson B. A. (1991). Method of moments estimates for the four-parameter beta compound binomial model and the calculation of classification consistency indexes [ACT Research Report]. ACT. [Google Scholar]
Hanson B. A., Zeng L., Colton D. A. (1994). A comparison of presmoothing and postsmoothing methods in equipercentile equating [ACT Research Report]. ACT. [Google Scholar]
Kaggle . (2013). Dogs vs. Cats. Retrieved from. https://www.kaggle.com/c/dogs-vs-cats
Kolen M. J. (1984). Effectiveness of analytic smoothing in equipercentile equating. Journal of Educational Statistics, 9(1), 25–44. 10.2307/1164830 [DOI] [Google Scholar]
Kolen M. J. (1991). Smoothing methods for estimating test score distributions. Journal of Educational Measurement, 28(3), 257–282. 10.1111/j.1745-3984.1991.tb00358.x [DOI] [Google Scholar]
Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. Methods and practices (3rd ed.), Springer. [Google Scholar]
Kubat M. (2015). An introduction to machine learning (1st ed.), Springer Publishing Company. [Google Scholar]
Liu C., Kolen M. J. (2018). A comparison of strategies for smoothing parameter selection for mixed-format tests under the random groups design. Journal of Educational Measurement, 55(4), 564–581. 10.1111/jedm.12192 [DOI] [Google Scholar]
Liu C., Kolen M. J. (2020). A new statistic for selecting the smoothing parameter for polynomial loglinear equating under the random groups design. Journal of Educational Measurement, 57(3), 458–479. 10.1111/jedm.12257 [DOI] [Google Scholar]
Liu O. L., Rios J. A., Heilman M., Gerard L., Linn M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233. 10.1002/tea.21299 [DOI] [Google Scholar]
Lord F. M. (1965). A strong true-score theory, with applications. Psychometrika, 30(3), 239–270. 10.1007/BF02289490 [DOI] [PubMed] [Google Scholar]
Lord F. M. (1982). The standard error of equipercentile equating. Journal of Educational Statistics, 7(3), 165–174. 10.2307/1164642 [DOI] [Google Scholar]
Mitchell T. M. (2017). Key ideas in machine learning. Retrieved from. https://www.cs.cmu.edu/∼tom/mlbook/keyIdeas.pdf
Reinsch C. H. (1971). Smoothing by spline functions. II. Numerische Mathematik, 16(5), 451–454. 10.1007/BF02169154 [DOI] [Google Scholar]
Sammut C., Webb G. I. (2010). Encyclopedia of machine learning. Springer. [Google Scholar]
von Davier M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857. 10.1007/s11336-018-9608-y [DOI] [PubMed] [Google Scholar]
Wang X., Peng Y., Lu L., Lu Z., Bagheri M., Summers R. M. (2017). ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. IEEE CVPR, 2097–2106. https://arxiv.org/abs/1705.02315 [Google Scholar]
Zeng L., Kolen M., Hanson B., Cui Z., Chien Y. (2004). RAGE-RGEQUATE. University of Iowa. [Computer software]. [Google Scholar]

[bibr1-01466216251363244] Akaike H. (1981). Likelihood of a model and information criteria. Journal of Econometrics, 16(1), 3–14. 10.1016/0304-4076(81)90071-3 [DOI] [Google Scholar]

[bibr2-01466216251363244] Chollet F. Others. (2015). Keras. Retrieved from https://keras.io

[bibr3-01466216251363244] Cui Z. (2018). Can a machine learn to screen irregularities? [ACT data bytes]. ACT. [Google Scholar]

[bibr4-01466216251363244] Cui Z. (2021). Machine learning and small data. Educational Measurement: Issues and Practice, 40(4), 8–12. 10.1111/emip.12472 [DOI] [Google Scholar]

[bibr5-01466216251363244] Cui Z., Kolen M. J. (2008). Comparison of parametric and nonparametric bootstrap methods for estimating random error in equipercentile equating. Applied Psychological Measurement, 32(4), 334–347. 10.1177/0146621607300854 [DOI] [Google Scholar]

[bibr6-01466216251363244] Cui Z., Kolen M. J. (2009). Evaluation of two new smoothing methods in equating: The cubic B-spline presmoothing method and the direct presmoothing method. Journal of Educational Measurement, 46(2), 135–158. 10.1111/j.1745-3984.2009.00074.x [DOI] [Google Scholar]

[bibr7-01466216251363244] Hanson B. A. (1991). Method of moments estimates for the four-parameter beta compound binomial model and the calculation of classification consistency indexes [ACT Research Report]. ACT. [Google Scholar]

[bibr8-01466216251363244] Hanson B. A., Zeng L., Colton D. A. (1994). A comparison of presmoothing and postsmoothing methods in equipercentile equating [ACT Research Report]. ACT. [Google Scholar]

[bibr9-01466216251363244] Kaggle . (2013). Dogs vs. Cats. Retrieved from. https://www.kaggle.com/c/dogs-vs-cats

[bibr10-01466216251363244] Kolen M. J. (1984). Effectiveness of analytic smoothing in equipercentile equating. Journal of Educational Statistics, 9(1), 25–44. 10.2307/1164830 [DOI] [Google Scholar]

[bibr11-01466216251363244] Kolen M. J. (1991). Smoothing methods for estimating test score distributions. Journal of Educational Measurement, 28(3), 257–282. 10.1111/j.1745-3984.1991.tb00358.x [DOI] [Google Scholar]

[bibr12-01466216251363244] Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking. Methods and practices (3rd ed.), Springer. [Google Scholar]

[bibr13-01466216251363244] Kubat M. (2015). An introduction to machine learning (1st ed.), Springer Publishing Company. [Google Scholar]

[bibr14-01466216251363244] Liu C., Kolen M. J. (2018). A comparison of strategies for smoothing parameter selection for mixed-format tests under the random groups design. Journal of Educational Measurement, 55(4), 564–581. 10.1111/jedm.12192 [DOI] [Google Scholar]

[bibr15-01466216251363244] Liu C., Kolen M. J. (2020). A new statistic for selecting the smoothing parameter for polynomial loglinear equating under the random groups design. Journal of Educational Measurement, 57(3), 458–479. 10.1111/jedm.12257 [DOI] [Google Scholar]

[bibr16-01466216251363244] Liu O. L., Rios J. A., Heilman M., Gerard L., Linn M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233. 10.1002/tea.21299 [DOI] [Google Scholar]

[bibr17-01466216251363244] Lord F. M. (1965). A strong true-score theory, with applications. Psychometrika, 30(3), 239–270. 10.1007/BF02289490 [DOI] [PubMed] [Google Scholar]

[bibr18-01466216251363244] Lord F. M. (1982). The standard error of equipercentile equating. Journal of Educational Statistics, 7(3), 165–174. 10.2307/1164642 [DOI] [Google Scholar]

[bibr19-01466216251363244] Mitchell T. M. (2017). Key ideas in machine learning. Retrieved from. https://www.cs.cmu.edu/∼tom/mlbook/keyIdeas.pdf

[bibr20-01466216251363244] Reinsch C. H. (1971). Smoothing by spline functions. II. Numerische Mathematik, 16(5), 451–454. 10.1007/BF02169154 [DOI] [Google Scholar]

[bibr21-01466216251363244] Sammut C., Webb G. I. (2010). Encyclopedia of machine learning. Springer. [Google Scholar]

[bibr22-01466216251363244] von Davier M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857. 10.1007/s11336-018-9608-y [DOI] [PubMed] [Google Scholar]

[bibr23-01466216251363244] Wang X., Peng Y., Lu L., Lu Z., Bagheri M., Summers R. M. (2017). ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. IEEE CVPR, 2097–2106. https://arxiv.org/abs/1705.02315 [Google Scholar]

[bibr24-01466216251363244] Zeng L., Kolen M., Hanson B., Cui Z., Chien Y. (2004). RAGE-RGEQUATE. University of Iowa. [Computer software]. [Google Scholar]

PERMALINK

Using Deep Learning to Choose Optimal Smoothing Values for Equating

Chunyan Liu

Zhongmin Cui

Abstract