Bayesian Modeling of the Mnemonic Similarity Task Using Multinomial Processing Trees

Michael D Lee; Craig EL Stark

doi:10.1007/s41237-023-00193-3

. Author manuscript; available in PMC: 2024 Mar 13.

Published in final edited form as: Behaviormetrika. 2023 Jan 20;50(2):517–539. doi: 10.1007/s41237-023-00193-3

Bayesian Modeling of the Mnemonic Similarity Task Using Multinomial Processing Trees

Michael D Lee ¹, Craig EL Stark ²

PMCID: PMC10936565 NIHMSID: NIHMS1948122 PMID: 38481469

Abstract

The Mnemonic Similarity Task (MST: Stark et al., 2019) is a modified recognition memory task designed to place strong demand on pattern separation. The sensitivity and reliability of the MST make it an extremely valuable tool in clinical settings. We develop new cognitive models, based on the multinomial processing tree framework, for two versions of the MST. The models are implemented as generative probabilistic models and applied to behavioral data using Bayesian graphical modeling methods. We demonstrate how the combination of cognitive modeling and Bayesian methods allows for flexible and powerful inferences about performance on the MST. These demonstrations include latent-mixture extensions for identifying individual differences in decision strategies, and hierarchical extensions that measure fine-grained differences in the ability to detect lures. One key finding is that the availability of a “similar” response in the MST reduces individual differences in decision strategies and allows for more direct measurement of recognition memory.

Keywords: Mnemonic Similarity Task, multinomial processing trees, recognition memory, Bayesian graphical models

Introduction

The Mnemonic Similarity Task (MST: Stark et al., 2015, 2019) is a modified recognition memory task designed to place strong demands on pattern separation. The most common form of the task involves two phases. The first is a study phase, in which a participant is presented with a sequence of stimuli, and is given a simple task to encourage active encoding. The second phase is a test phase, in which a sequence of items are presented and the participant must indicate whether or not they were presented on the study list. The key innovation of the MST is that the test phase includes lure stimuli that are similar to, but not the same as, studied stimuli. The degree of the similarity is often quantified, so that an MST involves a range of different levels of lure stimuli. In one version of the MST, the only possible response options are “old” and “new”, which means that lure stimuli are correctly classified as new. In another version, the possible response options are “old”, “similar”, and “new”, which means the lure stimuli are correctly classified as similar.

Through the introduction of lure stimuli, the MST provides a more fine-grained test of recognition memory. Its sensitivity and reliability have made it an extremely valuable tool in clinical settings for identifying hippocampal dysfunction associated with healthy aging, dementia, schizophrenia, depression, and other disorders. The standard empirical measure of lure performance is the Lure Discrimination Index (LDI), which is the difference in the probability of a “similar” response to a lure compared to the probability of a “similar” response to a new item. An alternative, and potentially complementary, approach to empirical measures like the LDI is to use cognitive models of task behavior to make inferences about people’s underlying memory systems and decision processes. Model-based approaches have a long history in clinical assessment and have been applied to diagnostic tasks involving recognition (Snodgrass & Corwin, 1988), recall (Alexander et al., 2016; Lee et al., 2020), and semantics (Chan et al., 2001; Westfall & Lee, 2021).

Signal Detection Theory (SDT: Green & Swets, 1966; MacMillan & Creelman, 2004) is widely used as a model of recognition memory in tasks that do not incorporate lure stimuli. In these models, the old and new stimuli correspond to the SDT signal and noise distributions. One way to extend the SDT model to the MST would be to introduce additional distributions located between the signal and noise distributions representing the various levels of lure stimuli. This extension has been successfully used for the old-new task (e.g. Villarreal et al., 2022). For the old-similar-new task, however, an extended SDT approach also requires introducing a second decision criterion associated with the “similar” response. This makes the strong assumption that a “similar” response is based on a recognition memory strength that is too strong for a “new” response but too weak for an “old” response. Conceptually, as argued by Stark et al. (2019, p. 940), this assumption seems problematic. When presented with a lure item, it seems at least possible that a participant actively remembers the similar item presented during study, and is able to discriminate that item from the one being presented. Such a mental comparison would provide active evidence for a “similar” response, contrary to the assumptions of a SDT model.

A different cognitive modeling framework is provided by discrete models based on multinomial processing trees (MPTs: Batchelder & Riefer, 1980; Erdfelder et al., 2009; Kellen & Klauer, 2014), which are also widely used to model standard recognition memory tasks. The most common MPT model of recognition memory is the two-high threshold model. It assumes that when an old item is presented, there is some probability that it is remembered and an “old” response is made. If it is not remembered, the participant is assumed to guess, which could lead to either an “old” or “new” response. On the other hand, if a new item is presented, there is a probability it is remembered that this item was not studied, leading to a “new” response. Otherwise, the same guessing process is used. The two-high threshold model, in this sense, is consistent with the intuition that it is possible for an old item being remembered, but also for a new item to be remembered not to have been studied (Klauer & Kellen, 2018).

In this article, we develop new MPT models of the study-test MST tasks with both old-new and old-similar-new response options. The models are implemented as graphical models (Lee & Wagenmakers, 2013). This allows Bayesian methods of inference to be applied, and also allows the core task models to be extended to provide accounts of individual differences in lure discriminability and response strategies. We demonstrate these features of the modeling approach in case studies using previously collected data. The structure of the remainder of this article is as follows. In the next section, we develop the two basic MPT models of the MST. We then describe the data used in the case studies. The first case study focuses on the old-new MST. The second case study focuses on the old-similar-new MST. We conclude by discussing how the models help measure and understand individual differences in recognition memory, and emphasize the role of Bayesian graphical models in allowing the flexible development of useful models while providing rigorous statistical inference.

New MPT Models of the MST

MPT Model of the Old-New Task

The two-high threshold model can naturally be extended to the MST, using the probability trees shown in Figure 1. There are three trees, corresponding to old, new, and lure stimuli. The processing of old stimuli is identical to the two-high threshold model, with probabilities $ρ$ of remembering and $γ$ of guessing old. The processing of new stimuli is almost identical to the two-high threshold model, except that the probability of remembering that an item was not studied is now $ψ$ , which is potentially different from $ρ$ . A weakness of the two-high threshold model is that it has to assume for identifiability that these probabilities are equal. It seems psychologically plausible, however, that these probabilities could be different, and the additional information provided by the MST design and its introduction of lure stimuli allows the equality constraint to be removed.

Most importantly, the MST model adds assumptions about processing lures. It is assumed that the first step is an attempt to remember the studied item on which the lure is based. If memory succeeds, there is then a probability $δ_{l}$ that the presented lure is discriminated from the remembered item. If both memory and discrimination succeed, a “new” response is made. If memory succeeds but discrimination fails, an “old” response is made. If the studied item is not remembered, the same guessing process applies as for old and new stimuli. The $δ_{l}$ discriminability applies to lures of type $l$ , allowing for different probabilities depending on how similar the lure is to the previously studied item.

Collectively, these assumptions mean that the probability of responding “old” to an old item is

θ = ρ + (1 - ρ) γ,

(1)

because it could arise either by remembering or guessing. The probability of responding “old” to a new item is

θ = (1 - ψ) γ,

(2)

because it can only arise by guessing. Finally, the probability of responding “old” to a level $l$ lure item is

θ = ρ (1 - δ_{l}) + (1 - ρ) γ,

(3)

because it could arise by remembering the relevant study item but failing to discriminate it from the presented item, or by failing to remember and then guessing.

MPT Model of Old-Similar-New Task

The MPT model of the old-similar-new MST requires two additional changes, as shown in Figure 2. The first is that guessing can produce “old”, “similar”, or “new” responses. The model has probabilities for all three possibilities, $γ^{o}$ , $γ^{s}$ , and $γ^{n}$ , with the constraint $γ^{o} + γ^{s} + γ^{n} = 1$ . The second change is that a “similar” response is made if a remembered item is distinguished from a presented lure. In the old-new task, this response is “new”, because all that can be indicated is that the item was not on the study list. In the old-similar-new task, it can be identified as a lure item via a “similar” response.

The model now makes predictions about the probability of “old”, “new”, and “similar” responses for each type of presented item. For an old item, these probabilities are

θ = (ρ + (1 - ρ) γ^{o}, (1 - ρ) γ^{n}, (1 - ρ) γ^{s}),

(4)

where the vector $θ$ lists the probabilities for “old”, “new”, and “similar” responses, in that order. For a new item, the probabilities are

θ = ((1 - ψ) γ^{o}, ψ + (1 - ψ) γ^{n}, (1 - ψ) γ^{s}) .

(5)

Finally, the probabilities for a level $l$ lure item are

θ = (ρ (1 - δ_{l}) + (1 - ρ) γ^{o}, (1 - ρ) γ^{n}, (1 - ρ) γ^{s}) .

(6)

Experiment

Participants

A total of 21 participants (13 female, mean age 21 years, age range 18–24 years) were recruited through the Sona Systems experimental management system at the University of California at Irvine, which organizes the participation of students in science experiments for course credit. All participants signed consent forms approved and conducted in compliance with the Institutional Review Board of the University of California at Irvine.

Methods

Participants were given both old-new and old-similar-new versions of the MST using different stimulus sets. In both tasks 128 images were studied while completing an indoor/outdoor task to encourage active encoding (2.0s duration, 0.5s ISI). During testing, 192 images were presented (2.0s duration, 0.5s ISI), made up of 64 repeated old images, 64 new images, and 64 lure images. The order of the task and the two stimulus sets assigned to each task were counterbalanced across participants.

Participants indicated their responses via on-screen button clicks. We used a web-based version of the MST (Stark et al., 2021) written in JavaScript using the jsPsych library (De Leeuw, 2015) to allow for remote testing. It is a mature, free, stable library that has been rigorously tested even for demanding reaction time-based experiments (De Leeuw & Motz, 2016; Hilbig, 2016; Pinet et al., 2017). In addition, we integrated the task with the open-source JATOS package (Lange et al., 2015) to provide a reliable means of securely administering test sessions on the web and managing the data.

In the MST, lure images have an empirically-derived “mnemonic similarity” in relation to their studied counterpart based on an independent assessment of how frequently each image is incorrectly judged to be identical to the study image (Lacy et al., 2011; Yassa et al., 2011). In the MST, this continuous probability is binned into five levels with level 1 lures being the most similar and level 5 lures being the least similar.

Behavioral Results

Accuracy ranged between 30% and 86% for the old-new task with a mean of 68%. and between 22% and 84% for the old-similar-new task with a mean of 62%. The product-moment correlation for participant accuracy across the two tasks was $r = 0.77$ . This is consistent with previous findings. For example, Stark et al. (2015) found LDI measures for these tasks to have a correlation of 0.79.

MPT Model for the Old-New MST

Basic Model

Figure 3 shows a graphical model representation of the MPT model for the old-new MST. In graphical models, nodes represent latent parameters and observed information, and specify how they are assumed to be related to one other. Children depend on their parents in the graph structure, and encompassing plates identify independent replications of the graph structure. Using the notation adopted by Lee & Wagenmakers (2013), the model parameters $ρ$ , $ψ$ , $γ$ , and $δ = (δ_{1}, \dots, δ_{5})$ are shown as circular and unfilled nodes, because they represent continuous latent parameters. Whether the stimulus item on trial $t$ is an old, new, or level $l$ lure is represented by $s_{t}$ . This node is square and shaded, because it represents discrete observed information. The model parameters and item information together determine the probability $θ_{t}$ of an “old” response. This is shown as a double-bordered node because it is a deterministic function of the parameters and item information. Finally, the observed behavior on trial $t$ is $y_{t} = 1$ if the response is “old” and $y_{t} = 0$ if the response is “new”. This is discrete and observed, and depends on the response probability $θ_{t}$ . The trial-level information—the item information, response probability, and observed response—is encompassed by a plate that indicates the replication of this graph structure across trials.

We implemented the graphical model in JAGS (Plummer, 2003), which is a high-level scripting language that automates fully Bayesian inference based on computational methods. Throughout this article, we generally sampled models using 8 independent chains with 1000 samples each after discarding 1000 burn-in samples. Convergence was assessed by visual inspection of traceplots and the standard $\hat{R}$ statistic (Brooks & Gelman, 1997). JAGS scripts and MATLAB code for applying the models are provided in supplementary information.

Descriptive Adequacy

Figure 4 summarizes the descriptive adequacy (or “fit”) of the model to each participant. Each panel corresponds to a participant, and their behavior is shown by the orange line. The line follows the item ordering old $(O)$ , level 1 lure $(L_{1})$ , level 2 lure $(L_{2}), \dots$ , level 5 lure $(L_{5})$ , and new $(N)$ . This ordering corresponds to decreasing similarity of the test item to a studied item. The line then indicates the proportion of “old” responses to each item type made by the participant. Optimal performance would be 100% “old” responses for old items and 0% “old” responses for all other items. The participants are labeled A–U and ordered in terms of decreasing overall accuracy across the panels. It is visually clear that more accurate participants show closer to optimal behavior, with lines that decrease as items become less similar.

The blue cross markers in Figure 4 show the mean of the posterior predictive distribution for the model and the lines above and below show the 95% credible interval. The posterior predictive distribution can be interpreted as the model’s attempt to re-describe the behavioral data. A descriptive adequate model should match the behavioral data. It is clear that the MPT model does this well for almost all of the participants. It closely matches the proportion of “old” responses to each item type, except for a couple of the worst-performed participants (e.g., Participants S and T). Most importantly, the model is able to capture the individual differences in participant behavior. For example, Participant A mostly responds “new” to the lure items, while Participant Q mostly responds “old”. The model is able to describe both of these extremes, and the intermediate sorts of behavior shown by the other participants.

Inferences

Figure 5 shows the parameter inferences made by the models. The left panel shows the relationship between the two memory parameters $ρ$ and $ψ$ . The right panel shows the relationship between $ρ$ and the guessing old probability $γ$ . Each participant’s joint posterior with respect to the pair of parameters is summarized by a marker for the posterior mean and error bars for 95% credible intervals for each marginal posterior. The two memory parameters are modestly correlated with each other, but are not identical. This is evidence in favor of the modeling assumption to distinguish between remembering that an item was studied versus remembering that an item was not studied. The memory and guessing old parameters appear to vary independently across participants, consistent with them measuring different memory and decision making processes.

Figure 5 superimposes colored squares that suggest a possible interpretation of the individual differences revealed by the parameter inferences. Each color corresponds to a potential interpretable subgroup. The yellow and green squares represent a different sort of contaminant behavior. In the left panel, it is clear that Participants S, T, and U in these subgroups have very low memory performance. In the right panel they are separated into Participants S and T who guess “old” and “new” about equally often, and Participant U (and potentially Q) who almost always respond “old”. Random responding and repeated responding are two common sorts of contaminant behavior (Zeigenfuse & Lee, 2010), and seem likely explanations for the poor performance of these participants. These interpretations are highly consistent with the observed behavior of Participants S, T, and U in Figure 4.

The right panel of Figure 5 identifies subgroups among participants who performed well in the task. The darker blue subgroup that includes Participants L, M, N, O, P, and R are inferred to have higher base-rates of guessing “old” and often have worse memory. The lighter blue subgroup includes the participants with the best memory and a lower probability of guessing “old”. This second subgroup generally corresponds to the participants with the greatest task accuracy.

Extension to Model Individual Differences

One way to incorporate the exploratory suggestion of subgroups into formal modeling is by extending the basic model to have hierarchical latent mixtures. As demonstrated by Bartlema et al. (2014, see also Lee 2018), hierarchical latent-mixture models provide a general statistical way of extending the cognitive models to account for two sorts of individual differences. The latent mixture extension allows for qualitatively different subgroups of participants, while the hierarchical extension allows for continuous variation within the subgroups. The latent mixture extension thus addresses both the contaminant behavior and the different types of attentive behavior. Intuitively, each component in the latent-mixture model corresponds to one of the colored boxes, and represents a different task strategy. The hierarchical extension then allows for variation in the parameters of participants within the same component, and represents fine-grained variability in exactly how a strategy is executed.

Figure 6 shows a graphical model that incorporates these extensions. There are five components in the latent-mixture model, indexed by a latent discrete parameter $z_{i}$ for the $i th$ participant. The first two components correspond to attentive responding and focus on individual differences in the memory $ρ_{i}$ and guessing base-rate $γ_{i}$ individual parameters. These parameters are assumed to be drawn from overarching truncated Gaussian distributions that are different depending on whether $z_{i} = 1$ or $z_{i} = 2$ . In particular, the subgroup means are constrained to to be different with the first “low memory” subgroup having lower mean memory $μ_{ρ}^{1} < μ_{ρ}^{2}$ but greater base-rate of guessing “old” $μ_{γ}^{1} > μ_{γ}^{2}$ than the second “high memory” subgroup. Intuitively, this means $z_{i} = 1$ corresponds to the dark blue strategy in Figure 5, in which worse memory for old items is compensated for by increasing guessing “old”, while $z_{i} = 2$ corresponds to the light blue strategy in which old items are remembered better and the base-rate of guessing “old” is lower. The remaining three components correspond to different contaminant strategies, so that if $z_{i} = 3$ , there is always a 50-50 probability of responding “old” $(θ = \frac{1}{2})$ , if $z_{i} = 4$ there is a high probability of responding “old” $(θ = 0.99)$ , and if $z_{i} = 5$ there is a high probability of responding “new” $(θ = 0.01)$ .

The latent subgroup indicator is sampled for each participant from a categorical distribution $z_{i} \sim categorical (ϕ)$ so that $ϕ = (ϕ_{1}, \dots, ϕ_{5})$ represents the base rate of each strategy. The base-rate itself is given a Dirichlet distribution $ϕ \sim Dirichlet (1, \dots, 1)$ which corresponds to the assumption that all possible base-rates are equally likely. The memory for absence $ψ_{i}$ and the discriminability probabilities $δ_{i}$ do not define the proposed strategy differences and so continue to be allowed to vary independently across individuals.

Inferences

Figure 7 shows the inferences about subgroup membership, represented by the posterior distribution of $z_{i}$ , made by the model for each participant. The subgroup inferences largely match the proposed groupings in Figure 5. The most accurate Participants A–K are inferred to belong to the “high memory” group, and less accurate Participants L–R are inferred to belong to the “low memory” group. The exception is Participant J, whose classification is uncertain. There is also significant uncertainty for Participant H. For the remainder of the participants the inferences about group membership are relatively confident, because the most likely subgroup has most of the posterior mass. The three participants anticipated to show contaminant behavior are inferred to belong to the appropriate “random” or “always respond old” subgroups.

While the subgroup inferences make sense, given the assumption of the different possible strategies, it is important to test for the evidence of these strategies as an account of individual differences. In particular, evidence for dividing the non-contaminant participants into two subgroups is important. (It could be argued that it is always useful to allow for the possibility of contaminant behavior, even if it is not observed in a specific data set). Evidence for two qualitatively different forms of task-attentive behavior is provided by the Bayes factor comparing a model that includes the subgroups to one that does not.

We approximated this Bayes factor using the posterior distribution of $ϕ_{1}$ , which corresponds to the base-rate of “low memory” participants. If this base-rate is zero, the model reduces to having just one account of task-attentive behavior. Thus the Savage-Dickey method for estimating Bayes factors, which compares the ratio of prior to posterior densities at the point in parameter space where a more general model reduces to a nested special case, can be used (Wetzels et al., 2010). The Bayes factor is ${B F}_{10} = 5$ , meaning that the data are five times more likely under the model that incorporates the two subgroups. A figure showing the posterior distributions of the base-rate probabilities $ϕ$ is provided in the supplementary information.

Extension to Model Discriminability

The inferred discriminabilities $δ_{i} = (δ_{i 1}, \dots, δ_{i 5})$ measure the ability of the $i th$ person to discriminate a remembered study item from a current item on a test trial. In the modeling thus far, they are assumed to be independent. A stronger assumption would impose an order constraint, corresponding to the idea that more similar lures are more difficult to discriminate. Such an order constraint could be imposed in a strong way for each individual $δ_{i 1} < \dots < δ_{i 5}$ , or in a weaker way in a hierarchically extended model by applying it to the mean of group distributions for each level of lure.

We pursue an alternative approach using a model of the relationship between similarity and discriminability. This is also a form of hierarchical extension, specifying additional modeling assumptions about the relationship in terms of parameters that can vary across individuals. In an exploratory way, we observed that the inferences about $δ_{i}$ in the unconstrained model showed a mostly monotonic and often S-shaped (sigmoidal) pattern of increase in the probability of discrimination as lures become easier to distinguish. A figure showing these results is provided in the supplementary information.

On this basis, we propose the logistic model relating lure similarity to discrimination probability

δ_{i l} = \frac{1}{1 + exp {- β_{i} (\frac{l}{n_{l} + 1} - τ_{i})}},

(7)

where $τ$ can be interpreted as a threshold for discriminability and $β$ can be interpreted as a slope controlling how quickly discriminability increases. Figure 8 shows three specific examples of this relation that help understand the interpretation of the parameters. The lure similarity is normalized to lie between 0 and 1, so that the model is applicable to MST designs with any number of lure levels. The five levels for the current experiment have been mapped to normalized similarities of $1 ∕ 6, \dots, 5 ∕ 6$ . This is achieved by the $(\frac{l}{n_{l} + 1} - τ)$ term in the shift of the logistic. Accordingly, smaller values of $τ_{i}$ mean that the discriminability probability begins to increase for more difficult lures. The specific examples show that for $τ_{i} = 0.25$ there is already some probability of discriminating level 1 lures and discrimination is near perfect for level 2 lures. For $τ_{i} = 0.67$ , in contrast, level 4 lures are discriminated about half the time and only the least similar level 5 lures are consistently discriminated. The specific examples also show how slopes ranging from $β_{i} = 10$ to $β_{i} = 20$ impact the rate of increase in discriminability.

Figure 8. — Modeled relationship between the similarity of a lure to a studied item and the probability of discriminating the lure.

Figure 9 shows a graphical model that implements the hierarchical extension for discriminability. The key change is that $δ_{i}$ now becomes a deterministic function of $β_{i}$ and $τ_{i}$ . In this model, only the latent-mixture components corresponding to attentive task behavior are included. Participants S, T, and U identified as contaminants by the analysis in Figure 7 are excluded from the analysis.

This model remains descriptively adequate using the same approach to posterior predictive analysis presented in Figure 4. This is an important finding, since the model has been significantly extended by introducing the two strategies, and significantly constrained by the logistic relationship between lure similarity and discriminability. A figure showing the posterior predictive analysis for this model is provided in the supplementary information.

Inferences

Figure 10 summarizes the inferences about the relationship between lure similarity and discriminability for each participant. The solid lines show the logistic function corresponding to the inferred posterior means of the slope $β_{i}$ and threshold $τ_{i}$ for the $i th$ participant. The violin plots show the posterior distributions for each lure level. The slopes are reasonably similar: almost all of the participants have low discriminability for the most difficult level 1 lures, near perfect discriminability for the easiest level 5 lures, and a steady rate of increase between those extremes. There is much more variability in the thresholds at which the increase in discriminability begins. For example, Participants D and E begin to show increasing discriminability for the level 2 lures, Participants G and H begin to increase for the level 3 lures, and Participants K and L begin to increase for level 4 lures. The interpretable variability in the threshold $τ_{i}$ suggests that it is a useful measure of lure discrimination. It can be interpreted as a measure of how similar or different lures need to be so that a participant has some ability to discriminate them from remembered studied items.

Figure 11 shows one way in which the model can be used to measure participants memory based on their MST performance. The left panel summarizes the joint posterior distribution of the memory $ρ$ and threshold $τ$ parameters. Markers correspond to posterior means and error bars show 95% credible intervals. These two parameters capture the different aspects of memory assessed by the MST. The memory parameter corresponds to the ability to recall studied items. It essentially corresponds to the memory ability assessed by a standard recognition memory task. In support of this, the correlations of the posterior mean of $ρ$ with a standard frequentist $d^{'}$ measure of discriminability are $r = 0.81$ between old and new items and $r = 0.54$ between old and lure items. The threshold parameter, on the other hand, corresponds to the ability to discriminate remembered items from similar ones presented at test. It corresponds to the pattern separation capabilities that the MST aims to measure.

The results in Figure 11 show that these abilities vary largely independently across the participants. Memory recall probabilities can vary from about 40% to about 80% and thresholds can vary from about 20% to 80% across the normalized similarity scale. Four illustrative participants are highlighted in Figure 11 to emphasize the usefulness of this analysis. They approximately represent the four possibilities of low versus high memory and low versus high discriminability. The behavioral performances of these participants, as originally presented in Figure 4, are shown in the four right hand panels. Participants N and L are inferred to have worse discriminability than Participants C and A, which is evident behaviorally because they often respond “old” to lures. Participants N and C are inferred to have worse memory than Participants A and L, which is evident behaviorally because they fail to respond “old” to old stimuli more often.