Abstract
We encountered a problem in which a study’s experimental design called for the use of paired data, but the pairing between subjects had been lost during the data collection procedure. Thus we were presented with a data set consisting of pre and post responses but with no way of determining the dependencies between our observed pre and post values. The aim of the study was to assess whether an intervention called Self-Revelatory Performance had an impact on participant’s perceptions of Alzheimer’s disease. The participant’s responses were measured on an Affect grid before the intervention and on a separate grid after. To address the underlying question in light of the lost pairing we utilized a modified bootstrap approach to create a null hypothesized distribution for our test statistic, which was the distance between the two Affect Grids’ Centers of Mass. Using this approach we were able to reject our null hypothesis and conclude that there was evidence the intervention influenced perceptions about the disease.
Keywords: Affect Grid, Bootstrap, Alzheimer’s disease, Center of Mass
1. Introduction
A joint study between The University of Kansas Alzheimer’s Disease Center (ADC) and Arts and AGEing KC investigated whether a particular type of therapy called Self Revelatory Performance (SRP) [4] had an impact on participants perceptions of Alzheimer’s disease. The research hypothesis was that the SRP had a positive effect on participants emotional stance. The study’s design called for collecting a survey, which consisted of two 5-point Likert items [10] and an Affect grid [14], from each participant before and after the performance. Using this paired data the goal was to analyze whether or not each individual’s response was affected by the performance and quantify the average individual’s shift in perceptions about the disease. Unfortunately, the surveys were collected in such a way (they were put into two piles) that the pairing between subjects was lost. Due to this issue, the original analysis was no longer viable and we were brought on in an attempt to salvage as much information from the data as possible in order to address the research question. This paper focuses on the methodology used to analyze the responses to the Affect grid. The Likert items were analyzed by permuting 10,000 possible unique pairings and using an Ordinal-Quasi Symmetry model as a sensitivity analysis to the effect of the lost pairing. All statistical analyses and data management procedures were conducted in R [13].
1.1. The Affect Grid
Participants were presented with an Affect Grid identical to the one in Figure 1 with the exception that we added the row and column labels 1 through 9 after data collection. These labels were added in order to analyze the results and we have no reason to believe the choice of these values affected the analysis. The Affect Grid is an item that addresses two questions at once; the horizontal axis measures valence (on a range from unpleasant feelings to pleasant feelings), while the vertical axis measures arousal (on a range from sleepiness to high arousal) [7]. In an Affect Grid participants are prompted to mark a cell that best describes their current combination of valence and arousal. Affect Grids were originally designed to measure a single instance of a participants emotional state and in these instances has been shown to be a “moderately valid measure of the general dimensions of pleasure and arousal” [8]. However, many authors have used the Affect Grid in paired and longitudinal studies despite the lack of validation in these scenarios. While some work has been done on specific issues related to multiple measurements on the Affect Grid in response to participant’s tendency to exaggerate [16], we are unaware of any studies that address multiple paired measurements using an Affect Grid to study population wide changes. The original developers of the Affect Grid intended for the item to be scored as two separate measures, which is similar to simply using the Affect Grid to graphically visualize the combination of two Likert items. However, for our analysis we treated the responses to the Affect Grid as a pair, using the coordinates as a combined measure of emotional state. Despite the parsimony that could be achieved by using responses to the Affect Grid as a single response, no validation studies for its use in this scenario have been undertaken. By analyzing the data as a single item we could have unknowingly introduced bias into the responses.
Figure 1.
The Affect Grid with added row and column numbers
Survey items are often susceptible to extreme responding bias (ER) and central tendency bias (CTB) [5], which are both concerns with this type of two-dimensional Likert item. The other types of bias often encountered such as acquiesance, or social desireability bias would not be of concern due to the lack of “yes” or “no” questions and the apolitical nature of our topic respectively. If our approach had introduced ER or CTB we would expect to see the responses on the extremes of the grid or clustered near the middle. The pre-responses In Figure 2a show some evidence of ER, with most responses toward the left side of the grid; however the post responses in Figure 2b appear to show evidence of CTB with most responses clustered in the middle of the grid. Taken together these provide no evidence of either ER or CTB bias, which cannot occur at the same time. The more likely explanation for the apparent pattern of responses is that the SRP had an effect and shifted the participants more extreme responses to more neutral responses. Overall we cannot rule out the possibility of response bias, however there is no evidence bias was introduced by that treating the Affect grid as a single item.
Figure 2.
Heatmaps of participants responses before and after the SRP
We had a total of 180 observations, 93 completed pre Affect Grids, and 87 completed post Affect Grids. In light of the lost pairing we modified the research hypothesis to the more broad statement that “The SRP had an effect on participant’s emotional state concerning Alzheimer’s disease.” If we concluded there was an effect, then based on the shifts in the Affect Grid we would claim that the shift appeared to be in certain direction, while noting that we did not test for a directional shift but merely any shift. Thus our null hypothesis () was that the SRP had no effect on the participants emotional state. Under any differences between the two samples (the pre and post) were assumed to be due to sampling error and therefore the two samples can be viewed as draws from some true distribution of the participants emotional state, which we refer to as the null distribution. To analyze the data we reformed to the equivalent statement that the pre and post responses are samples from this null distribution. Under this hypothesis we were able to get an estimate of the null distribution.
2. Methods
2.1. The Estimated Null Distribution
Without loss of generality, the counts of the cells in the Affect Grid follow a multinomial distribution, where each cell has a certain probability of being chosen such that the total probability equals 1. To form our null distribution these cell probabilities were estimated using the frequency of the total (pre and post) observed cell responses divided by the total number of responses. For example there were 17 total pre and post participants who marked the cell (1,5), thus the relative frequency for this cell was . We justified combining the pre and post responses to estimate the distribution for two reasons: 1) under our Null distribution these samples are draws from the same distribution, and 2) the sample size was relatively small in relation to the number of cell probabilities we needed to estimate.
There were 93 completed pre Affect Grids and 87 completed post Affect Grids, the combinations of which are graphically displayed by discrete heat maps in Figure 2a and 2b. Note that both the pre and post samples contained 43 empty cells out of the 81 total cells. Combining these samples into the estimated null distribution resulted in 54 cells with at least one observation; however, even after this combination there were still 27 cells with no responses.
Figure 3a displays a heat map of the estimated null distribution with the relative frequencies for each cell. In estimating this distribution we did not want to force the cells with no observations to have an expected value of 0, which would result in a degenerate conditional binomial distribution for those cells, which we did not believe was an accurate description of the true null distribution. Instead we believe these cells are sampling 0’s and following the advice of Agresti [1] we adjusted the observed null distribution by adding small constants. We added 0.0005 to the estimated probability of the 27 empty cells and reduced the estimated probability of the 54 cells with an observation by 0.00025 so that the total probability would sum to 1. We considered more sophisticated approaches to dealing with these empty cells such as assigning probabilites by weighting the responses near that cell, or by using a function of the row probability multiplied by the column probability. Nevertheless, we felt these approaches would have smoothed the distribution too much, especially given the scarcity of the data. By adding small equal constants to each cell we avoided the issue of having cell probabilities of 0, while staying as close as possible to the observed values.
Figure 3.
Observed and adjusted estimates of the null distribution
The cells with observed responses that would be most influenced by having their relative frequency reduced were the cells with only 1 response. We can see from Figure 3a that these cells had a relative frequency of approximately 0.0056. After reducing the observed cell’s relative frequency by 0.00025 the relative frequency of the cells with only 1 observation was approximately 0.0053, a reduction of 4.5 percent. While this percent reduction is much greater than the 0.25 percent reduction in relative frequency experienced by the most common cell response (1,5), we felt comfortable that overall it was a fairly modest reduction. Given that we added such small constants to the formerly empty cells, the expected value of a response occurring in any of these cells previously empty cells is only 1.35 percent. Figure 3b shows the adjusted null distribution on the same scale as the observed estimate of null distribution. A sensitivity analysis on the effect of different constants is included in the Results section.
2.2. Bootstrap Testing Procedure
In order to analyze our observed data we needed to measure the central tendency of the pre and post samples. To do this we used the center of mass (COM). In a physical system the COM is the point at which the system balances or rests. If we had a physical 9×9 grid with weights corresponding to the participant’s responses, the COM would be the point at which we could put a fulcrum and balance the grid. If the responses on the grid shifted so too would the COM. For a two dimensional discrete system of points the center of mass is (, ) where
| (1) |
and mi represents the mass at each point [12]. We summed over the n = 93 participants for the pre center of mass and n = 87 participants for the post center of mass to get the corresponding center of mass (, ) for the observed pre and post responses.
To test our null hypothesis that the performance had no effect on participant’s responses, we used the Euclidean distance [2] between the COM of the pre and post samples. We refer to this distance as COMD. Under our null hypothesis these samples come from the same distribution and therefore we would expect COMD to be small, attributing the difference in the central tendencies of the two samples to random error. To get a measure of the distribution of distances under the null we used a bootstrap approach [3] on our adjusted null distribution. We drew 10,000 samples of size 93 from the modified null distribution and 10,000 samples of size 87 to function as bootstrapped pre and post samples. We then calculated the distance between each of the 10,000 simulated pre and post samples and compared our observed COMD to the bootstrapped distribution of COMD generated under . Our observed pre COM was (2.796, 4.774) and the COM of our post sample was (3.920, 5.333), thus our observed distance was 1.255.
We calculated a bootstrap p-value by summing the total number of bootstrapped distances that were as or more extreme than our observed distance divided by the total number of boostraps. Due to the inherent sampling variation of a bootstrap p-value we followed the advice of Li, Tai, and Nott [9] and calculated a confidence interval for our bootstrap p-value. We used the confidence interval originally suggested by Wilson [17], with a continuity correction [15] the confidence interval (L,U) is given as
| (2) |
| (3) |
where represents the estimated p-value, and n is the total number of bootstrap samples. In order to get a good estimate of our p-value we used 10,000 bootstrap samples, which with the assumption that the p-value would be near 0.05 results in an estimated length of the 95% confidence interval being less that 0.01.
3. Results
Our bootstrapped distances ranged from .0025 to 1.339; however, only 2 of the 10,000 distances were as or more extreme (i.e. larger) than our observed COMD value of 1.255, resulting in a bootstrap p-value of .0002. Our bootstrap convidence interval for the p-value was (0.00003, 0.0008).
Figure 4 displays a histogram of the simulated distances and a dotted line marking our observed distance. While the theoretical distribution of the test statistic COMD is unknown, the observed value of 1.255 clearly lies in the extreme tail of the empirically derived distribution. Under the null distribution we would not expect to draw such different samples from the same distribution.
Figure 4.
Simulated Distances (with observed distance represented by dotted line).
Thus we rejected our null hypothesis that the samples came from the same distribution and therefore rejected the hypothesis that the SRP had no effect. The original research question was whether the performance had a positive effect. While we did not test for a shift in a certain direction, given that we found evidence a shift had occurred and by examining the observed responses in Figure 2, we concluded that there appeared to be a shift in the positive direction (i.e. toward the right side of the grid). We take this as evidence that the SRP had an effect, specifically a positive shift in participant’s perceptions about Alzheimer’s disease.
3.1. Sensitivity Analysis
To evaluate whether either the value we chose to add to the empty cells or possible dependencies between the pre and post responses could have affected our results we conducted a sensitivity analysis for both potential problems.
For the analysis of the effect of added constant we allowed the value chosen to vary among six different levels, the smallest being 0.0001 and the largest being 0.011. As the weight increases so does the amount of smoothing we are applying to the observed data. We would expect that smaller values would have little effect on our results since they would be closest to the actual observed data, which is why only two of the new values are below our original value of 0.0005. The maximum weight we looked at was restricted to 0.011, the reason for this is that we did not want to smooth the data too much and restricted ourselves to removing an equal probability from each of the cells that already contained an observation. The cells that only had one observation had an estimated probability of 0.0056 and thus we could take a maximum of .0055 (allowing for a very small remaining probability) away from the cells with one observation. This corresponds to adding a value of 0.011 to the empty cells. We drew 10,000 samples of size 93 for the pre sample and size 87 for the post sample, calculated their center of mass, and recorded how many of them were as or more extreme than our observed distance.
Table 1 shows that as the weight increased so did the empirical p-value; nevertheless for all the weights chosen the p-value was still far below the significance level of 0.05. Of course by making the weight arbitrarily large the p-value could potentially be greater than our significance threshold, but the weight would need to be larger than 0.011.Thus we would need to reduce the estimated probability of the cells with observations in an unequal manner (taking more away from cells with more observations, or perhaps based upon regions of observations). There is no evidence to suggest that such an invasive redistribution of the weight of observed values is warranted, thus we concluded that our analysis was not sensitive to the choice of weight.
Table 1.
Sensitivity Analysis
| Weight | Empirical P-value |
|---|---|
| 0.00010 | 0.0001 |
| 0.00025 | 0.0001 |
| 0.00075 | 0.0005 |
| 0.00100 | 0.0004 |
| 0.00250 | 0.0006 |
| 0.01100 | 0.0024 |
Our original weight of 0.0005 resulted in an empirical p-value of 0.0002.
In addition we also looked at whether different levels of correlation among the responses could have had an effect on the results. We generated 100,000 pre and post draws from the null distribution, recorded the Pearson correlation between the pre and post x, pre and post y and also the distance between the pre center of mass and post center of mass. We then fit several linear models with COMD as the response and main effects and interactions of both the correlation in the x axis and the correlation in the y axis. We fit this model with the full 100,000 draws and also a subset of the draws which resulted in large, i.e. greater than 1, values of COMD. In both of these models neither the main effects or the interaction term were statistically significant. Thus we dont believe that possible correlation between the pre and post responses would have greatly impacted our results.
4. Discussion
4.1. Limitations
While we feel confident in the result of this study there are limitations to the method we used to analyze the data. One concern is that the lack of validation of the Affect Grid as a single item introduced response bias that we are unaware of. However, we did not see any evidence of the most common types of bias. We reiterate that we only tested for a shift in perceptions, we found evidence that this shift occured and based on the apparent positive direction of the shift we feel the SRP had a positive impact. While it is highly unlikely that the Affect Grid would be biased in such a way that the apparent shift toward the positive side of the grid was actually indicative of a different type of response this possibility must be entertained and is an unavoidable symptom of the lost pairing.
Another issue with our approach is that it has poor power to detect specific types of shifts. As an example, Figure 5a displays a heat map of possible pre responses that would indicate the population had extreme feelings about Alzheimers disease. Figure 5b displays a heat map of possible post responses that would indicate the population had very neutral feelings. While these two populations clearly differ in their observed emotional state they have the same center of mass, and therefore our test statistic would detect no difference between them. In general, any systematic shift or rotation around the pre center of mass would result in an identical or very similar post center of mass. In theory, there could be many shifts within the population so long as they did not alter the center of mass, none of which would be detected by our test statistic. It is important to note that the shift could merely be a change in the dispersion or variance between our two samples. If the both the dispersion and the COM changed we would detect this difference, but if the dispersion changed from pre to post while the COM remained unchanged our approach would not detect this difference because we were concerned with a change in the central tendency of the samples. If this method was used and the null hypothesis was not rejected our method provides no way to definitively know whether there was no evidence of a shift or that there was an undetectable shift. In our study we rejected the null and thus this limitation was not an issue for this particular dataset.
Figure 5.
Theoretical shift which would result in COMD = 0 despite obvious differences in the samples
4.2. Validation and Generalizations
In order to check the validity of our method we did a simulation study to compare its power and type I error rate to that of Hotelling’s [6]. We chose Hotellings which is a generalization of the t-statistics for multivariate testing because we were testing the change of central tendency across two dimensions. There is no standard test designed for the situation we found ourselves in. However Hotelling’s is a logical choice in the same way that the on dimensional t-test is often used to analyze Likert data. While Hotellings assumes Normality and data from the Affect grid is clearly discrete, it is fairly robust to departures from Normality [11]. In addition we used sample sizes of 100 per sample so that by the multidimensional central limit theorem the samples of coordinate pairs should be fairly Bivariate Normal.
To compare COMD with Hotellings we created nine distributions that we felt were representative of what might be encountered when using the Affect Grid. Figure 6 shows heatmaps of the distributions. These were created by thinking of the data as a combination of two 9-point Likert items which correspond to the x and y axes. Often Likert data exhibit either extreme responses or central tendency bias, which provides three options for each of the axes. Values can be positive skewed, negative skewed or centered in the middle of the nine points. For instance in Figure 6 the distribution (Positive, Center) is a combination of right skewed responses on the x-axis and centered responses on the y-axis. The vectors of cell probabilities for Positive, Negative, and Centered are found in Table 2 and thus the cell probabilities for these distributions can be found by multiplying those vectors in the appropriate way. For instance the cell probabilities for the Distribution titled (Right, Centered) are found by calculating . The cell probabilities for the other eight distributions in Figure 6 can be calculated similarly.
Figure 6.
Heatmaps showing the differences in cell probabilities for the nine distributions
Table 2.
Probability Vectors used to Generate Distributions
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|
| RT | .22 | .20 | .15 | .135 | .085 | .06 | .055 | .05 | .045 |
| CT | .055 | .075 | .1 | .16 | .22 | .16 | .1 | .075 | .055 |
| LT | 0.045 | .05 | .055 | .06 | .085 | .135 | .15 | .20 | .22 |
These values represent the probability of a given number 1,…,9 being chosen. Using these vectors we can get the cell probabilities for the distributions.
We wanted to compare how COMD performed relative to Hotelling’s in a scenario where we were drawing equal sample sizes from two distributions. For the distributions shown in Figure 6 there are 36 combinations of the distributions and the 9 scenarios where both samples are actually coming from the same distribution. For the 36 combinations of different distributions we estimated the power for COMD and Hotelling’s by drawing 1,000 random samples from each distribution and calcuating the p-value for Hotelling’s and the bootstrapped p-value for COMD. Our estimate of power was then the percentage of those simulations that had resulted in a p-value ≤ 0.05. Similarly for the estimate of Type I error, we took the percentage of significant results out of the total number of simulations. For the COMD method on each of the 1,000 samles we used a bootstrap of size 1,000.
Table 3 provides estimates of the power in detecting differences between the pairs of distributions. The estimated power of COMD is comparable to Hotelling’s in every instance we looked at. Both methods achieved greater than 90% power for comparing these specific distributions. A more comprehensive simulation study that drew sample from more similar distributions or looked at different sample sizes would have yielded different and probably lower power estimates for both COMD and . Nevertheless, both methods performed very well under these conditions.
Table 3.
Estimated Power for COMD and Hotellings
| PC | PN | CP | CC | CN | NP | NC | NN | |
|---|---|---|---|---|---|---|---|---|
| PP | .914(.915) | 1(1) | .918(.929) | .999(.999) | 1(1) | 1(1) | 1(1) | .909(.910) |
| PC | — | .908(.913) | 1(1) | .943(.924) | .996(.996) | 1(1) | 1(1) | .917(.918) |
| PN | — | 1(1) | .9(.912) | .907(.911) | 1(1) | 1(1) | 1(1) | |
| CP | — | .927(.915) | 1(1) | .920(.926) | .998(.998) | 1(1) | ||
| CC | — | .934(.921) | .999(.999) | .931(.923) | .999(.999) | |||
| CN | — | 1(1) | .999(.999) | .907(.911) | ||||
| NP | — | .913(.921) | 1(1) | |||||
| NC | — | .915(.920) |
Power estimates for the combination of the distributions in the relevant row and column. The values are given as: COMD (Hotelling’s )
Similarly, Table 4 provides estimates of the Type I error rates, when the two samples came from the same distribution. Again we see that COMD and Hotellings provided very similar results. The Type I error rate was controlled around the nominal 0.05 level for both COMD and Hotellings . A more extensive simulation study comparing thest two methods is needed to determine which method performs better in certain scenarios. In addition more work on the operating characteristic of COMD would need to be done in order to be comfortable using this approach outside of this specific instance. Neverthless, in this set of simulations it has shown itself to be at least as good as Hotelling’s while not making as many assumptions about the data. We believe it shows promise as a way to analyze discrete grid data, specifically data that can be thought of as combinations of Likert items.
Table 4.
Estimated Type I error for COMD and Hotellings
| PP | PC | PN | CP | CC | CN | NP | NC | NN | |
|---|---|---|---|---|---|---|---|---|---|
| Hotellings | 0.062 | 0.039 | 0.044 | 0.050 | 0.036 | 0.045 | 0.044 | 0.055 | 0.052 |
| COMD | 0.058 | 0.045 | 0.045 | 0.053 | 0.039 | 0.054 | 0.047 | 0.060 | 0.054 |
Estimated Type I error rates when drawing samples from the same distribution
5. Conclusions
The loss of pairing and the type of data enountered in this problem presented a unique challenge for analysis. Using our methodology we were able to address the research hypothesis and provide evidence in its favor. The extreme nature of the observed COMD and the apparent direction of the shift toward the right side of the grid lead us to believe that Self Revelatory Performance had an effect on particiapant’s responses, specifically a positive effect. More work would need to be done to validate SRP as an effective tool for educating seniors about Alzheimer’s disease, but both the therapy and the methodology developed show promise for application outside of this study.
Acknowledgments
A previous version of this work was presented under the same name at the 2017 Joint Statistical Meetings This work was supported in part by a grant from the U.S. National Institute of Aging (P30 AG035982)
References
- [1].Agresti A, Categorical Data Analysis, 3rd ed., Vol. 2, Wiley and Sons, New Jersey, 2013. [Google Scholar]
- [2].Deza MM, Deza E, Encyclopedia of Distances, Springer, Berlin, 2009. [Google Scholar]
- [3].Efron B, Bootstrap methods: Another look at the jacknife. , The Annals of Statistics, 16 (1979), pp. 1–26. [Google Scholar]
- [4].Emunah R, Self-revelatory performance: A form of drama therapy and theatre., Drama Therapy Review, 1, 71–85. [Google Scholar]
- [5].Furnham A, Response bias, social desireability and dissimulation., Personality and Individual Differences, 7, 385–400. [Google Scholar]
- [6].Hotelling H, The Generalization of Student’s Ratio., Annals of Mathematical Statistics, 3, 360–378. [Google Scholar]
- [7].Kensigner EA, Remembering emotional experiences: The contribution of valence and arousal, Reviews in the Nuerosciences, 2004, pp. 241–252. [DOI] [PubMed] [Google Scholar]
- [8].Killgore WDS, The Affect Frid: A Moderately Valid, Nonspecific Measure of Pleasure And Arousal, Psychological Reports, 83, pp. 639–642. [DOI] [PubMed] [Google Scholar]
- [9].Li J, Tai BC, and Nott DJ Confidence interval for the bootstrap P-value and sample size calculation of the bootstrap test, Jounral of Nonparametric Statistics, 21 (2009), pp. 649–661. [Google Scholar]
- [10].Likert R, A technique for the measurement of attitudes, Archives of Psychology, 140, pp. 5–55. [Google Scholar]
- [11].Mardia KV, Assessment of Multinormality and the Robustness of Hotelling’s Test., Journal of the Royal Statistical Society, Series C (Applied Statistics), 24, pp. 163–171. [Google Scholar]
- [12].Protter MH, Protter CB Calculus with Analytic Geometry 4th ed, Jones and Bartlett, 2009. [Google Scholar]
- [13].R: A Language and Environment for Statistics Computing. R Foundation for Statistical Computing, 2013. [Google Scholar]
- [14].Russell JA, Weiss A. and Mendelsohn Affect grid GA: A single-item scale of pleasure and arousal, Journal of Personality and Social Psychology, 57, pp. 493–502. [Google Scholar]
- [15].Newcombe RG Two-sided confidence intervals for the single proportion: comparison of seven methods, Statistics in Medicine 17 (8), pp. 857–872. [DOI] [PubMed] [Google Scholar]
- [16].Russell YI and Gobet F, Sinuosity and the affect grid: A method for adjusting repeated mood scores., Perceptual and Motor Skills, 114, pp. 125–136. [DOI] [PubMed] [Google Scholar]
- [17].Wilson EB, Probable inference, the law of succession, and statistical inference, Jounral of the American Statistical Association, 22 (1927). [Google Scholar]






