Skip to main content
Aesthetic Surgery Journal logoLink to Aesthetic Surgery Journal
. 2024 Jan 5;44(7):733–743. doi: 10.1093/asj/sjad374

“I Want It to Look Natural”: Development and Validation of the FACE-Q Aesthetics Natural Module

Anne F Klassen 1,, Stefan Cano 2, Jasmine Mansouri 3, Lotte Poulsen 4, Charlene Rae 1, Manraj Kaur 5, Steven Dayan 6, Elena Tsangaris 5, Kathleen Armstrong 7, Jennifer Klok 8, Katherine Santosa 9, Andrea Pusic 5
PMCID: PMC11177552  PMID: 38180487

Abstract

Background

The concept of “natural” after a facial aesthetic treatment represents an understudied area. We added scales to FACE-Q Aesthetics to provide a means to measure this concept from the patient's perspective.

Objectives

The objective of this study was to develop and validate the FACE-Q Aesthetic Natural module.

Methods

Concept elicitation interviews with people having minimally invasive treatments were conducted to explore the natural concept and develop scales. Patient and expert input refined scale content. An online sample (ie, Prolific) of people who had a facial aesthetic treatment was analyzed with Rasch measurement theory to examine psychometric properties. A test-retest reliability study was performed, and construct validity was examined.

Results

Interviews with 26 people were conducted. Three scales were developed and refined with input from 12 experts, 11 patients, and 184 online survey participants. Data from 1358 online participants provided evidence of scale reliability and validity. Reliability was high with person separation index, Cronbach alpha, and intraclass correlation coefficient values without extremes ≥0.82. Tests of construct validity confirmed that the scales functioned as hypothesized. Higher scores on the Expectations scale were associated with how important it was to have a natural look and movement after treatment. In addition, higher scores on the Natural Appearance and Natural Outcome scales correlated with better scores on other FACE-Q Aesthetics scales, and were associated with the face looking and feeling natural and with overall satisfaction with facial appearance.

Conclusions

Many people seeking facial aesthetic treatments want to look natural after treatment. These new FACE-Q Aesthetics scales provide a means to measure the concept of natural from the patient's perspective.

Level of Evidence: 3

graphic file with name sjad374il1.jpg


See the Commentary on this article here.

The American Academy of Facial Plastic and Reconstruction Surgery member survey for 2021 reported that one of patients’ top concerns was having an unnatural result following treatment.1 Because opinions about how natural someone looks are highly subjective, measuring this concept after facial aesthetic treatments should involve patients. To measure the patient’s perspective, questionnaires (ie, patient-reported outcome measures, or PROMs) are needed.2 When rigorously and carefully designed, such tools can provide a highly specific and powerful means of incorporating the patient’s perspective into outcome assessments.3,4

Most people who have facial aesthetic treatments report improvements in appearance and health-related quality of life, but treatments carry risks and have associated complications that can result in poor outcomes. Our research team developed FACE-Q Aesthetics to provide a means to measure outcomes that matter to people having any type of surgical or nonsurgical facial aesthetic treatment from their perspective.5-11 This PROM comprised 40 independently functioning scales and checklists that measured a range of outcomes important to patients. An advantage to its modular design was that new scales can be added when a gap in measurement was identified. One such gap was the need for scales measuring the concept of “natural.” While not all people having facial aesthetic treatments seek a natural outcome, most people do. Providing a means to measure how natural the result of treatment appears from the patient's perspective could help to improve the quality and outcome of care provided.

The specific aims of our study were to elicit concepts important to measuring how natural the outcome of facial treatment was from the patient's perspective, and to develop, refine, and test new FACE-Q scales to measure this important concept.

METHODS

Research Ethics

This study was coordinated at McMaster University (Hamilton, ON, Canada). Ethics board approval was obtained from the Hamilton Integrated Ethics Board (Canada; #13603).

Approach

This study was conducted between October 2021 and March 2023 as part of a larger mixed-methods study to develop and validate outcome scales for minimally invasive skin and facial aesthetic treatments.12,13 We followed international guidelines for PROM development.2,14-17 An interpretative description qualitative approach was followed.18 Participants for concept elicitation interviews were recruited between October 2021 and March 2022 from 3 private practice clinics in Canada and 3 in the USA. Staff in the clinics were asked to recruit adults who varied by age, gender, race, and minimally invasive treatment. A qualitative researcher contacted recruited patients to describe the study in more detail and obtain informed consent. Experienced interviewers conducted interviews by phone or on a secure web conferencing platform (ie, Zoom; Zoom Video Communications, San Jose, CA). Participants were asked if the concept “natural” was important to them, how they would define natural, and how they would describe a natural look for the treatment they had or planned to have. Appendix A shows the complete interview guide, available at www.aestheticsurgeryjournal.com.

All interviews were audio recorded, transcribed, and coded by 2 coders who met to achieve consensus on discrepant codes. All concepts were labeled with a domain and major and/or minor theme. Codes were transferred to MS Excel (Microsoft, Redmond, Washington) with DocTools (DocTools ApS, Frederikssund, Denmark) and refined through constant comparison.19 Interviews continued until saturation of most concepts was reached.20 Participants received a gift card of $100 USD to thank them for their time.

Concept Elicitation and Scale Refinement

Concepts related to natural were used to form the Natural scales described here. These scales were refined through several steps as shown in Figure 1. In October 2022, to establish content validity, with REDCap (Vanderbilt University, Nashville, TN), participants from the qualitative interviews were invited to review each item and select the most appropriate answer: (1) I do not understand the question; (2) I understand the question, but it could be worded better; (3) I understand the question but it is not relevant to me; or (4) I understand the question and it is relevant to me.21 A comment box was provided for missing concepts. Participants were provided a gift card of $30 USD to thank them for their time. In this step, problematic items were dropped or reworded and comments were reviewed.

Figure 1.

Figure 1.

Methods flow diagram. CTT, classical test theory; DIF, differential item functioning; ICC, intraclass correlation coefficient; PSI, person separation index.

Next, an experienced interviewer conducted cognitive debriefing interviews over Zoom. In the interviews we obtained feedback on instructions, items, and response options and identified missing items. The interviews were audio recorded, transcribed, and analyzed. We provided a $70 USD gift card to thank participants for their time. In addition, experts in aesthetics and representatives from the aesthetics industry were emailed the scales and invited to identify any items they considered not relevant to patients and to suggest missing concepts.

Content validity was further examined using an online platform (Prolific, Oxford, UK). A screening survey was conducted in December 2022 for Canada and the USA, and in August 2023 for the UK. When we conducted the survey, the number of Prolific participants fluent in English for the Canada and USA sample was 121,170, and the UK sample was 37,458. The rate of pay was the equivalent of 10.80 GBP. Screening was used to identify a sample of people who in the past 12 months had been to a plastic surgery or dermatology clinic and had 1 or more of the facial aesthetic treatments listed in Appendix B, available online at www.aestheticsurgeryjournal.com. We excluded people who did not specify a treatment (ie, chose “none” or “other” for the treatment type). The invited sample was asked to select 1 answer for each item: (1) I do not understand the question; (2) I understand the question, but it is not relevant to me; or (3) I understand the question and it is relevant to me. A comment box was provided for missing concepts.

Scale Testing

In February 2023, a pilot field test was conducted using the Prolific sample described above. Rasch measurement theory (RMT) analysis was performed with RUMM2030 software (RUMM Laboratory Pty Ltd., Perth, Australia), utilizing the unrestricted Rasch model for polytomous ordered responses. Items with extreme misfit with the Rasch model were removed.22-24

A new sample was then identified in Prolific for the field test using the same inclusion criteria (see Appendix B). The field test took place in March 2023. The Natural scales were included alongside other scales developed in the mixed-methods study.12,13 Given its length, the survey was divided into 2 parts. At the end of Part 1, people were asked if they would like to continue. Those who agreed and whose facial aesthetic treatment had not worn off were sent Part 2 several days after completing Part 1. Table 1 describes the psychometric tests we performed.

Table 1.

Psychometric Tests Performed

Test Description Summary of findings
Thresholds for item response options Examines whether the item response options are ordered on a continuum (eg, a score of 1, on a continuum, should be lower than scores of 2 and higher). This approach creates a hierarchy of items to determine how items are ordered from easiest to hardest to endorse. Item thresholds are shown in Appendix F. The threshold map should display a probabilistic Guttman pattern reflecting a clinically sensible hierarchy. The items located at the lower end of the scale should be the hardest to endorse positively, and vice versa. This basic pattern was observed for all 3 scales and demonstrates that item hierarchies existed within the scales.
Item fit Examines the extent to which observed data align with expected values based on the Rasch model. Item fit is assessed with fit residuals and chi-square statistics. Fit residuals summarize the observed and expected responses to an item by the sample and should ideally lie within the range −2.5 to +2.5. Chi-square values summarize the difference between observed and expected responses to an item for participants in different class interval groups and should be nonsignificant after Bonferroni adjustment. Item characteristic curves can be viewed graphically.22,23 For the item fit analysis, the sample size was amended to 500 to adjust the P values given the large sample.23 Item fit was assessed for each scale and showed that the majority of items had nonsignificant P values (>.05). See Appendix G. These findings provide support that the data fit the Rasch model.
Local dependency Determines the extent of local dependency among items. Residual correlations were examined to identify any greater than 0.30 above the average correlations. Items deemed locally dependent were included in subtests to determine their impact on scale reliability.25 All but a few items were found to function independently of each other, providing support that the reliability statistics were not artificially inflated.23
Scale-to-sample targeting Inspects the spread of person locations (ie, natural outcome in the sample) and item locations (ie, range of concept of natural measured by the set of items). A scale that is better targeted has more coverage and has the mean person location close to the center of the items.26 Appendix G shows a visual representation of targeting (see images in right-hand column). For all 3 scales, the distribution of the scores for persons in the sample (pink bars) covered that of the items in the scale (blue bars). The means of the person locations were also relatively close to the center for each scale. These findings provide support that the scales had acceptable targeting.
Differential item functioning (DIF) The extent to which items were invariant across age (ie, 20-29, 30-39, 40-49, ≥50), gender (females vs males), and country (US, Canada, UK). The sample was amended to 500 to adjust the P values for the large sample. A significant difference between subgroups identifies potential DIF. For this analysis, we chose random samples with equal size samples in subgroups. When potential DIF was identified, variables were split for the relevant items, and the original and split person locations were correlated to examine the impact of DIF on scale scoring.27 The analysis was repeated 3 times to determine if the results were stable. DIF was found for age groups in 2 items, and gender in 6 items When further examined, the DIF did not impact the scale scoring. These findings provide support that the scales worked the same in different subgroups (eg, men and women).23
Reliability Examines the accuracy of scores for a scale. Reliability statistics range from 0 to 1 with higher scores indicating greater reliability; scores should be above 0.70.28,29 We explore 3 types of reliability:
  1. Person separation index (PSI)—in RMT, this test determines the extent to which people in the sample are separated by the scale items.30

  2. Cronbach alpha—in Rumm2030, we computed this statistic to measure internal consistency. COSMIN guidelines require Cronbach's alpha to be ≥0.7 for a scale to demonstrate internal consistency.

  3. Test-retest reliability (TRT)—a subset of participants completed the survey twice. For each scale, we excluded anyone who: reported an important change in the natural concepts and anyone who completed the TRT outside of 7-14 days. We identified extremes using box plots and computed intraclass correlation coefficients (ICC) with a 2-way random effects model with and without extremes included.28,29

PSI—All 3 scales met the criterion both with and without extremes (≥0.81; Table 4). These findings show that the items in the scales could discriminate between participants.23
Cronbach alpha– All 3 scales met the criterion both with and without extremes (≥0.87, see Table 4). These findings further demonstrate the reliability of each scale.
TRT—ICC values ranged from 0.62 to 0.88 with extremes, with only the Expectation scale not meeting the ≥0.70 criterion. When extreme values were removed, ICC values ranged from 0.88 to 0.97, with all scales meeting the criterion. These findings provide support that the participants who were stable gave reproducible scores. (Table 4)
Construct validity Examines the extent to which the scale accurately measures what it purports to measure. We utilized the transformed scores for these analyses. Table 5 lists the predefined hypotheses for each scale. Criteria for evaluating hypotheses were in line with COSMIN criteria.29 For correlation hypotheses, coefficients were interpreted as follows: >0.50 for similar constructs and 0.3-0.5 for related but dissimilar constructs.29 To assess overall construct validity, we applied the COSMIN threshold for sufficient construct validity of 75% or more of the hypotheses accepted.29 For each scale, more than 75% of the hypotheses tested were accepted. These results provide evidence that the scales were able to detect differences between groups as expected. The scales also correlated as expected with validated scales measuring similar constructs. These correlations provide evidence that each scale worked to accurately measure the construct of interest. (Table 5, Figures 2, 3, Supplement 8)

COSMIN, Consensus-based standards for the selection of health measurement instruments; RMT, Rasch measurement theory.

RESULTS

Concept Elicitation

Study sample characteristics are shown in Tables 2 and 3. Coding and analysis identified that the concept of natural was highly important. Most participants wanted a natural look, but a few preferred the “frozen” look. A natural look was described in positive terms (eg, more refreshed, rejuvenated, enhanced, well rested, younger, some movement) and an unnatural look was described in negative terms (eg, overtreated, overdone, overfilled, fake, artificial, plastic). Participants saw a natural look as age-appropriate and not so noticeable that someone might comment, that is, a subtle change. Looking like oneself after treatment was important: “Well, it's something when I look in the mirror, it's still me. It's not a different person.” A natural look was also described as one in which the parts of the face were proportionate: “So for me, a natural look would be my lips are proportionate with one another. They aren’t too big for my face. They are the perfect size. They aren’t too small. And everything is coordinated together.”

Table 2.

Characteristics for the 26 Qualitative Interview and Prolific Survey Participants

Qualitative sample Cognitive sample Field test sample
n = 26 n = 184 % n = 1358 %
Country Canada 6 21 11.4 107 7.9
USA 20 40 21.7 721 53.1
UK 0 123 66.8 528 38.9
Missing 0 0 0 2 0.1
Age 20-29 3 39 21.2 231 17.0
30-39 6 43 23.4 305 22.5
40-49 7 45 24.5 442 32.5
50-59 10 57 31.0 380 28.0
Gender Female 23 154 83.7 1005 74.0
Male 3 28 15.2 339 25.0
Gender diverse 0 2 1.1 11 0.8
Prefer to not answer 0 0 0 3 0.2
Race White 22 139 75.5 1036 76.3
Black 2 9 4.9 96 7.1
Latin American 0 7 3.8 35 2.6
East Asian 0 6 3.3 45 3.3
Middle Eastern 0 3 1.6 10 0.7
South Asian 1 5 2.7 40 2.9
Southeast Asian 1 2 1.1 13 1.0
Indigenous 0 1 0.5 1 0.1
Mixed race 0 10 5.4 69 5.0
Other/missing/prefer to not answer 0 2 1.0 14 1.1
Marital status Married/common law 16 88 47.9 746 55.0
Single 7 60 32.6 423 31.1
Divorced 2 23 12.5 122 9.0
Separated 0 7 3.8 35 2.6
Widowed 1 3 1.6 14 1.0
Other/prefer not to answer 0 18 1.3
Fitzpatrick skin type Always burn and never tan 2 10 5.4 97 7.1
Usually burn and minimally tan 9 45 24.5 363 26.7
Mild burn and then tan 9 81 44.0 503 37
Rarely burn and always tan 4 25 13.6 263 19.4
Rarely burn and tan very easily 1 16 8.7 110 8.1
Never burn and never tan 1 4 2.2 22 1.6
Missing 0 3 1.6 0 0
Highest education Some high school 0 3 1.6 7 0.5
High school 1 10 5.4 105 7.7
Some college, trade, or university 4 25 13.6 200 14.7
College, trade, or university degree 9 103 56.0 686 50.5
Some master’s or doctoral 0 7 3.8 86 6.3
Master’s or doctoral degree 11 36 19.6 273 20.1
Missing/prefer to not answer 1 0 0 1 0.1

Table 3.

Treatment History Reported by Qualitative Sample and Prolific Participants

Qualitative sample Cognitive sample Field test sample
n = 26 n = 184 % n = 1358 %
Injectable Botox 18 124 67.4 592 43.6
Filler 17 121 65.8 395 29.1
Platelet-rich plasma (PRP) 1 13 7.1 61 4.5
Skin booster 0 17 9.3 114 8.4
Skin resurfacing Microdermabrasion 7 81 44.0 496 36.5
Chemical peel 16 74 40.2 489 36.0
Hydrafacial 2 65 35.3 580 42.7
Laser 14 47 25.5 239 17.6
Microneedling 2 46 25.0 301 22.2
Light therapy 14 40 21.7 209 15.4
Skin tightening Radiofrequency 7 21 11.4 150 11.0
High intensity ultrasound 0 17 9.2 137 10.1
Thread lift 1 13 7.1 95 7.0
Fat removal Fat removal 1 14 7.6 87 6.4

With the natural concepts we created items that formed 2 scales. One scale (Natural Appearance) measured how natural the face looked after facial aesthetic treatment (My facial features [eg, lips, cheeks] look balanced). The second scale (Natural Outcome) focused on evaluating what patients thought about the results of the facial aesthetic treatment they had (eg, The treatment was not overdone).

Scale Refinement

Eleven participants from the concept elicitation interviews provided feedback on the scales with REDCap. For the face scale, composed of 23 items (253 ratings), no one chose “I do not understand”; 11 (0.9%) ratings were “I understand this question, but it could be worded better”; 31 (12.3%) ratings were “I understand this question, but it is not relevant to me”; and 211 (83.4%) ratings were “I understand this question and it is relevant to me.”

For the treatment scale, composed of 32 items (352 ratings), no one chose “I do not understand”; 3 (0.9%) ratings were “I understand this question, but it could be worded better”; 46 (13.1%) ratings were “I understand this question, but it is not relevant to me”; and 303 (86.1%) ratings were: “I understand this question and it is relevant to me.”

Appendices C-E, available at www.aestheticsurgeryjournal.com, show the item-level decisions made in each round. In Round 1, 7 cognitive debriefing interviews were performed. Round 1 also included feedback from 3 aesthetic plastic surgeons and 1 plastic surgery resident from Canada. Based on this round, 39 items were retained, 8 items revised, 8 items dropped, and 10 items added. Based on the suggestion of the experts, the Natural Appearance scale was adapted to provide a means to measure expectations before treatment.

Round 2 included 5 plastic surgeons, 1 dermatologist, and 2 industry experts from Denmark, Canada, Sweden, and the USA. The total number of items tested in Round 2 was 74. Based on this round, 57 items were retained, 14 items revised, 3 items dropped, and 8 items added, resulting in 79 items.

In Round 3, 556 Prolific participants accessed the screening survey, 194 who met the screening inclusion criteria were invited to complete the cognitive survey, 156 completed the survey, and 144 were found to meet the study inclusion criteria. A sample of 40 participants from the UK sample was invited to complete the cognitive survey before the field test. The full cognitive sample included 184 participants. For Expectations, Natural Appearance, and Natural Outcome, respectively, the option “I do not understand the question” was chosen 0.3%, 0.4%, and 1.4% of the time, and the option “I understand the question and it is relevant to me” was chosen 81.6%, 81.7% and 82.7% of the time. Based on this round, 2 items were revised, and 9 items were dropped, resulting in 70 items.

Pilot Field Test

The 144 participants were invited to complete the pilot field test, and 123 completed 1 or more of the Natural scales. Based on the RMT analysis, 6 items were dropped. The field test survey included the 2 face scales with 15 items each and the treatment scale with 34 items.

Field Test

A total of 4301 Prolific participants completed a screening survey. After removing 1365 duplicates, incompletes, and ineligible participants, 1895 met the screening inclusion criteria. Of these, 1458 responded to the invitation but 199 were not eligible for the study for the following reasons: survey incomplete (n = 95), no treatment (n = 84), reported “other” for the type of treatment (n = 14), and provided unreliable answers (n = 6). The Part 1 sample included 1259 participants, of whom 1201 completed the Expectations scale. For Part 2, 10 participants requested no follow-up, and treatment had worn off for 249. From the 1000 potential participants, 960 completed the Face: Natural Outcome scale, and 962 completed the Treatment: Natural Result scale. A total of 1235 participants completed 1 or more Natural scales.

For the RMT analysis, data for the 123 pilot field test sample were combined with the 1235 field test participant sample. Characteristics for the 1358 participants are shown in Tables 2 and 3. Participant age ranged from 20 to 76 years with a mean age of 42.4 years (SD = 12.1). Of 1358 participants, 1005 (74%) were females, 339 (25%) were males, and the remaining participants were gender diverse or preferred not to answer. The RMT analysis resulted in 3 scales: a 15-item Expectations scale, 10-item Natural Appearance scale, and 12-item Natural Outcome scale. All 37 items had ordered thresholds (Appendix F, available at www.aestheticsurgeryjournal.com). Items had good item fit to the Rasch model with nonsignificant chi-square P values after Bonferroni adjustment (Appendix G, available at www.aestheticsurgeryjournal.com). For 19 items, the fit residuals were outside ±2.5. Significant differential item functioning (DIF) was present in 2 items for age group and 6 items for gender (Appendix G). The DIF had little impact on scoring; Pearson correlations between person locations before and after splitting for DIF was ≥0.999.

Table 4 shows the scale level results. Reliability was high, with person separation index (PSI) and Cronbach alpha values >0.80. The face scales had 1 or 2 pairs of items that evidenced local dependency. Subtests dropped the reliability to a maximum of 0.04 for PSI values and 0.06 for Cronbach alpha values. Appendix F shows the person-item threshold distributions. The sample (upper histograms) was targeted to the scales (lower histograms). While most participants (≥74.4%) scored on the range of measurement provided by the scales, scores were skewed to the right, which indicated that participants had high expectations for a natural outcome and that a natural result had been achieved.

Table 4.

RMT Scale Level Statistics and Other Psychometric Results

Model fit Reliability
Scale Items
n
Sample
n
RMT
n
Score on scale % χ2 DF P value Floor % Ceiling % PSI α ICC +/− extremes
+extr −extr +extr −extr n ICC LB UB
Expectations 15 1321 1066 80.7 161.6 135 .06 0.5 19.2 0.81 0.82 0.93 0.90 113 0.62 0.45 0.74
103 0.88 0.82 0.92
Natural Appearance 10 1072 947 88.3 96.28 70 .02 0.4 11.3 0.84 0.83 0.90 0.87 111 0.86 0.79 0.90
100 0.94 0.91 0.96
Natural Outcome 12 1070 796 74.4 135.1 72 .00 0.1 25.5 0.86 0.88 0.95 0.92 115 0.88 0.83 0.92
98 0.97 0.96 0.98

α, Cronbach alpha; CI, confidence interval; DF, degrees of freedom; +extr, with extremes; −extr, without extremes; ICC, intraclass correlation coefficient; LB, lower bound; PSI, person separation index; RMT, Rasch measurement theory; UB, upper bound.

Regarding construct validity, all but 3 correlations between the natural post-treatment scales and FACE-Q Aesthetic scales met consensus-based standards for the selection of health measurement instruments (COSMIN) criteria as hypothesized (see Table 5). Figure 2 shows the mean scores for the test of construct validity for the Expectations scale. As hypothesized, scores were incrementally higher by how important it was to achieve a natural look after treatment. Figure 3 shows the mean scores for the Natural Appearance and Natural Outcome for one of the construct validity tests. Descriptive statistics supporting construct validation results are in Appendix H (available at www.aestheticsurgeryjournal.com). As hypothesized, the Natural scale scores were incrementally higher for participants who reported that their face looked and felt more natural after treatment. Similarly, higher scores were associated with greater satisfaction with facial appearance overall. All construct validity tests for differences between subgroups were significant (ie, P < .001).

Table 5.

Summary of Construct Validation Hypotheses and Results

Hypothesis Scale
Expectations Natural Appearance Natural Outcome
Mean scores will be incrementally higher . . .
with increased importance of having a natural look Yes** NA NA
as self-report responses to “How natural your face looks?” increase NA Yes** Yes**
as self-report responses to “How natural your face feel?” increase NA Yes** Yes**
with greater satisfaction with how the face looks overall NA Yes** Yes**
Scores will correlate with . . .
FACE-Q Face overall scores moderately (0.3-0.5) NA Yes**
r = 0.383
Yes**
r = 0.372
FACE-Q Psychological scores moderately (0.3-0.5) NA Yes**
r = 0.345
Yes**
r = 0.334
FACE-Q Social scores moderately (0.3-0.5) NA Yes**
r = 0.319
No**
r = 0.272
FACE-Q Outcome scores strongly (≥0.5) NA Yes**
r = 0.526
Yes**
r = 0.791
a FACE-Q Decision scores strongly (≥0.5) NA No**
r = 0.420
Yes**
r = 0.577
No. of hypotheses tested 1 8 8
No. of hypotheses accepted 1 7 7
Percentage accepted 100% 87.5% 87.5%

aFACE-Q Aesthetics decision scale included in field test sample only. **P ≥ .001 (2-tailed). NA, not applicable.

Figure 2.

Figure 2.

Mean scores for the Expectations scale by how important it is for participants to achieve a natural look.

Figure 3.

Figure 3.

Mean scores for the Natural Appearance and Outcome scales by participant ratings of how natural their face looks.

Finally, for the test-retest (TRT) data, we excluded 4 participants from the Expectations scale, 15 from the Natural Appearance scale, and 11 from the Natural Outcome scale because they completed the survey outside 7 to 14 days or reported an important change in the scales’ concepts. The intraclass correlation coefficient (ICC) values for the TRT reliability and the upper and lower bounds were all ≥0.82 after excluding extremes (see Table 4).

DISCUSSION

Facial aesthetic treatments can improve how patients look and feel about themselves. Most people in our qualitative study desired a natural-looking result from their treatment. Participants described a natural result as one that looked age-appropriate and was subtle rather than obvious to others. Furthermore, a natural result was one that made people look better, rested, refreshed, and rejuvenated.

Our team developed and validated the FACE-Q Aesthetics Natural module to provide a means for patients to self-report their expectations before treatment as well as their outcomes after treatment. Patients’ own words were employed to ensure that the content of the scales resonated with them. Our large, diverse, and international sample provided evidence to support the reliability and validity of the natural scales in a wide range of minimally invasive treatments. Table 1 provides an interpretation of study results. As far as we are aware, these scales represent the first rigorously developed and validated scales to measure the concept of natural from the patient's perspective.31

Previous studies on the topic of natural focus on its meaning and attempt to measure this concept in various ways.31-36 Our study is unique with its use of a mixed-methods approach to understand and develop a means to measure natural from the patient’s perspective. Another PROM with a component relating to the concept of natural is the Facial Line Satisfaction Questionnaire (FLSQ).37 This PROM was designed to measure outcome of treatment of crow's feet, glabellar, and forehead lines.37 The concept of natural was measured with a single item in each of the baseline and follow-up versions of the FLSQ. In contrast to the FLSQ, the FACE-Q Natural scales can be used evaluate outcomes for any minimally invasive facial aesthetic treatment, and all items are focused on the concept of natural. Other studies have measured the concept of natural in the context of treatment with hyaluronic acid fillers.31,34-36 To address a limitation of these studies, we included a more diverse sample of types of treatments and patient characteristics.

Our study had its own limitations. While our sample was large and diverse, only English-speaking people were included. Our sample may not reflect the general population of Canada, the USA, or the UK, or of people seeking facial aesthetic treatments. Some treatment types in our study sample were represented by a small number of participants. While the concept of natural is highly important to people having minimally invasive treatments, we recognize that natural is also relevant to people having facial aesthetic surgery. Future research to determine if the Natural scales are applicable to cosmetic surgery is warranted. It is important to note that the treatment and clinical data collected in our study are self-reported. Finally, online platforms for research have limitations. Participants selected to participate and were paid for their involvement in the research. However, there is evidence that the data provided through Prolific is of high quality when compared with other online platforms.38

CONCLUSIONS

Each person undergoing facial aesthetic treatment has unique priorities and expectations. A PROM allows the treatment provider to understand whether goals of care have been reached from the patient's unique point of view. The new FACE-Q Aesthetics Natural module provides a means for clinicians and researchers to measure expectations and outcomes from the patient’s perspective.

Supplemental Material

This article contains supplemental material located online at www.aestheticsurgeryjournal.com.

Supplementary Material

sjad374_Supplementary_Data

Disclosures

Drs Klassen, Cano, and Pusic are codevelopers of the FACE-Q Aesthetics patient-reported outcome measure and receive a share of license revenues as royalties based on their institution’s inventor sharing policy. Dr Cano is the chief scientific officer of Modus Outcomes (Cambridge, MA), a division of THREAD (Cary, NC). Dr Klassen provides research consulting services to the pharmaceutical industry through EVENTUM Research (Hamilton, ON, Canada). All other authors have nothing to disclose.

Funding

The authors received no external financial support for the research, authorship, and publication of this article.

REFERENCES

  • 1. American Academy of Facial Plastic and Reconstruction Surgery . AAFPRS announces annual survey results: demand for facial plastic surgery skyrockets as pandemic drags on. Accessed September 27, 2023. https://www.aafprs.org/Media/Press_Releases/2021%20Survey%20Results.aspx
  • 2. US Food and Drug Administration . Guidance for industry patient-reported outcome measures: use in medical product development to support labeling claims. Accessed September 27, 2023. https://www.fda.gov/downloads/drugs/guidances/ucm193282.pdf [DOI] [PMC free article] [PubMed]
  • 3. Calvert M, Kyte D, Price G. Maximising the impact of patient reported outcome assessment for patients and society. BMJ. 2019;365:k5267. doi: 10.1136/bmj.k5267 [DOI] [PubMed] [Google Scholar]
  • 4. Black N. PROMs could help transform healthcare. BMJ. 2013;346:16. doi: 10.1136/bmj.f167 [DOI] [PubMed] [Google Scholar]
  • 5. Pusic A, Klassen AF, Scott AM, Cano SJ. Development and psychometric evaluation of the FACE-Q satisfaction with appearance scale: a new patient-reported outcome instrument for facial aesthetics patients. Clin Plast Surg. 2013;40(2):249–260. doi: 10.1016/j.cps.2012.12.001 [DOI] [PubMed] [Google Scholar]
  • 6. Panchapakesan V, Klassen AF, Cano SJ, Scott AM, Pusic AL. Development and psychometric evaluation of the FACE-Q aging appraisal scale and patient-perceived age visual analog scale. Aesth Surg J. 2013;33(8):1099–1109. doi: 10.1177/1090820X13510170 [DOI] [PubMed] [Google Scholar]
  • 7. Klassen AF, Cano SJ, Scott AM, Pusic AL. Measuring outcomes that matter to face-lift patients: development and validation of FACE-Q appearance appraisal scales and adverse effects checklist for the lower face and neck. Plast Reconstr Surg. 2014;133(1):21–30. doi: 10.1097/01.prs.0000436814.11462.94 [DOI] [PubMed] [Google Scholar]
  • 8. Klassen AF, Cano SJ, Schwitzer J, Scott A, Pusic AL. FACE-Q scales for health-related quality of life, early life impact and satisfaction with outcomes and decision to have treatment: development and validation. Plast Reconstr Surg. 2015;135(2):375–386. doi: 10.1097/PRS.0000000000000895 [DOI] [PubMed] [Google Scholar]
  • 9. Klassen AF, Cano SJ, East CA, et al. Development and psychometric evaluation of the FACE-Q scales for patients undergoing rhinoplasty. JAMA Facial Plast Surg. 2016;18(1):27–35. doi: 10.1001/jamafacial.2015.1445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Klassen AF, Cano SJ, Schwitzer JA, et al. Development and psychometric validation of the FACE-Q skin, lips, and facial rhytides appearance scales and adverse effects checklists for cosmetic procedures. JAMA Dermatol. 2016;152(4):443–451. doi: 10.1001/jamadermatol.2016.0018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Klassen AF, Cano SJ, Grotting JC, et al. FACE-Q eye module for measuring patient-reported outcomes following cosmetic eye treatments. JAMA Facial Plast Surg. 2017;19(1):7–14. doi: 10.1001/jamafacial.2016.1018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Klassen AF, Pusic AL, Kaur M, et al. The SKIN-Q: an innovative patient-reported outcome measure for evaluating minimally invasive skin treatments for the face and body. Facial Plast Surg Aesthet Med. In press [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Klassen AF, Cano SJ, Kaur M, et al. Extending the range of measurement: the FACE-Q Aesthetics face overall item library and banks. Submitted.
  • 14. Regnault A, Willgoss T, Barbic S; International Society for Quality of Life Research Mixed Methods Special Interest Group . Towards the use of mixed methods inquiry as best practice in health outcomes research. J Patient Rep Outcomes. 2018;2(1):1–4. doi: 10.1186/s41687-018-0043-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Aaronson N, Alonso J, Burnam A, et al. Assessing health status and quality-of-life instruments: attributes and review criteria. Qual Life Res. 2002;11(3):193–205. doi: 10.1023/a:1015291021312 [DOI] [PubMed] [Google Scholar]
  • 16. Patrick DL, Burke LB, Gwaltney CJ, et al. Content validity–establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO good research practices task force report: part 1–eliciting concepts for a new PRO instrument. Value Health. 2011;14(8):967–977. doi: 10.1016/j.jval.2011.06.014 [DOI] [PubMed] [Google Scholar]
  • 17. Patrick DL, Burke LB, Gwaltney CJ, et al. Content validity–establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO Good Research Practices Task Force report: part 2–assessing respondent understanding. Value Health. 2011;14(8):978–988. doi: 10.1016/j.jval.2011.06.013 [DOI] [PubMed] [Google Scholar]
  • 18. Thorne S, Kirkham SR, MacDonald-Emes J. Interpretive description: a noncategorical qualitative alternative for developing nursing knowledge. Res Nurs Health. 1997;20(2):169–177. doi: [DOI] [PubMed] [Google Scholar]
  • 19. Pope C, Ziebland S, Mays N. Qualitative research in health care. Analysing qualitative data. Br Med J. 2000;320(7227):114–116. doi: 10.1136/bmj.320.7227.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Sandelowski M. Theoretical saturation. In: Given LM, ed. The Sage Encyclopedia of Qualitative Methods. Vol 1. Sage; 2008:875–876. [Google Scholar]
  • 21. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377–381. doi: 10.1016/j.jbi.2008.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Andrich D, Marais I. A Course in Rasch Measurement Theory Measuring in the Educational, Social and Health Sciences. Springer; 2019:161–171.. [Google Scholar]
  • 23. Hobart J, Cano S. Improving the evaluation of therapeutic interventions in multiple sclerosis: the role of new psychometric methods. Health Technol Assess. 2009;2009:214. doi: 10.3310/hta13120 [DOI] [PubMed] [Google Scholar]
  • 24. Andrich D, Sheridan BS, Luo G. RUMM2030Plus: Rasch Unidimensional Models for Measurement. RUMM Laboratory; 2021. www.rummlab.com.au [Google Scholar]
  • 25. Christensen KB, Makransky G, Horton M. Critical values for Yen's q 3: identification of local dependence in the Rasch model using residual correlations. Appl Psychol Meas. 2017;41(3):178–194. doi: 10.1177/0146621616677520 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Cleanthous S, Bongardt S, Marquis P, Stach C, Cano S, Morel T. Psychometric analysis from EMBODY1 and 2 clinical trials to help select suitable fatigue pro scales for future systemic lupus erythematosus studies. Rheumatol Ther. 2021;8(3):1287–1301. doi: 10.1007/s40744-021-00338-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Andrich D, Hagquist C. Real and artificial differential item functioning. J Educ Behav Statist. 2012;37(3):387–416. doi: 10.3102/1076998611411913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Nunnally JC. Psychometric Theory. 3rd ed. McGraw-Hill; 1994. [Google Scholar]
  • 29. Prinsen CA, Mokkink LB, Bouter LM, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147–1157. doi: 10.1007/s11136-018-1798-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Andrich D. An index of person separation in latent trait theory, the traditional KR.20 index, and the Guttman scale response pattern. Educ Res Perspect. 1982;9(1):95–104. [Google Scholar]
  • 31. Solish N, Bertucci V, Percec I, Wagner T, Nogueira A, Mashburn J. Dynamics of hyaluronic acid fillers formulated to maintain natural facial expression. J Cosmet Dermatol. 2019;18(3):738–746. doi: 10.1111/jocd.12961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Dayan S, Romero DH. Introducing a novel model: the special theory of relativity for attractiveness to define a natural and pleasing outcome following cosmetic treatments. J Cosmet Dermatol. 2018;17(5):925–930. doi: 10.1111/jocd.12732 [DOI] [PubMed] [Google Scholar]
  • 33. Michaud T, Gassia V, Belhaouari L. Facial dynamics and emotional expressions in facial aging treatments. J Cosmet Dermatol. 2015;14(1):9–21. doi: 10.1111/jocd.12128 [DOI] [PubMed] [Google Scholar]
  • 34. Philipp-Dormston WG, Schuster B, Podda M. Perceived naturalness of facial expression after hyaluronic acid filler injection in nasolabial folds and lower face. J Cosmet Dermatol. 2020;19(7):1600–1606. doi: 10.1111/jocd.13205 [DOI] [PubMed] [Google Scholar]
  • 35. Swift A, von Grote E, Jonas B, Nogueira A. Minimal recovery time needed to return to social engagement following nasolabial fold correction with hyaluronic acid fillers produced with XpresHAn technology. Clin Cosmet Investig Dermatol. 2017;10:229–238. doi: 10.2147/CCID.S138155 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Dayan S, Fabi S, Nogueira A. Lay rater evaluation of naturalness and first impression following treatment of lower face wrinkles with hyaluronic acid fillers. J Cosmet Dermatol. 2021;20(4):1091–1097. doi: 10.1111/jocd.13927 [DOI] [PubMed] [Google Scholar]
  • 37. Pompilus F, Burgess S, Hudgens S, Banderas B, Daniels S. Development and validation of a novel patient-reported treatment satisfaction measure for hyperfunctional facial lines: facial line satisfaction questionnaire. J Cosmet Dermatol. 2015;14(4):274–285. doi: 10.1111/jocd.12166 [DOI] [PubMed] [Google Scholar]
  • 38. Strickland JC, Stoops WW. The use of crowdsourcing in addiction science research: Amazon mechanical Turk. Exp Clin Psychopharmacol. 2019;27(1):1–18. doi: 10.1037/pha0000235 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sjad374_Supplementary_Data

Articles from Aesthetic Surgery Journal are provided here courtesy of Oxford University Press

RESOURCES