Comparing text mining and manual coding methods: Analysing interview data on quality of care in long-term care for older adults

Coen Hacking; Hilde Verbeek; Jan P H Hamers; Sil Aarts

doi:10.1371/journal.pone.0292578

. 2023 Nov 8;18(11):e0292578. doi: 10.1371/journal.pone.0292578

Comparing text mining and manual coding methods: Analysing interview data on quality of care in long-term care for older adults

Coen Hacking ^1,^2,^*, Hilde Verbeek ^1,², Jan P H Hamers ^1,², Sil Aarts ^1,²

Editor: Baby Gobin³

PMCID: PMC10631650 PMID: 37939098

Abstract

Objectives

In long-term care for older adults, large amounts of text are collected relating to the quality of care, such as transcribed interviews. Researchers currently analyze textual data manually to gain insights, which is a time-consuming process. Text mining could provide a solution, as this methodology can be used to analyze large amounts of text automatically. This study aims to compare text mining to manual coding with regard to sentiment analysis and thematic content analysis.

Methods

Data were collected from interviews with residents (n = 21), family members (n = 20), and care professionals (n = 20). Text mining models were developed and compared to the manual approach. The results of the manual and text mining approaches were evaluated based on three criteria: accuracy, consistency, and expert feedback. Accuracy assessed the similarity between the two approaches, while consistency determined whether each individual approach found the same themes in similar text segments. Expert feedback served as a representation of the perceived correctness of the text mining approach.

Results

An accuracy analysis revealed that more than 80% of the text segments were assigned the same themes and sentiment using both text mining and manual approaches. Interviews coded with text mining demonstrated higher consistency compared to those coded manually. Expert feedback identified certain limitations in both the text mining and manual approaches.

Conclusions and implications

While these analyses highlighted the current limitations of text mining, they also exposed certain inconsistencies in manual analysis. This information suggests that text mining has the potential to be an effective and efficient tool for analysing large volumes of textual data in the context of long-term care for older adults.

Introduction

In recent years, client perspectives have become increasingly important in long-term care (LTC) for older adults when assessing the quality of care [1–3]. To gain insight into these perspectives, textual data are often collected, such as electronic health records, policy documents or transcribed interviews with various stakeholders, including residents of nursing homes [2,4]. When interviews are conducted with stakeholders in nursing homes, textual data may be collected by transcribing audio recordings verbatim from interviews (i.e. literally translating voice into text), and are often referred to as transcripts. This type of data collection often results in large amounts of textual data. To be able to analyse these data, researchers often conduct a so-called coding analysis, which involves manually analysing each transcript (stemming from an interview) to identify text fragments that are relevant to the objective at hand (often a research question) [2,5]. Each key fragment is summarised using codes (i.e. summaries of several words) that reflect the condensed meaning of that specific fragment [5]. The codes are then clustered based on their similarity, and are grouped into themes [5]. These themes convey a certain topic which is of relevance to the transcript at hand, which often provides a direct or indirect answer to the research question [5]. Although this type of coding is typically performed in a bottom-up manner, it is also possible to apply a top-down approach, in which case a set of themes is constructed in advance [3]. Since text analysis through coding is known to be very time-consuming and prone to bias due to the subjectivity of the researchers, coding is often performed independently by two or more researchers, thereby ensuring a certain level of objectivity. Manual analysis is never completely objective, as researchers are prone to human biases such as generalisations, inferences, and interpretations [6,7]. which compromise the reproducibility and limit the amount of data that can be analysed.

To overcome the aforementioned drawbacks, text mining could offer a possible solution. Text mining is the process of transforming unstructured text into structured data in order to gain new information and knowledge [8]. and has already been used for knowledge discovery in other domains of health care [4,9–13]. Knowledge discovery is the process of extracting useful information from a collection of data; for example, a study conducted on electronic health records discussed how text mining could be used to group pathology reports and discharge summaries, based on similar word occurrences [10]. Another study that focused on organising clinical narratives concluded that text mining could be applied to clinical narratives to identify keywords that could help in classifying physiotherapy treatments [4]. These examples highlight the usefulness of text mining in the health care domain.

Recent advancements in the field of text mining have ushered in a variety of new techniques, each with its unique focus and application [14–19]. Some models are particularly good at generating context-aware, human-like text, while others excel at incorporating multi-modal data, such as text and images, for a more comprehensive analysis [14–16]. Moreover, there is a growing emphasis on adapting these models to run efficiently on consumer-grade hardware [17]. Despite these strides in technology, there are still significant challenges in achieving the level of accuracy required for some tasks, and in many cases, human expertise continues to outperform automated methods [17].

To understand the potential usefulness of text mining for qualitative research in long-term care for older adults, it should be compared to the current gold standard of manual coding [20]. This study aims to compare a text mining approach to a manual approach in terms of accuracy, consistency, and expert feedback. Accuracy is a measure of the degree to which the results from the text mining approach are similar to those of the manual approach, whereas consistency is defined as the degree to which an approach (i.e. text mining or manual) finds the same themes for similar pieces of text. Expert feedback is collected to show whether the analyses conducted through text mining are perceived to be correct.

Materials and methods

Study design

In this study, a comparison was conducted between the use of manual and text mining approaches in a sentiment analysis and a thematic content analysis of qualitative data accumulated in an LTC setting. Two different text mining models were constructed: (i) a sentiment analysis model, and (ii) a thematic content analysis model [21,22]. Each model was then compared to the respective manual coding approach, based on an accuracy evaluation, a consistency evaluation and expert feedback.

Sample and participants

Data were collected as part of a project entitled ‘Connecting Conversations’, which aimed to assess the experienced quality of care in nursing homes from different perspectives [2,23]. This was achieved by interviewing residents, family members and care professionals at different nursing homes in the South of Limburg [2,23].

A total of n = 250 interviews were conducted at five different LTC organizations in the southern part of the Netherlands. From those interviews, 234 were transcribed (16 could not be transcribed due to poor audio quality). From the remaining 234 interviews, 61 were analysed manually using thematic content analysis. In addition, 103 interviews were analysed manually using sentiment analysis. All analysis in the manuscript were performed using those 61 and 103 interviews for the thematic content analysis and sentiment analysis respectively.

All interviews were conducted between January 2018 and December 2019. A diverse set of wards were included, including those for older people with dementia [23]. A total of n = 35 interviewers conducted the interviews. These interviewers were part of the project ’Connecting Conversations,’ which aims to assess the experienced quality of care in nursing homes from the resident’s perspective. They primarily come from a long-term care setting and have received specialized training to conduct these interviews. For a more comprehensive understanding of the ’Connecting Conversations’ project, see Sion et. al. 2020a [2]. The medical ethical committee of Zuyderland (the Netherland) approved the study protocol (17-N-86). Information about the study was provided to all interviewers, residents, family members and caregivers by an information letter. All participants provided written informed consent: residents with legal representatives gave informed consent themselves (as well as their legal representatives) before and during the conversations.

Data

The interviews were anonymously collected in the form of audio recordings and were transcribed verbatim (in Dutch) [2]. Personally identifiable information was removed from the transcripts before being coded. The data were coded by three research experts, each working in the Living Lab on Ageing and Long-Term Care for over 5 years. All these experts have a minimum of ten years of experience in conducting qualitative research. A total of 103 transcripts were manually coded regarding the sentiment [24]. In this analysis, text segments were manually coded as being either ‘positive’ or ‘negative’. However, text segments were only coded if the text discussed a topic relevant to the nursing home. A total of 61 transcripts were manually coded using INDEXQUAL, a thematic framework for defining the quality of LTC [3]. The themes provided by INDEXQUAL are ‘context’, ‘nursing home’, ‘person’, ‘expectations’, ‘personal needs’, ‘past experiences’, ‘word of mouth’, ‘experiences’, ‘care environment’, ‘relationship-centred care’, ‘experienced quality of care’, ‘perceived care services’, ‘perceived care outcomes’ and ‘satisfaction’ [2,3]. In both cases, transcripts were coded using MAXQDA, and these codes were exported to develop a text mining approach [25].

Text mining models

The models presented in the current study were created using deep learning, a method in which artificial neural networks (ANNs) are used to learn automatically from input data [26]. A Dutch base language model called RobBERT was used [22]. The advantage of using such a model is that language knowledge can be learned from a large dataset of arbitrary (Dutch) text. Two models were developed in the current study: a sentiment analysis model, and thematic content analysis model. The code for the models can be found at: https://doi.org/10.5281/zenodo.8391747.

Sentiment analysis

Sentiment analysis is the process of computationally identifying the sentiment expressed in a piece of text [8,27]. For example, the sentence ‘It’s a good day’ could be identified as being positive, while the sentence ‘It’s a bad day’ could be identified as being negative. The sentence ‘Today I went for a walk,’ could be neutral, as it does not convey whether the walk was experienced as a positive or negative event. Coded text segments were passed directly as input to the model, without modification. The sentiment analysis model was trained to classify the sentiment of a given piece of text into one of two categories, i.e. positive or negative. A positive or negative code was only assigned when it was perceived as being relevant to improving the quality of care [24].

Thematic content analysis

As part of the thematic content analysis, the model was trained to identify the themes present in each piece of text and to classify them into the relevant themes of the INDEXQUAL coding scheme. Since the number of coded text segments (n = 3867) was insufficient to allow the model to learn all the themes and sub-themes (n = 16), only the main themes were used: ‘Experienced quality of care’, ‘Experiences’, ‘Expectations’ and ‘Context’ [3]. Each code containing a sub-theme was changed to one of these main themes, and the model was designed to be able to identify multiple themes that may be present in a text segment.

Evaluation

The text mining models were analysed in three ways: an accuracy evaluation, a consistency evaluation, and using expert feedback. The accuracy analysis assessed the ability of each model to correctly classify or predict outcomes based on the input data, while the consistency analysis evaluated their ability to produce consistent results over multiple runs or when applied to different datasets, and expert feedback was used to provide additional insight into the performance and potential biases of the models [28–30].

Accuracy

The accuracy evaluation aimed to calculate the percentage of text segments that were assigned the same codes in both the text mining approach and the manual approach [8,27,31]. For example, if the text mining model for sentiment analysis assigned the same sentiment as the manual approach for all of the sentences, then the model would be considered 100% accurate. To calculate the accuracy, training and validation sets were used: the training set was used to provide feedback to the model to help improve it (i.e. supervised learning), while the validation set was used to evaluate whether what the model had learned so far could be generalised to data that it had not had the chance to learn from [28]. The total amount of data was split, with 90% forming the training set and 10% the validation set. The accuracy score from the validation set was reported, as this is more representative of how a model would perform on unseen data [28]. A confusion matrix was used to display the results of the accuracy evaluation. Such a matrix shows the different cases for each possible choice that either the manual or text mining approach can make. Accuracy was calculated using the formula: (TP + TN) / (TP + TN + FP + FN). In this case, TP is the true positive (i.e. where a code is present in both analyses), TN is the true negative (i.e. where a code is absent in both analyses), and FP is the false positive (i.e. where a code is predicted to be present but is absent in the manual analysis), while FN is the false negative (i.e. where a code is predicted to be absent but is present in the manual analysis). These components help us assess the accuracy of the model’s predictions and its performance overall [28].

Consistency

In the consistency evaluation, both the manual and text mining approach were analysed to determine the consistency of each approach individually. When a coded text is consistent, the expected outcome is that each sentence that is semantically similar will be coded in the same way. A consistency evaluation was conducted by comparing the assigned themes or sentiment between similar sentences; for example, if two sentences were semantically very similar, then it would be expected that these sentences would also be coded with the same themes, and if two sentences were semantically very different, it would be less likely that these would be coded in the same way [30,32].

Expert feedback

To determine whether the output of the models was reliable and comparable to that of manual coding, feedback was collected from the original research experts. This information was collected from three of the research experts who coded the original data, all of whom worked at the Living Lab on Ageing and Long-Term Care for over 5 years. All their feedback was captured in an audio-recorded interview.

The research experts were shown three coded transcripts and were asked to give feedback on them. Without their knowing, the research experts were shown one transcript that had been left unmodified manually coded transcripts (i.e. a transcript that contained the codes as previously analysed by the research experts themselves). After being shown each individual transcript, the research experts were asked to provide feedback on that transcript overall. Their feedback was then analysed to discover potential issues with the text mining approach.

Following this, the research experts were given one large transcript from the validation set in which they were shown both the manual and text mining versions next to each other. This type of comparison allowed them to comment on why the differences between the approaches arose. Their feedback was also used to highlight issues with the accuracy analysis.

Results

Accuracy

Sentiment analysis

The results show that the overall accuracy for the sentiment between the manual approach and the model was 81.8%. Fig 1 displays the results of the sentiment analysis in the form of a confusion matrix. It can be seen from the figure that most of the text in the transcripts was not coded with a sentiment, either through the manual process or through text mining. Manually coded text with a negative sentiment was only recognised as positive by text mining in 0.1% of cases, and only 0.3% of the text that was manually coded with a positive sentiment was recognised by text mining as negative. The average accuracy over all transcripts was 88.7% with standard deviation of 8.6%. The minimum was accuracy was 52.1% and the maximum accuracy was 99.6%.

Thematic content analysis

A comparison was conducted between the manually coded INDEXQUAL themes and the codes predicted by the model, and the results indicated that the model achieved an accuracy of 83.7%. Fig 2 shows the confusion matrices for the validation set. For all of the themes in general, it was found that most of the text segments that weren’t coded by the manual approach, were also not coded by the text mining approach. For the theme ‘Context’, we found that the text mining approach assigned a code to a text segment much more often compared to the manual approach. The themes of ‘Context’ and ‘Expectations’ were absent from most of the manually coded text (in 87.9% and 95.2% of cases, respectively). The themes of ‘Experienced Quality of Care’ and ‘Experiences’ were identified correctly by the text mining approach in a higher percentage of text segments compared to ‘Context’ and ‘Expectations’; however, ‘Experienced Quality of Care’ and ‘Experiences’ also had higher rates of false positives and false negatives. False positives were cases where text mining incorrectly assigned a particular theme to text segment, and false negatives were cases where text mining incorrectly failed to assign a theme. “The average accuracy over all transcripts was 81.9% with a standard deviation of 8.5%. The minimum accuracy of any transcript was 43.1% and the maximum was 93.4%.

Fig 2 — A confusion matrix is shown for each of the main INDEXQUAL themes (Experienced quality of care, Experiences, Expectations and Context). The y-axis of each matrix represents the presence or absence of a theme as determined through manual analysis, while the x-axis indicates the text mining predictions. Cells on the diagonals capture instances of agreement between manual coding and text mining for each theme. Off-diagonal cells detail discrepancies, indicating false positives or false negatives. Percentages within cells show the proportion of occurrences for each scenario in relation to the total dataset.

Consistency

Sentiment analysis

Consistency scores were calculated as part of the sentiment analysis, as shown in Table 1. Semantic similarity is a value between 0% and 100%, where a higher percentage indicates that the results were more consistent [33]. On average, the transcripts coded using the sentiment analysis model were more consistent than those coded using the manual approach.

Table 1. Overview of the consistency of the manual and text mining approaches in regarding the sentiment analysis.

Theme	Manual (%)	Text mining (%)
Positive	68.3	74.4
Negative	67.6	73.8

Open in a new tab

Thematic content analysis

As is shown in Table 2, the text mining approach was more consistent when coding sentences related to experienced QoC and experiences. These were also the themes that occur most often in the interviews. On average the text mining approach was more consistent using the current metric. While, the results displayed a low consistency, it should be noted that only limited context was taken into account. This increased the perceived similarity of sentences and therefore decreases the consistency.

Table 2. Overview of the consistency of the manual and text mining approaches regarding various themes.

Theme	Manual (%)	Text mining (%)
Experienced QoC	51.8	58.9
Experiences	54.0	59.1
Expectations	59.5	61.8
Context	59.4	62.2
Average	56.2	60.5

Open in a new tab

Expert feedback

Overall, the research experts expressed a mixed-to-positive assessment of the analysis of the transcripts. While they were most positive about the manually coded transcript, they were unable to distinguish it from the transcript coded by the text mining algorithm in the training set. In contrast, the text mining approach in the validation set was recognized by the research experts as having a lower level of accuracy (e.g., smaller coded text segments compared to the manual codes). The research experts identified certain themes, such as “Context” and “Expectations,” as posing greater difficulties for the algorithm, whereas other themes, such as “Experienced Quality of Care” and “Experiences,” were coded more similarly by both the algorithm and the research experts. The experts acknowledged that coding was generally a challenging task.

“I don’t find the coding to be poor. I notice that the codes about which the text mining approach is wrong, we’ve also had deliberations.”

“[Text mining] isn’t not all perfect, however it does allow us to analyse much more interviews.”

The research experts were presented with a transcript from the validation set, where both the manual and text mining versions were presented side by side to enable the research experts to explain the differences between the approaches. Most of the feedback from the research experts focused on codes that were similar between the two approaches or where the text mining approach incorrectly coded something. However, according to the experts, some codes were coded correctly by the text mining approach, but not by the manual approach.

“Yes, we’ve missed that one, seems logical to me.”

“Yes, [similar to the other] we missed that one as well.”

Although the instances of text mining finding errors in the manual codes were few, they negatively impacted the accuracy analysis. This is because such codes were regarded as false positives. Additionally, there was at least one instance where the text mining algorithm had coded the same information at a different location in the text.

“Here, the model applied the theme of quality of care [instead of where we coded it].”

Discussion

This study compared two approaches to coding text, a text mining approach and a manual approach, and carried out two types of analysis: a sentiment analysis and a thematic content analysis. The two approaches were compared in terms of their accuracy and consistency, and based on expert feedback. The results showed that for most text segments, the approaches were coded in a similar fashion. However, further analyses also showed that there were key differences in coding between the text mining approach and the manual approach in terms of accuracy and consistency.

The results of an accuracy analysis showed that the text mining models coded text with the same themes as the manual approach in more than 80% of cases. However, it was found that the number of false positives and false negatives were relatively high compared to the true positives. This indicates that the actual similarity (i.e. for text containing more coded segments) between the methods may be lower. One of reasons for the discrepancies between the manual and text mining approaches is that many manually coded text segments contain more than one theme; for example, 19% of all of the text coded with the theme ‘Experiences’ was also coded by the research experts with other themes, such as ‘Experienced quality of care’ or ‘Expectations’. The presence of overlapping themes in text can pose a challenge for text mining models, as this makes it more difficult to accurately determine which text characteristics correspond to each theme. In addition, the complexity and variability of natural language and the current limitations of text mining algorithms may also contribute to the lower accuracy of text mining models [34–37]. The variance of the accuracy between transcripts shows that a possible reason for lower accuracies could be due to factors that vary between transcripts, such as the quality of the transcription, the nature of the language used by the participants, or contextual factors that were not taken into account by the text mining or manual approach.

The results of a consistency analysis suggested that the current text mining models were able to produce more consistent codes for semantically similar sentences across all interviews compared to the manual analyses. However, the measured difference in consistency between the approaches was less than 5% on average. This could be explained by the fact that the text mining approach learned from the manual codes, and hence the text mining models also exhibited the same type of inconsistencies to a certain degree [38,39].

Feedback from the research experts suggested that text mining could be a valuable supplement to traditional qualitative analysis methods, and could provide a more efficient and objective way of analysing large amounts of text data [40,41]. However, research experts were able to identify flaws in both methods of analysis. This could be because research experts had more knowledge about the subject of the analyses and could therefore recognise wider patterns [42,43]. However, it was difficult for human experts to distinguish between the codes they had assigned manually and codes that were assigned by the text mining model. When the experts were able to compare the codes created by the text mining approach and their own manual codes, they reported that they had also missed certain text segments when they originally coded the interviews. These segments were discovered and coded by the text mining models. This finding suggests that text mining models could be helpful for manual analysis, as demonstrated using recent methods such as InstructGPT and MM-CoT [14–16]. These methods show that language models can aid in a variety of tasks, from writing cover letters to creating SPSS or Python scripts. However, these language models require human guidance to achieve the best results, as many of these tasks may be subject to human bias [38].

Using deep learning models, such as those highlighted in this study, offers a distinct advantage in terms of speed. While deep learning models can process and analyse data within seconds, manual analysis, depending on the complexity and volume of the data, can span weeks to even months [44]. However, it’s essential to recognize that the results from deep learning models might not always align perfectly with those of manual analysis. As such, researchers might find the need to fine-tune the outputs generated by text mining models. Despite this, the integration of deep learning significantly accelerates the qualitative analysis process, offering a more efficient alternative to traditional methods.

Some methodological limitations must be acknowledged. Firstly, in large parts of the interviews, no codes were identified by either the text mining or manual approach. As a result, the average accuracy of the text mining models was higher than it might have been if the text were coded with a higher density. Secondly, it is important to consider the limitations of the algorithm used to calculate the sentence similarity. This algorithm has an accuracy that is limited to 66% for the classification of similar sentences [30]. This is also challenging, as it is therefore difficult to define which properties of a text segment are important in terms of the semantic similarity. For example, given four sentences regarding a resident, a nurse, a resident’s family member, and a visiting doctor, it is possible to split them based on whether a person is a healthcare professional or not; however, it is also possible to split them based on whether a person is part of the nursing home staff or an outsider. Which property is more important to the similarity depends on factors such as the research question, and determining the similarity becomes more difficult with complex sentences. Moreover, it is important to consider the potential for human bias in qualitative analysis. Bias can arise from a variety of sources, including the research expert’s own preconceptions and assumptions, the sampling and recruitment of participants, and the methods and techniques used to collect and analyse data [36,37]. As the text mining model learns from inherently subjective data, it also learns to apply codes with the biases that exist in the data. While the expert feedback showed that few of these cases existed, such cases can negatively impact the evaluated accuracy of text mining models. Lastly, the analysis conducted in the current study only had context window of 512 words at most, which represents a technical limitation of the method [21,22]. This limits the textual context that the models have access to. These issues can be mitigated using large language models that are better able to capture the nuances and complexities of natural language (e.g. GPT-3) [25,37,45]. Such models can also handle a larger context of words. Whereas RobBERT has a maximum context length of 512, GPT-3 has a context of 4,096. However, such large language models cannot be used on most personal computers, as they require specialised hardware to run efficiently (i.e. GPUs or TPUs with large amounts of memory) [46]. Using these via online (cloud) systems could give rise to issues regarding the privacy of the interview participants. However, recent advances have shown that ‘smaller’ (i.e. more efficient) large language models can achieve similar results, and these models can be used on personal computers, unlike GPT-3 [19,47].

Future work

Future research could focus on applying a hybrid approach that combines the text mining and manual methods. Using this approach, a text mining algorithm could be used to pre-process the text data and identify potential themes and patterns, which could then be reviewed and refined by human experts. This would allow for an efficient and objective analysis of large datasets, while also allowing for the expertise and knowledge of human experts to be incorporated. Future research should investigate whether this approach could help to reduce the potential for bias and improve the accuracy of the results.

Future work could compare multiple novel text mining models such as GPT-4 and LLaMA to show whether larger models can generate results that are better with respect to the context and more similar to the manual analysis. Comparing different models side-by-side could offer a useful way to visualize the main features and capabilities of each model, and could also facilitate the identification of any common weaknesses or limitations that may exist across some or all of the models being investigated. This could also enable the identification of areas where specific models may excel relative to others.

Conclusions

The current study shows that text mining can be an effective tool for quickly and accurately identifying sentiment and thematic content from large amounts of textual data. Text mining can help to reduce the amount of time and resources needed to analyse textual data, making it a valuable tool for analysing large amounts of qualitative data. However, as shown in the current study, text mining has certain limitations regarding language understanding; in its current state, text mining is no substitute for manual coding, but can be seen as a helpful addition.

Acknowledgments

The authors would like to thank the Data Science Research Infrastructure (DSRI) at Maastricht University. Without the usage of their computational resources, the current study could not have been conducted. Moreover, thanks to the contributors of the “huggingface transformers” library, as the library provided many of the components for developing the ASR model in the current study. Lastly, a special thanks to Katya Sion, Audrey Beaulen and Erica de Vries, the research experts who manually coded the transcripts and gave their feedback.

Data Availability

The code is now available on Zenodo: https://zenodo.org/doi/10.5281/zenodo.8391746. Our interview data will not be publicly available due to the privacy of our participants. Upon request, our interview data may be provided with restrictions. Data are available from the Living Lab in Ageing and Long-Term Care (contact via Sil Aarts (ouderenzorg@maastrichtuniversity.nl) for researchers who meet the criteria for access to confidential data.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Pols J. Enacting appreciations: Beyond the patient perspective. Health Care Analysis. 2005;13: 203–221. doi: 10.1007/s10728-005-6448-6 [DOI] [PubMed] [Google Scholar]
2.Sion K, Verbeek H, de Vries E, Zwakhalen S, Odekerken-Schröder G, Schols J, et al. The feasibility of connecting conversations: A narrative method to assess experienced quality of care in nursing homes from the resident’s perspective. International Journal of Environmental Research and Public Health. 2020;17: 5118. doi: 10.3390/ijerph17145118 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Sion KY, Haex R, Verbeek H, Zwakhalen SM, Odekerken-Schröder G, Schols JM, et al. Experienced quality of post-acute and long-term care from the care recipient’s perspective–a conceptual framework. Journal of the American Medical Directors Association. 2019;20: 1386–1390. doi: 10.1016/j.jamda.2019.03.028 [DOI] [PubMed] [Google Scholar]
4.Delespierre T, Denormandie P, Bar-Hen A, Josseran L. Empirical advances with text mining of electronic health records. BMC medical informatics and Decision Making. 2017;17: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Strauss A, Corbin J. Basics of qualitative research techniques. Sage publications; Thousand Oaks, CA; 1998. [Google Scholar]
6.Norris N. Error, bias and validity in qualitative research. Educational action research. 1997;5: 172–176. [Google Scholar]
7.Mackieson P, Shlonsky A, Connolly M. Increasing rigor and reducing bias in qualitative research: A document analysis of parliamentary debates using applied thematic analysis. Qualitative Social Work. 2019;18: 965–980. [Google Scholar]
8.Hofmann M, Chisholm A. Text mining and visualization: Case studies using open-source tools. CRC Press; 2016. [Google Scholar]
9.Popowich F. Using text mining and natural language processing for health care claims processing. ACM SIGKDD Explorations Newsletter. 2005;7: 59–66. [Google Scholar]
10.Raja U, Mitchell T, Day T, Hardin JM. Text mining in healthcare. Applications and opportunities. J Healthc Inf Manag. 2008;22: 52–6. [PubMed] [Google Scholar]
11.Moqurrab SA, Ayub U, Anjum A, Asghar S, Srivastava G. An accurate deep learning model for clinical entity recognition from clinical notes. IEEE Journal of Biomedical and Health Informatics. 2021;25: 3804–3811. doi: 10.1109/JBHI.2021.3099755 [DOI] [PubMed] [Google Scholar]
12.Azeemi AH, Waheed A. Covid-19 tweets analysis through transformer language models. arXiv preprint arXiv:210300199. 2021.
13.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023;388: 1233–1239. doi: 10.1056/NEJMsr2214184 [DOI] [PubMed] [Google Scholar]
14.Thiergart J, Huber S, Übellacker T. Understanding emails and drafting responses–an approach using GPT-3. arXiv preprint arXiv:210203062. 2021.
15.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:220302155. 2022.
16.Zhang Z, Zhang A, Li M, Zhao H, Karypis G, Smola A. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:230200923. 2023.
17.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971. 2023.
18.Percha B. Modern clinical text mining: A guide and review. Annual review of biomedical data science. 2021;4: 165–187. doi: 10.1146/annurev-biodatasci-030421-030931 [DOI] [PubMed] [Google Scholar]
19.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:230709288. 2023.
20.Song H, Tolochko P, Eberl J-M, Eisele O, Greussing E, Heidenreich T, et al. In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication. 2020;37: 550–572. [Google Scholar]
21.Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:190711692. 2019.
22.Delobelle P, Winters T, Berendt B. RobBERT: A Dutch RoBERTa-based Language Model. Findings of the association for computational linguistics: EMNLP 2020. Online: Association for Computational Linguistics; 2020. pp. 3255–3265. [Google Scholar]
23.Sion K, Verbeek H, Aarts S, Zwakhalen S, Odekerken-Schröder G, Schols J, et al. The validity of connecting conversations: A narrative method to assess experienced quality of care in nursing homes from the resident’s perspective. International Journal of Environmental Research and Public Health. 2020;17: 5100. doi: 10.3390/ijerph17145100 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Sion KYJ, Rutten JER, Hamers JPH, de Vries E, Zwakhalen SMG, Odekerken-Schröder G, et al. Listen, look, link and learn: A stepwise approach to use narrative quality data within resident-family-nursing staff triads in nursing homes for quality improvements. BMJ Open Quality. 2021;10. doi: 10.1136/bmjoq-2021-001434 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Software V. MAXQDA 2020 online manual. 2019. Available: maxqda.com/help-max20/welcome.
26.Yegnanarayana B. Artificial neural networks. PHI Learning Pvt. Ltd.; 2009.
27.Hotho A, Nürnberger A, Paaß G. A brief survey of text mining. Ldv forum. Citeseer; 2005. pp. 19–62.
28.Zhou Z-H. Machine learning. Springer Nature; 2021. [Google Scholar]
29.Kotsiantis SB, Zaharakis I, Pintelas P, et al. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering. 2007;160: 3–24. [Google Scholar]
30.Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. Proceedings of the 2019 conference on empirical methods in natural language processing. Association for Computational Linguistics; 2019. http://arxiv.org/abs/1908.10084.
31.Schrauwen S. Machine learning approaches to sentiment analysis using the dutch netlog corpus. Computational Linguistics and Psycholinguistics Research Center. 2010; 30–34.
32.Yin W, Hay J, Roth D. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. CoRR. 2019;abs/1909.00161. http://arxiv.org/abs/1909.00161.
33.Bölücü N, Can B, Artuner H. A siamese neural network for learning semantically-informed sentence embeddings. Expert Systems with Applications. 2023;214: 119103. [Google Scholar]
34.Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in neural information processing systems. Curran Associates, Inc.; 2020. pp. 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
35.Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines. 2020;30: 681–694. [Google Scholar]
36.Easton KL, McComish JF, Greenberg R. Avoiding common pitfalls in qualitative data collection and transcription. Qualitative health research. 2000;10: 703–707. doi: 10.1177/104973200129118651 [DOI] [PubMed] [Google Scholar]
37.Maycock M. “I do not appear to have had previous letters.” The potential and pitfalls of using a qualitative correspondence method to facilitate insights into life in prison during the covid-19 pandemic. International Journal of Qualitative Methods. 2021;20: 16094069211047129. [Google Scholar]
38.Kim B, Kim H, Kim K, Kim S, Kim J. Learning not to learn: Training deep neural networks with biased data. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. pp. 9012–9020.
39.Goyal A, Bengio Y. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A. 2022;478: 20210068. [Google Scholar]
40.Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems. 2019;32.
41.Zhong Q, Ding L, Zhan Y, Qiao Y, Wen Y, Shen L, et al. Toward efficient language model pretraining and downstream adaptation via self-evolution: A case study on SuperGLUE. arXiv preprint arXiv:221201853. 2022.
42.Fan A, Lavril T, Grave E, Joulin A, Sukhbaatar S. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:200209402. 2020.
43.Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys. 2022;55: 1–28. [Google Scholar]
44.Wang H et al. Efficient algorithms and hardware for natural language processing. PhD thesis, Massachusetts Institute of Technology. 2020.
45.Workshop B,:, Scao TL, Fan A, Akiki C, Pavlick E, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv; 2022.
46.Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. Proceedings of the international conference for high performance computing, networking, storage and analysis. 2021. pp. 1–14.
47.Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, et al. Stanford alpaca: An instruction-following LLaMA model. GitHub repository. https://github.com/tatsu-lab/stanford_alpaca; GitHub; 2023.

PLoS One. doi: 10.1371/journal.pone.0292578.r001

Decision Letter 0

Corinne Jola

26 Jul 2023

PONE-D-23-11158A comparison of text mining and manual coding methods in long-term care for older adults regarding quality of carePLOS ONE

Dear Dr. Hacking,

Thank you for submitting your manuscript to PLOS ONE and for your patience in awaiting the reviewers' response. Both reviewers noted the importance of our wok but after careful consideration, we feel that whilst your manuscript has merit, it does not fully meet PLOS ONE’s publication criteria as it currently stands. I shared the reviewers' view on the applied relevance of your work but also felt that methodological information, background/references about the methodological approaches and analyses were limited and at times confusing, thus impacting the reproducibility of your work. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please ensure that your decision is justified on PLOS ONE’s publication criteria and not, for example, on novelty or perceived impact. For a potential acceptance of your submission, we expect you to address all concerns raised by the two reviewers that are possible based on the data you have collected for this submission. For concerns where you would require additional data collection but are not in a position to do so, please consider their points in your response and potentially in the limitation section of your manuscript.

Please submit your revised manuscript by Sep 09 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Corinne Jola

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Authors,

Thank you for this possibility to peer review your interesting and important study paper. The study aims to understand the potential usefulness of text mining for qualitative research in long-term care for older adults. It compares text mining approach to manual approach regarding sentiment analysis and thematic content analysis in terms of accuracy, consistency, and expert feedback. This study is meaningful for the reasons which the authors state in their paper - manual analysis of large free-text-data-sets is time consuming and has risk of objectivity bias. Thus new effective analysis methods are needed.

These peer review comments are given from the point of view of non-native English user and nursing scientist who has used text-mining method in own research.

1. Is the manuscript technically sound, and do the data support the conclusions?

-Line 1: The tile might be stronger if the "interview data" would be stated in the title

-Line 38: In the abstract conclusions, I am not sure if you can say, based on your results, that text mining is potential for large data sets, as your sample was just couple of hundreds interview texts. I would prefer to conclude, that text mining is potential for analysis of free text data in the context of LTC for older adults. For future work you might suggest to test the method in large data sets (e.g. thousands-tens of thousands).

-The study presents results of original research and the sections are constructed according to journal's guidelines and are named and organized in logical order.

-The introduction section describes the judgement for the need of this study and states the aim of the study.

-Line 75: There should be reference for the gold standard of manual coding.

-Materials and Methods paragraph: You could add here your study design.

-Lines 85-86: References for the two Text mining models should be presented.

- Line 93 in Sample and Participants paragraph: A total of n=250 is confusing as earlier was said different n-numbers. In addition in Line 105 and Line 108 -> are presented different numbers. The numbers in different places needs more clarification for transparency.

Should you describe also the experts' interviews in this section?

-Line 94: Who were interviewing the participants? How many interviewers?

2. Has the statistical analysis been performed appropriately and rigorously?

-Line 113: Is the MAXQDA coding system free of charge or commercial software? Can you describe this?

-Line 121-123: Move this sentence to limitation section in discussion " Both models were capable of identifying where and how to code a text segment using a context of 512 words at most, which represents a technical limitation of the method [17,18]"

-Line 125: Who coded the sentiment analysis? One, two or more researchers?

3. Have the authors made all data underlying the findings in their manuscript fully available?

-Line 160: Could you present sample table of accuracy scores and confusion matrix?

-Line 169-171: What was semantics consistency scale between very similar - very different?

-Line 176-177: Research experts interview were not described in methods section.

-Line 199-202: The sentence is unclear: "The text mining model was less accurate in determining what may be relevant to the organisation of a nursing home, as text which was not coded manually was often coded by text mining as either negative (4.3%) or positive (5.2%)."

-Line 278: Discussion: The text is logical and clear in discussion, however I would suggest to add comparison of your results with previous studies. You have some references presenter, but not the comparison or mirroring your own results towards other studies.

-Line 328...: Could you present some health care related example instead of the phrase ".... For example, given four sentences regarding a cat, a dog, a lion, and a wolf, it is ...."

-Line 359 in Conclusions section: Here also you use wording "large data sets" ... prefereably should not overestimate suitability for large data sets based on your results, just suitability for free text analysis in this context.

-References: Majority of the references are of high quality and from the last five years. However, 1/4 of the references are older than 10 years and the peer-review of the reference is missing or unclear among the following references: Lines: 393, 400, 412, 417, 419, 428, 435, 447, 456, 459, 463, 465, 468, 472 and 475.

The references with "arXiv preprint arXiv" are submissions to a computer scientist database, but might not be yet peer reviewed or might have been even rejected from publication after submission. You should try to find the peer reviewed version of these articles and write references according to accepted/ published versions of the papers.

-Figure 1 and Figure 2: These figures do not open up for me. These needs revisions and clarifications. At the moment it looks to me, that Text mining and Manual analysis had exactly the same statistical values. I do not know how to read the Figures. In addition, you should understand the figures as stand-alone, without reading the manuscript text.

4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

-As I am not native English user, I do not have comments for the English grammatic or typography.

I hope these comments help you to further develop your paper and to get it even stronger.

This paper deserves to get published, but needs some more details for transparency and in order to be able to use your method for replication.

Reviewer #2: Summary Statement: This reviewer thanks the authors for their submission. Indeed, evaluation of information from clients in long-term care is relevant and methods to address their quality of care experience. More specifically, authors aim to evaluate if text mining can be accurate, consistent, and similar to expert review in both sentiment analysis and thematic content. The manuscripts is straightforward and quite readable but could be improved with further methodological and quantitative depth added as well as specific details in the discussion about how to improve the next study in the domain.

Strengths where no changes are required:

1) The methods of inclusion are well-described and include a robust number of participants (n=250) with written informed consent from 5 sites which seems adequate and appropriate for the evaluation.

2) The methods of themes into (14) key areas seemed appropriate and were well-described.

3) Evaluation of the text mining was performed using accuracy, consistency, and expert review which seemed appropriate.

Weaknesses and areas of the manuscript that could be improved through further efforts:

1) The types of text mining could be further scientifically described in the introduction that are used later in the methods section:

Authors could better represent the current challenges with coding of qualitative data with clear description of methods and their performance characteristics. Recommend add 2-3 references in the introduction that specifically outline challenges in current methods with more specificity. Other authors have described with greater detail the text mining methods such as those described in the following (or could be alternative): Pranita Mahajan, Dipti P. Rana; International Journal of Innovative Technology and Exploring Engineering (IJITEE), ISSN: 2278-3075, Volume-9 Issue-2S, December 2019; or Annu Rev Biomed Data Sci. 2021 Jul 20;4:165-187. doi: 10.1146/annurev-biodatasci-030421-030931. Epub 2021 May 26. Additionally, further details about text mining models including InstructGPT and MM-CoT would be appropriate to cover in 1 sentence in the background/introduction.

2) While the authors used a Dutch language inclusive model, only a limited number of text word concepts were available (n=512 words). While this may not require a modification, authors in the discussion should further elaborate on the implications to the findings.

3) Expert feedback is inadequately described and should include a sentence referencing exactly in what capacity the individuals are considered expert. The statistical methods utilized in the comparison should also be briefly presented in the methods.

4) Authors describe the difficulties inherent in the analyses with discrepancies which can also be due to as well the multiple themes that are present but could expand upon how this can be mitigated in the discussion.

5) Overall, the manuscript could be improved through the additional enrichment of quantitative findings as well as depth in the methodological approach.

Minor Editing Recommendations:

1) Correction recommendation: In the introduction, there is a period that needs replaced by a comma, line 48:

To be able to analyze these data, 48 researchers often conduct a so-called coding analysis [2,5]. which involves manually.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Nov 8;18(11):e0292578. doi: 10.1371/journal.pone.0292578.r002

Author response to Decision Letter 0

9 Sep 2023

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

We’ve checked and updated the styling accordingly.

2. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please seehttp://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

Ethical restrictions apply here, as the data contains stories regarding the lives of residents, from the perspective of clients themselves, family and care professionals. Therefore, even if the names and other personally identifiable details were removed, the stories could still be linked to one of the residents. Because of the nature of the data, we have opted to not release it publicly. The data can still be inspected upon request through the AWO-L (s.aarts@maastrichtuniversity.nl).

We will update your Data Availability statement on your behalf to reflect the information you provide.

The interview data won’t be shared publicly, but is available on request. However, the code and the models will become available on Zenodo and Github.

5. Review Comments to the Author

Reviewer #1: Dear Authors,

Dear reviewer, thank you for your time and expertise in thoroughly reviewing our manuscript. Your detailed feedback have provided us with valuable insights and improved the comprehensibility of the manuscript. We appreciate the statement regarding the meaningfulness of our research, as we strive to explore efficient analysis methods that can overcome the challenges posed by manual analysis. We’ve made the necessary adjustments to manuscript to reflect your feedback.

These peer review comments are given from the point of view of non-native English user and nursing scientist who has used text-mining method in own research.

1. Is the manuscript technically sound, and do the data support the conclusions?

-Line 1: The tile might be stronger if the "interview data" would be stated in the title

The title has been adjusted to: “Comparing text mining and manual coding methods: analyzing interview data on quality of care in long-term care for older adults.”

While our model was not tested on many thousands of samples, we can extrapolate from the literature that our model was based on [4]. Moreover, when conducting the analysis for this study, it took less than a minute to analyse the entire dataset on our hardware (an RTX 2060). We’ve added a paragraph to the discussion to clarify this: “Using deep learning models, such as those highlighted in this study, offers a distinct advantage in terms of speed. While deep learning models can process and analyse data within seconds, manual analysis, depending on the complexity and volume of the data, can span weeks to even months [41]. However, it's essential to recognize that the results from deep learning models might not always align perfectly with those of manual analysis. As such, researchers might find the need to fine-tune the outputs generated by text mining models. Despite this, the integration of deep learning significantly accelerates the qualitative analysis process, offering a more efficient alternative to traditional methods.” (p 19, l 362)

4. Delobelle, P., Winters, T., & Berendt, B. (2020). Robbert: a dutch roberta-based language model. arXiv preprint arXiv:2001.06286.

-The study presents results of original research and the sections are constructed according to journal's guidelines and are named and organized in logical order.

Thank you.

-The introduction section describes the judgement for the need of this study and states the aim of the study.

Thank you.

-Line 75: There should be reference for the gold standard of manual coding.

We have added a reference for this:

Song H, Tolochko P, Eberl JM, Eisele O, Greussing E, Heidenreich T, Lind F, Galyga S, Boomgaarden HG. In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication. 2020 Jul 3;37(4):550-72.

-Materials and Methods paragraph: You could add here your study design.

We have added the ‘study design’ heading.

-Lines 85-86: References for the two Text mining models should be presented.

Both the sentiment analysis as well as the thematic content analysis model were based on the same base model (i.e. RobBERT). We’ve added the appropriate references:

Delobelle, P., Winters, T., & Berendt, B. (2020). Robbert: a dutch roberta-based language model. arXiv preprint arXiv:2001.06286.

The manuscript was adjusted to make this more clear. “A total of n = 250 interviews were conducted at five different LTC organizations in the southern part of the Netherlands. From those interviews, 234 were transcribed, 16 could not be transcribed due to poor audio quality. From the remaining 234 interviews, 61 were analysed manually using thematic content analysis. In addition, 103 interviews were analysed manually using sentiment analysis. All analysis in the manuscript were performed using those 61 and 103 interviews for the thematic content analysis and sentiment analysis respectively.” (p 7, l 104)

Should you describe also the experts' interviews in this section?

We have altered this part of the method section. "The data were coded by three research experts, each working in the Living Lab on Ageing and Long-Term Care for over 5 years. All these experts have a minimum of ten years of experience in conducting qualitative research." (p 8, l 126)

-Line 94: Who were interviewing the participants? How many interviewers?

To be more concise of who the interviews were, we have added this information. "A total of n = 35 interviewers conducted the interviews. These interviewers were part of the project 'Connecting Conversations,' which aims to assess the experienced quality of care in nursing homes from the resident’s perspective. They primarily come from a long-term care setting and have received specialized training to conduct these interviews. For a more comprehensive understanding of the 'Connecting Conversations' project, see Sion et. al. 2020a." (p 7,l 112) [2].

2. Has the statistical analysis been performed appropriately and rigorously?

-Line 113: Is the MAXQDA coding system free of charge or commercial software? Can you describe this?

MAXQDA is a commercial software, although free alternatives exist.

We have adjusted this in the limitations section: “Lastly, the analysis conducted in the current study only had context window of 512 words at most, which represents a technical limitation of the method [17,18]. This limits the textual context that the models have access to. These issues can be mitigated using large language models that are better able to capture the nuances and complexities of natural language (e.g. GPT-3) [25,37]. Such models can also handle a larger context of words. Whereas RobBERT has a maximum context length of 512, GPT-3 has a context of 4,096. However, such large language models cannot be used on most personal computers, as they require specialized hardware to run efficiently (i.e. GPUs or TPUs with large amounts of memory) [38]. Using these via online (cloud) systems could give rise to issues regarding the privacy of the interview participants. However, recent advances have shown that ‘smaller’ (i.e. more efficient) large language models can achieve similar results, and these models can be used on personal computers, unlike GPT-3 [39,40].” (p 20,l 388)

-Line 125: Who coded the sentiment analysis? One, two or more researchers?

Transcripts were coded by three researchers. This information was added to the manuscript.

3. Have the authors made all data underlying the findings in their manuscript fully available?

-Line 160: Could you present sample table of accuracy scores and confusion matrix?

The authors are unsure what the reviewer means with this specific comment. Confusion matrices are shown in figure 1 and 2. We have added additional statistics to the manuscript, including the average accuracy over all transcripts and their standard deviations.

For the sentiment analysis we added the line: “The average accuracy over all transcripts was 88.7% with standard deviation of 8.6%. The minimum was accuracy was 52.1% and the maximum accuracy was 99.6%.” (p 13, l 227) and for the thematic content analysis: “The average accuracy over all transcripts was 81.9% with standard deviation of 8.5%. The minimum accuracy of any transcript was 43.1% and the maximum was 93.4%.” (p 14, l 256) We have elaborated on this in the discussion: “The variance of the accuracy between transcripts shows that a possible reason for lower accuracies could be due to factors that vary between transcripts , such as the quality of the transcription, the nature of the language used by the participants, or contextual factors that were not taken into account by the text mining or manual approach.” (p 18, l 335)

-Line 169-171: What was semantics consistency scale between very similar - very different?

We’ve adjusted the text to clarify this: “Semantic similarity is a value between 0% and 100%, where a higher percentage indicates that the results were more consistent.” In addition, we also included a reference [5]. (p 14, l 271).

5. Bölücü, Necva, Burcu Can, and Harun Artuner. "A Siamese neural network for learning semantically-informed sentence embeddings." Expert Systems with Applications, 214, 2023: 119103.

-Line 176-177: Research experts interview were not described in methods section.

As stated before, more information regarding the experts are provided in the methods section.

This was clarified this by changing the sentence to: “Text mining often coded text as positive (4.3%) or negative (5.2%) that was not coded in the manual analysis”.

Thank you for the statement regarding the clear discussion. For many studies, the model’s performance is compared to similar models on public datasets. However, in the current manuscript we’re working with a non-public dataset that has its own unique characteristics. This makes direct comparison with other studies impossible, as there are no established benchmarks or standards for our specific dataset.

-Line 328...: Could you present some health care related example instead of the phrase ".... For example, given four sentences regarding a cat, a dog, a lion, and a wolf, it is ...."

The reviewer stated an important point here. Hence, this has been altered in the manuscript. “For example, given four sentences regarding a resident, a nurse, a resident's family member, and a visiting doctor, it is possible to split these sentences based on whether a person is a healthcare professional or not; however, it is also possible to split them based on whether a person is part of the nursing home staff.” (p 20, l 377)

We have addressed this in an earlier comment and adjusted the discussion. We’ve added a paragraph to the discussion to clarify this: “Using deep learning models, such as those highlighted in this study, offers a distinct advantage in terms of speed. While deep learning models can process and analyse data within seconds, manual analysis, depending on the complexity and volume of the data, can span weeks to even months [3]. However, it's essential to recognize that the results from deep learning models might not always align perfectly with those of manual analysis. As such, researchers might find the need to fine-tune the outputs generated by text mining models. Despite this, the integration of deep learning significantly accelerates the qualitative analysis process, offering a more efficient alternative to traditional methods.” (p 19, l 362)

-References: Majority of the references are of high quality and from the last five years. However, 1/4 of the references are older than 10 years

We appreciate your acknowledgment of the quality and recency of the majority of our references. Regarding the older references, it's crucial to highlight that, while currency in citations is often indicative of relevance in fast-evolving fields, foundational works can often remain pertinent long after their publication, as they form the basis of our models. In our selection of references, the older citations were included to provide historical context, foundational understanding, or to reference methodologies and theories that remain central to the topic even after a decade or more. We’ve added the following references:

1. Maycock, M. ‘I Do Not Appear to Have had Previous Letters’. The Potential and Pitfalls of Using a Qualitative Correspondence Method to Facilitate Insights Into Life in Prison During the Covid-19 Pandemic. International Journal of Qualitative Methods, 2021, 20: 16094069211047129.

2. Lee, P, Sebastien B, and Joseph P. "Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine." New England Journal of Medicine, 388.13, 2023: 1233-1239.

and the peer-review of the reference is missing or unclear among the following references: Lines: 393, 400, 412, 417, 419, 428, 435, 447, 456, 459, 463, 465, 468, 472 and 475.

Many cutting-edge methods, particularly in the field of computer science, are initially published on platforms like arXiv. While these submissions might not have undergone the traditional scientific peer-review processes, they often come from reputable researchers or institutions and are accompanied by transparent resources such as source code and model weights. This open-access approach allows the broader community to inspect, validate, and build upon the work. We acknowledge the importance of peer-reviewed articles and always strive to reference them when available. However, given the rapid advancements in the field, we also believe in the value of these preprints as they represent the latest developments. Hence, we decided to keep the included references (as they are the foundation of our models) but also add some new, peer-reviewed references:

“43. Maycock, M. ‘I Do Not Appear to Have had Previous Letters’. The Potential and Pitfalls of Using a Qualitative Correspondence Method to Facilitate Insights Into Life in Prison During the Covid-19 Pandemic. International Journal of Qualitative Methods, 2021, 20: 16094069211047129.

44. Lee, P, Sebastien B, and Joseph P. "Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine." New England Journal of Medicine, 388.13, 2023: 1233-1239.”

We’ve updated the caption of the figures to improve the clarify of the figures. These captions should allow the reader to understand the figures without reading the rest of the manuscript. (p 13, l 230)

-As I am not native English user, I do not have comments for the English grammatic or typography.

I hope these comments help you to further develop your paper and to get it even stronger.

This paper deserves to get published, but needs some more details for transparency and in order to be able to use your method for replication.

Thank you for your constructive feedback. We believe the manuscript has improved greatly based on this feedback.

Strengths where no changes are required:

1) The methods of inclusion are well-described and include a robust number of participants (n=250) with written informed consent from 5 sites which seems adequate and appropriate for the evaluation.

2) The methods of themes into (14) key areas seemed appropriate and were well-described.

3) Evaluation of the text mining was performed using accuracy, consistency, and expert review which seemed appropriate.

We deeply appreciate your time and constructive feedback on our manuscript. We’ve carefully considered your comments and have adjusted the manuscript accordingly.

Weaknesses and areas of the manuscript that could be improved through further efforts:

1) The types of text mining could be further scientifically described in the introduction that are used later in the methods section:

Thank you for the insightful comments regarding the need for a more detailed discussion of text mining methods in the introduction section of our manuscript. We agree that clarifying the methods used and citing relevant challenges in the current methodologies can enrich the manuscript. Currently, Large Language Models (LLMs) are outperforming older rule-based and machine learning methods [6]. We have added the following to the introduction: “Recent advancements in the field of text mining have ushered in a variety of new techniques, each with its unique focus and application [34–36, 39, 46, 47]. Some models are particularly good at generating context-aware (e.g. InstructGPT), human-like text, while others excel at incorporating multi-modal data, such as text and images, for a more comprehensive analysis [34, 35, 36]. Moreover, there is a growing emphasis on adapting these models to run efficiently on consumer hardware [39]. Despite these strides in technology, there are still significant challenges in achieving the level of accuracy required for some tasks, and in many cases, human expertise continues to outperform automated methods [39].” (p 5, l 75)

6. Zhong Q, Ding L, Zhan Y, Qiao Y, Wen Y, Shen L, et al. Toward efficient language model pretraining and downstream adaptation via self-evolution: A case study on SuperGLUE. arXiv preprint arXiv:221201853. 2022.

We have adjusted this in the limitations section: “Lastly, the analysis conducted in the current study only had context window of 512 words at most, which represents a technical limitation of the method [17,18]. This limits the textual context that the models have access to. These issues can be mitigated using large language models that are better able to capture the nuances and complexities of natural language (e.g. GPT-3) [25,37]. Such models can also handle a larger context of words. Whereas RobBERT has a maximum context length of 512, GPT-3 has a context of 4,096. However, such large language models cannot be used on most personal computers, as they require specialised hardware to run efficiently (i.e. GPUs or TPUs with large amounts of memory) [38]. Using these via online (cloud) systems could give rise to issues regarding the privacy of the interview participants. However, recent advances have shown that ‘smaller’ (i.e. more efficient) large language models can achieve similar results, and these models can be used on personal computers, unlike GPT-3 [39,40].” (p 20,l 388)

3) Expert feedback is inadequately described and should include a sentence referencing exactly in what capacity the individuals are considered expert.

In We have altered this part of the method section. "The data were coded by three research experts, each working in the Living Lab on Ageing and Long-Term Care for over 5 years. All these experts have a minimum of ten years of experience in conducting qualitative research." (p 8, l 126)

The statistical methods utilized in the comparison should also be briefly presented in the methods.

We added the text: “Accuracy was calculated using the formula: (TP + TN) / (TP + TN + FP + FN). In this case, TP is the true positive (i.e. where a code is present in both analyses), TN is the true negative (i.e. where a code is absent in both analyses), and FP is the false positive (i.e. where a code is predicted to be present but is absent in the manual analysis), while FN is the false negative (i.e. where a code is predicted to be absent but is present in the manual analysis). These components help us assess the accuracy of the model's predictions and its performance overall.” (p 10, l 184) A reference was included:

Zhou Z-H. Machine Learning. Springer Nature 2021. Available from: https://doi.org/10.1007/978-981-15-1967-3 [Accessed April 12, 2023].

For the consistency analysis we’ve added references to relevant literature the use vector embeddings to check similarity:

Bölücü, Necva, Burcu Can, and Harun Artuner. "A Siamese neural network for learning semantically-informed sentence embeddings." Expert Systems with Applications, 214, 2023: 119103.

For the expert feedback, we didn’t have any quantitative results to apply statistics to.

The reviewer stated an interesting and important point here. The problem with multiple themes is something very difficult to mitigate for the current dataset, as the themes are all highly correlated. In the ‘Limitations’ section we suggest that larger models that have been trained for longer are more capable of distinguishing between the nuances in the text. Other possibilities to mitigate the issue, are using more data or creating themes are split more clearly. Data quality therefore also plays are role.

5) Overall, the manuscript could be improved through the additional enrichment of quantitative findings as well as depth in the methodological approach.

We’ve clarified the statistical methods in the ‘methods’ section. Moreover, we’ve added the average and standard deviation of the accuracy scores over all transcripts.

Minor Editing Recommendations:

1) Correction recommendation: In the introduction, there is a period that needs replaced by a comma, line 48:

To be able to analyze these data, 48 researchers often conduct a so-called coding analysis [2,5]. which involves manually.

Thank you for your thorough review. We’ve altered this.

References

1. Song H, Tolochko P, Eberl JM, Eisele O, Greussing E, Heidenreich T, Lind F, Galyga S, Boomgaarden HG. In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication. 2020 Jul 3;37(4):550-72.

2. Sion K, Verbeek H, de Vries E, Zwakhalen S, Odekerken-Schröder G, Schols J, Hamers J. The Feasibility of Connecting Conversations: A Narrative Method to Assess Experienced Quality of Care in Nursing Homes from the Resident's Perspective. Int J Environ Res Public Health. 2020 Jul 15;17(14):5118. doi: 10.3390/ijerph17145118. PMID: 32679869; PMCID: PMC7400298.

3. Wang H. Efficient algorithms and hardware for natural language processing (Doctoral dissertation, Massachusetts Institute of Technology). 2020.

4. Delobelle, P., Winters, T., & Berendt, B. (2020). Robbert: a dutch roberta-based language model. arXiv preprint arXiv:2001.06286.

5. Bölücü, Necva, Burcu Can, and Harun Artuner. "A Siamese neural network for learning semantically-informed sentence embeddings." Expert Systems with Applications 214 (2023): 119103.

Attachment

Submitted filename: Response to reviewers.docx

Click here for additional data file.^{(36.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0292578.r003

Decision Letter 1

Baby Gobin

25 Sep 2023

Comparing text mining and manual coding methods: analysing interview data on quality of care in long-term care for older adults

PONE-D-23-11158R1

Dear Dr. Hacking,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Baby Gobin

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: Dear Authors,

Thank you for your thorough and detailed responses to all of the presented questions and suggestions. You have improved your article stronger. Also your reasoning for using The references with "arXiv preprint arXi " is acceptable. I am delighted to have the possibility to review this excellent scientific paper advancing evidence about text mining possibilities in health sciences research. I wish you interesting research projects in the future and I am looking forward to reading your further publications!

Reviewer #2: Summary Statement: This reviewer thanks the authors for their revision on the submission. The authors have point by point addressed each reviewer concern with careful and thoughtful attention to each detail. Specifically, they have improved the description of the methodological and quantitative approach as well as the discussion about how to improve the next study in the domain. As a result, the overall manuscript reads much better and contributes to the knowledge in this important domain.

(No Change in Comments) Strengths where no changes are required:

1) The methods of inclusion are well-described and include a robust number of participants (n=250) with written informed consent from 5 sites which seems adequate and appropriate for the evaluation.

2) The methods of themes into (14) key areas seemed appropriate and were well-described.

3) Evaluation of the text mining was performed using accuracy, consistency, and expert review which seemed appropriate.

Weaknesses and areas of the manuscript that could be improved through further efforts were addressed to the degree appropriate in the revision:

1) The types of text mining could be further scientifically described in the introduction that are used later in the methods section:

Authors have added the recommended references in the introduction that specifically outline challenges in current methods with more specificity.

2) While the authors used a Dutch language inclusive model, only a limited number of text word concepts were available (n=512 words). Authors have adequately elaborated on the implications to the findings.

3) Authors have added further details related to the expert feedback and their capacity to do so.

4) Authors have further expanded upon the discrepancies which can be mitigated in the discussion.

5) Authors have further enriched the quantitative findings as well as depth in the methodological approach.

Minor Editing Recommendations:

Authors have made the appropriate minor editing changes, and no further changes to recommend.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Amy M Sitapati, MD

**********

PLoS One. doi: 10.1371/journal.pone.0292578.r004

Acceptance letter

Baby Gobin

23 Oct 2023

PONE-D-23-11158R1

Comparing text mining and manual coding methods: analysing interview data on quality of care in long-term care for older adults

Dear Dr. Hacking:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Baby Gobin

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to reviewers.docx

Click here for additional data file.^{(36.7KB, docx)}

Data Availability Statement

[pone.0292578.ref001] 1.Pols J. Enacting appreciations: Beyond the patient perspective. Health Care Analysis. 2005;13: 203–221. doi: 10.1007/s10728-005-6448-6 [DOI] [PubMed] [Google Scholar]

[pone.0292578.ref002] 2.Sion K, Verbeek H, de Vries E, Zwakhalen S, Odekerken-Schröder G, Schols J, et al. The feasibility of connecting conversations: A narrative method to assess experienced quality of care in nursing homes from the resident’s perspective. International Journal of Environmental Research and Public Health. 2020;17: 5118. doi: 10.3390/ijerph17145118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0292578.ref003] 3.Sion KY, Haex R, Verbeek H, Zwakhalen SM, Odekerken-Schröder G, Schols JM, et al. Experienced quality of post-acute and long-term care from the care recipient’s perspective–a conceptual framework. Journal of the American Medical Directors Association. 2019;20: 1386–1390. doi: 10.1016/j.jamda.2019.03.028 [DOI] [PubMed] [Google Scholar]

[pone.0292578.ref004] 4.Delespierre T, Denormandie P, Bar-Hen A, Josseran L. Empirical advances with text mining of electronic health records. BMC medical informatics and Decision Making. 2017;17: 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0292578.ref005] 5.Strauss A, Corbin J. Basics of qualitative research techniques. Sage publications; Thousand Oaks, CA; 1998. [Google Scholar]

[pone.0292578.ref006] 6.Norris N. Error, bias and validity in qualitative research. Educational action research. 1997;5: 172–176. [Google Scholar]

[pone.0292578.ref007] 7.Mackieson P, Shlonsky A, Connolly M. Increasing rigor and reducing bias in qualitative research: A document analysis of parliamentary debates using applied thematic analysis. Qualitative Social Work. 2019;18: 965–980. [Google Scholar]

[pone.0292578.ref008] 8.Hofmann M, Chisholm A. Text mining and visualization: Case studies using open-source tools. CRC Press; 2016. [Google Scholar]

[pone.0292578.ref009] 9.Popowich F. Using text mining and natural language processing for health care claims processing. ACM SIGKDD Explorations Newsletter. 2005;7: 59–66. [Google Scholar]

[pone.0292578.ref010] 10.Raja U, Mitchell T, Day T, Hardin JM. Text mining in healthcare. Applications and opportunities. J Healthc Inf Manag. 2008;22: 52–6. [PubMed] [Google Scholar]

[pone.0292578.ref011] 11.Moqurrab SA, Ayub U, Anjum A, Asghar S, Srivastava G. An accurate deep learning model for clinical entity recognition from clinical notes. IEEE Journal of Biomedical and Health Informatics. 2021;25: 3804–3811. doi: 10.1109/JBHI.2021.3099755 [DOI] [PubMed] [Google Scholar]

[pone.0292578.ref012] 12.Azeemi AH, Waheed A. Covid-19 tweets analysis through transformer language models. arXiv preprint arXiv:210300199. 2021.

[pone.0292578.ref013] 13.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023;388: 1233–1239. doi: 10.1056/NEJMsr2214184 [DOI] [PubMed] [Google Scholar]

[pone.0292578.ref014] 14.Thiergart J, Huber S, Übellacker T. Understanding emails and drafting responses–an approach using GPT-3. arXiv preprint arXiv:210203062. 2021.

[pone.0292578.ref015] 15.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:220302155. 2022.

[pone.0292578.ref016] 16.Zhang Z, Zhang A, Li M, Zhao H, Karypis G, Smola A. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:230200923. 2023.

[pone.0292578.ref017] 17.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971. 2023.

[pone.0292578.ref018] 18.Percha B. Modern clinical text mining: A guide and review. Annual review of biomedical data science. 2021;4: 165–187. doi: 10.1146/annurev-biodatasci-030421-030931 [DOI] [PubMed] [Google Scholar]

[pone.0292578.ref019] 19.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:230709288. 2023.

[pone.0292578.ref020] 20.Song H, Tolochko P, Eberl J-M, Eisele O, Greussing E, Heidenreich T, et al. In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication. 2020;37: 550–572. [Google Scholar]

[pone.0292578.ref021] 21.Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:190711692. 2019.

[pone.0292578.ref022] 22.Delobelle P, Winters T, Berendt B. RobBERT: A Dutch RoBERTa-based Language Model. Findings of the association for computational linguistics: EMNLP 2020. Online: Association for Computational Linguistics; 2020. pp. 3255–3265. [Google Scholar]

[pone.0292578.ref023] 23.Sion K, Verbeek H, Aarts S, Zwakhalen S, Odekerken-Schröder G, Schols J, et al. The validity of connecting conversations: A narrative method to assess experienced quality of care in nursing homes from the resident’s perspective. International Journal of Environmental Research and Public Health. 2020;17: 5100. doi: 10.3390/ijerph17145100 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0292578.ref024] 24.Sion KYJ, Rutten JER, Hamers JPH, de Vries E, Zwakhalen SMG, Odekerken-Schröder G, et al. Listen, look, link and learn: A stepwise approach to use narrative quality data within resident-family-nursing staff triads in nursing homes for quality improvements. BMJ Open Quality. 2021;10. doi: 10.1136/bmjoq-2021-001434 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0292578.ref025] 25.Software V. MAXQDA 2020 online manual. 2019. Available: maxqda.com/help-max20/welcome.

[pone.0292578.ref026] 26.Yegnanarayana B. Artificial neural networks. PHI Learning Pvt. Ltd.; 2009.

[pone.0292578.ref027] 27.Hotho A, Nürnberger A, Paaß G. A brief survey of text mining. Ldv forum. Citeseer; 2005. pp. 19–62.

[pone.0292578.ref028] 28.Zhou Z-H. Machine learning. Springer Nature; 2021. [Google Scholar]

[pone.0292578.ref029] 29.Kotsiantis SB, Zaharakis I, Pintelas P, et al. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering. 2007;160: 3–24. [Google Scholar]

[pone.0292578.ref030] 30.Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using siamese BERT-networks. Proceedings of the 2019 conference on empirical methods in natural language processing. Association for Computational Linguistics; 2019. http://arxiv.org/abs/1908.10084.

[pone.0292578.ref031] 31.Schrauwen S. Machine learning approaches to sentiment analysis using the dutch netlog corpus. Computational Linguistics and Psycholinguistics Research Center. 2010; 30–34.

[pone.0292578.ref032] 32.Yin W, Hay J, Roth D. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. CoRR. 2019;abs/1909.00161. http://arxiv.org/abs/1909.00161.

[pone.0292578.ref033] 33.Bölücü N, Can B, Artuner H. A siamese neural network for learning semantically-informed sentence embeddings. Expert Systems with Applications. 2023;214: 119103. [Google Scholar]

[pone.0292578.ref034] 34.Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, editors. Advances in neural information processing systems. Curran Associates, Inc.; 2020. pp. 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

[pone.0292578.ref035] 35.Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines. 2020;30: 681–694. [Google Scholar]

[pone.0292578.ref036] 36.Easton KL, McComish JF, Greenberg R. Avoiding common pitfalls in qualitative data collection and transcription. Qualitative health research. 2000;10: 703–707. doi: 10.1177/104973200129118651 [DOI] [PubMed] [Google Scholar]

[pone.0292578.ref037] 37.Maycock M. “I do not appear to have had previous letters.” The potential and pitfalls of using a qualitative correspondence method to facilitate insights into life in prison during the covid-19 pandemic. International Journal of Qualitative Methods. 2021;20: 16094069211047129. [Google Scholar]

[pone.0292578.ref038] 38.Kim B, Kim H, Kim K, Kim S, Kim J. Learning not to learn: Training deep neural networks with biased data. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. pp. 9012–9020.

[pone.0292578.ref039] 39.Goyal A, Bengio Y. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A. 2022;478: 20210068. [Google Scholar]

[pone.0292578.ref040] 40.Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems. 2019;32.

[pone.0292578.ref041] 41.Zhong Q, Ding L, Zhan Y, Qiao Y, Wen Y, Shen L, et al. Toward efficient language model pretraining and downstream adaptation via self-evolution: A case study on SuperGLUE. arXiv preprint arXiv:221201853. 2022.

[pone.0292578.ref042] 42.Fan A, Lavril T, Grave E, Joulin A, Sukhbaatar S. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:200209402. 2020.

[pone.0292578.ref043] 43.Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. ACM Computing Surveys. 2022;55: 1–28. [Google Scholar]

[pone.0292578.ref044] 44.Wang H et al. Efficient algorithms and hardware for natural language processing. PhD thesis, Massachusetts Institute of Technology. 2020.

[pone.0292578.ref045] 45.Workshop B,:, Scao TL, Fan A, Akiki C, Pavlick E, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv; 2022.

[pone.0292578.ref046] 46.Rajbhandari S, Ruwase O, Rasley J, Smith S, He Y. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. Proceedings of the international conference for high performance computing, networking, storage and analysis. 2021. pp. 1–14.

[pone.0292578.ref047] 47.Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, et al. Stanford alpaca: An instruction-following LLaMA model. GitHub repository. https://github.com/tatsu-lab/stanford_alpaca; GitHub; 2023.

PERMALINK

Comparing text mining and manual coding methods: Analysing interview data on quality of care in long-term care for older adults

Coen Hacking

Hilde Verbeek

Jan P H Hamers

Sil Aarts

Roles

Abstract

Objectives

Methods

Results

Conclusions and implications

Introduction

Materials and methods

Study design

Sample and participants

Data

Text mining models

Sentiment analysis

Thematic content analysis

Evaluation

Accuracy

Consistency

Expert feedback

Results

Accuracy

Sentiment analysis

Fig 1. Confusion matrix comparing sentiment analysis results of the manual and text mining approach.

Thematic content analysis

Fig 2. Comparison of results from the thematic content analysis.

Consistency

Sentiment analysis

Table 1. Overview of the consistency of the manual and text mining approaches in regarding the sentiment analysis.

Thematic content analysis

Table 2. Overview of the consistency of the manual and text mining approaches regarding various themes.

Expert feedback

Discussion

Future work

Conclusions

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Corinne Jola

Roles

Author response to Decision Letter 0

Decision Letter 1

Baby Gobin

Roles

Acceptance letter

Baby Gobin

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases