Skip to main content
PLOS One logoLink to PLOS One
. 2020 Jun 29;15(6):e0235209. doi: 10.1371/journal.pone.0235209

Assisting students’ writing with computer-based concept map feedback: A validation study of the CohViz feedback system

Christian Burkhart 1,*, Andreas Lachner 2, Matthias Nückles 1
Editor: Maciej Huk3
PMCID: PMC7323980  PMID: 32598359

Abstract

CohViz is a feedback system that provides students with concept maps as feedback on the cohesion of their writing. Although previous studies demonstrated the effectiveness of CohViz, the accuracy of CohViz remains unclear. Thus, we conducted two comprehensive validation studies to assess the accuracy of CohViz in terms of its reliability and validity. In a reliability study, we compared the concept maps generated by CohViz with concept maps generated by four human expert raters based on a text corpus comprising students' explanatory texts (N = 100). Regarding the depiction of cohesion gaps, we obtained high accordance between the CohViz concept maps and the concept maps generated by the human expert raters. However, CohViz tended to overestimate the number of relations within the concept maps. In a validity study, we examined the validity of CohViz and compared central features of the CohViz concept maps with convergent linguistic features and divergent linguistic features based on a Wikipedia text corpus (N = 1020). We found medium to high agreement with the convergent cohesion features and low agreement with the divergent features. Together, these findings suggest that CohViz can be regarded as an accurate feedback system to provide feedback on the cohesion of students' writing.

Introduction

Comprehensible writing can be regarded as a crucial writing skill in today’s knowledge society [13]. A central feature that contributes to the comprehensibility of a text is cohesion [4,5]. Cohesion refers to grammatical or lexical devices which explicitly relate ideas within and across text segments [6]. It can be established in two ways. On a local level, writers can improve the cohesion of their texts by relating information between neighboring sentences, for instance by reiterating arguments or using near-synonyms (i.e., argument overlap) [6,7]. On a global level, students can improve the cohesion of their texts by explicitly relating text passages [8], for instance, by adding headings or by following a particular rhetorical structure [1,9]. Despite its central role in supporting readers’ comprehension, college students often face difficulties in writing cohesive texts [10,11]. Therefore, particularly students require ample formative feedback on the cohesion of their writing [12]. Providing instant feedback, however, is relatively time-consuming and often not feasible during regular teaching. Thus, a variety of computer-based feedback systems have recently been developed to improve students’ writing for specific linguistic features such as text cohesion, particularly in the early stages of writing instruction [1215]. The advantage of these systems is that they provide students with instant and time-independent information about the quality of their writing. Many of these systems generate graphical visualizations from students’ texts in the form of concept maps [16,17]. These concept maps provide students with an additional external representation of their text and direct their attention to distinct textual deficits to activate appropriate revision activities [15]. In the current paper, we provide a thorough validation study of a promising graphical feedback system, CohViz, which was explicitly designed to improve the cohesion of students’ writing. In two studies we investigated whether CohViz accurately provides feedback. Therefore, we measured the reliability and the validity of the generated feedback information using two corpus-based studies. Such a validation study about the accuracy of feedback of CohViz is particularly important, as recent studies provided evidence that the effectiveness of feedback largely depends on the accuracy of the provided information within the feedback [18,19].

CohViz: A feedback system to support students’ cohesive writing

CohViz is a graphical feedback system we developed which is freely available online [20]. CohViz automatically provides students with concept maps as external representations of their texts to improve the cohesion of their writing (see Fig 1) [15,21,22]. As such, using CohViz has been demonstrated to be feasible in online or blended learning settings [15,23] but also in large lecture classes [22] with high levels of students’ self-regulation during writing. By combining distinct state-of-the-art natural language processing (NLP) technologies [2426], CohViz automatically generates concept maps from students’ self-written texts (see Fig 1). Local cohesion deficits are highlighted by colored fragments, i.e. sub-concept maps that are not related to other concept maps. Global cohesion deficits are visualized in terms of the text’s macrostructure [27], i.e. by the way the concepts of the texts are used and related. The key advantage of CohViz is that it supplements teachers’ feedback in that it allows to provide students with quick and specific feedback on the degree of cohesion of their texts. In practice, the feedback can be embedded as a homework assignment [22] or modeled by teachers to show textual deficits of non-cohesive texts [15]. For example, an instructor could use the tool to inform students why a particular text is high or low in cohesion (e.g., by showing texts which yield numerous fragments or texts which do not have central concepts with many relations). Similarly, students can use the tool to check short texts such as abstracts or summaries for cohesion during writing.

Fig 1. Example of a concept map generated from CohViz.

Fig 1

Colored fragments indicate concepts that are not related to the rest of the concepts and are thus indicative of a problem in cohesion. For example, the writer did not relate the first two sentences, leading to a fragmented concept map. The concept map contains three fragments, 13 concepts, and 13 relations.

Generation of the CohViz concept maps

To generate the concept maps, CohViz uses a four-step procedure (see Fig 2). First, CohViz segments each text into sentences (i.e., the segmentation phase). Second, in the extraction phase, CohViz determines the grammatical function (i.e., subject, direct object, indirect object) of each concept using the RFTagger [25], and reduces the identified concepts to lexical lemmas (e.g., theory instead of theories) to improve comparisons across concepts [24]. Third, in the relation phase, CohViz extracts relationships between concepts according to their grammatical relationship within sentences [28], and co-references between sentences: Within sentences, a relation depicts the semantic relationship between two concepts in terms of distinct propositions. Each proposition encompasses two arguments which are grammatically related (i.e., subject, object, direct object). To reduce the complexity of the concept maps, the type of relation (i.e., its predicate) is not visualized. For example, for the sentence “Hollywood baked the rolls”, CohViz builds one relation. The subject “Hollywood” is combined with the object “roll”. Hence, CohViz represents semantic information between arguments in terms of partial propositions in which the predicate is not explicitly labeled (see also, [17,29] for related approaches). Additionally, CohViz visualizes unambiguous (but not ambiguous) co-references. Unambiguous pronouns refer to co-references which can be matched to concepts of the previous sentences (e.g., unambiguous co-reference: “The sailors boarded the ship. They were ready to start.”; ambiguous co-reference: The sailors boarded the ship of the companions. They were ready to start). The relations are then stored as word-pairs and are visualized in the form of a concept map.

Fig 2. Depiction of the four step procedure of the CohViz feedback.

Fig 2

Previous research on the effectiveness of CohViz

So far, the effectiveness of CohViz has been tested in various settings between 2017 and 2019, including controlled laboratory studies and ecologically valid field studies in a wide range of disciplines (e.g. biology, ethics, philosophy, educational psychology, teacher education). In these experimental studies, the authors examined the effectiveness of CohViz with college students [15,21,22,30]. Overall, in a mini meta-analysis Burkhart, Lachner, and Nückles [31] could show that CohViz has a medium effect on both local and global cohesion and was thus effective in improving the cohesion of students’ texts (i.e., local cohesion g = 0.62; global cohesion g = 0.57). A potential explanation for the obtained effects of CohViz can be found in the think-aloud study by Lachner, Burkhart, and Nückles [15]. The think-aloud study could show that students who processed the concept maps could infer both local and global writing plans from the concept maps directly. Besides, the analysis of the concept map triggered negative monitoring processes (i.e., student thought about the macrostructure of a text) which led to further planning processes.

These findings may be indicative of the prognostic validity of CohViz. However, it is less clear whether the information provided by the feedback is accurate. Research on the accuracy is important, as research indicates that the effectiveness of feedback depends on its accuracy [18,19]. Therefore, in-depth research on the accuracy is mandatory for potential future advancements of such feedback systems.

Overview of the current study

Against this background, we conducted two corpus-based studies in which we investigated the accuracy of the CohViz feedback in terms of its reliability and validity (see Fig 4).

Fig 4. Depiction of the data analysis procedure for the reliability and validity study.

Fig 4

In the reliability study, we investigated the reliability of CohViz concept maps compared to concept maps based on propositional segmentations by human expert raters on a corpus of 100 expository texts of novice writers collected from previous experimental studies. To compare the concept maps, we used structural indicators commonly used in research on concept maps [3235]. We assumed that CohViz concept maps yield similar results in the number of concepts, relations, and fragments compared to the human expert raters.

Regarding the validity of CohViz, we assessed the convergent and divergent validity of the CohViz system with automated measures from well-established assessment systems [3639], as used for example by the well-known system Coh-Metrix [40]. As CohViz was predominantly designed to provide feedback on the cohesion of students’ writing, we focused on the number of fragments as the central indicator of cohesion provided by CohViz and compared it with convergent (i.e., argument overlap, semantic overlap) and divergent (i.e., syntactic complexity, lexical diversity, and word concreteness) linguistic features of text cohesion. To benchmark the fragments against the convergent and divergent measures, we used a representative text corpus comprising 1020 expository texts from the German Wikipedia. Regarding the convergent validity, we expected medium to large correlations between the number of CohViz fragments as an indicator of text cohesion and the convergent features of text cohesion (i.e., argument overlap, semantic similarity). Regarding the divergent validity, we assumed no substantial correlations between the CohViz fragments and features on the syntactic and lexical level.

Methods

Corpora collection

Reliability corpus

The corpus to test the reliability of CohViz was compiled by the authors as a sample from an entire set of 901 expository texts written in German by college students. All texts were produced by novice writers in the course of different experimental studies conducted by the authors and were conducted between 2015 and 2018 at German universities. We made sure to compile texts with a representative range of topics in the natural sciences and the humanities. In all of these studies the dependent variables of interest were measures of text quality (e.g., local and global cohesion). In those studies, students had 15 minutes to accomplish a writing task. For each of these studies, students gave their written consent for the scientific use of the data. As manual segmentation and coding of the entire corpus of 901 texts would be enormously time-intensive, we decided to use a random, yet a representative subset of 100 texts from the entire text corpus (see Table 1). To test whether the sample was large enough to find small to medium effects, we conducted a power analysis with a sample size of 100, and effect size of r > .3 and Pearson correlation as the statistical test. We obtained a satisfactory power of .93, indicating that the sample size was large enough to detect small to medium effects for convergent validity. To reduce experimenter bias, the selection was carried out via a computer algorithm written by the authors that randomly selected ten texts per topic. For each topic within a domain the algorithm selected 10 expository texts resulting in a total set of 100 texts. We aimed at a text corpus with high linguistic variability, as we were interested in testing the reliability of CohViz under realistic conditions. That said, the text corpus varied on central linguistic dimensions per topic, such as the number of words, F(9, 90) = 6.03, p < 0.01, ηp2 = .38 (large effect), the number of sentences, F(9, 90) = 7.72, p < 0.01, ηp2 = .43 (large effect), and the Flesch-Kincaid readability scores, F(9, 90) = 6.44, p < 0.01, ηp2 = .39 (large effect). The Flesch-Kincaid is a measure for calculating the potential comprehensibility of a text [41] and uses various linguistic features (i.e., number of total words, number of total sentences, number of total syllables) to which different weighting factors are assigned. Texts with a high Flesch-Kincaid score are difficult to read, whereas texts with a low readability score are easy to read. To increase the comprehensibility of the score, it is commonly standardized to grade levels. The obtained readability score of the reliability corpus thus suggests that the texts were suitable for 12th-grade students.

Table 1. Descriptive statistics of the corpus of the reliability study.
Topic N Mean number of words Flesch-Kincaid readability score Mean number of sentences
All texts 100 200.47 (72.14)a 12.5 (2.09) 12.42 (5.02)
Natural sciences
Combustion engines 10 261.3 (66.32) 10.43 (1.35) 19.2 (5.61)
Osmosis 10 136.9 (18.84) 11.03 (1.39) 9.0 (2.83)
Natural selection 10 166.3 (28.04) 11.429 (1.77) 10.0 (2.36)
Data preservation 10 234.6 (73.66) 13.64 (1.73) 12.5 (3.98)
Humanities
School systems 10 212.9 (82.99) 15.16 (2.13) 11.5 (3.75)
Knowledge representations 10 204.5 (51.11) 12.37 (2.26) 14.0 (4.22)
Cognitive Load Theory 10 255.8 (90.78) 12.63 (1.09) 15.9 (6.01)
Reading skills 10 124.2 (19.25) 13.37 (1.80) 7.1 (1.10)
Instructional explanations 10 193.6 (43.66) 12.79 (2.08) 11.7 (2.63)
Formative assessment 10 214.6 (44.61) 12.01 (1.07) 13.3 (4.39)

aValues in brackets refer to standard deviations.

Validity corpus

As a crucial feature of validity is the degree of representativeness [42], we sought to construct a corpus representing a broad range of topics and writing styles in expository writing. One easily-accessible database of expository texts is Wikipedia. Currently, the German Wikipedia comprises 2,409,327 articles. Thus, to test the validity of CohViz, we extracted 1020 German Wikipedia entries from the overall Wikipedia corpus (see Table 2). As students’ expository texts in common classroom studies were rather short, we only analyzed the extended summary sections of the articles. To extract texts from a wide variety of topics and writing styles and because of the enormous size of the Wikipedia corpus, we used a computer algorithm that randomly extracted the texts from the corpus.

Table 2. Descriptive statistics of the corpus of the validity study.
Features M SD
Number of words per sentence 16.23 3.66
Number of syllables per word 1.96 0.18
Number of sentences 12.85 3.59
Number of words per text 208.36 71.80
CohViz features
Number of fragments 3.68 1.99
Number of concepts 53.16 19.06
Number of relations 158.50 233.43
Convergent features on the cohesion level
Adjacent semantic overlap .25 .15
Text level semantic overlap .21 .10
Adjacent argument overlap .27 .17
Text level argument overlap .23 .13
Divergent features on the syntactic level
Average length of longest dependency 9.31 2.48
Average number of complex nominals per clause 0.68 0.27
Divergent features on the lexical level
Word concreteness 1.07 0.35
Root type-token ratio 9.36 1.32

Construction of the concept maps

To compare the CohViz concept maps to concept maps generated from human expert raters, we asked four human expert raters to segment the texts from the reliability corpus into propositions as a basis for the generation of the concept maps (see Fig 4 for the full processing of the corpus). Therefore, we looked for experienced raters with a solid background in linguistics who had previous experience in propositional segmentation. Among the raters who fulfilled these criteria, we asked four advanced master students with a major in applied linguistics or learning and instruction to analyze the corpus. All raters came from the same German university as the authors. Their mean age was 24 (SD = 2.94). They were already familiar with the procedure of propositional segmentation since it was part of their studies’ curriculum. To ensure a uniform prior knowledge of the procedure, each rater was provided with multiple in-depth training sessions (five hours on average) on propositional segmentation and text cohesion. In these training sessions, raters were instructed on different cohesion strategies (e.g., argument overlap, connectives, bridging information). Additionally, they were trained in propositional segmentation with authentic practice material. Each rater independently segmented the texts from the corpus into propositions. To standardize the segmentation procedure, the raters were supported with a segmentation manual when segmenting the texts. First, they were instructed to split each text into sentences. Next, they were asked to divide each sentence into propositions consisting of a predicate and its arguments (e.g., subject, possessor, direct object) [11,43]. For instance, the sentence “Neil went to the moon” would be broken down into the following proposition: P(subject = Neil, object = moon, predicate = go) and transferred to the word-pair: Neil–moon. Additionally, the raters were asked to relate adjacent sentences when they were explicitly related by meaningful cohesive devices (i.e., argument overlap, co-references, superordinate concepts, subordinate concepts, bridging information, and connectives). Each proposition was entered into a spreadsheet with the id of the participant, the number of each sentence, and the subject and object of each proposition as two variables. From this list and to reduce errors by manual coding, we automatically calculated three structural indicators commonly studied in concept mapping research [34,35] using the CohViz algorithm (see Fig 1 for an example, see [44] for the full computer algorithm): Concepts represented the concepts from the text, relations represented the relationships among these concepts based on the propositional segmentation and fragments represented clusters of concepts within a text which were not related to the rest of the text and should thus indicate potential breaks of text cohesion. To guarantee the consistency and representativeness of the human expert segmentations, we followed a two-step procedure. First, all four raters coded a subset of 30 expository texts from the corpus of 100 texts. Interrater reliability for this subset was good to excellent: number of fragments (ICC = .76), number of relations (ICC = .91), and number of concepts (ICC = .97). Second, as the interrater agreement between the four raters was high, two of the four raters coded the rest of the entire text corpus. Interrater reliability between the two raters again was excellent for all indices: number of fragments (ICC = .89), number of relations (ICC = .99), and number of concepts (ICC = .99). Differences between the two remaining raters were resolved by discussion.

To compare the structural indicators of the human expert raters to CohViz, we generated the concept maps from the same reliability corpus using the above mentioned CohViz algorithm. In contrast to the propositional segmentation by the human expert raters, the CohViz concept maps were generated by state-of-the-art natural language techniques [2426]. From these concept maps we computed the same structural indicators, that is the number of concepts, the number of relations, and the number of fragments.

Measures

Structural indicators of the concept maps

The generated concept maps yielded three structural indicators commonly used in concept mapping research [34,35]: The number of concepts in each text, the number of relations between these concepts, and the number of fragments (see Fig 1). Fragments are the main indicator of text cohesion in CohViz and depict distinctly colored sub-concept maps that are not related to the rest of the concept map. An increase in the number of fragments should indicate a decrease in local text cohesion.

Formally, the number of fragments (F) can be defined as the sum of all sub-concept maps in a cluster graph [45]: F=f=1nFi(C,R). C denotes all concepts within a sub-concept map, C = {c1,c2,c3,…,cn). For example, in Fig 3, the green fragment includes the concepts psychology, neuroscience, phenomenon, domain, and linguistic. R denotes all relations within a sub-concept map, for example, R = {{c1,c2},{c1,c3},…{c1,cn}}. In Fig 3, for example, the green fragment has four relations with five concepts. Fragments are disjoint from each other so that fragments do not share concepts. For example, the green and the orange fragment in Fig 3 do not share a common concept. Formally, this fact can be described by C(F1)∩C(F2) = ∅. The full algorithm for calculating these structural indicators can be found in the public data repository of this study [44].

Fig 3. Convergent and divergent features of the validity study.

Fig 3

Convergent and divergent measures of text cohesion

To investigate the validity of CohViz, we benchmarked the three structural indicators obtained by CohViz to convergent and divergent measures of text cohesion commonly used in well-established assessment systems for writing quality [36,46]. Convergent validity can be assessed by comparing the fragments as the central measure of text cohesion in CohViz to other text cohesion measures (e.g., argument overlap, LSA-dependent cohesion measures) [39]. Divergent validity can be assessed by comparing CohViz to divergent measures of text cohesion (e.g., syntactic or lexical complexity as indicators of writing quality). Since a comprehensive coverage of all available linguistic features of texts would result in a confusing plethora of computer-based indicators (n > 200), we followed suggestions by McNamara, Crossley, and McCarthy [47] and MacArthur et al. [5] and selected convergent and divergent measures commonly associated with writing quality and applied these measures to the validity corpus.

To obtain these convergent and divergent measures of text cohesion, we chose to analyze the texts with well-established assessment systems as they have been thoroughly studied in the past and represent the state-of-the-art tools in text-processing. As the Wikipedia entries were written in German we used the assessment tool developed by Hancke et al. [37], which has been designed for the German language. The system provides researchers with validated and commonly used linguistic indicators regarding the cohesion, syntactic, and lexical complexity of a text [4,5,48]. This system is comparable to Coh-Metrix, which was developed for the English language by Graesser et al. [46]. To compare these measures with CohViz, we first calculated the structural indicators from the validity corpus and then extracted the convergent and divergent validity measures from the validity corpus with the assessment system developed by Hancke et al. [37].

Convergent features on the cohesion level. Argument overlap is a central indicator of cohesion and measures the co-occurrence of arguments between two sentences [6]. To measure argument overlap, texts are first segmented into sentence pairs. Afterward the sentence pairs, which contain a common argument (e.g., noun phrase), are counted and divided by the total amount of sentence pairs (see Fig 3 for an example). There are two versions of argument overlap.

Adjacent argument overlap measures the overlap of arguments between neighboring sentences. Formally, adjacent argument overlap is defined as the ratio between the sum of the adjacent sentence pairs that share a common argument Pi,i+1, divided by all adjacent sentence pairs in a text (n−1):

Adjacentargumentoverlap=i=1n1Pi,i+1*1n1

Text level argument overlap measures the overlap of each sentence with every other sentence in a text and is a proxy for the global cohesion of a text. Formally, text level overlap is defined as the ratio of the sum of all sentence pairs within a text that share a common argument P, divided by all possible sentence pairs within a text (n*n12). To calculate text level argument overlap, the sentences are stored in a lower triangular matrix, in which the rows and columns represent the sentences of the text.

Textlevelargumentoverlap=i=1nj=1nPij|i<j*1n*n12

Semantic overlap indicates the semantic similarity between sentences. To measure the semantic overlap we applied the common statistical procedure latent semantic analysis (LSA) as introduced by Graesser et al. [46]. LSA values can be interpreted as a correlation that takes values between 0 (low semantic relatedness) and 1 (high semantic relatedness). LSA is a sensitive feature for the semantic relatedness of sentence pairs and is a central indicator of text cohesion in Coh-Metrix [48]. The LSA-cosine is computed as follows (see Fig 3): First, LSA counts how often each word occurs in each sentence. Then, a statistical procedure called singular value decomposition is applied which reduces the number of words to a smaller number of main concepts. From this reduced form, the semantic similarity between sentences is then computed by taking their cosine similarity. Values close to 1 indicate a high semantic similarity and values close to 0 indicate a low semantic similarity. Adjacent semantic overlap measures the cosine similarity between neighboring sentences, and text level semantic overlap measures the cosine similarity between all possible sentence pairs of a text.

Divergent features on the syntactic level. Average length of longest dependency refers to the longest distance between a word and its dependent within a sentence [49]. For example, the sentence “The tall and energetic politician who emphasized the value of freedom did not like the reporter” (see Fig 3) has a longest dependency of 10: The subject “politician” must be related to the predicate “like” [50] and the distance between the subject and the predicate is 10 words. Long dependencies are challenging for readers, as the first element of the dependency has to be kept in working memory until the next element can be related to the first element [50]. Thus, the average length of the longest dependency for each sentence in a text is a precise feature for the syntactic complexity of a text.

Average number of complex nominals per clause refers to groups of words that relate to a common noun [51]. For example, a noun can be described in more detail by one or more adjectives (e.g., “the tall and energetic politician”), by placing multiple noun phrases side by side (e.g., “the man who drank beer”), or by a gerund phrase (e.g., “doing sports regularly is good for your health”). To obtain the average number of complex nominals per clause, the number of complex nominals is counted and then averaged by the number of clauses. Complex nominals are challenging for readers since they put a high demand on working memory, especially for less proficient readers [52].

Divergent features on the lexical level. Word concreteness. Concrete words are characterized by the fact that they refer to tangible sensory experiences. By using concrete words, a reader can see agents performing actions that affect objects. For example, the word “reporter” refers to an object and a sensory experience, whereas the word “freedom” refers to a concept that cannot be captured by sensory experiences and is thus more abstract. To measure word concreteness, several computational models have been proposed [53]. Among those, word concreteness models based on WordNets are commonly used. A WordNet is a lexical database of words from a particular language that form a hierarchical tree structure. The closer words are to the root of the tree structure, the more abstract they are. For example, the word “sports” is related to “soccer” since “sports” is a superordinate of soccer. Hence, “sports” has fewer hypernyms (i.e., superordinates) than “soccer” and is thus more abstract than “soccer”. To compute the hypernym score, we calculated the average number of hypernyms among all concepts within a text. To compute these scores, we used GermaNet, a WordNet for the German language containing over 16,000 words [54]. High values in the hypernym score signify high levels of concreteness. Low levels of the hypernym score characterize low levels of concreteness.

Root type-token ratios are a common feature of lexical diversity. Type-token ratios are calculated by dividing the number of unique words (types) within a text by all words within a text (tokens). For example, the sentence in Fig 3 consists of 13 words (tokens) and 12 unique words (types), since politician is repeated once in the second sentence and stop words such as articles are ignored. Hence, this example would yield a type-token ratio of 12/13 or 0.92. As type-token ratios are sensitive to text length, root type-token ratios are adjusted for text length by taking the square root of the number of tokens [55]. Texts with a high root type-token ratio are more difficult to understand because readers need a greater vocabulary to understand the content of a text.

Data analysis

We tested the reliability and the validity of the CohViz concept maps using the two corpora (see Fig 4). All texts from both corpora were automatically analyzed by the CohViz algorithm. The algorithm generated the concept maps from the texts of the respective corpus and computed the three structural indicators from these concept maps (i.e., number of fragments, number of concepts, number of relations). These structural indicators were the basis for testing the reliability and validity of the feedback.

Assessing the reliability of CohViz

To test the reliability of CohViz, we first created the reliability corpus. These texts were fed into the CohViz engine and additionally analyzed by human expert raters who segmented each text into propositions. From these propositions we computed the concept maps and extracted their structural indicators. To compare the three structural indicators between the CohViz concept maps and the human expert concept maps, we computed product-moment correlations, interrater agreement, and bias scores. Interrater agreement was measured by intraclass correlations. We additionally chose to use intraclass correlations to product-moment correlations since product-moment correlations do not take potential differences among raters into account and only provide evidence for the general associations among two raters. Therefore, generally intra-class correlations are preferred, as they additionally correct for potential divergence among two or more raters [56]. Interrater reliability measures, however, only measure the (dis)agreement between the computer feedback and the human expert raters but do not provide additional information about the direction of potential disagreements (i.e., over- or underestimations) [57]. The direction of disagreement is commonly computed in terms of bias measures [58]. Bias refers to the signed difference between the values provided by the computer-based feedback tool and the values by the human expert raters (XComputer−Xrater). Positive values indicate overestimations (i.e. the computer provides higher numbers than the expert raters), negative values indicate underestimations (i.e. the computer provides lower numbers than the expert raters). These general quantitative methods are often accompanied by qualitative analyses of the disagreements to inspect the underlying reasons for these disagreements [6,59,60]. We used an alpha level of .05 for all statistical analyses. Power analyses indicated that we achieved an excellent test-power of 1-β = .93 (while setting α-error to .05, the sample size to N = 100, and the smallest detectable effect to r = .30).

Assessing the validity of CohViz

To test the validity of CohViz, we proceeded similarly. First, we fed the texts from the validity corpus into the CohViz engine which extracted the three structural indicators. Next, we computed the convergent and divergent measures of text cohesion from the texts of the validity corpus with the assessment system developed by Hancke et al. [37]. To compare the three structural indicators generated by CohViz with convergent and divergent measures of cohesion, we computed product-moment-correlations in particular for the number of fragments as the central measure of text cohesion delivered by CohViz. Post-hoc analyses indicated that we achieved an excellent test-power of 1-β = .94 (while setting α-error to .05, the sample size to N = 1020, and the significantly smallest detectable effect to r = .01). As such, for the interpretation of the convergent and divergent validity of CohViz, the direction and the size of the effect (i.e., the size of the correlation) is more important than the mere establishment of significance, as small but less meaningful correlations may also become significant due to the high test power.

Results

Reliability

First, to investigate the relationships between the concept maps generated by CohViz and the human expert raters, we computed product-moment correlations and intraclass correlations (ICC) per dependent measure (see Table 3). For the number of fragments, we found high correlations between the CohViz concept maps and the concept maps based on human expert raters. Similarly, the intraclass correlations for the number of fragments were high. For the number of relations, we obtained a high correlation between the maps generated by CohViz and the maps based on human raters, and medium-to-high intraclass correlations. The correlation for the number of concepts in the CohViz-generated and human rater based maps was excellent, as was the intraclass correlations (see Table 3). Together, these results indicate that concept maps generated by CohViz were generally highly consistent with the concept maps by human raters, as the product-moment and intra-class correlations showed at least medium but mostly high values for the three measures, that is, the number of fragments, the number of concepts, and the number of relations.

Table 3. Accuracy of the CohViz methodology compared to human raters.

Feature Human raters CohViz Product-moment correlation Intra-class reliability Bias scores 95% CI Size of effect
M SD M SD r(98) p ICC Mdiff SDdiff t(99) p LL UL Cohen’s d
Number of fragments 2.63 1.71 2.76 1.79 .76 < .001 .76 -0.13 1.22 -1.07 .289 -0.37 0.11 -0.15
Number of concepts 30.90 12.13 32.63 12.32 .83 < .001 .78 -1.73 3.64 -4.75 < .000 -2.45 -1.01 -0.67
Number of relations 40.61 20.98 59.21 29.90 .95 < .001 .96 -18.60 17.00 -10.93 < .000 -15.18 -21.92 -1.39

Mdiff = mean difference scores between human raters and CohViz; SDdiff = standard deviation of difference scores between human raters and CohViz; ICC = intraclass correlations; CI = confidence intervals; LL = lower limit, UL = upper limit. All t-tests were two-tailed.

To investigate the direction of potential discrepancies, we computed bias scores [58] between the human expert concept maps and the CohViz concept maps for each dependent variable (i.e., the number of fragments, the number of concepts, the number of relations, see Table 3). To test whether these differences were significant, we computed one-sample t-tests against zero for each dependent variable [58]. As the correlations already indicated, we did not find a significant difference in the number of fragments between the human expert raters and CohViz (d = 0.15, small effect), indicating that CohViz was highly accurate in detecting cohesion fragments in the text corpus. However, CohViz tended to overestimate the number of relations (d = 1.39, large effect), and the number of concepts (d = 0.67, medium effect), indicating lower accuracy regarding the extraction of concepts and relations from the texts.

To investigate the reasons for these overestimations regarding the extraction of concepts and relations, we conducted two follow-up qualitative analyses.

First, we analyzed the overestimation in the number of concepts for the 10 texts with the largest number of overestimations regarding the number of concepts. On average, the selected subsample of CohViz concept maps contained 10 concepts more than the human expert raters’ concept maps (which corresponds to an overall overestimation of 101 concepts for all 10 of the texts). Of these overestimations, 22% (n = 22) occurred due to parsing errors of the CohViz engine during the extraction phase (see Fig 2, e.g. by wrong categorizations of n-grams, or lemmas). The remaining 78% of the overestimations (n = 79) occurred because raters accidentally omitted concepts or entire sentences in their propositional segmentations. Thus, the qualitative analysis rather suggests that the disagreement between CohViz and the human expert raters regarding the number of concepts resulted because of segmentation errors by the human expert raters and not by the CohViz system, which further corroborates the reliability of the CohViz system.

We proceeded similarly for the number of relations and analyzed the overestimations of the ten texts with the highest overestimations of relations. On average, each concept map by CohViz contained 47 relations more than the human expert raters’ concept maps (which corresponds to an overall overestimation of 476 relations for all 10 texts). Of these overestimations, 66% (n = 315) were due to errors of the CohViz system. The qualitative analyses revealed that most overestimations resulted from texts that contained complex sentence structures (e.g., long sentences comprising several subordinate clauses), or realized enumerations in the form of incomplete sentences. Thus, most of the errors must be attributed to parsing problems in the extraction phase, as computer-linguistic approaches usually require well-formed grammatical constructions. Contrarily, 34% (n = 161) of the potential overestimations resulted due to omissions of the human raters, which took place during the segmentation of the students’ texts. Thus, the qualitative analysis suggests that the overestimations regarding the relations among concepts likely resulted when the text contained relatively complex syntactic structures and incomplete sentences.

Validity

Correlations of CohViz with convergent features of text cohesion

We computed product-moment correlations to investigate the relationship between CohViz features, in particular, the number of fragments as the central measure of text cohesion delivered by CohViz, and the different convergent features. In line with our expectations, we found medium to large negative correlations for most of the convergent features of text cohesion (except for adjacent semantic overlap, see Table 4) with the CohViz indicator number of fragments, indicating that an increase in CohViz fragments was associated with a decrease in cohesion. The text level measures of cohesion correlated more strongly with the CohViz measures than the adjacent cohesion measures, as the number of fragments is rather a measure of cohesion gaps on the entire text level.

Table 4. Correlations between CohViz features and convergent and divergent features of text cohesion.
Level Feature 1 2 3 4 5 6 7 8 9 10 11
CohViz features
1. Number of fragments
2. Number of concepts .12***
3. Number of relations -.04 .57***
Convergent features on the cohesion level
4. Adjacent semantic overlap -.25*** -.02 .10***
5. Text level semantic overlap -.43*** -.11*** .10** .66***
6. Adjacent argument overlap -.46*** -.00 .09** .40*** .32***
7. Text level argument overlap -.60*** -.06 .09** .25*** .44*** .75***
Divergent features on the syntactic level
8. Average length of longest dependency -.24*** .45*** .29*** .17*** .27*** .21*** .28***
9. Average number of complex nominals per clause -.08** .23*** .19*** .12*** .20*** .06 .07* .29***
Divergent features on the lexical level
10. Word concreteness -.02 -.25*** -.22*** -.03 -.03 .01 .02 .07* -.30***
11. Root type-token ratio .15*** .81*** .36*** -.06 -.15*** -.12*** -.17*** .52*** .15 .08*

The central results from the validity study are in boldface. Correlations represent Pearson product-moment correlation coefficient

*p ≤ .05

**p ≤ .01

***p ≤ .001

Correlations with divergent features of text cohesion

Next, we tested associations of the number of fragments provided by CohViz with the divergent measures of writing quality (i.e., syntactic level and lexical level). Except for the average length of longest dependencies, we found non-substantial to weak correlations with the number of fragments generated by CohViz for both the syntactic and the lexical features (see Table 4). One reason for the particularly weak correlation between the average number of longest dependencies and the number of fragments might be that CohViz generates more relations with an increase in the average length of dependencies, leading to a decrease in the number of fragments. In summary, these results indicate that the number of fragments was not substantially related to most divergent features of writing quality. These findings on the discriminant features further underscore the conclusion that CohViz is a valid indicator of text cohesion.

Discussion

The purpose of the presented studies was to examine the reliability and validity of CohViz, a computer-based feedback approach, which provides (novice) writers with informative feedback on the cohesion of their writing in the form of concept maps. We successfully demonstrated the reliability and construct validity of the CohViz feedback in two studies.

The findings of the reliability study showed that CohViz is particularly reliable in visualizing cohesion deficits and concepts. However, regarding the depiction of relations, CohViz tended to be less accurate compared to the human expert raters, indicating room for technical improvement. Regarding the number of concepts, our qualitative analyses revealed that CohViz proved to even be superior to human expert raters in validly visualizing the text’s concepts, as human coding was often accompanied by coding errors. This finding is astonishing, as both expert raters received in-depth training of several hours on propositional segmentation and the instantiation of cohesion. Apparently, computer-generated concept maps by the CohViz engine were less error-prone than human-generated concept maps and as such may offer valid opportunities to provide students with accurate feedback on the chosen concepts when writing.

In the validity study, we demonstrated the convergent and divergent validity of the CohViz system using features of writing quality from well-established assessment systems [36,46], as used by Coh-Metrix [46]. In particular, we compared the fragments as the main indicator of text cohesion in CohViz with convergent features of text cohesion (i.e., argument overlap; semantic overlap) and divergent features on the syntactical and lexical level (i.e., average length of longest dependency; average number of complex nominals per clause; word concreteness; root type-token ratio). We found that the number of fragments was moderate to highly associated with other convergent features of text cohesion. In line with our expectation of divergent validity, there were no substantial relations between the number of CohViz fragments and the lexical and syntactical features. The results from the validation study suggest that the fragments as the central indicator of text cohesion are particularly valid to assess the degree of argument overlap in students’ texts. In particular, the fragments seem to assess how connected the central concepts are on the text level. This result is particularly interesting as it highlights the gap in the mechanisms that contribute to the effectiveness of feedback. Given this result, it can be assumed that the fragments will help students, in particular, to identify issues of textual cohesion at the level of argument overlap.

The two studies have both theoretical and practical implications. First, our findings add to the scarce evidence that human expert raters, as the presumed gold-standard, are not always a better alternative in segmenting and parsing texts compared to computer-based processing technology. In our study, we recruited expert raters with ample training on propositional segmentation. Nevertheless, these expert raters still faced difficulties in accurately segmenting expository texts into concepts. On the other hand, expert raters were better able to detect relations among concepts, as human expert raters have a more profound knowledge of grammar and linguistics. Our findings thus show that human expert raters are not necessarily the gold-standard to analyze texts in terms of its structural indicators. The superiority of computer-based approaches over human raters has already been suggested for surface writing tasks such as spelling or orthography [61]. Our findings, however, add to the previous findings, as they show that computer-based systems may be suited to assess the cohesion of texts, which can be regarded as an important feature that contributes to the quality of texts [11,62]. Thus, we see computer-based assessments as a potential supplement to human assessments to further advance students’ writing [12].

Finally, our study highlights the central role of triangulation in validation studies. Many validation studies often solely relied on reporting relationships with other assessment technologies or simply reported relations to human expert raters [6366]. The findings of these studies provide relevant insights into potential relationships among tools and human raters. Such unidimensional analyses, however, may not suffice to fully prove the reliability and validity of the features under investigation, as no information is available about the direction of potential discrepancies and the underlying reasons. In our studies, by contrast, we implemented multiple quantitative features to examine the overall validity of the CohViz feedback. These analyses were complemented by qualitative analyses, which additionally exemplified potential reasons for the inaccuracies. As such, we see the triangulation method in our studies as a valuable approach to investigate the validity of computer-based assessments.

Limitations of the study and future research

Since the results of the current studies suggest that the fragments in CohViz are indicative of problems in argument overlap at the text level, future research should conduct empirical studies to test whether indeed argument overlap is best supported by the feedback. Since the ability to establish cohesion by argument overlap typically develops at a younger age [67], future studies should investigate the effectiveness of the feedback with these students. Such studies would not only provide crucial information for which aspects of cohesion the tool is most effective but also increase the generalizability of the feedback.

Since we only used expository texts (i.e., explanatory texts) in our studies it is not clear whether our findings would generalize to other text genres such as narrative texts. For instance, in narrative texts, text cohesion is lower than in expository texts [8,68]. That said, genres may not only differ in the extent of text cohesion but may also rely on different linguistic devices to instantiate cohesion within texts. For instance, in expository texts, experienced writers rather use argument overlap and bridging information as linguistic devices, whereas in narratives, cohesion is often achieved by the use of pronominal references. Given that the instantiation of cohesion may highly depend on the particular genre, future studies should, therefore, replicate our findings in other genres.

Another related issue refers to the text length in our studies (around 200–300 words). Given that particularly semantic features based on latent semantic analyses work better for longer texts [6], the question remains whether our findings would replicate with longer texts. Then, however, CohViz concept maps might become too large and complex. One solution could be to provide students with several small and manageable concept maps portraying the cohesion for the different subsections of the text. A less detailed but more inclusive “macro-concept map” could visualize the central theme of the text by showing global relations between text elements. Whether novice writers will benefit from feedback consisting of such a combination of concept maps that differ in the level of detail is also a question to be addressed in further studies.

Conclusion

In conclusion, the two studies showed that CohViz is a reliable and valid approach to provide students with feedback on the cohesion of their writing. Verifying the accuracy of the feedback is an important step to uncover the mechanisms underlying the effectiveness of concept map for text cohesion. The findings of the two studies provide evidence that CohViz is an effective tool to improve the cohesion of students’ texts because it accurately depicts deficits of text cohesion. Thus, CohViz may serve as a vehicle to supplement instructors’ feedback and enhance the cohesion of students’ writing.

Acknowledgments

We would like to thank Christina Schuba, Ricarda Budde, Arne Kappis, and Helen Dambach for coding the corpus from the reliability study. We furthermore express our gratitude to Zarah Weiß and Detmar Meurers from the Department of Theoretical Computer Linguistics (ISCL) of the University of Tübingen for providing the automated measures in the validity study.

Data Availability

The CohViz system is available on the Open Science Framework (OSF) under the following address: https://doi.org/10.17605/OSF.IO/SA53E. All data files and analyses of this manuscript are available on the OSF under the following address: https://doi.org/10.17605/OSF.IO/UHPF3.

Funding Statement

The authors received no specific funding for this work. The article processing charge was funded by the Baden-Wuerttemberg Ministry of Science, Research and Art and the University of Freiburg in the funding programme Open Access Publishing.

References

  • 1.Ozuru Y, Dempsey K, McNamara DS. Prior knowledge, reading skill, and text cohesion in the comprehension of science texts. Learn Instr. 2009;19: 228–242. 10.1016/j.learninstruc.2008.04.003 [DOI] [Google Scholar]
  • 2.Britton BK, Gülgöz S. Using Kintsch’s computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. J Educ Psychol. 1991;83: 329–345. 10.1037/0022-0663.83.3.329 [DOI] [Google Scholar]
  • 3.McNamara DS, Kintsch E, Songer NB, Kintsch W. Are Good Texts Always Better? Interactions of Text Coherence, Background Knowledge, and Levels of Understanding in Learning From Text. Cogn Instr. 1996;14: 1–43. 10.1207/s1532690xci1401_1 [DOI] [Google Scholar]
  • 4.Wiley J, Hastings P, Blaum D, Jaeger AJ, Hughes S, Wallace P, et al. Different Approaches to Assessing the Quality of Explanations Following a Multiple-Document Inquiry Activity in Science. Int J Artif Intell Educ. 2017;27: 758–790. 10.1007/s40593-017-0138-z [DOI] [Google Scholar]
  • 5.MacArthur CA, Jennings A, Philippakos ZA. Which linguistic features predict quality of argumentative writing for college basic writers, and how do those features change with instruction? Read Writ. 2018;32: 1553–1574. 10.1007/s11145-018-9853-6 [DOI] [Google Scholar]
  • 6.McNamara DS, Louwerse MM, McCarthy PM, Graesser AC. Coh-Metrix: Capturing Linguistic Features of Cohesion. Discourse Process. 2010;47: 292–330. 10.1080/01638530902959943 [DOI] [Google Scholar]
  • 7.Halliday MAK, Hasan R. Cohesion in english. Routledge; 1976. [Google Scholar]
  • 8.Graesser AC, Millis KK, Zwaan RA. Discourse Comprehension. Annu Rev Psychol. 1997;48: 163–189. 10.1146/annurev.psych.48.1.163 [DOI] [PubMed] [Google Scholar]
  • 9.McNamara DS. Reading Both High-Coherence and Low-Coherence Texts: Effects of Text Sequence and Prior Knowledge. Can J Exp Psychol. 2001;55: 51–62. 10.1037/h0087352 [DOI] [PubMed] [Google Scholar]
  • 10.Concha S, Paratore JR. Local coherence in persuasive writing: An exploration of chilean students’ metalinguistic knowledge, writing process, and writing products. Writ Commun. 2011;28: 34–69. 10.1177/0741088310383383 [DOI] [Google Scholar]
  • 11.Lachner A, Nückles M. Bothered by Abstractness or Engaged by Cohesion? Experts’ Explanations Enhance Novices’ Deep-Learning. J Exp Psychol Appl. 2015;21: 101–115. 10.1037/xap0000038 [DOI] [PubMed] [Google Scholar]
  • 12.Kellogg RT, Whiteford AP. Training Advanced Writing Skills: The Case for Deliberate Practice. Educ Psychol. 2009;44: 250–266. 10.1080/00461520903213600 [DOI] [Google Scholar]
  • 13.Allen LK, Jacovina ME, McNamara DS. Computer-Based Writing Instruction 2nd ed. In: MacArthur CA, Graham S, Fitzgerald J, editors. Handbook of Writing Research. 2nd ed. New York, London: The Guilford Press; 2016. pp. 316–329. [Google Scholar]
  • 14.Roscoe RD, Snow EL, McNamara DS. Feedback and revising in an intelligent tutoring system for writing strategies. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2013;7926 LNAI: 259–268. 10.1007/978-3-642-39112-5-27 [DOI] [Google Scholar]
  • 15.Lachner A, Burkhart C, Nückles M. Mind the gap! Automated concept map feedback supports students in writing cohesive explanations. J Exp Psychol Appl. 2017;23: 29–46. 10.1037/xap0000111 [DOI] [PubMed] [Google Scholar]
  • 16.Villalon J, Calvo RA. Concept maps as cognitive visualizations of writing assignments. J Educ Technol Soc. 2011;14: 16–27. [Google Scholar]
  • 17.Ferrera L, Butcher K. Visualizing feedback: Using graphical cues to promote self-regulated learning. Proceedings of the Annual Meeting of the Cognitive Science Society. 2011. [Google Scholar]
  • 18.Hirst JM, DiGennaro Reed FD, Reed DD. Effects of Varying Feedback Accuracy on Task Acquisition: A Computerized Translational Study. J Behav Educ. 2013;22: 1–15. 10.1007/s10864-012-9162-0 [DOI] [Google Scholar]
  • 19.Lachner A, Backfisch I, Nückles M. Does the accuracy matter? Accurate concept map feedback helps students improve the cohesion of their explanations. Educ Technol Res Dev. 2018;66: 1051–1067. 10.1007/s11423-018-9571-4 [DOI] [Google Scholar]
  • 20.Burkhart C, Lachner A, Nückles M. CohViz—a computer-based feedback system to provide students with concept map feedback on the cohesion of their texts. 2020. 10.17605/OSF.IO/SA53E [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lachner A, Neuburg C. Learning by writing explanations: computer-based feedback about the explanatory cohesion enhances students’ transfer. Instr Sci. 2018;47: 19–37. 10.1007/s11251-018-9470-4 [DOI] [Google Scholar]
  • 22.Lachner A, Burkhart C, Nückles M. Formative computer-based feedback in the university classroom: Specific concept maps scaffold students’ writing. Comput Human Behav. 2017;72: 459–469. 10.1016/j.chb.2017.03.008 [DOI] [Google Scholar]
  • 23.Lachner A, Thomas P, Breil P, Stankovic N. Optimierung von Flipped Classrooms durch konstruktive und interaktive Lernaktivitäten: Eine explorative Studie in der Philosophiedidaktik Forschungs- und Entwicklungsfelder der Lehrerbildung auf dem Prüfstand Ergebnisse der ersten Förderphase der Qualitätsoffensive Lehrerbildung an der Tübingen School of Education. Tübingen: Tübingen University Press; 2020. [Google Scholar]
  • 24.Roberts W. pygermanet. GermaNet API for Python. 2016. Available: https://github.com/wroberts/pygermanet [Google Scholar]
  • 25.Schmid H, Laws F. Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1; 2008. pp. 777–784. [Google Scholar]
  • 26.Bird S, Loper E. "NLTK: the natural language toolkit. Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. Association for Computational Linguistics; 2004. p. 31. [Google Scholar]
  • 27.Kintsch W, van Dijk TA. Toward a Model of Text Comprehension and Production. Psychol Rev. 1978;85. [Google Scholar]
  • 28.Van Valin RD. An introduction to syntax. Cambridge University Press; 2001. [Google Scholar]
  • 29.Berlanga AJ, van Rosmalen P, Boshuizen HPA, Sloep PB. Exploring formative feedback on textual assignments with the help of automatically created visual representations. J Comput Assist Learn. 2012;28: 146–160. 10.1111/j.1365-2729.2011.00425.x [DOI] [Google Scholar]
  • 30.Lachner A, Schurer T. Effects of the Specificity and the Format of External Representations on Students’ Revisions of Fictitious Others’ Texts. J Writ Res. 2018;9: 333–351. 10.17239/jowr-2018.09.03.04 [DOI] [Google Scholar]
  • 31.Burkhart C, Lachner A, Nückles M. Applying Principles of Multimedia Learning to Support Students’ Expository Writing: Differential Effects of Spatial Contiguity and Signaling on Students’ Cohesive Writing. J Exp Psychol. 2019. [Google Scholar]
  • 32.Ifenthaler D. Toward automated computer-based visualization and assessment of team-based performance. J Educ Psychol. 2014;106: 651–665. 10.1037/a0035505 [DOI] [Google Scholar]
  • 33.Ifenthaler D, Masduki I, Seel NM. The mystery of cognitive structure and how we can detect it: tracking the development of cognitive structures over time. Instr Sci. 2009;39: 41–61. 10.1007/s11251-009-9097-6 [DOI] [Google Scholar]
  • 34.Novak JD. Concept mapping: A useful tool for science education. J Res Sci Teach. 1990;27: 937–949. 10.1002/tea.3660271003 [DOI] [Google Scholar]
  • 35.Novak JD, Cañas AJ. The theory underlying concept maps and how to construct and use them. 2008. [Google Scholar]
  • 36.Berendes K, Vajjala S, Meurers D, Bryant D, Wagner W, Chinkina M, et al. Reading demands in secondary school: Does the linguistic complexity of textbooks increase with grade level and the academic orientation of the school track? J Educ Psychol. 2018;110: 518–543. 10.1037/edu0000225 [DOI] [Google Scholar]
  • 37.Hancke J, Vajjala S, Meurers D. Readability classification for German using lexical, syntactic, and morphological features. Proceedings of COLING 2012. 2012. pp. 1063–1080. [Google Scholar]
  • 38.McNamara DS, Crossley S, Roscoe R. Natural language processing in an intelligent writing strategy tutoring system. Behav Res Methods. 2013;45: 499–515. 10.3758/s13428-012-0258-1 [DOI] [PubMed] [Google Scholar]
  • 39.Crossley SA, Kyle K, Dascalu M. The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap. Behav Res Methods. 2018;51: 14–27. 10.3758/s13428-018-1142-4 [DOI] [PubMed] [Google Scholar]
  • 40.Graesser AC, McNamara DS, Louwerse MM, Cai Z. Coh-Metrix: Analysis of text on cohesion and language. Behav Res Methods, Instruments, Comput. 2004;36: 193–202. 10.3758/BF03195564 [DOI] [PubMed] [Google Scholar]
  • 41.Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS. Derivation of New Readability Formulas for Navy Enlisted Personell. 1975. [Google Scholar]
  • 42.Taylor CS. Validity and Validation. Oxford University Press; 2013. 10.1093/acprof:osobl/9780199791040.001.0001 [DOI] [Google Scholar]
  • 43.Kintsch W. Information accretion and reduction in text processing: Inferences. Discourse Process. 1993;16: 193–202. 10.1080/01638539309544837 [DOI] [Google Scholar]
  • 44.Burkhart C. Data Repository—Assisting students’ writing with computer-based concept map feedback: A validation study of the CohViz feedback system. 2020. 10.17605/OSF.IO/UHPF3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Bondy A, Murty USR. Graph Theory: An Advanced Course. 1st ed. Springer; 2007. [Google Scholar]
  • 46.Graesser AC, McNamara DS, Louwerse MM, Cai Z. Coh-Metrix: Analysis of text on cohesion and language. Behav Res Methods, Instruments, Comput. 2004;36: 193–202. 10.3758/bf03195564 [DOI] [PubMed] [Google Scholar]
  • 47.McNamara DS, Crossley SA, McCarthy PM. Linguistic Features of Writing Quality. Writ Commun. 2009;27: 57–86. 10.1177/0741088309351547 [DOI] [Google Scholar]
  • 48.McNamara DS, Louwerse MM, McCarthy PM, Graesser AC. Coh-Metrix: Capturing Linguistic Features of Cohesion. Discourse Process. 2010;47: 292–330. 10.1080/01638530902959943 [DOI] [Google Scholar]
  • 49.Temperley D. Minimization of dependency length in written English. Cognition. 2007;105: 300–333. 10.1016/j.cognition.2006.09.011 [DOI] [PubMed] [Google Scholar]
  • 50.Gibson E. The dependency locality theory: A distance-based theory of linguistic complexity. Image, language, brain. 2000. pp. 95–126. [Google Scholar]
  • 51.Cooper TC. Measuring Written Syntactic Patterns of Second Language Learners of German. J Educ Res. 1976;69: 176–183. 10.1080/00220671.1976.10884868 [DOI] [Google Scholar]
  • 52.Schmidt JE. Die deutsche Substantivgruppe und die Attribuierungskomplikation. Tübingen: Niemeyer; 1993. [Google Scholar]
  • 53.Feng S, Cai Z, Crossley S, McNamara DS. Simulating human ratings on word concreteness. Twenty-Fourth International FLAIRS Conference. 2011. [Google Scholar]
  • 54.Hamp B, Feldweg H. GermaNet—a Lexical-Semantic Net for German. Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. Madrid; 1997. [Google Scholar]
  • 55.Linderholm T, Everson MG, Van Den Broek P, Mischinski M, Crittenden A, Samuels J. Effects of causal text revisions on more- and less-skilled readers’ comprehension of easy and difficult texts. Cogn Instr. 2000;18: 525–556. 10.1207/S1532690XCI1804_4 [DOI] [Google Scholar]
  • 56.Jinyuan L, Wan T, Guanqin C, Yin L, Changyong F, others. Correlation and agreement: overview and clarification of competing concepts and measures. Shanghai Arch psychiatry. 2016;28: 115–120. 10.11919/j.issn.1002-0829.216045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Müller R, Büttner P. A critical discussion of intraclass correlation coefficients. Stat Med. 1994;13: 2465–2476. 10.1002/sim.4780132310 [DOI] [PubMed] [Google Scholar]
  • 58.Schraw G. A conceptual analysis of five measures of metacognitive monitoring. Metacognition Learn. 2009;4: 33–45. 10.1007/s11409-008-9031-3 [DOI] [Google Scholar]
  • 59.Brannen J. Mixing methods: Qualitative and quantitative research. Routledge; 2017. [Google Scholar]
  • 60.Harwood N. What Do Proofreaders of Student Writing Do to a Master’s Essay? Differing Interventions, Worrying Findings. Writ Commun. 2018;35: 474–530. 10.1177/0741088318786236 [DOI] [Google Scholar]
  • 61.Kellogg RT, Whiteford AP. Training Advanced Writing Skills: The Case for Deliberate Practice. Educ Psychol. 2009;44: 250–266. 10.1080/00461520903213600 [DOI] [Google Scholar]
  • 62.McNamara DS, Kintsch W. Learning from texts: Effects of prior knowledge and text coherence. Discourse Process. 1996;22: 247–288. 10.1080/01638539609544975 [DOI] [Google Scholar]
  • 63.Attali Y, Burstein J. Automated Essay Scoring With E-Rater® V.2.0. ETS Res Rep Ser. 2005; i–21. 10.1002/j.2333-8504.2004.tb01972.x [DOI] [Google Scholar]
  • 64.Kyle K, Crossley S, Berger C. The tool for the automatic analysis of lexical sophistication TAALES: version 2.0. Behav Res Methods. 2017;50: 1030–1046. 10.3758/s13428-017-0924-4 [DOI] [PubMed] [Google Scholar]
  • 65.Liu OL, Rios JA, Heilman M, Gerard L, Linn MC. Validation of automated scoring of science assessments. J Res Sci Teach. 2016;53: 215–233. 10.1002/tea.21299 [DOI] [Google Scholar]
  • 66.Mao L, Liu OL, Roohr K, Belur V, Mulholland M, Lee H- S, et al. Validation of Automated Scoring for a Formative Assessment that Employs Scientific Argumentation. Educ Assess. 2018;23: 121–138. 10.1080/10627197.2018.1427570 [DOI] [Google Scholar]
  • 67.Crossley SA, Kyle K, McNamara DS. The development and use of cohesive devices in L2 writing and their relations to judgments of essay quality. J Second Lang Writ. 2016;32: 1–16. 10.1016/j.jslw.2016.01.003 [DOI] [Google Scholar]
  • 68.Graesser AC, McNamara DS. Computational Analyses of Multilevel Discourse Comprehension. Top Cogn Sci. 2011;3: 371–398. 10.1111/j.1756-8765.2010.01081.x [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Maciej Huk

24 Jan 2020

PONE-D-19-33124

Assisting students’ writing with computer-based concept map feedback: A validation study of the CohViz feedback system

PLOS ONE

Dear Mr. Burkhart,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular:

  • Some of the the analysed measures should be defined formally (e.g. "mean readability score based on the Flesch-Kincaid measure"),

  • Details of statistical analysis procedure should be given,

  • Demographic information of students and human experts is missing,

  • The information about the authors of CohViz system should be clearly presented,

  • The text should be changed to remove not needed repetitions of sentences and information.

We would appreciate receiving your revised manuscript by Mar 09 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified whether consent was informed.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

Reviewer #3: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I think the quality and overall rigor of the studies are very good; however, the writing about the statistics in Study 2 is confusing enough that I can't assert that the work meets the Plos One standard. (Although I do think it likely does.)

I also attached the below comments in a separate document, but overall, I do think this is a publishable paper. I would like to see more practical advice for using the tool.

Major comments:

This is a helpful paper because it validates the use of a specific tool for student feedback on their writing. It is unclear how the authors think this tool is being used and how it can be used.

The paper is very long and some information seems unnecessary. There are, for example, many examples. There is a lot of repetition both on a sentence level and in terms of information repeated again in later explanations that should be adjusted.

The paper is flawed as a piece of writing because of the way that the two studies were basically pasted together as well as the chronological presentation of information. The paper needs to be reorganized to explain that the authors studied validity and reliability of the CohViz system and have one methods section that outlines all the methods (preplanned and post hoc) and one results and discussion section. The paper does need to be rewritten and some of the framing and examples have to be sacrificed to make the paper work as a whole.

The current overall discussion section is very good—really concise, helpful, and well-written. The rest of the paper should follow this model and state things once and without so much explanation.

The results should be tabulated and not embedded in the text.

Specific comments:

Abstract:

• Overall this is clear and well written, describing a helpful pair of studies about a tool already in use.

• It would be helpful to know the discipline of the teachers using CohViz

• The authors should be careful not to use the same word multiple times in a sentence or phrase

• It would be helpful to know the general length of the texts assessed

Introduction:

• Please delete the information about aspirin and the following sentence. Your prior point is cogent and adequate and does not need more support.

• It would be helpful to have a little more information about the setting for the studies. Just a couple of words—college? High school? English? Writing? Wikipedia?

• Overall—the introduction does not seem to me to be introducing the major concepts that occur later in the paper and therefore it is confusing

Computer-Based Systems on Students’ Cohesive Writing

• I’m not sure what this section adds. Would it be sufficient to have a single paragraph providing a short overview of the idea of using a computer-based system for evaluating cohesion in the earlier section?

CohViz: A Feedback System to Support Students’ Cohesive Writing

• I still would like more information about the setting for using this tool—who uses it?

• I am unsure when and where the cited studies by Lachner and colleagues were completed. In a college? Recently? Why?

Accuracy as a Critical Component of Effective Computer-Based Feedback, Reliability, Validity

• I’m not sure what these sections add. Ideally, the section about CohViz should simply state the limitations of the prior studies, the actual needs of users, and the need for the current studies much more concisely, following the style of the current overall discussion.

Overview of the Present Studies

• I would not describe the studies as “empirical”

• I would describe the studies in terms of what they are testing (“reliability” and “validity”) instead of calling them Study 1 and Study 2 and expecting the reader to keep track of the shifts

Study 1 Methods:

• I find this section confusing, largely because I’m not sure who is doing what where and when and to which corpus of texts. Ideally the methods would identify in order:

o The goal of the study (assess reliability of concept maps between machine and human coders)

o How human coders were identified

o Which corpus of work was used for CohViz versus human coders. How was the full corpus generated and in which setting and by whom? How was the smaller representative sample generated in which setting and why whom?

o How were the outputs generated? Who reviewed them?

• I would move the explanation about how CohViz works to the introduction

• I would strongly recommend avoiding the use of “Lincoln freed the slaves” as an example. It would be highly offensive to many US readers. Please choose something more neutral. Maybe “Hollywood baked the rolls”?

Study 1: Results and discussion:

• Please move the statistical methods back to the methods section

• This discussion is far too long and has far too many little asides and explanations As a reader, it is helpful to have the results presented as:

o How big were the samples reviewed? Did the representative sample for human coders adequately mimic the overall sample? How did you know that?

o What actually was completed? How many concept maps were generated and reviewed?

o How well did the two methods correlate?

Study 2:

• This seems very long and not clear because information that should be in the introduction is interspersed throughout the text, there are methods in the results section and too much explanation as the results are presented. It simply feels very similar to the problems I noted in Study 1, where the methods and results are not clearly laid out in a logical order, but rather seem to demand that the reader carefully track with the researchers.

Overall discussion:

This is generally well written and concise. Please describe the studies in terms of what they tested and not by numbers. It makes the work very hard to read.

Limitations and future directions:

This seems very wordy and a bit too speculative. I applaud the correctness of the limitations identified, but the lengthy caveats and explanations are not really helpful. More helpful are the practical modes of moving forward.

Reviewer #2: The authors showed that CohViz is a liable and valid approach to provide students with feedback on the cohesion of their writing. A major cause which might lead to an almost-reject decision is that their research purpose is not convincing. Specifically, the authors claim at the beginning that the effectiveness of CohViz may come from a superficial warning or an essential improvement on cohesion. And they find at last that the effectiveness of CohViz comes from the essential improvement on cohesion. The reason behind their questioning that the effectiveness of CohViz may only come from a superficial warning is not solid. Generally speaking, CohViz-like systems are just developed from the idea that bad writing usually companies with bad cohesion. It is hard to believe that these systems do not benefit from the above design idea. As in the Aspirin example illustrated by authors themselves, the authors’ work in CohViz is like to proof that Aspirin is not a placebo. I believe this is not enough, in terms of innovation. If the authors can provide an example telling that some students do not actually improve cohesion suggested by CohViz-like systems, but make other changes which also improves the quality of writing, I will be more than happy to recommend their revisions to be accepted. Other problems are listed as follows.

1) Second Paragraph in Introduction:

The first question is: Which references among [4,5,10,11] do the authors believe to document CohViz is a beneficial tool? I believe there is none, otherwise they ought to provide a specific reference indicating where Figure 1 comes from. I suggest that the authors first define CohViz as a family/cluster of computer-based feedback systems which help students’ writing using cohesion. If CohViz is really a specific tool, then the success of W-pal (the real system developed in [10]) does not necessarily mean CohViz is also a success. At least, the authors may add that CohViz is developed based on a duplication of W-pal.

The second question is: I understand the purpose of authors using the Aspirin example, but it is far from CohViz. The authors may simply indicate that they know CohViz is effective, but do not know how it works or do not know the detail mechanism makes CohViz effective. That is enough. NO need to use an irrelevant example to point out the logic error (even do not need to mention the logic error because it may probably be the authors’ own logic, not others’). Other readers may simply want to know more, e.g., why is it effective? If this is true, then there is actually no logic error. Please go quick and straight to the point. And I am also disappointed when reading the conclusion of this manuscript that it only confirms CohViz’s effectiveness, but seemingly forgets the why question.

To sum up, the 2nd paragraph in Introduction needs to be re-written.

2) Page 5, the first sentence under the subtitle “CohViz: a feedback system to support students’ cohesive writing”. I did not find any “CohViz” in [5]. Do you mean [5] developed “CohViz”? I believe not.

3) Page 6, Lachner, Burkhart, and Nückles are not the authors of [5], and they are also not the authors of [4].

4) Related work is not reviewed adequately. Other works can be used as a start point, such as [5]. The authors may tell readers what [5] has already achieved in the mechanism investigation.

5) The common senses of reliability and validity do not need to be re-introduced.

6) The demographic information of students and human experts is missing.

Reviewer #3: >>> 1. Language problems:

1.1 hypernyms => hyperonyms (many times)

1.2 funders => founders

>>> 2. Presentation problems:

2.1 Table 1: the table title nor its header do not specify the meaning of values in brackets

2.2. The format of references is not uniform (others => et al.)

2.3 Fig 1., Fig 3., the quality is low, please consider vector format

>>> 3. Other problems:

3.1 some of the the analyzed measures were not defined formally, e.g. "mean readability score based on the Flesch-Kincaid measure",

"Adjacent semantic overlap", "Word concreteness"

e.g. "Adjacent semantic overlap measures the cosine similarity between neighboring sentences, and text level semantic overlap measures the cosine similarity between all possible sentence pairs of the text." is not explaining how the measure is calculated.

3.2 Obtained results can be biased by the selection of the human experts. The process of their selection is not given.

3.3. It would be good to give information who is the author of CohViz system. In the actual form of the text it is unclear.

Summary: The language is good. The quality of illustrations is low and the presentation should be improved. But what is most important area that needs fixing includes definitions of analyzed measures and analysis of the characteristics of used population of human experts.

Recommendation: major rework

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: L DeTora

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PLoS One.docx

PLoS One. 2020 Jun 29;15(6):e0235209. doi: 10.1371/journal.pone.0235209.r002

Author response to Decision Letter 0


9 May 2020

Reviewer #1:

I think the quality and overall rigor of the studies are very good; however, the writing about the statistics in Study 2 is confusing enough that I can't assert that the work meets the Plos One standard. (Although I do think it likely does.)

I also attached the below comments in a separate document, but overall, I do think this is a publishable paper. I would like to see more practical advice for using the tool.

Major comments:

#1: This is a helpful paper because it validates the use of a specific tool for student feedback on their writing. It is unclear how the authors think this tool is being used and how it can be used.

In the introduction of the revised manuscript we added a section in which we explain how and for what purposes the tool can be used (i.e., “CohViz: A feedback system to support students’ cohesive writing”, see p. 4). We argue that the main advantage of CohViz is that it provides students with instant feedback on the cohesion of their texts and thus supplements teachers‘ feedback capabilities. In particular, we provide users with use cases in which we explain how teachers can use the feedback as a modeling tool and how students can revise their texts, for example in terms of a homework assignment. In addition, in the discussion, we now explain how the results of the current study can contribute to the utility of the tool. In that section we argue that the results have shown that CohViz is particularly valid for deficits in students’ ability to establish argument overlap. As a practical recommendation we therefore suggest that in future studies the effectiveness of the tool should be tested with writers who have specific problems with argument overlap.

#2: The paper is very long and some information seems unnecessary. There are, for example, many examples. There is a lot of repetition both on a sentence level and in terms of information repeated again in later explanations that should be adjusted.

We thoroughly went over the whole manuscript and have tried to remove all repetitions. In the course of restructuring the manuscript (see #3) we additionally removed many repetitons on the overall structure of the manuscript.

#3: The paper is flawed as a piece of writing because of the way that the two studies were basically pasted together as well as the chronological presentation of information. The paper needs to be reorganized to explain that the authors studied validity and reliability of the CohViz system and have one methods section that outlines all the methods (preplanned and post hoc) and one results and discussion section. The paper does need to be rewritten and some of the framing and examples have to be sacrificed to make the paper work as a whole.

We have thoroughly reorganized the entire manuscript. The revised manuscript now has only one methods, results, and discussion section. In the introduction we focused more on the limitations from previous studies and the need to study the accuracy of CohViz. In the method section we followed the rhetorical structure of similar studies published in PlosOne and divided the methods section in the subsections “Corpora collection”, “Measures”, and “Data Analysis” (see Abu-Shaheen et al., 2018; Hara et al., 2016; Kogure et al., 2014; Kruizinga et al., 2012; Steenson et al., 2018). In the results section we again followed the outline of previous studies and separately presented the results for the reliability and validity of the feedback. In addition, as mentioned in #2, we removed unnecessary repetitions and examples from the manuscript.

#4: The current overall discussion section is very good—really concise, helpful, and well-written. The rest of the paper should follow this model and state things once and without so much explanation.

As of comment #1, we have tried to follow the style of the overall discussions and have tried to remove repetitions from the manuscript.

#5: The results should be tabulated and not embedded in the text.

We went through the results carefully and looked for data that was not reported in tabular form. We found one case in Study 1 (now the reliability study) and integrated these results into Table 3. In addition, we have moved information about the two corpora presented in the body text to their corresponding tables.

Specific comments:

#6: Abstract: Overall this is clear and well written, describing a helpful pair of studies about a tool already in use. It would be helpful to know the discipline of the teachers using CohViz.

So far, CohViz has mainly been used for research purposes in laboratory settings and in ecological valid classroom situations (e.g., philosophy education, educational research, teacher education). On page 6, we have included the information accordingly:

"So far, the effectiveness of CohViz has been tested in various settings between 2017 and 2019, including controlled laboratory studies and ecologically valid field studies in a wide range of disciplines (e.g. biology, ethics, philosophy, educational psychology, teacher education). In these experimental studies, the authors examined the effectiveness of CohViz with college students [15,20,21,29]. Overall, in a mini meta-analysis Burkhart, Lachner, and Nückles [30] could show that CohViz has a medium effect on both local and global cohesion and was thus effective in improving the cohesion of students’ texts (i.e., local cohesion g = 0.62; global cohesion g = 0.57). A potential explanation for the obtained effects of CohViz can be found in the think-aloud study by Lachner, Burkhart, Nückles [15]. The think-aloud study could show that students who processed the concept maps could infer both local and global writing plans from the concept maps directly. In addition, the analysis of the concept map triggered negative monitoring processes (i.e., student thought about the macrostructure of a text) which lead to further planning processes."

#7: The authors should be careful not to use the same word multiple times in a sentence or phrase

As mentioned in comment #1 and #4, we have carefully revised the manuscript for potential word reduncancies and deleted them accordingly.

#8: It would be helpful to know the general length of the texts assessed.

The length of the texts of the two corpora are reported in Table 1 and Table 2. In the revised manuscript we also moved some descriptions of the corpus (e.g., number of sentences) from the body text to the tables. For both studies we now report the number of words and sentences in their respective tables (see page 10 and 12).

#9: Introduction: Please delete the information about aspirin and the following sentence. Your prior point is cogent and adequate and does not need more support.

We removed the information about aspirin from the manuscript.

#10: It would be helpful to have a little more information about the setting for the studies. Just a couple of words—college? High school? English? Writing? Wikipedia?

In the revised manuscript, we added some information about the setting of the studies (see p. 8):

„The corpus to test the reliability of CohViz was compiled by the authors as a sample from an entire set of 901 expository texts written in German by college students. All texts were produced by novice writers in the course of different experimental studies conducted by the authors and were conducted between 2015 and 2018 at German Universities. We made sure to compile texts with a representative range of topics in the natural sciences and the humanities. In all of these studies the dependent variables of interest were measures of text quality (e.g., local and global cohesion).”

#11: Overall—the introduction does not seem to me to be introducing the major concepts that occur later in the paper and therefore it is confusing.

We have thoroughly restructured the introduction to make the major concepts of the manuscript clearer. After providing the reader with an introduction to the lack of research of the feedback’s accuracy, we introduce CohViz as the main computer-based feedback system of this manuscript (see heading “CohViz: A feedback system to support students‘ cohesive writing”). Next, we explain that previous research only measured the external validity of the tool and introduce the reader to the lack of research on the of reliability and validity of the feedback (see heading “Previous research on the effectiveness of CohViz”). We then explain the two main research questions of the study. In addition, we decided to remove the heading „Computer-Based Systems on Students‘ Cohesive Writing“ (see #12) from the manuscript since the concepts discussed in this section probably misled readers about the purpose of the manuscript.

#12: Computer-Based Systems on Students’ Cohesive Writing. I’m not sure what this section adds. Would it be sufficient to have a single paragraph providing a short overview of the idea of using a computer-based system for evaluating cohesion in the earlier section?

The section „Computer-Based Systems on Students’ Cohesive Writing“ introduced the reader to the concept of text cohesion and computer-based feedback systems. However, in line with comment #2 and #7, some information in this section was redundant with the introduction. Therefore, we removed the section „Computer-Based Systems on Students’ Cohesive Writing“ from the manuscript and incorporated the information on computer-based feedback in the introduction. Overall, we shortened the text segment on computer-based feedback systems to a single paragraph.

"Despite its central role for supporting readers’ comprehension, college students often face difficulties in writing cohesive texts [10,11]. Therefore, particularly students require ample formative feedback on the cohesion of their writing [12]. Providing instant feedback however is relatively time-consuming and often not feasible during regular teaching. Thus, recently, a variety of computer-based feedback systems have been developed to improve students’ writing for specific linguistic features such as text cohesion, particularly in the early stages of writing instruction [12–15]. The advantage of these systems is that they provide students with instant and time-independent information about the quality of their writing. Many of these systems generate graphical visualizations from students’ texts in the form of concept maps [16,17]. These concept maps provide students with an additional external representations of their text and direct their attention to distinct textual deficits in order to activate appropriate revision activities [15]."

#13: CohViz: A Feedback System to Support Students’ Cohesive Writing. I still would like more information about the setting for using this tool—who uses it?

As explained in comment #6, CohViz has so far mainly been used as a research tool in studies conducted by the authors. In these studies, we have used the tool both in controlled laboratory situations and in ecologically valid classroom contexts. However, since the utility of CohViz was not explained in sufficient detail in the manuscript, we have added a short section to the introduction in which we explain for which purposes CohViz is particularly valuable.

"The key advantage of CohViz is that it supplements teachers’ feedback in that it allows to provide students with quick and specific feedback on the degree of cohesion of their texts. In practice, the feedback can be embedded as a homework assignment [21] or modelled by teachers to show textual deficits of non-cohesive texts [15]. For example, an instructor could use the tool to inform students why a particular text is high or low in cohesion (e.g., by showing texts which yield numerous fragments or texts which do not have central concepts with many relations). Similarly, students can use the tool to check short texts such as abstracts or summaries for cohesion during writing."

#14: I am unsure when and where the cited studies by Lachner and colleagues were completed. In a college? Recently? Why?

We added the information accordingly, see also comment #6. To date, five studies with CohViz have been published (in-between 2017-2019). The studies were conducted with German university students. The main reason for conducting these studies was students' difficulties in writing cohesive texts. Since writing cohesive texts is a central problem for students, we developed an automated feedback system to give them immediate feedback on the cohesion of their texts. Earlier work by Lachner and Nückles (2015) could show that the explanations of university students are more fragmented and thus less cohesive than the explanations of experts. In the revised manuscript, we added the information about the studies accordingly:

"So far, the effectiveness of CohViz has been tested in various settings between 2017 and 2019, including controlled laboratory studies and ecologically valid field studies in a wide range of disciplines (e.g. biology, ethics, philosophy, educational psychology, teacher education). In these experimental studies, the authors examined the effectiveness of CohViz with college students [15,20,21,29]. Overall, in a mini meta-analysis Burkhart, Lachner, and Nückles [30] could show that CohViz has a medium effect on both local and global cohesion and was thus effective in improving the cohesion of students’ texts (i.e., local cohesion g = 0.62; global cohesion g = 0.57). A potential explanation for the obtained effects of CohViz can be found in the think-aloud study by Lachner, Burkhart, Nückles [15]. The think-aloud study could show that students who processed the concept maps could infer both local and global writing plans from the concept maps directly. In addition, the analysis of the concept map triggered negative monitoring processes (i.e., student thought about the macrostructure of a text) which lead to further planning processes."

Additionally, due to the comment, we became aware that some of the references were not updated correctly due to technical issues. For example, in the manuscript we incorrectly credited Temperley as a reference for CohViz. We carefully revised the manuscript accordingly. Thank you again, for making us aware for potential flaws in the references.

#15: Accuracy as a Critical Component of Effective Computer-Based Feedback, Reliability, Validity. I’m not sure what these sections add. Ideally, the section about CohViz should simply state the limitations of the prior studies, the actual needs of users, and the need for the current studies much more concisely, following the style of the current overall discussion.

In the revised manuscript, we have followed the reviewer’s recommendations and removed the headings mentioned from the manuscript. In the introduction, we first provide a thorough introduction to CohViz, its efficacy and applicability, and then present previous research on its effectiveness. We then go straight to the limitations of previous studies and highlight that the basic assumptions of reliability and validity have not been rigorously tested so far.

#16: Overview of the Present Studies. I would not describe the studies as “empirical”.

In the revised manuscript we refrained from using the term „empirical“ and now speak of a „corpus-based study“.

#17: I would describe the studies in terms of what they are testing (“reliability” and “validity”) instead of calling them Study 1 and Study 2 and expecting the reader to keep track of the shifts.

In the revised manuscript we changed to labelling of the studies to Study 1 -> reliability study and Study 2 -> validity study.

#18: Study 1 Methods: I find this section confusing, largely because I’m not sure who is doing what where and when and to which corpus of texts. Ideally the methods would identify in order: The goal of the study (assess reliability of concept maps between machine and human coders); How human coders were identified; Which corpus of work was used for CohViz versus human coders. How was the full corpus generated and in which setting and by whom? How was the smaller representative sample generated in which setting and why whom?; How were the outputs generated? Who reviewed them?

Indeed, the description of the process was not clearly laid out in the manuscript. Therefore, with regard to the reviewer’s comment, we have tried to address each question raised by the reviewer in the corpus section of the first study on reliability (see p. 8). In addition, to increase the comprehensibility of the methods section, we have added a figure (Fig 2) in which we depict the data analysis procedure for each study individually. With regard to the ordering of the methods section, as mentioned in comment #3, we have followed the style of previous studies published in PlosOne and report the subsections in the following order: “Corpora collection”, “Measures”, and “Data Analysis”.

#19: I would move the explanation about how CohViz works to the introduction.

We followed the recommendations by the reviewer and moved to section on how CohViz works to the introduction (see heading “Generation of the CohViz concept maps”).

#20: I would strongly recommend avoiding the use of “Lincoln freed the slaves” as an example. It would be highly offensive to many US readers. Please choose something more neutral. Maybe “Hollywood baked the rolls”?

We changed to example and now report the example provided by the reviewer.

#21: Study 1: Results and discussion: Please move the statistical methods back to the methods section

In the inspection of the manuscript we saw that the power analyses for both studies were presented in the results section. Accordingly, we have moved these power analyses to the analysis sections of the method.

#22: This discussion is far too long and has far too many little asides and explanations. As a reader, it is helpful to have the results presented as: How big were the samples reviewed? Did the representative sample for human coders adequately mimic the overall sample? How did you know that? What actually was completed? How many concept maps were generated and reviewed? How well did the two methods correlate?

We followed suggestions by comment #3 and now provide only a general discussion of both studies. Therefore, we removed the discussion of Study 1 (reliability study) from the manuscript.

#23: Study 2: This seems very long and not clear because information that should be in the introduction is interspersed throughout the text, there are methods in the results section and too much explanation as the results are presented. It simply feels very similar to the problems I noted in Study 1, where the methods and results are not clearly laid out in a logical order, but rather seem to demand that the reader carefully track with the researchers.

As mentioned in #18, we have thoroughly restructured the methods and results section of the manuscript. To increase the comprehensibility of the results, we divided this section into a subsection on reliability and validity (see also comment #3).

#24: Overall discussion: This is generally well written and concise. Please describe the studies in terms of what they tested and not by numbers. It makes the work very hard to read.

As mentioned in #17, we have renamed the studies to Study 1 -> reliability study and Study 2 -> validity study.

#25: Limitations and future directions: This seems very wordy and a bit too speculative. I applaud the correctness of the limitations identified, but the lengthy caveats and explanations are not really helpful. More helpful are the practical modes of moving forward.

In the revised manuscript, we have shortened the technical limitations of the current study and added a practical limitation which, to our knowledge, would be a logical next step to improve the utility of the feedback for improving students’ ability to write cohesive texts.

Reviewer #2:

#1: The authors showed that CohViz is a liable and valid approach to provide students with feedback on the cohesion of their writing. A major cause which might lead to an almost-reject decision is that their research purpose is not convincing. Specifically, the authors claim at the beginning that the effectiveness of CohViz may come from a superficial warning or an essential improvement on cohesion. And they find at last that the effectiveness of CohViz comes from the essential improvement on cohesion. The reason behind their questioning that the effectiveness of CohViz may only come from a superficial warning is not solid. Generally speaking, CohViz-like systems are just developed from the idea that bad writing usually companies with bad cohesion. It is hard to believe that these systems do not benefit from the above design idea. As in the Aspirin example illustrated by authors themselves, the authors’ work in CohViz is like to proof that Aspirin is not a placebo. I believe this is not enough, in terms of innovation. If the authors can provide an example telling that some students do not actually improve cohesion suggested by CohViz-like systems, but make other changes which also improves the quality of writing, I will be more than happy to recommend their revisions to be accepted. Other problems are listed as follows.

We carefully thought about the comment. We agree that the tool is only capable to provide feedback on the coehison of texts. However, we want to note that besides other textual features such as lexical complexity, syntactic complexity, and text length, cohesion has been demonstrated to be a crucial textual feature which contributes to the comprehensibility of texts, and as such to writing quality (see MacArthur, Jennings, & Philippakos 2018; Wiley et la., 2017). In addition, previous research has shown that cohesion contributes to the overall comprehensibility of texts (McNamara, 2001; McNamara, Kintsch, Songer, & Kintsch, 1996; Ozur, Dempsey, & McNamara, 2007; O'Reilly & McNamara, 2007). Empirical research provided further evidence that students face difficulties in writing cohesive texts (Connor, 1984; Granger & Tyson, 1996; Hinkel, 2001; Zhang, 2000). The need to support students writing cohesive texts is also reflected by the fact that a plethora of writing guides at universities aim to improve students’ ability to write cohesive texts (e.g., Education Development Unit, 2002; Purdue Online Writing Lab, 2020; The University of Melbourne, 2020; Writing Center, 2020). Against this background, the central educational goal of CohViz is to improve students' ability to write cohesive texts. Previous research has confirmed the effectiveness of CohViz. For example, on page 6 of the revised manuscript we report the results of a mini meta-analysis which has shown that CohViz has a medium effect on the local and global cohesion of students texts. Unfortunately, with regard to comment # 5, some references, including references to the effectiveness of CohViz, were incorrect because the reference list was not updated correctly. We have corrected this error in the revised manuscript and now report on previous research on CohViz correctly.

The central purpose of the study results from the fact that previous studies have not addressed the central question of the reliability and validity of feedback. Although we know the (meta-)cognitive processes triggered by the feedback, we have missed the fundamental basis for these processes, namely whether the feedback provides students with accurate feedback on the textual cohesion. These results may help to better understand the chain of effects of the feedback. In the field of automated writing evaluation systems, such studies are of utmost importance in order to establish the credibility of the system (Allen, 2016; Attali, 2013). What automated writing evaluation systems and computer-based feedback systems have in common is that they have to diagnose certain textual deficits from the texts. The exact representation of these deficits is therefore of central importance for the feedback system. In fact, most computer-based feedback systems rely directly on automated text evaluation systems that have been tested for reliability and validity (see Roscoe & McNamara, 2013; Wade-Stein & Kintsch, 2004) to provide students with feedback on the quality of their texts. The current study therefore closes this research gap and better helps to understand the chain of effects of concept map feedback to improve students’ ability to write cohesive texts.

#2: Second Paragraph in Introduction: The first question is: Which references among [4,5,10,11] do the authors believe to document CohViz is a beneficial tool? I believe there is none, otherwise they ought to provide a specific reference indicating where Figure 1 comes from. I suggest that the authors first define CohViz as a family/cluster of computer-based feedback systems which help students’ writing using cohesion. If CohViz is really a specific tool, then the success of W-Pal (the real system developed in [10]) does not necessarily mean CohViz is also a success. At least, the authors may add that CohViz is developed based on a duplication of W-Pal.

Indeed, some references in the manuscript were not correct as the reference list was not updated correctly. For example, in the issue raised by the reviewer, we mistakenly credited Temperley as a reference for CohViz. As mentioned in #1, we have already conducted several studies on the effectiveness of CohViz. In a comprehensive meta-analysis, for example, we were able to show that CohViz has a medium effect on both the local and global cohesion of students‘ texts (Burkhart, Lachner, & Nückles, 2020). We have carefully corrected these references in the revised manuscript and hope that all references are given correctly.

#3: The second question is: I understand the purpose of authors using the Aspirin example, but it is far from CohViz. The authors may simply indicate that they know CohViz is effective, but do not know how it works or do not know the detail mechanism makes CohViz effective. That is enough. NO need to use an irrelevant example to point out the logic error (even do not need to mention the logic error because it may probably be the authors’ own logic, not others’). Other readers may simply want to know more, e.g., why is it effective? If this is true, then there is actually no logic error. Please go quick and straight to the point.

In the revised manuscript, we have removed the information on aspirin from the manuscript and tried to make it clear that the main research question of the current study was to investigate how reliable and valid the feedback is in terms of text cohesion. We have also removed the sentence about the logic error from the section.

#4: And I am also disappointed when reading the conclusion of this manuscript that it only confirms CohViz’s effectiveness, but seemingly forgets the why question. To sum up, the 2nd paragraph in Introduction needs to be re-written.

We agree with the reviewer that the why question has not been sufficiently addressed in the manuscript. We therefore added a section on the (meta-)cognitive processes triggered by the feedback to the introduction (see p. 6).

#4: 2) Page 5, the first sentence under the subtitle “CohViz: a feedback system to support students’ cohesive writing”. I did not find any “CohViz” in [5]. Do you mean [5] developed “CohViz”? I believe not.

This comment directly relates to #2. We corrected the mistake in the revised manuscript.

#5: 3) Page 6, Lachner, Burkhart, and Nückles are not the authors of [5], and they are also not the authors of [4].

This comment directly relates to #2. We corrected the mistake in the revised manuscript.

#6: 4) Related work is not reviewed adequately. Other works can be used as a start point, such as [5]. The authors may tell readers what [5] has already achieved in the mechanism investigation.

This comment directly relates to #2. We corrected the mistake in the revised manuscript.

#7: 5) The common senses of reliability and validity do not need to be re-introduced.

6)

We have followed the recommendations of the reviewer and removed the general description of reliability and validity from the manuscript.

#8: The demographic information of students and human experts is missing.

In the methods section of the revised manuscript, we have added the mean age of the human expert raters and the standard deviation. Regarding the demographic information of the students from the corpus used for the reliability study, we now report that they were college students from German universities. However, since the sample was randomly drawn from 901 texts, we are not able to report the exact mean age of the students (see p. 13):

"To compare the CohViz concept maps to concept maps generated from human expert raters, we asked four human expert raters to segment the texts from the reliability corpus into propositions as a basis for the generation of the concept maps (see Fig 4 for the full processing of the corpus). Therefore, we aimed for experienced raters with a solid background in linguistics who had previous experience in propositional segmentation. Among the raters who fulfilled these criteria, we asked four advanced master students with a major in applied linguistics or learning and instruction to analyze the corpus. All raters came from the same German university as the authors. Their mean age was 24 (SD = 2.94). They were already familiar with the procedure of propositional segmentations since propositional segmentation was part of their studies’ curriculum. To ensure a uniform prior knowledge on the procedure, each rater was provided with multiple in-depth training sessions (five hours on average) on propositional segmentation and text cohesion. In these training sessions, raters were instructed on different cohesion strategies (e.g., argument overlap, connectives, bridging information). Additionally, they were trained in propositional segmentation with authentic practice material."

Reviewer #3:

#1: 1. Language problems: 1.1. hypernyms => hyperonyms (many times); 1.2. funders => founders

We carefully revised the manuscript for language problems and had an English native speaker proofread the entire manuscript. In the literature there is a distinction between the terms hypernym and hyponym. Hyponyms refer to subordinate concepts (e.g., red in relation to color) whereas hypernyms refer to superordinate concepts (e.g., color in relation to red). We introduced the term hypernym in the section on „Word concreteness“. In this section, we argue that an abstract word is defined by the fact that it has more hypernyms than a non-abstract word. We think, therefore, that the use of the concept hypernym in this case is warranted. The concept of hyponyms has only been mentioned once as an example of a cohesion device between adjacent sentences (see „Generation of the human expert concept maps“). However, to make it more comprehensible, we now speak of subordinate concepts. As for the terms funders and founders we have not found any occurences in the manuscript. Nevertheless, we checked the entire manuscript for potential flaws and corrected them if necessary.

#2: 2. Presentation problems: 2.1 Table 1: the table title nor its header do not specify the meaning of values in brackets

In fact, the explanation of the values in brackets was missing in Table 1. We have therefore added a note explaining that these values refer to standard deviations: „aValues in brackets refer to standard deviations.“

#3: 2.2. The format of references is not uniform (others => et al.)

We went through the manuscript carefully and used et al. for all references where appropriate for the Vancouver style.

#4: 2.3 Fig 1., Fig 3., the quality is low, please consider vector format

We have redesigned all figures to ensure the quality of the figures.

#5: 3. Other problems: 3.1 some of the the analyzed measures were not defined formally, e.g. "mean readability score based on the Flesch-Kincaid measure",

"Adjacent semantic overlap", "Word concreteness"; e.g. "Adjacent semantic overlap measures the cosine similarity between neighboring sentences, and text level semantic overlap measures the cosine similarity between all possible sentence pairs of the text." is not explaining how the measure is calculated.

Indeed, the Flesch-Kincaid measure was not introduced in the manuscript. Hence, we added a short explanation what it measures and how to interpret the measure (see p. 9). As for the divergent and convergent measures (e.g., adjacent semantic overlap), we thoroughly explained how these measures were generated in the section on „Measures of convergent and divergent validity“. In addition, we added an explanation for the measures root-type-token ratio and semantic overlap in Fig 4. However, if the reviewer insists to see are more detailed explanation of how these measures were generated, we will be happy to provide them.

#6: 3.2 Obtained results can be biased by the selection of the human experts. The process of their selection is not given.

Even though there is evidence that raters differ in their rating behaviors (Eckes, 2008), the results from our interrater reliability indicate that the raters showed a high degree of consistency in segmenting the corpus into propositions (ICC > .76). However, to avoid potential biases, we also added additional safeguards. First, we aimed to find expert raters with a solid background in linguistics who had prior experience with propositional segmentation. Addtionally, these raters received in-depth trainings on how to segment texts into propositions (see text segment below). In addition, we chose to recruit four expert raters in total. In other studies using expert raters, it is common practice to use only two raters (see Callender & McDaniel, 2009; Chi et al., 2018; Müller & Oppenheimer, 2014; Ozuro, Dempsey, & McNamara, 2009). We therefore believe that a total of four raters is above the current standard and given our selection process we are confidend that the results are not biased. Nevertheless, we added crucial information about the selection process of the human expert raters to the manuscript.

#7: 3.3. It would be good to give information who is the author of CohViz system. In the actual form of the text it is unclear.

Unfortunately, some references in the manuscript were wrong as the reference list was not updated correctly. For example, in the issue raised by the reviewer, we mistakenly credited Temperley as a reference for CohViz. Hence, it was not clear who the auhors of CohViz are. We have carefully corrected these references in the revised manuscript and hope that the authors of CohViz can now be correctly identified (e.g., Lachner, Burkhart, & Nückles, 2017).

#8: Summary: The language is good. The quality of illustrations is low and the presentation should be improved. But what is most important area that needs fixing includes definitions of analyzed measures and analysis of the characteristics of used population of human experts.

Recommendation: major rework

References

Abu-Shaheen, A., Yousef, S., Riaz, M., Nofal, A., AlFayyad, I., Khan, S., & Heena, H. (2018). Testing the validity and reliability of the Arabic version of the painDETECT questionnaire in the assessment of neuropathic pain. PLoS ONE, 13(4), 1–13. https://doi.org/10.1371/journal.pone.0194358

Allen, L. K., Jacovina, M. E., & McNamara, D. S. (2016). Computer-Based Writing Instruction. In C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of Writing Research (2nd ed., pp. 316–329). New York, London: The Guilford Press.

Attali, Y. (2013). Validity and Reliability of Automated Essay Scoring. In M. D. Shermis & J. Burstein (Eds.), Handbook of Automated Essay Evaluation: Current Applications and New Directions (pp. 181–198). Routledge.

Callender, A. A., & McDaniel, M. A. (2009). The limited benefits of rereading educational texts. Contemporary Educational Psychology, 34(1), 30–41. https://doi.org/10.1016/j.cedpsych.2008.07.001

Center, W. (2020). Flow and Cohesion. Retrieved from https://www.umass.edu/writingcenter/flow-and-cohesion

Chi, M. T. H., Adams, J., Bogusch, E. B., Bruchok, C., Kang, S., Lancaster, M., … Yaghmourian, D. L. (2018). Translating the ICAP Theory of Cognitive Engagement Into Practice. Cognitive Science, 42(6), 1777–1832. https://doi.org/10.1111/cogs.12626

Connor, U. (1984). A study of cohesion and coherence in English as a second language students’ writing. Paper in Linguistics, 17(3), 301–316. https://doi.org/10.1080/08351818409389208

Granger, S., & Tyson, S. (1996). Connector usage in the English essay writing of native and non-native EFL speakers of English. World Englishes, 15(1), 17–27. https://doi.org/ 10.1111/j.1467-971X.1996.tb00089.x

Hara, N., Matsudaira, K., Masuda, K., Tohnosu, J., Takeshita, K., Kobayashi, A., … Kikuchi, N. (2016). Psychometric assessment of the Japanese version of the zurich claudication questionnaire (ZCQ): Reliability and validity. PLoS ONE, 11(7), 1–10. https://doi.org/ 10.1371/journal.pone.0160183

Hinkel, E. (2001). Matters of cohesion in L2 academic texts. Applied Language Learning, 12(2), 111–132.

Kogure, T., Sumitani, M., Suka, M., Ishikawa, H., Odajima, T., Igarashi, A., … Kawahara, K. (2014). Validity and reliability of the Japanese version of the newest vital sign: A preliminary study. PLoS ONE, 9(4), 1–6. https://doi.org/10.1371/journal.pone.0094582

Kruizinga, I., Jansen, W., de Haan, C. L., van der Ende, J., Carter, A. S., & Raat, H. (2012). Reliability and validity of the dutch version of the brief infant-toddler social and emotional assessment (BITSEA). PLoS ONE, 7(6). https://doi.org/10.1371/journal.pone.0038762

Lab, P. O. W. (2020). Revising for Cohesion. Retrieved from https://owl.purdue.edu/owl/general_writing/the_writing_process/proofreading/revising_for_cohesion.html

Meisuo, Z. (2000). Cohesive Features in the expository writing of undergraduates in two chinese universities. RELC Journal, 31(1), 61–95.

Melbourne, U. of. (2020). Improving cohesion. Retrieved from https://services.unimelb.edu.au/__data/assets/pdf_file/0011/1264790/Improving_cohesion_Update_051112.pdf

Mueller, P. A., & Oppenheimer, D. M. (2014). The Pen Is Mightier Than the Keyboard. Psychological Science, 25(6), 1159–1168. doi:10.1177/0956797614524581

Ozuru, Y., Dempsey, K., & McNamara, D. S. (2009). Prior knowledge, reading skill, and text cohesion in the comprehension of science texts. Learning and Instruction, 19(3), 228–242. https://doi.org/10.1016/j.learninstruc.2008.04.003

Unit, E. D. (2020). Editing your Writing for Content, Coherence and Cohesion. Retrieved from http://wwwdocs.fce.unsw.edu.au/fce/EDU/educoncohcoh.pdf

Roscoe, R. D., & McNamara, D. S. (2013). Writing pal: Feasibility of an intelligent writing strategy tutor in the high school classroom. Journal of Educational Psychology, 105(4), 1010–1025. https://doi.org/10.1037/a0032340

Steenson, S., Özcebe, H., Arslan, U., Ünlü, H. K., Araz, Ö. M., Yardim, M., … Huang, T. T. K. (2018). Assessing the validity and reliability of family factors on physical activity: A case study in Turkey. PLoS ONE, 13(6), 1–15. https://doi.org/10.1371/journal.pone.0197920

Wade-Stein, A. D., & Kintsch, E. (2004). Summary Street: Interactive Computer Support for Writing. Cognition and Instruction, 22(3), 333–362.

Wiley, J., Hastings, P., Blaum, D., Jaeger, A. J., Hughes, S., Wallace, P., … Britt, M. A. (2017). Different Approaches to Assessing the Quality of Explanations Following a Multiple-Document Inquiry Activity in Science. International Journal of Artificial Intelligence in Education, 27(4), 758–790. https://doi.org/10.1007/s40593-017-0138-z

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Maciej Huk

22 May 2020

PONE-D-19-33124R1

Assisting students’ writing with computer-based concept map feedback: A validation study of the CohViz feedback system

PLOS ONE

Dear Dr. Burkhart,

Thank you for submitting your manuscript to PLOS ONE. It was reviewed by the three Reviewers including me as an Academic Editor (reviewer #3). After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular:

  • PLOS One data and software availability criteria should be met,

  • quantitative comparison of selected measures of CohViz and other tools for analyses of writing quality considered by Authors should be clearly presented,

  • formal definitions of methods and analyzed measures would help to reproduce the results.

Please submit your revised manuscript by Jul 06 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The paper as initially submitted was already a valuable addition to the literature and based on a well-designed study. Therefore the prior comments largely had to do with how the information was presented as well as the completeness of small details that are necessary to understand how the study was designed. This is an excellent writing revision, following my recommendations very well while also addressing the concerns of reviewer two.

More importantly, this rewrite made it possible to see that the statistics are well-done. It is evident that the rigor is adequate to answer the researchers' questions. This paper now provides more adequate information to help readers and researchers consider how they might use this technology in various academic settings.

Reviewer #2: (No Response)

Reviewer #3: >>> 1. Language problems: not detected

>>> 2. Presentation problems:

2.1 The functioning of the CohViz system is mainly textual. It would be good to present mathematical formulas used to calculate important measures of the text cohesion features used by the CohViz.

>>> 3. Other problems:

3.1 The link to data given by the Authors is invalid.

Authors write: "All files are available on github at the following address:

" ext-link-type="uri" xlink:type="simple">https://github.com/ch-bu/dataassisting-students-writing-with-computer-based-concept-map-feedback"

This link goes to "Error 404 - page not found"

3.2 It would be good to give direct information who is the author of CohViz system. In the actual form of the text it is still unclear without reading the text of given references (e.g. [15]). Please consider the following change:

CohViz is a graphical feedback which automatically provides students with concept maps as external representations to improve the cohesion of their writing (see Fig 1) [15,20,21].

=

CohViz is a graphical feedback tool we developed, which automatically provides students with concept maps as external representations to improve the cohesion of their writing (see Fig 1) [15,20,21].

3.3 The main focus of the manuscript is the CohViz system developed by the Authors. Manuscript includes description of the system and validation of its reliability and accuracy. In such case PLOS One Software availability criteria should be met:

"PLOS ONE will consider submissions that present new methods, software, databases, or tools as the primary focus of the manuscript if they meet the following criteria:

(...)

Validation: Submissions presenting methods, software, databases, or tools must demonstrate that the new tool achieves its intended purpose. (...)

Availability: If the manuscript’s primary purpose is the description of new software or a new software package, this software must be open source, deposited in an appropriate archive, and conform to the Open Source Definition. (...)

Please see:

https://journals.plos.org/plosone/s/submission-guidelines#loc-methods-software-databases-and-tools

Validation is presented. Availability is not met.

3.4 Authors write:

"In the validity study, we demonstrated the convergent and divergent validity of the CohViz system in comparison to measures of other well-established research tools for writing quality [35,43]"

Can Authors point out where in their manuscript the "measures of other well-established research tools for writing quality" are presented and compared with results for CohViz? Any comparison of values of selected measures for CohViz and considered "other" tools? Any table or figure?

Summary: The language is good. The presentation should be improved. Data availability and software availability are not met. But what is most important area that needs fixing includes formal definitions of analyzed measures and comparison with other similar tools.

Recommendation: major rework

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Lisa DeTora

Reviewer #2: Yes: Tai Wang

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jun 29;15(6):e0235209. doi: 10.1371/journal.pone.0235209.r004

Author response to Decision Letter 1


4 Jun 2020

Reviewer #1:

The paper as initially submitted was already a valuable addition to the literature and based on a well-designed study. Therefore the prior comments largely had to do with how the information was presented as well as the completeness of small details that are necessary to understand how the study was designed. This is an excellent writing revision, following my recommendations very well while also addressing the concerns of reviewer two.

More importantly, this rewrite made it possible to see that the statistics are well-done. It is evident that the rigor is adequate to answer the researchers' questions. This paper now provides more adequate information to help readers and researchers consider how they might use this technology in various academic settings.

Thank you very much. We appreciate your feedback.

Reviewer #2:

(No Response)

Reviewer #3:

1. Language problems: not detected

2. Presentation problems:

2.1 The functioning of the CohViz system is mainly textual. It would be good to present mathematical formulas used to calculate important measures of the text cohesion features used by the CohViz.

Thank you for the suggestion. Under the sub-chapter Structural indicators of the concept maps we have now added a formal definition of the three central structural indicators of CohViz using graph theory notation (p. 15). We now describe the fragments, relations, and concepts in the language of graph theory and explain these definitions using the concept map in Figure 3 as an example. At the end of the sub-chapter we also refer to the public repository under which the algorithms (written in Python) for calculating these indicators can be found. Besides, we have added and described formulas that explain the calculation of the measures of adjacent argument overlap and text level overlap (p. 16-17).

3. Other problems:

3.1 The link to data given by the Authors is invalid. Authors write: "All files are available on github at the following address: https://github.com/ch-bu/dataassisting-students-writing-with-computer-based-concept-map-feedback". This link goes to "Error 404 - page not found"

We apologize for any inconvenience caused by the incorrect link. We have now uploaded the data and the analyses of the data to a public repository under the following address: https://doi.org/10.17605/OSF.IO/UHPF3 on the OSF portal. We have also published the algorithms we used to calculate the structural indicators of CohViz. The files for these calculations can be found in the analysis/stuctural_indicators directory of the repository. Also, a reference to the repository can be found in the manuscript on page 15.

3.2 It would be good to give direct information who is the author of CohViz system. In the actual form of the text it is still unclear without reading the text of given references (e.g. [15]). Please consider the following change: CohViz is a graphical feedback which automatically provides students with concept maps as external representations to improve the cohesion of their writing (see Fig 1) [15,20,21]. => CohViz is a graphical feedback tool we developed, which automatically provides students with concept maps as external representations to improve the cohesion of their writing (see Fig 1) [15,20,21].

We have incorporated the reviewer's suggestion and now report that we are the authors of the system (see the revised manuscript, p. 4).

3.3 The main focus of the manuscript is the CohViz system developed by the Authors. Manuscript includes description of the system and validation of its reliability and accuracy. In such case PLOS One Software availability criteria should be met:

"PLOS ONE will consider submissions that present new methods, software, databases, or tools as the primary focus of the manuscript if they meet the following criteria:

(...) Validation: Submissions presenting methods, software, databases, or tools must demonstrate that the new tool achieves its intended purpose. (...) Availability: If the manuscript’s primary purpose is the description of new software or a new software package, this software must be open source, deposited in an appropriate archive, and conform to the Open Source Definition. (...) Please see:

https://journals.plos.org/plosone/s/submission-guidelines#loc-methods-software-databases-and-tools. Validation is presented. Availability is not met.

In fact, we did not include a reference to the CohViz system in the manuscript. Accordingly, on page 4 of the revised manuscript, we have now included the reference to the system, which has been uploaded to the OSF public repository under the following address: https://doi.org/10.17605/OSF.IO/SA53E. We chose to publish the system under the MIT license as it conforms to the Open Source Definition (https://opensource.org/licenses) recommended by PLOS One. Under the MIT license, the software is freely available and may be freely distributed and modified.

3.4 Authors write:

"In the validity study, we demonstrated the convergent and divergent validity of the CohViz system in comparison to measures of other well-established research tools for writing quality [35,43]"

Can Authors point out where in their manuscript the "measures of other well-established research tools for writing quality" are presented and compared with results for CohViz? Any comparison of values of selected measures for CohViz and considered "other" tools? Any table or figure?

Indeed, the sentence was written misleadingly. By "other" we referred to the measures we reported in the chapter on convergent and divergent measures of text cohesion, which are reported in Table 2. We have revised the sentence, removed the word “other” and now report that we have used convergent and divergent measures from well-established assessment systems (see p. 27). Besides, we have added a box to Figure 2, in which we explain the processing of the Wikipedia corpus, and explicitly refer to the assessment system used by Hancke et al. During the revision we also noticed that we have used two terms to describe these systems in the manuscript: computer-linguistic frameworks and assessment systems. We now speak uniformly of assessment systems.

Summary: The language is good. The presentation should be improved. Data availability and software availability are not met. But what is most important area that needs fixing includes formal definitions of analyzed measures and comparison with other similar tools.

Recommendation: major rework

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Maciej Huk

11 Jun 2020

Assisting students’ writing with computer-based concept map feedback: A validation study of the CohViz feedback system

PONE-D-19-33124R2

Dear Dr. Burkhart,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. The acceptance was suggested by the three Reviewers: two "accept" decisions for version R1 and third "accept" by me as an Academic Editor for version R2.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: 1. Language problems:

1.1 "for the full computer algorithm" = "for the full algorithm" / "for the full source code of the CohViz program"

2. Presentation problems: not detected

3. Other problems:

3.1 It would be good to extend the text with comparison of the CohViz with other systems supporting cohesive writing.

Summary:

Exept minor problems the language is good and thoughts are presented clearly. Presentation of the CohViz system is detailed and the software is available freely as OpenSource.

Comparison of described system with other similar tools would increase the value of the text.

Recommendation: accept

===EOT===

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

Acceptance letter

Maciej Huk

17 Jun 2020

PONE-D-19-33124R2

Assisting students’ writing with computer-based concept map feedback: A validation study of the CohViz feedback system

Dear Dr. Burkhart:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Maciej Huk

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: PLoS One.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    The CohViz system is available on the Open Science Framework (OSF) under the following address: https://doi.org/10.17605/OSF.IO/SA53E. All data files and analyses of this manuscript are available on the OSF under the following address: https://doi.org/10.17605/OSF.IO/UHPF3.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES