Identification of sentence stems characteristic of Chinese learner English writing

Jingjie Li; Wenjie Hu

doi:10.1016/j.heliyon.2024.e37166

. 2024 Aug 30;11(3):e37166. doi: 10.1016/j.heliyon.2024.e37166

Identification of sentence stems characteristic of Chinese learner English writing

Jingjie Li ^a,^⁎, Wenjie Hu ^b

PMCID: PMC11947701 PMID: 40196792

Abstract

Phraseological units in academic English texts have been a central focus in recent corpus linguistic research. This paper describes a special category of clause-level phraseological units, namely, Characteristic Sentence Stems (CSSs), with a view to describing their identifying criteria and their extraction method. CSSs are contiguous lexico-grammatical sequences which contain a subject-predicate structure and which are frame expressions characteristic of academic writing. The extraction method of a CSS consists of six steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range calculation, and overlapping sequence reduction. The significance of occurrence calculation is the crux of this method. It includes the computing of both the internal association and the boundary independence of a CSS, and it tests the occurring significance of the CSS from both the inside and the outside perspectives. Our methods and results suggest that CSSs can be statistically defined and extracted from corpora and can employed in large-scale studies to more fully account for the phraseological features of non-native English academic writing.

Keywords: Sentence stem, Phraseological unit, Identification measures, Chinese EFL learners

1. Introduction

The sentence stem is commonly known as a phraseological unit that is of clause length or longer and that contains a subject-predicate structure and serves as the frame of a sentence, such as the final point is …, another thing is …, it will be shown that …. As clause-level phraseological units, sentence stems are indispensable elements with which our utterances are largely made [1]. They tend to appear in recurrent clusters, representing the “preferred ways of sayings things” of a speech community ([2]: 35). Sentence stems reflect the idiomaticity of language use.

Studies in diverse fields have referenced the importance of sentence stems in language use. Spoken fluency research has found that “fluent and idiomatic control of a language rests to a considerable extent on knowledge of a body of sentence stems which are ‘institutionalized’ or ‘lexicalized’” ([1]: 191). These sentence stems carry the authority of regular and accepted use by members of the speech community (ibid: 209). English for academic purposes (EAP) studies have shown that sentence stems can serve as key disciplinary markers, being commonly used in certain disciplines while being less prevalent in others. As noted by Hyland (2008: 5), certain sentence stems help to “shape text meanings and contributing to our sense of distinctiveness in a register”, and can act as crucial indicators for exploring typicality in language use across disciplines.

In the field of EFL teaching, research has noticed a lack of phraseological competence in the use of sentence stems among non-native learners. Flowerdew and Li [3] interviewed nine Chinese L1 writers about how they utilized specialist literature in composing their own articles in English. The participants in their interviews stated that, to prepare for their own writing, they refer to a selection of published articles and keep notes of “good” and “potentially useful” expressions therein and that the expressions include a number of sentence stems, such as In this article/In this paper/Here we report …, This figure reveals that …, and The fit shown in Fig. 1 corresponds to … (ibid: 449–457). Their study suggest that the shortage of sentence stems stored in non-native learners’ mind has restricted their achievement in fluent and independent writing.

Fig. 1 — Distribution of border entropy values.

This underscores the importance and the great potential value of sentence stems in the application, particularly, to non-native academic teaching and learning. Supporting this view, Simpson-Vlach and Ellis [4] created a “pedagogically useful” list of formulaic sequences for academic speech and writing; it includes a large number of sentence stems, such as I'm talking about, it is obvious that, and if you look (at) (the) (ibid: 500). Hammond [5] also designed a formulaic frame phrasebank to facilitate first-year students' skill development in writing. A number of example phrases in the phrasebank also turned out to be sentence stems, such as This theory describes …, This stage is followed by …, and An important concept in this stage is … (ibid: 4). It can be seen that some sentence stems are practically useful for non-native learners' writing and, to some extent, they have become essential language resources for their writing (e.g., Ref. [[5], [6], [7]]).

However, unlike other types of phraseological units such as chunks, lexical bundles, semantic sequences, etc., sentence stems have not been sufficiently studied in previous research of EFL learners' writing. Some research (e.g., Ref. [3,[8], [9], [10], [11]]) touched upon sentence stems when examining the collocational features of students' writing, but few studies have focused on and systematically investigated sentence stems in EFL learners’ writing. As far as we know, no measures or tools have yet been developed to extract sentence stems from large English corpora.

In this study, we use the term Characteristic Sentence Stem (CSS) to refer to the sentence stem notable for its phraseological salience and characteristic of EFL learners' writing. Phraseological salience implies that a sentence stem extends beyond being a mere random creation by an individual learner. Instead, it is a sequence characterized by a prominent co-occurrence of lexical and grammatical elements that consistently establish form-meaning pairings. As phraseologically salient sequences, CSSs cohere significantly more than would be expected by chance, effectively “glued together” to reveal a strong and recurring pattern of clause-level phraseological use. In keeping with this perspective, we have conducted the present study to develop a feasible method for automatic extraction of CSSs to facilitate large-scale studies of sentence stems, enrich the contents of learner phraseology, and broaden our research perspective to analyze EFL learners’ writing performance. To delve further into these objectives, this article will address the following research questions:

(a).
How can we identify and extract CSSs automatically from Chinese learner corpora if we are unable to foresee and predetermine any specific CSS at the start of our extraction?
(b).
Do the extraction results of CSSs differ when using different word association measures and, if so, in what ways?
(c).
Using the described method, what CSSs have been extracted from the corpus, and what pedagogical implications could these CSSs have for Chinese EFL teaching?

2. Literature review

2.1. Previous studies of the sentence stem in phraseology

Sentence stems have long received attention in the field of phraseology. They consist of at least a subject and a verb, and may optionally include other thematic elements, such as discourse item, linking word, etc., and a rhematic post-verbal element, such as an object or complement [12]. Sentence stems can manifest as full clause-like constructions like NP be-TENSE sorry to keep-TENSE you waiting (I am sorry to keep you waiting) and “Who (the EXPLET) do-PRES NPi think PROi be-PRES!” (Who the hell do you think you are!). They can also constitute multiple clause constituents, such as my name is and It seems to me (that). In this case, sentence stems act as extended onsets, forming the springboard of utterances leading up to the lexically most variable element (Altenberg 1998: 113).

Phraseological analysis reveals that only a minority of sentence stems are entirely novel, featuring new combinations of words that follow regular syntactic rules. More commonly, sentence stems are recurrent, familiar sequences, retrieved as more or less prefabricated or routinized strings readily available for the production of discourse. Such sentence stems are the essential “building blocks” of an utterance, shaping regular expression patterns and embodying typical ways of meaning-making. Sentence stems of this kind are typically categorized under two terms, each highlighting different characteristics: “lexicalized sentence stems” [1], defined by the fixedness of the sequence, and “textual sentence stems” [2], defined by the functionality of the sequence.

Pawley and Syder [1] first proposed the concept of the lexicalized sentence stem, or regular form-meaning pairing in language. By “lexicalized”, they mean that the “grammatical form and lexical content” of a sentence stem are “wholly or largely fixed” (ibid: 191–192). An example of lexicalized sentence stems is NP be-TENSE sorry to keep-TENSE you waiting, as in I am sorry to keep you waiting. Pawley and Syder (1983: 202) held that fluent communication largely relies on native speakers’ adoption of a clause-chaining style to string lexicalized sentence stems together. Their research was the first to identify the sentence stem as a phraseological unit. However, “lexicalization is a matter of degree” (ibid: 212). Fully lexicalized sentence stems take up only a very small proportion, and most of them are only partially lexicalized or are lexicalized to a low degree ([14]: 37). Moreover, lexicalization is not particularly operationalizable in technical terms. It is hard to measure whether a sentence stem is lexicalized and to what degree it is lexicalized. All of these lower the feasibility of the empirical studies of lexicalized sentence stems.

Further developments in phraseology have highlighted that a large number of recurrent sentence stems are structurally incomplete yet notably functional; they are routinized expressions tied to specific discourse-pragmatic functions. This usage is so pervasive that these sentence stems, to a large extent, establish regular form-meaning pairings with the functions. Expanding on this functional emphasis, Granger and Paquot [2] put forward the concept of the textual sentence stem, defining it as a “routinized sentence fragment that consists of a subject and a verb and serves textual functions” (ibid: 44). Examples include I will discuss…and it will be shown that … (ibid: 44). This concept accentuates the functionality of a sentence stem rather than its lexicalization. Regrettably, Granger and Paquot [2] only offered a concept, with no specific research into it. Furthermore, they defined textual sentence stems as primarily “serving textual functions” (ibid: 44), but we discovered that many sentence stems perform other functions besides textual functions, such as interpersonal functions (e.g. it would be interesting to).

The two types of sentence stems discussed, though representing different perspectives, both emphasize the phraseological salience of a sentence stem. Such sentence stems signify the conventional and typical language use within the speech community. Building on these insights, we try to introduce the notion of CSS, aiming to identify phraseologically salient sentence stems that represent an important clause-level phraseological resource in Chinese EFL learners' writing. In phraseology, the salience of a multi-word sequence can be statistically measured through the significance of the co-occurrence of its constituent words. This statistical approach enhances the feasibility of the automatic extraction of CSSs from corpora, thus facilitating large-scale, corpus-driven studies of CSSs to account for clause-level phraseological features in Chinese EFL learners’ writing.

2.2. Statistics-based extraction of phraseological units

Most current statistical approaches for extracting phraseological units rely on word association measures (e.g., Mutual Information, Dice coefficient, Odds Ratio, Fisher Exact p-value, Log-Likelihood Ratio). These measures calculate the strength of association between the components of a sequence based on their occurrence and co-occurrence in a corpus [15]. The association strength indicates the chance of a candidate to be a phraseological unit [16]. Notably, current word association measures are primarily designed to quantify the association within two-word sequences. However, a CSS is a multi-word sequence, varying in length. When dealing with such multi-word sequences, a common practice involves the transformation of a multi-word sequence into a set of two-word sequences based on its constituent words. Subsequently, current word association measures are applied to calculate the association within each two-word sequence. Specific algorithms are then used to integrate these individual association values, yielding a composite value determining the overall internal association for the multi-word sequence. This research follows the mentioned methodology to calculate the internal association of a sentence stem.

The current word association measures can be broadly categorized into two types: non-directional word association calculation and directional word association calculation. The former posits a mutual association between words “a” and “b”, quantifying the overall degree of their association. Consequently, the association degree between “a” and “b” is computed as a single value. Notably, this calculation does not examine whether the association degree of “a→b” is equivalent to the association degree of “b→a.” The latter, on the other hand, assumes that the association between words “a” and “b” has directionality. The association from “a→b” should be distinguished from the association from “b→a.” Therefore, when calculating the association degree between “a” and “b”, it is imperative to differentiate the direction and obtain two distinct values: the association degree of “a→b” and the association degree of “b→a.”

(a)
Non-directional word association measures

Church and Hanks [17] were the first to introduce the pointwise mutual information (MI) algorithm to measure the significance of mutual associations between words, enabling the automatic identification of typical collocations. This algorithm remains one of the most widely used word association measures in linguistic studies. The MI is an algorithm for quantifying shared information between two known words “a” and “b” in a text, that is, the mutual interaction force arising from the co-occurrence of “a” and “b”; or, in other words, the reduction in uncertainty about “b” given knowledge of “a”, or how much information “b” reveals about “a”. This statistical measure possesses the characteristics of non-negativity and is commonly denoted as $I (a, b)$ . The MI formula is shown below (Expression 1). Here, $p (a)$ and $p (b)$ represent the probabilities of the individual occurrences of words “a” and “b” in the corpus, and $p (a, b)$ represents the probability of the co-occurrence of “a” and “b.”

Equation 1.

(1)

After the introduction of the MI algorithm to linguistics studies, various statistics-based non-directional word association algorithms emerged. The Dice coefficient (Dice) algorithm, proposed by Dice [18], was introduced to identify collocations by Smadja et al. [19], and the formula is expressed as follows (Expression 2). Here, $f (a)$ represents the frequency of occurrence of word “a” in the corpus, $f (b)$ is the frequency of occurrence of word “b” in the corpus, and $f (a, b)$ represents the co-occurrence frequency of “a” and “b.”

Equation 2.

(2)

As noted above, the formulas for MI and Dice are relatively simple. The variables involved in Expressions (1) and (2) are only associated with the occurrence (rather than non-occurrence) of words. These variables include the probabilities (or frequencies) of the individual occurrences of the word “a” or “b” (i.e., $p (a)$ , $p (b)$ , $f (a)$ , $f (b)$ ), as well as the co-occurrence probability (or frequency) of “a” and “b” (i.e., $p (a, b)$ , $f (a, b)$ ). However, more word associations algorithms involve the computation of variables associated with both the occurrence of words (referred to as “occurrence variable”) and the non-occurrence of words (referred to as “non-occurrence variable”), and often require the consideration of multiple combinations of occurrence variables and non-occurrence variables.

For ease of exposition in the following text, we employ a contingency or two-by-two table Gries, 2013; [21], which represents the frequency distribution resulting from the cross-classification of two or more variables, to illustrate the variable combinations typically involved in word association algorithms (as shown in Table 1). The cells (cell1, cell2, cell3, cell4) in Table 1 respectively denote different variable combinations. Cell1 corresponds to f(a, b), representing the frequency of co-occurrence of words “a” and “b”; cell2 corresponds to f(a, ¬b), representing the frequency of occurrence of “a” while “b” does not occur; cell3 represents f(¬a, b), indicating the frequency of occurrence of “b” while “a” does not occur; and cell4 represents f(¬a, ¬b), signifying the frequency of non-occurrence of both “a” and “b.” In the four cells, cell1 corresponds to an exclusive “occurrence variable” combination and cell4 an exclusive “non-occurrence variable” combination, while cell2 and cell3 represent mixed combinations involving both “occurrence” and “non-occurrence” variables.

Table 1.

Contingency table for the association calculation between words “a” and “b”.

	Frequency of occurrence of a: f(a)	Frequency of non-occurrence of a:f(¬a)
Frequency of occurrence of b: f(b)	cell1: f(a, b)	cell3: f(¬a, b)
Frequency of non-occurrence of b: f(¬b)	cell2: f(a, ¬b)	cell4: f(¬a, ¬b)

Open in a new tab

The Odds Ratio is a relatively simple statistical measure that considers both “occurrence” and “non-occurrence” variables; it is commonly used to determine the degree of association between the occurrence or non-occurrence of “a” and “b.” The formula is expressed as follows (Expression 3); it signifies the ratio of the frequency of occurrence of “a” given the occurrence of “b” to the frequency of occurrence of “a” given the non-occurrence of “b”, divided by the frequency of occurrence of “b” given the non-occurrence of “a” to the frequency of non-occurrence of “b” given the non-occurrence of “a.” If the final calculated Odds Ratio value is greater than 1, it indicates that “a” and “b” are positively associated, meaning that the presence of one word increases the probability of the other word's occurrence.

Equation 3.

(3)

The Fisher Exact p-value (p) and the Log-Likelihood Ratio (LLR) are also widely used measures of word associations involving multiple sets of “occurrence” and “non-occurrence” variables. The formulas for the two algorithms (Expression 4 and Expression 5) are complex and are expressed as follows. Here, $c e l l 1, c e l l 2, c e l l 3,$ and $c e l l 4$ have the same meanings as described above, and $n$ represents the total number of $c e l l 1, c e l l 2, c e l l 3,$ and $c e l l 4$ . In the context of Fisher Exact p-value, a lower p-value indicates a stronger association between words. Typically, when the p-value is less than 5 %, it signifies a significant association between words; when it is less than 1 %, the association is considered highly significant. Nevertheless, there is currently no consensus on the optimal threshold for LLR to determine the significance of word associations.

Fisher Exact p-value Formula:

Equation 4.

(4)

Log-Likelihood Ratio (LLR) Formula:

Equation 5.

(5)

In comparison to measures considering only “occurrence” variables (e.g., MI, Dice), measures including both “occurrence” and “non-occurrence” variables (e.g., Odds Ratio, Fisher Exact p-value, LLR) provide a more comprehensive perspective on the assessment of association between words “a” and “b.” Consequently, the resulting statistical values seem to be more comprehensive and accurate. However, in practical applications, the formulas for measures including both “occurrence” and “non-occurrence” variables are conspicuously intricate, demanding additional computational resources. Additionally, these measures are sensitive to factors such as corpus size, word classes, data distribution, and parameter choices.

Conversely, measures considering only “occurrence” variables demonstrate general computational simplicity and efficiency, rendering them suitable for large-scale text data. Thus, whether measures including both “occurrence” and “non-occurrence” variables perform significantly better than those considering only “occurrence” variables remains an unsettled matter. It is widely acknowledged that the choice of measures hinges upon considerations of data size, computational efficiency, dataset characteristics, and research objectives.

(b)
Directional word association measures

Non-directional word association measures, such as the aforementioned MI, Dice, Odds Ratio, etc., quantify the overall strength of mutual attraction between words “a” and “b.” Those measures do not distinguish the directionality of attraction, i.e., whether the attraction from “a” to “b” differs from that of “b” to “a.” In other words, non-directional word association measures compute the overall probability of co-occurrence between “a” and “b” in the form of (a, b) without considering which word (“a” or “b”) serves as the predominant factor in their co-occurrence. To further investigate the directionality issue of attractions between “a” and “b,” it is necessary to employ directional word association measures, among which the series of pairwise measures such as Attraction and Reliance, ΔP Attraction and ΔP Reliance, are the most widely used.

The pairwise measures of Attraction and Reliance were proposed by Schmid [22]: 54–55). The former (Attraction) signifies the strength of attraction from “a” to “b” (i.e., the association strength of “a→b”), calculated by dividing the frequency of co-occurrence of “a” and “b” by the frequency of occurrence of “b” (Expression 6a in Table 2 below). The latter (Reliance) represents the degree to which “a” is attracted by “b”, or in other words, the strength of attraction from “b” to “a” (i.e., the association strength of “b→a”), calculated by dividing the frequency of co-occurrence of “a” and “b” by the frequency of occurrence of “a” (Expression 6b). To be able to render the scores as percentages, the dividend is multiplied by 100 in both divisions.

Table 2.

Calculating Attraction and Reliance scores.

Equation 6a.

(6a)

Equation 6b.

(6b)

Equation 7a.

(7a)

Equation 7b.

(7b)

Open in a new tab

Note: $c e l l 1 + c e l l 3$ (in Expressions 6a and 7a) yields the frequency of occurrence of “b”, and $c e l l 1 + c e l l 2$ (in Expression 6b and 7b) yields the frequency of occurrence of “a.”

Schmid and Küchenhoff [23] posited that the pairwise measures of Attraction and Reliance lack consideration of the variable $c e l l 4$ (see Table 2), thus failing to establish a connection between the frequency distribution of words “a” and/or “b” and the frequency of non-occurrence of both “a” and “b.” Therefore, they refined the measures of Attraction and Reliance using the Delta P algorithm1 (ΔP, [24]) and introduced the pairwise measures of ΔP Attraction (Expression 7a) and ΔP Reliance (Expression 7b), with the formulas provided in Table 2.

As discussed above, both non-directional and directional word association measures have distinct characteristics. Non-directional measures are better suited for calculating the overall strength of association between words, while directional measures can further explore the directionality issues of word associations. Both non-directional and directional measures have been extensively used in phraseological unit extraction (e.g., Ref. [16,[25], [26], [27], [28]]). Different association measures may excel at identifying varying types of phraseological units, but their overall performance and effectiveness have been repeatedly validated and verified across different data types, making them widely accepted in phraseological studies (e.g., Ref. [15,23,29]). As Su et al. (2024: 62) noted, most current approaches are statistically based on word association measures.

In this study, we will use the aforementioned association measures to individually calculate the co-occurring salience of the constituent words in a sentence stem (internal association calculation) and then compare the extracted CSSs using these measures. Specifically, in Section 4.4, we will illustrate the process of internal association calculation using MI as an example. In Section 5.2, we will compare and evaluate the results derived from the combined use of MI and border entropy with those obtained through the utilization of alternative association measures. In addition, as mentioned earlier, most association measures are designed for evaluating the association between two words; it is necessary to adjust existing two-word association algorithms to measuring longer multi-words sequences, with a specific focus on sentence stems. The adjusted approach will be further detailed in Section 4.4.

3. The concept of CSS and the corpus

3.1. Identifying criteria of CSS

In this research, a CSS is temporarily defined as a recurrent contiguous lexico-grammatical sequence which contains a subject-predicate structure and which is of phraseological salience characteristic of Chinese EFL learners’ writing. This definition indicates two identifying criteria.

First, a CSS must have a subject-predicate structure. Three points are worth noting regarding this criterion. (a) A CSS does not necessarily contain all of the subject and predicate elements. Following Granger and Paquot's [2] identifying criteria, our identification requires that a CSS have at least a subject and a predicate verb, such as years have witnessed and we must admit. (b) A CSS may contain additional syntactic functional elements such as objects (e.g., we should spare no effort to) and adverbials (e.g., from the picture we can see); in fact, this is mostly the case. In sum, a CSS takes the “subject + predicate verb(s)” as the core components but varies greatly in length and structure. It can be either a clause constituent such as we can draw the conclusion that … and it is convenient for …, or a full clause (refer to Ref. [1]) such as practice makes perfect and reasons are as follows (c) Subject omission structure (e.g., as noted above). and predicate omission structure (e.g., if possible) require individual treatment. As-introduced sequences, whether with a subject (e.g., as we have seen) or without a subject (e.g., as can be seen), are both counted as sentence stems in our identification.

Second, not all of the sequences with a subject-predicate structure are CSSs. A CSS should be phraseologically salient, marked by a higher-than-expected co-occurrence of its constituent words and demonstrating a strong and recurring pattern of clause-level phraseological use in Chinese learners' writing. This standard emphasizes the salience or the typicality of a CSS and filters out such sentence stems as she is also, we have done. In our study, the salience of each CSS will be measured in terms of two statistical parameters: internal association (i.e., the co-occurring salience of the constituent words of a CSS) and boundary independence (i.e., the clarity of the left and right borders of a CSS). Please refer to Section 4.4 for detailed discussion. Only the sentence stems that reach high statistical standards will be considered characteristic of Chinese EFL learners’ writing and will become candidate CSSs.

3.2. The corpus

We based our study on the TECCL corpus; it contains approximately 10,000 English essays written by various groups of Chinese EFL learners, among which university students constitute the overwhelming majority of the writers. The corpus consists of texts written in class, in testing, and after class; the writing samples included were produced between 2011 and 2015. To the best of our knowledge, the TECCL corpus, totaling 1,817,335 word tokens, is one of the largest Chinese learner corpora publicly available, and it has been widely used for studies of Chinese EFL learners' English. The corpus figures prominently for its representativeness in two respects. (a) The geographical spread of the writers in the corpus is by far the widest of all Chinese EFL learners’ English corpora. (b) The proportion of the essays written by top-notch university students and by non-top-notch university students corresponds well to the actual distribution of top-notch universities in China.

In this study, we selected the essays written by university undergraduate students from TECCL and formed a 1.3-million-word corpus, TECCL-Sample, to use as the data source for illustrating the extraction procedures of CSSs and for analyzing the idiomatic usages of sentence stems in Chinese EFL learners’ writing. The TECCL-Sample contains 6886 essays with the size of 1,387,716 tokens (running words), 27,851 types (distinct words), and 91,747 sentences. Its standardized TTR amounts to 41.58 %, and its mean sentence length is 15.13 (in words).

4. CSS extraction method

This section will take TECCL-Sample as an example and will enlarge on our CSS extraction method that consists of six main steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range setting, and overlapping sequence reduction. POS tagging is used to label each word in the corpus with its Part of Speech (POS); n-gram segmentation is used to segment each running text in the corpus into different groups of linear sequences; structure identification is used to identify sentence stems out of all the linear sequences; significance of occurrence calculation is used to measure the salience of each sentence stem in learners’ writing and its possibility to become a CSS; text range setting is used to examine the inter-textual frequency distribution of each sentence stem; and overlapping sequence reduction is used to ensure that each of the extracted CSSs does not substantially overlap the others. In what follows, we will discuss these six steps in more detail.

4.1. POS tagging

We POS tag all the texts in TECCL-Sample with the tagging software CLAWS7 and its C7 tagging set. It should be pointed out that the selection of tagging software is rather flexible. In addition to CLAWS7, other common tagging software, such as NLTK tagging packages,2 can also be used. We choose to use CLAWS7 based on two considerations. First, CLAWS7 provides high tagging accuracy. According to the official website, CLAWS has consistently achieved 96–97 % accuracy. (b) The C7 tagging set has very detailed tag categories, which are crucial for the structure identification in 4.3.

4.2. N-gram segmentation

All the tagged texts in TECCL-Sample are segmented into linear sequences, whose lengths vary from two to eight words. That is, every single text in TECCL-Sample is repeatedly segmented into different groups of n-grams, such as 2-g, 3-g, 4-g. Then we process all the n-grams by (a) deleting sequences that span two paragraphs, sentences, or include punctuation marks as semicolon, colons, unpaired brackets, and unpaired quotation marks, and by (b) calculating the frequency of each n-gram. It is worth noting that, at this stage, we do not take the POS tag into account, and thus, the obtained n-grams may include both sentence stems and sequences in other structures, as well.

4.3. Structure identification

Based on POS tags, we search for sequences with a subject-predicate structure in the obtained n-grams. This step, though seemingly simple, is quite complex in practice because many kinds of parts of speech can serve as subjects, and sometimes there is no subject. After repeated discussions and examinations, we identify five broad categories and, altogether, 36 subcategories of POS tags that can function as subjects: nouns (22 types), pronouns (9 types), determiners3 (3 types), as-introduced structure (1 type), and existential-there (1 type). We also identify four broad categories (16 subcategories) of verbs that can function as predicates: i.e., VB (VBDR, VBDZ, VBI, VBM, VBR, VBZ), VD (VD0, VDD, VDZ), VH (VH0, VHD, VHZ), VV (VV0, VVD, VVI, VVZ) in the C7 tagging set.

4.4. Significance of occurrence calculation

The most challenging point during the extracting process is to rule out non-salient sequences using statistical measures. This research employs two frequency-based algorithms to calculate the salience of each sequence in order to limit the potential interference from absolute frequency. The two algorithms measure the internal association and boundary independence of each sequence respectively: the former focuses on the inside of the sequence and looks at the attractions among its constituent words, while the latter takes the sequence as a whole and measures the variability of its outside neighboring words. Only when the internal association and the boundary independence of the sequence are both larger than their respective thresholds will the sequence be identified as a candidate CSS. By this means, each sequence is examined for significance from both the inside and the outside perspective. We got this idea from Jiang et al. (2007a: 9–16), who used the hybrid method to extract Chinese chunks. We modify their integrated algorithm and propose a normalization algorithm for overlapping sequence reduction, in order to optimize the extraction of English sentence stems.

4.4.1. Calculation of internal association

Many association algorithms have been proposed hitherto, such as MI, Log Likelihood Ratio, Dice, Fisher Exact p-value, and Odds Ratio, all of which are confined to measure the association or the attraction between two words (2-g). In order to measure that of a longer n-gram (where n > 2), we use pseudo-bigram transformation and the probability-weighted average algorithm to calculate the internal association of CSSs, as exemplified by MI [25,30]. The principles are:

(a)
We adopt the idea of pseudo-bigram transformation to turn every single n-gram (n $\geq$ 2) into n-1 pseudo-bigrams [31]. The concept of pseudo-bigram transformation assumes that every n-gram (w₁, w₂, w₃, … …, w_n) has n-1 dispersion points, i.e., the spaces located between the positions of the constituent words of the n-gram. Each dispersion point transform the n-gram into a pseudo-bigram consisting of two parts: a left part (w₁…w_i) and a right part (w_(i+1)…w_n) ( $1 \leq i \leq n - 1$ ). Therefore, n-1 dispersion points can transform the n-gram into n-1 pseudo-bigrams. For example, the tri-gram “practice makes perfect” can be transformed into two pseudo-bigrams: “practice ∗ makes perfect” and “practice makes ∗ perfect”. Once n-grams have undergone this transformation, current association measures can be applied to calculate the internal associations for these pseudo-bigrams.
(b)
We calculate the expected value of joint probability for each pseudo-bigram, and we get n-1 expected values of joint probability for the n-gram. We then compute the probability-weighted average of all the expected values and obtain the weighted expected value of joint probability for the whole n-gram, denoted by WAP. Finally, we take the logarithmic function the empirical joint probability divided by WAP, resulting in the MI for the whole n-gram as its internal association value. The formula is shown below (Expression 8), where S_n represents a multi-word sequence consisting of n words $S_{n} = {w_{1}, w_{2}, w_{3}, \dots, w_{n}}$ , and $i$ represents the dispersion point which transforms the sequence W into a pseudo-bigram and is “located” between a left and a right part of the pseudo-bigram: w₁, w₂, …, w_i and w_(i+1), …, w_n ( $1 \leq i \leq n - 1, n \geq 2$ ).

Equation 8.

(8)

Once again using the tri-gram “practice makes perfect” as an example, the data related to MI calculation is as follows:

(i)
$P_{p r a c t i c e} = \frac{893}{1361171} = 6.5605 \times 1 0^{- 4}$
(ii)
$P_{p e r f e c t} = \frac{238}{1361171} = 1.7485 \times 1 0^{- 4}$
(iii)
$P_{p r a c t i c e m a k e s} = \frac{17}{1292954} = 1.3148 \times 1 0^{- 5}$
(iv)
$P_{m a k e s p e r f e c t} = \frac{15}{1292954} = 1.1601 \times 1 0^{- 5}$
(v)
$P_{p r a c t i c e m a k e s p e r f e c t} = \frac{11}{1225132} = 8.9786 \times 1 0^{- 6}$

To calculate the WAP for the trigram, we will have the expected joint probability E1 for the pseudo-bigram “practice ∗ makes perfect” and E2 for the pseudo-bigram “practice makes ∗ perfect” respectively as follows:

(vi)
$E_{1} = E_{(p r a c t i c e * m a k e s p e r f e c t)} = P_{p r a c t i c e} \times P_{m a k e s p e r f e c t}$

= 6.5605 \times 10^{- 4} \times 1.1601 \times 10^{- 5} \approx 7.6108 \times 10^{- 9}

(vii)
$E_{2} = E_{(p r a c t i c e m a k e s * p e r f e c t)} = P_{p r a c t i c e m a k e s} \times P_{p e r f e c t}$

= 1.3148 \times 10^{- 5} \times 1.7485 \times 10^{- 4} \approx 2.2989 \times 10^{- 9}

Applying the formula of probability-weighted average (Expression 8), we have.

(viii)
$W A P_{(p r a c t i c e m a k e s p e r f e c t)} = \sum_{i = 1}^{i = 2} P (E_{i}) \cdot E_{i} = P (E_{1}) \cdot E_{1} + P (E_{2}) \cdot E_{2} = \frac{7.6108 \times 1 0^{- 9}}{7.6108 \times 1 0^{- 9} + 2.2989 \times 1 0^{- 9}} \times 7.6108 \times 1 0^{- 9} + \frac{2.2989 \times 1 0^{- 9}}{7.6108 \times 1 0^{- 9} + 2.2989 \times 1 0^{- 9}} \times 2.2989 \times 1 0^{- 9} \approx 6.3785 \times 1 0^{- 9}$

According to the refined MI algorithm (Expression 8), the internal association of the trigram “practice makes perfect” is finally calculated as follows:

(ix)
$M I_{(p r a c t i c e m a k e s p e r f e c t)} = \log_{2} (\frac{P_{p r a c t i c e m a k e s p e r f e c t}}{W A P}) = lo g_{2} (\frac{8.9786 \times 1 0^{- 6}}{6.3785 \times 1 0^{- 9}}) \approx 10.4590$

4.4.2. Calculation of boundary independence

Boundary independence is another approach used to calculate the significance of occurrence of a multi-word sequence ([32]: 476–481 [33];: 9–16). It measures the clarity of outside borders (i.e., the left and right sides) of the sequence, by considering the uncertainty of its adjacent collocates. Its logic is that the borders of a sequence are clearer if its collocates are more various and more evenly distributed; in other words, the more various words a sequence can collocate with, the more independent the borders of that sequence. Here, we employ the concept of “border entropy” to measure the degree of boundary independence of a sentence stem. If we consider the adjacent collocates of a sequence as random distributions, the larger the value of border entropy, the more uncertain the sequence's collocates will be; hence, the higher independence significance the sequence has, and the more likely it becomes a CSS. The procedures are described as follows.

(a)
For every sentence stem, the respective set of its left and right adjacent collocates are automatically generated. Each set contains information about collocates, such as how many varied words the sequence can collocate with, what those words are, and how many times each word co-occurs with the sequence. In order to improve the processing efficiency, we use the nested mode of dictionary data structure of Python to store the relevant data of each sequence's left and right adjacent words. It is worth noting that two possible situations can lead to the increase of the border entropy value. One is the normal case that the left and right adjacent words of a sequence are of various types and are in even distribution, which indicates an unstable collocation between the sequence and its adjacent words. The other is that punctuation marks, such as colons, semicolons, periods, parentheses, and quotation marks, appear right before or after a sequence, which also indicates a very small possibility that the sequence will cross those punctuation boundaries to form a CSS with other words. For this reason, we create a special category of “empty border” to annotate the punctuation marks which occur on either the left or the right side of a sequence. In order to maximize the border entropy of the sequence with empty borders, we assign different key names to each occurrence of empty border, such as { ‘none_1’: 1, ‘none_2’: 1, ‘none_3’: 1, …… }, even though the same punctuation mark may occur repeatedly on the sequence's borders.
(b)
With reference to the statistics generated in step (a), we calculate the respective left and right border entropy of the sequence with Expression 9 below. Let S be the candidate sequence. A represents the set of words that occur to the left side of S, a is an element in the set A, and $P (a S | S)$ refers to the probability of co-occurrence of word a and sequence S under the condition that S has occurred. B refers to the set of words that occur to the right side of S, b is an element in B, and $P (S b | S)$ means the conditional probability that word b occurs with S, given S. We also derive an algorithm to integrate the left border entropy ( ${H (S)}_{l e f t}$ ) and the right border entropy ( ${H (S)}_{r i g h t}$ ) of the sequence S in order to determine the overall value of boundary independence for S ( $H (S)$ ). The integrated algorithm is shown in Expression 10 below.

Equation 9.

(9)

Equation 10.

(10)

Here, we take the sequence there is a widespread concern over, which has occurred 5 times in TECCL-Sample, as an example to illustrate the calculation of boundary independence. Table 3 shows the concordances of there is a widespread concern over, with its left and right adjacent words or its punctuation marks highlighted in bold and shade.

Table 3.

Concordances of the sentence stem there is a widespread concern over.

1.		There is a widespread concern over	the issue that whether you prefer to study
2.		There is a widespread concern over	the topic about the formal examination, it
3.		There is a widespread concern over	whether famous people shoulder more res
4.	Currently,	there is a widespread concern over	hunting wild animals for meals. A recent s
5.	s by reading literature.	There is a widespread concern over	the issue the importance of Reading Litera

Open in a new tab

As shown in Table 1, the left side of there is a widespread concern over consists of comma (1 time), period (1 time), and “null” (the first sentence of an essay, 3 times); the right adjacent words and their respective frequency are: the (3 times), whether (1 time), and hunting (1 time). Note that “null” has occurred 3 times on the left border of there is a widespread concern over, but in our calculation, we regard it as three different “empty borders”, each of which occurs once, because three different “empty borders” will yield a larger value of the left border entropy than one “empty border” occurring 3 times, and a larger entropy value indicates a greater clarity on the left border for the sequence (Refer to step (a) for more details). Based on this consideration, the sequence there is a widespread concern over is stored as follows in our program:

‘there is a widespread concern over ’: { ‘Freq’ : 5,

‘leftWords’:{ ‘none_1’: 1, ‘none_2’: 1, ‘none_3’: 1, ‘none_4’: 1, ‘none_5’: 1},

‘rightWords’:{ ‘the’ : 3, ‘whether’ : 1, ‘hunting’ : 1} }

According to Expressions (2) and (3), the boundary independence of there is a widespread concern over is calculated as follows:

{H (t h e r e i s a w i d e s p r e a d c o n c e r n o v e r)}_{l e f t} = - (\frac{1}{5} \times \log_{2} \frac{1}{5} + \frac{1}{5} \times {l og}_{2} \frac{1}{5} + \frac{1}{5} \times \log_{2} \frac{1}{5} + \frac{1}{5} \times \log_{2} \frac{1}{5} + \frac{1}{5} \times \log_{2} \frac{1}{5}) \approx 2.321928

{H (t h e r e i s a w i d e s p r e a d c o n c e r n o v e r)}_{r i g h t} = - (\frac{3}{5} \times \log_{2} \frac{3}{5} + \frac{1}{5} \times \log_{2} \frac{1}{5} + \frac{1}{5} \times \log_{2} \frac{1}{5}) \approx 1.3709506

H (t h e r e i s a w i d e s p r e a d c o n c e r n o v e r) = \sqrt{{H (t h e r e i s a w i d e s p r e a d c o n c e r n o v e r)}_{l e f t} \times {H (t h e r e i s a w i d e s p r e a d c o n c e r n o v e r)}_{r i g h t}} = \sqrt{2.321928 \times 1.3709506} \approx 1.784166

4.4.3. Threshold setting

The combined use of internal association calculation (i.e., MI) and boundary independence calculation (i.e., border entropy) can effectively reduce the redundancy of candidate CSSs, but this approach also increases the complexity of setting threshold values. In our case, the MI threshold setting is relatively simple, as a threshold value of 3 has been determined empirically and widely used in linguistic research for choosing the best candidate and for assigning a fairly high weight (See Ref. [34]: 217 [35];: 227). We follow this tradition and use 3 as the cut-off value for the internal association MI.

On the other hand, no specific threshold value has been established for border entropy, leaving us without a reference for setting our threshold. To address this, we first plotted the distribution of border entropy values for all sentence stems, as shown in Fig. 1.

The distribution of border entropy values shows a complex pattern. For nearly four-fifths of the plot, the values form staircase-like lines with a noticeable gap, followed by a small-scale fluctuation and a sharp rise near the end. Upon examining the data, we found that among the 29,914 different candidate sequences, 15,182 have a border entropy value of 0, forming the lower horizontal line in the plot, and 7687 have a value of 1, forming the higher, shorter line. Only 518 sequences have a value between 0 and 1, creating the small gap in the plot. We now consider the cases with border entropy values of 0 and 1 for threshold setting.4

(a)
When the border entropy value of a sequence is 0. According to Expressions 9 and 10, there is only one possibility for the border entropy value of 0: the sequence only co-occurs with one same word on its left or right border. In other words, the sequence has a very strong tendency to collocate with one word and, thus, its border is not clear at all. For example, it may be true occurs 4 times in TECCL-Sample, all collocating with the word that on its right side. This leads to its border entropy value of 0. For this reason, we decide to delete the sentence stems whose border entropy value is 0.
(b)
When the border entropy value of a sequence is 1. Based on our observations of the data, almost all the sequences with the border entropy value of 1 occur twice in the corpus and at the same time collocate with different words on both left and right sides. For example, it is widely acknowledged that the occurs twice, with two “empty borders” on its left side and two words (world and main) on the right side, which results in a border entropy value of 1. Considering that the parameter of text range, for which a threshold value of “more than four different essays” (See 4.5 for a detailed account), we exclude the sequences whose occurring frequency is 2 and whose border entropy is 1.

Based on the above considerations, we empirically set the threshold value of MI to 3 and the threshold of border entropy to 1.

4.5. Text range calculation

We also include “text range” in our extraction of CSSs, in that only when the occurrences of a sequence are statistically significant, and their inter-textual distributions are dispersive to a certain degree, do we have the reason to treat the sequence as an expression characteristic of Chinese learner English. In other words, the parameter of text range is employed to ensure that the use of a CSS is not the idiosyncrasy of an individual student but, rather, an expression recognized by other peers. The threshold value of text range (R) is set at a fairly low level in our extraction: R > 4, which means as long as a sequence appears in more than four essays, it meets our text range requirement.

In total, we have set three parameters to delimit CSSs: internal association (MI > 3), boundary independence (H > 1), and text range (R > 4). Only sentence stems that satisfy all of the three requirements are identified as candidate sequences for the next procedure. After taking this step, 2408 varied sequences have been extracted.

4.6. Overlapping sequence reduction

The cut-off scores of the above three parameters (i.e. MI, border entropy, and text range) have excluded a large number of noise sequences, but we still extract sequences like it is convenient (MI = 5.48, H = 3.21, D = 55), it is convenient for (MI = 6.31, H = 3.53, D = 32), it is convenient for us (MI = 6.53, H = 2.34, D = 14), it is convenient for people (MI = 5.28, H = 1.68, D = 6), it is convenient for us to (MI = 5.6, H = 3.01, D = 10), since their MI, border entropy, and text range scores are all above the threshold values. A commonly noticeable feature of sequences in different lengths, as such, is that the shorter sequence is part of, or is included in, the longer one; that is, they are overlapping sequences. In the present context, the shorter sequence is called a “sub-string” and the longer sequence a “super-string.” A problem facing us now is how to choose an appropriate sequence as the CSS, among all of the sub-strings and super-strings.

In this study, we use the LocalMax algorithm to remove the overlapping sequences [31]. Let S_n be the candidate sequence that consists of n words. S_n-1 represents any substring of S_n that has the size of (n-1), and S_n+1 is any super-string of S_n that has the size of (n+1). After we have extracted the sequences whose MI, border entropy, and text range scores are all larger than their cut-off values, we calculate the product of normalized MI, $N (MI)$ , and normalized border entropy, $N (H)$ , for each sequence S_n, and assign the obtained value to a new variable GI (Global Index). The formula for LocalMax is as follows:

Equation 11.

(11)

However, it should be noted that, in Expression 11 above, we use the normalized values of MI and border entropy instead of their absolute values in the calculation of GI, i.e., $G I = N (MI) \times N (H)$ . This is because the distribution range of absolute values for the respective MI and border entropy vary greatly; to be more specific, MI values are distributed within the range of [-3.56, 19.50], while border entropy values are in the range of [0, 7.08]. As we counted, 97.56 % of the sequences obtained in Step 4.4 have a higher or even a much higher value of MI than of border entropy. In this regard, if we calculate the GI of each sequence simply by multiplying the absolute values of MI and border entropy, then the MI tends to have a larger impact on the final value of GI than the border entropy. As a consequence, the choice to keep or to delete a sequence will lie more on its internal association than on boundary independence. However, it is argued that the inside measurement of a sequence (i.e., the MI in this study) and the outside measurement (i.e., the border entropy) should carry equal weight when judging the significance of occurrence of the sequence. Therefore, we introduce the Min-Max algorithm from statistics to normalize and convert the values of MI and border entropy, respectively, into the range [0, 1] by way of a linear transformation. This normalized conversion, for one thing, is to offset the unbalanced impact of MI and border entropy on the final result of our extraction, and for another to retain maximally the respective inner distributions of MI values and border entropy values. Suppose that $MI = {{M I}_{i}, i = 1, 2, \dots, n}$ is the set of MI absolute values, ${M I}_{\min}$ and ${M I}_{\max}$ are the respective minimum and maximum value of the set, $N ({M I}_{i})$ is the normalized value of any ${M I}_{i}$ in the set; $H = {H_{i}, i = 1, 2, \dots, n}$ is the set of border entropy values with $H_{\min}$ and $H_{\max}$ as its minimum and maximum values, and $N (H_{i})$ is the normalized value of any $H_{i}$ . The Min-Max normalization algorithm is shown in Expression 12 below.

Equation 12.

(12)

Here, we take the above example it is convenient for, which contains four words, into consideration. Table 4 shows all the 3-word sub-strings and 5-word super-strings of the sequence with their respective GI values.

Table 4.

3-word sub-strings and 5-word super-strings of it is convenient for (with GI value).

3-word sub-strings	Candidate CSS	5-word super-strings
it is convenient (GI = 0.18)	it is convenient for (GI = 0.21)	it is convenient for us (GI = 0.15)
		it is convenient for people (GI = 0.09)
		more it is convenient for (GI = 0.05)
		firstly it is convenient for (GI = 0.11)

Open in a new tab

Note that we only compare the sub-strings and super-strings which are sentence stems and whose MI and border entropy values are both larger than the cut-off scores. Altogether, one 3-word sub-string and four different 5-word super-strings of it is convenient for are selected for the calculation of LocalMax; other sub-strings and super-strings are discarded because either they are not sentence stems (e.g., is convenient for) or they do not satisfy the threshold requirements of MI, border entropy, or text range (e.g., it is convenient for students, if it is convenient for). By performing the LocalMax algorithm, the 4-word sentence stem it is convenient for is finally identified as a CSS because its GI value, which stands at 0.21, is higher than that of any of its 3-word sub-strings and 5-word super-strings.

5. Results and discussions

5.1. Overall data profile of the extracted CSSs

With the aforementioned steps, 1293 different CSSs (types), which occur 16,324 times in total (tokens), were automatically extracted, with their lengths varying from three to seven words. We then manually checked through each extracted CSS for precision and filtered out 320 varied dubious sequences that did not fit our intuition. In the end, we identified 973 different CSSs (types) with a total of 12,249 instances (tokens) from the corpus (See Appendix A for a list of 500 examples of finally-identified CSSs). In what follows, we will demonstrate the structural and functional distribution of the extracted CSSs to offer a more refined profile of the overall CSS data.

(1)
Structural distribution of CSSs

Drawing on Altenberg's [13] taxonomy, we classify CSSs into two broad structural categories: full clauses and clause constituents (see Table 5).

Table 5.

Distribution of CSSs of different structural categories.

Structural categories		Types	Percentage (%)	Tokens	Percentage (%)
Full clauses		122	12.538	1725	14.083
(a). Independent clause: maxims, proverbs, etc.		39	4.008	447	3.649
(b). Independent clause: others		58	5.961	457	3.731
(c). Dependent clause: as-introduced CSSs		25	2.569	821	6.703

Clause constituents		851	87.460	10524	85.917
(a). Personal subject		429	44.090	5676	46.338
(b). Impersonal subject	general	49	5.036	495	4.041
	specific	100	10.277	797	6.507
	demonstrative pronoun	58	5.961	761	6.213
(c). Dummy-it construction		128	13.155	1566	12.785
(d). Existential construction		87	8.941	1229	10.033

Total		973	100	12249	100

Open in a new tab

As shown in Table 5, 122 types and 1725 tokens of CSSs are identified to be full clauses; they are classified into three sub-categories: (a) Independent clauses that are maxims, proverbs, and other fragments of rhetoric (e.g., every coin has two sides). (b) Other independent clauses that are mainly expressed by two groups of subject: the general subject that consists of the expressions commonly used in developing an argument (e.g., advantages outweigh the disadvantages), and the specific subject that is linked to the semantic content of the topic of an essay (e.g., online shopping has many advantages). (c) As-introduced CSSs (e.g., as can be seen), which are identified as dependent clauses, a main subcategory of full clauses according to Altenberg (1998: 109).

The clause constituent CSSs, with 851 types and 10,524 tokens, outnumber the full clause CSSs by nearly four to one (in types) and by more than three to one (in tokens). CSSs in this category are divided into four structural sub-categories: (a) Personal subject CSSs, which are expressed by personal pronouns or nouns as subject (e.g., we should pay attention to). (b) Impersonal subject CSSs, whose subject position is occupied by impersonal nouns or pronouns in three groups: the topic-specific subject (e.g., appearance is more important than), the general subject (e.g., the main reason is that), and the demonstrative pronoun (e.g., that is the reason why). (c) Dummy-it CSSs, which is introduced by the anticipatory it (e.g., it is obvious that). (d) Existential CSSs, which is introduced by existential-there (e.g., there is no doubt that).

(2)
Functional distribution of argumentation-related CSSs

Scrutiny of the list of sentence stems (Appendix A) shows that most of the extracted CSSs are argumentation-related. It is found that the argumentation-related CSSs, with 931 types and 11,882 tokens, consist mostly of the expressions that are commonly used in the two components of developing an argument [36]: “analyzing and evaluating content knowledge” and “developing the writer's own position” (see Table 6).

Table 6.

Distribution of argumentation-related CSSs of different functional types.

Functional categories	Types	Percentage (%)	Tokens	Percentage (%)
Analyzing and evaluating content knowledge	313	33.62	3846	32.37
(a). describing the current situation or background	214	22.99	2116	17.81
(b). stating others' views	39	4.19	612	5.15
(c). stating popular assumptions	26	2.79	696	5.86
(d). indicating source of the opinion	23	2.47	330	2.78
(e). identifying conflicting points of view	11	1.18	92	0.77

Developing the writer's own position	618	66.38	8036	67.63
(a). stating an opinion or expressing a stance	443	47.58	5935	49.95
(b). giving reasons or explanations	97	10.42	1116	9.39
(c). support a claim with maxims, proverbs, etc.	39	4.19	447	3.76
(d). indicating conditions	24	2.58	429	3.61
(e). raising a question	7	0.75	43	0.36
(f). concluding or summarizing	8	0.86	66	0.56

Total	931	100.00	11882	100.00

Open in a new tab

The first element of argumentation “analyzing and evaluating content knowledge” requires that students possess adequate subject knowledge and are capable of distinguishing relevant from irrelevant information in the literature ([37]: 147). It is shown that 313 types and 3846 tokens of the extracted CSSs are used in relation to this element; specifically, they are used to realize five discourse-pragmatic functions: (a) “describing the current situation or background” (e.g., people suffer from), (b) “stating others’ views” (e.g., some people hold the opinion that), (c) “stating popular assumptions” (e.g., there is a widespread concern over), (d) “indicating source of the opinion” (e.g., as an old saying goes), (e) “identifying conflicting points of view” (e.g., there are different opinions among people).

The second element of argumentation “developing the writer's own position” requires that students express their opinion or establish a position based on their subject knowledge and be able to show a ‘workable balance between self and sources’ ([38]: 65). 618 types and 8036 tokens of CSSs are identified in relation to this element, about twice as many as those for “analyzing and evaluating content knowledge” either in types (618/313) or in tokens (8036/3846). Specifically, the CSSs of this type are found to realize six discourse-pragmatic functions: (a) stating personal opinion or expressing a stance (e.g., there is no doubt that), (b) giving reasons or explanations (e.g., that is the reason why), (c) quoting maxims, proverbs or fragments of rhetoric, typically, to support a claim, view, etc. (e.g., practice makes perfect), (d) indicating conditions (e.g., when it comes to), (e) raising a question (e.g., how should we deal with), (f) concluding or summarizing (e.g., we can safely draw the conclusion that).

5.2. Comparison of the extracted CSSs using different association measures

We employed six association measures, including four non-directional measures (Dice, Odds Ratio, Fisher Exact p-value, and LLR) and two directional measures (ΔP Attraction and ΔP Reliance), as substitutes for MI, to individually calculate the internal association of a sentence stem. These measures were applied in conjunction with boundary independence calculation and text range calculation to extract CSSs from the corpus. We selected the top 500 CSSs (types) from the results of each association measure (including MI) for comparison. Fig. 2 shows the overall distribution of CSS types in relation to sequence length. The figure consists of seven clustered bar charts, each representing the distribution of CSSs of different lengths extracted using different association measures.

The data analysis of Fig. 2 reveals distinctions in the sequence length distribution based on the choice of association measures. It is shown that the distributions derived from Dice, LLR, and Fisher Exact p-value are remarkably similar. Specifically, the highest number of CSSs corresponds to three-word sequences, as indicated by the light blue bars in Fig. 2. This prevalence of three-word sequences is significantly higher than sequences in other lengths. On the other hand, MI, ΔP Attraction, ΔP Reliance, and Odds Ratio exhibit a relatively similar distribution, with the highest number of CSSs being four-word sequences, represented by the orange bars. This dominance of four-word sequences is noticeably greater than sequences in other lengths. However, subtle variations exist among the four algorithms in terms of their performance on three-, five-, and six-word sequences.

We then conducted a comparative analysis by juxtaposing two sets of CSSs. The first set encompasses the top 500 CSSs extracted using MI, while the second set comprises the top 500 CSSs extracted individually using each of the alternative association measures (Odds Ratio, ΔP Reliance, ΔP Attraction, Dice, LLR, and Fisher Exact p-value). Through the application of set intersection operations to these sets, we aim to delineate the common or shared sentence stems that are extracted by MI and the association measures. The overall intersection of CSSs between each of the two sets is graphically presented in Fig. 3.

Fig. 3 provides an overarching perspective on the degree of overlap between the sentence stems obtained through six alternative association measures and those obtained through MI. From the number of sentence stems (types) in the intersections, ΔP Attraction demonstrates the highest level of overlap with MI, with 280 shared sentence stems. This is followed by ΔP Reliance with 266, Dice with 223, LLR with 219, and p-value with 206. Odds Ratio exhibits the lowest level of overlap, with 187 shared sentence stems. Turning to sequence length, a notable observation emerges in terms of the shared sentence stems; that is, four-word sequences consistently manifest a substantial overlap across all six “association measure-MI” pairings. This overlap also maintains a relatively stable count across the pairings, ranging from 82 (MI-Odds Ratio) to 135 (MI-ΔP Attraction).

To offer a more detailed examination of each comparative pairing, Fig. 4 is introduced to provide a granular breakdown of the intersection results of CSSs in terms of shared and distinct sentence stems for each “association measure-MI” pairing. The figure comprises 6 bar charts, each illustrating the comparative results of MI and the other association measure in a pairing. The grey bars represent the number of shared sentence stems extracted by the two measures (i.e., MI∩association measure). The blue bars denote the number of distinct sentence stems exclusively by MI but not by the other association algorithm in a pairing (i.e., MI - association measure), while the orange bars represent the number of sentence stems extracted by the other association measure but not by MI (i.e., association measure - MI). The blue and orange bars collectively depict the difference of extraction results between the two measures in a pairing.

The analysis of Fig. 4 reveals that MI tends to favor the extraction of longer sequences, compared to Dice, LLR, and Fisher Exact p-value. As shown in Fig. 4, in the pairings of “MI-Dice” (Fig. 4A), “MI- LLR” (Fig. 4C), and “MI- pValue” (Fig. 4E), the orange bars for three-word sequences are significantly higher than their corresponding blue bars. This indicates that the Dice, LLR, and Fisher Exact p-value algorithms extract far more three-word sequences than MI. Conversely, for four-to seven-word sequences, the blue bars are notably higher than their orange counterparts. This suggests that MI consistently extracts more four-to seven-word sequences than Dice, LLR, and Fisher Exact p-value. However, in the pairings of “MI-Odds Ratio” (Fig. 4B), “MI-ΔP Attraction” pairing (Fig. 4D), and “MI-ΔP Reliance” (Fig. 4F), MI does not exhibit the pronounced tendency to favor the extraction of longer sequences.

In summary, based on the above comparative analysis of association measures for sentence stem extraction, we can outline two main findings as follows.

(a)
In the comparison between MI and the six alternative association measures, MI and ΔP Attraction stand out by extracting the highest number of shared sentence stems (as illustrated in Fig. 3). This suggests a higher degree of similarity in the results extracted using the two measures. Among the shared sentence stems across the six “MI-association measure” pairings, four-word sequences are the most prevalent.
(b)
Among the three association measures — Dice, LLR, and Fisher Exact p-value — there is a notable inclination toward favoring shorter sequences, with three-word sequences being the most frequently extracted. In contrast, MI, ΔP Attraction, Odds Ratio and ΔP Reliance exhibit a preference for extracting longer sequences, which, while potentially less frequent than their shorter counterparts, are likely to be more informative. For example, the five-word sequence as is known to all (extracted by MI) yields more specific information compared to the four-word sequence as is known to (extracted by Fisher Exact p-value), which in turn provides more information than the corresponding three-word sequence as is known (extracted by LLR). The sequence as is known to all, as opposed to the simpler sequence as is known manifests a higher level of information richness and collocational specificity, and thus, is more beneficial for non-native students' writing.

5.3. Pedagogical implications

The phraseological deficits experienced by learners have long been noted. Errors stemming from a lack of phraseological competence, although not always major and varying in their impact on intelligibility, have an appreciable impact on the effectiveness of student writing. To enhance EFL learners' phraseological competence, it has been widely argued that approaches to second-language instruction should ensure that learners develop a rich repertoire of formulaic sequences ([39]: 142). Our investigation into CSSs underscores the expansive nature of this repertoire, extending beyond phrase-level sequences to encompass clause-level sequences. The identified CSSs demonstrate the formulaic patterns at the clause level in Chinese EFL learners’ essays and reveal the typical way in which Chinese learners write essays. Additionally, categorizing CSSs into lists based on their sentence patterns and discourse-pragmatic functions enables teachers to target either the structural or functional aspects of clause-level idiomatic expressions in Chinese learner English. Thus, from a pedagogical perspective, CSSs could have potential implications for Chinese EAP teaching and learning.

Next, we will present a specific example illustrating the pedagogical implication of CSSs.

It has been argued that the element of argumentation “developing the writer's own position” poses considerable difficulties for the novice writer ([37]: 147). Our extraction captures 39 different maxim-like expressions to realize this element in Chinese EFL learners' essays. The use of those expressions, which consist of maxims, proverbs, and other fragments of rhetoric, is deemed a good indicator to evaluate the sophistication of lexical use of an EFL learner in essay writing. Our extraction shows that Chinese EFL learners are aware of using maxims, proverbs, etc., in developing arguments in their essays. However, their usages of maxim-like expressions reveal two notable issues.

(a)
Some maxims have been used too often to be considered striking or interesting usages; a few of them have become overused even to the point of being trite and clichéd. For example, among the 73 essays on the topic of failure or success, the expression failure is the mother of success occurred 13 times, which means that nearly one of four (17.81 %, 13/73) essays of the topic used the expression. The repetitive occurrence of some maxims across Chinese learners' essays undermines the effectiveness of those maxims and makes the argument less interesting.
(b)
It is found that some idiomatic expressions can be traced back to their origins in Chinese. For example, the expression long time no see is derived from the Chinese greeting “好久不见.” A few expressions are also found to be direct translations of Chinese idioms or maxims, such as practice is the sole criterion for testing truth (实践是检验真理的唯一标准), knowledge is power (知识就是力量). Those English expressions may sound unidiomatic or strange to the native ear, but their corresponding Chinese expressions are highly familiar ones with Chinese native speakers.

From the analysis above, we can see that Chinese EFL learners are willing to express themselves with maxim-like expressions in developing an argument, but they seem to lack the ability to quote widely varying maxims, especially under (time) pressure. Explicit instruction on subject-related idiomatic expressions, especially those suitable for argumentative contexts, could potentially empower Chinese EFL learners to quote a diverse range of maxims more effectively. To enhance writing quality, instructors could also foster students' awareness of the importance of carefully selecting and using a broader range of nuanced maxim-like expressions to avoid the overuse of clichéd maxims. Moreover, instructors could foster students' cultural awareness regarding the use of maxim-like expressions. This involves guiding students to develop sensitivity and originality by expanding their repertoire of culturally-loaded expressions, ensuring a nuanced and culturally-aware use of maxim-like expressions in Chinese EFL learners’ writing.

6. Conclusion and limitations

This article explored the feasibility of automatic extraction of CSSs, a special category of clause-level phraseological units, from Chinese learner corpora. It also compared the extraction results of CSSs by using different association measures and discussed potential implications that the extracted CSSs could have for Chinese EFL teaching and learning.

The extraction method of CSS is the focal point of this article. It involves six steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range setting, and overlapping sequence reduction. The procedure starts with the preliminary extraction of formally qualified sequences with subject-predicate structures. Then, three parameters (internal association, boundary independence, and text range) are used to measure the typicality of each sequence in academic texts. Internal association measures the adhesions inside a CSS, boundary independence measures the clarity of a CSS's outside borders, and text range calculates the inter-textual dispersion of a CSS. Finally, the Min-Max normalization algorithm is applied to remove overlapping sequences. Using this method, the study extracted 973 different CSSs from the corpus. This paper also compared the extracted CSSs using different association measures for internal association calculation.

Our methods for the automatic extraction of CSSs from corpus data offers significant potential for advancing the phraseology of EFL learners. As Ellis [40]: 41) stated, “language acquisition is essentially a sequence learning problem.” Our method and results suggest that CSSs (a specific type of sequences) can be statistically measured and automatically extracted from corpora, enabling a detailed examination of the clause-level formulaic patterns characteristic of EFL learners’ writing. By categorizing CSSs according to their primary structures and functions, we could enhance our understanding of learner phraseology and support a targeted analysis of clause-level phraseological use within this demographic. Moreover, the adaptability of our method to different corpora enables customized analyses of phraseology in EFL.

However, we have to admit that this article is only a preliminary exploration of CSS, focusing on the methodological issues of CSS, and that it has not yet investigated the patterns of co-selection of a CSS when realizing a specific function. For example, the CSS we should pay attention to frequently co-occurs with result/inference adverbials (e.g., so, as a result) or hedging expressions (e.g. I think, as far as I'm concerned) to emphasize a cautious recommendation based on previous information or reasoning. As noted by Lee and Swales [41]: 57), “what apprentice writers may be mostly missing is fine tuning of lexical and syntactic subtleties, particularly in terms of their strategic and rhetorical implications.” Therefore, careful scrutiny of the combinatory behavior of CSSs with their patterns and functions will facilitate our understanding about how Chinese learners apply their clause-level phraseological competence in essay writing. In our follow-up study, we will examine the co-selection patterns of each of the extracted CSS. It is believed that the CSSs, combined with their co-selection patterns for realizing specific functions, would have greater potential value in the application to non-native EAP teaching and learning. Another limitation of this study is that the extracted CSSs are constrained to a length ranging from three to seven words. It is imperative to acknowledge that some idiomatic sequences may surpass this limit. Expanding the range of sequence length for extracting CSSs could capture a broader range of idiomatic expressions. Nevertheless, the inclusion of longer sequences may also increase the computational complexity of statistical analyses.

CRediT authorship contribution statement

Jingjie Li: Writing – review & editing, Writing – original draft, Visualization, Validation, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. Wenjie Hu: Writing – review & editing.

Data availability statement

Data are available from the corresponding author upon reasonable request.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Jingjie LI reports financial support was provided by Shanghai Planning Office of Philosophy and Social Science.

Acknowledgements

This work has been supported by the Shanghai Planning Office of Philosophy and Social Science (grant ref. 2021BYY001). The authors are grateful to the anonymous reviewers for their detailed and helpful comments on earlier drafts of this paper.

Footnotes

It is noteworthy that Gries [20] also concurrently introduced the ΔP measure for calculating directional word associations and proposed the pairwise measure of ΔP_{left-to-right} and ΔP_{right-to-left}, the formula of which is consistent with that of ΔP Attraction and ΔP Reliance.

The Natural Language Toolkit (NLTK) is an open source Python programs and data for Natural Language Processing, providing “a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning” (http://nltk.org/).

In the C7 tagging system, words that can serve as either pronouns or as determiners, such as any, some, this, and that, which are all tagged as determiners. To avoid omissions, we deliberately incorporated five out of thirteen subcategories of determiners into subject recognition.

⁴

We exclude the sequences with the border entropy value between 0 and 1 from our consideration, as their number is very small, only accounting for 1.73 % of the total data under processing.

Appendix A.

Appendix A. 500 examples of CSSs categorized by structure

1. Full Clause
(1). Independent clause: maxims, proverbs, fragments of rhetoric, etc.:
(every/each) coin has (its) two sides	we are what we read
nothing succeeds without a strong will	water is the source of life
everything has two sides	book is the ladder of human progress
failure is the mother of success	dream will come true
knowledge is power	time is money
practice makes perfect	practice is the sole criterion [for testing truth]
books are the ladder of human progress	the early bird catches the worm
interest is the best teacher	[a] friend in need is a friend indeed
nothing is impossible	histories make men wise
long time no see	[the years were a mirage and] there had been no years
actions speak louder than words	[if you try your] best everything can be done
everyone is equal	everyone has a dream
life is the greatest teacher	classics represent the wisdom of the past
life is short	everything is possible
nothing in the world is difficult [for one who sets his mind to it]	smoking is harmful
(2). Independent clause: others
there are different opinions among people	we should limit the development of tourism
reasons are as follows	students lack social practice
advantages outweigh the disadvantages	existing trade agreements should be repaired
skills and creativity are both worthwhile goals	it has both advantages and disadvantages
one thing is certain	life (will be/is) meaningful
online shopping has many advantages	the government should establish free libraries
online shopping has become a fashion	life will be colorful
generation gap is very common at present	company has won a large export order
water shortage is becoming an urgent problem	students are encouraged to make comments
college students should participate in social practice	parents love their children
we should read more books	college life is wonderful
my mother is a housewife	online shopping is convenient
life will be better	the death penalty is a step back
air is fresh	we should help each other
answer is yes	students choose to take part-time jobs
winter is coming	reading is very important
spring festival is a traditional festival	newspaper is a better source of news
families have only one child	English is very important
life is boring	life is not easy
going to classes should be optional	college life is different
(3). Dependent clause: as-introduced CSSs
as we (all) know	as the proverb goes
as is known to (all/us)	as we can see (in/from) the picture
as (the/a) saying goes	as the proverb says
(just) as (the/an) old saying goes	as time went by
as everyone knows	as is shown (in the picture)
as time goes by	as can be seen
as we all known	as is vividly depicted
as time goes on	as mentioned above

2. Clause Constituent
(1). Dummy-it CSSs:
in my opinion it is necessary	it is never too late to
it is obvious that	it is evident that
it is easy to	it is not difficult to find
it is said that	it is worthwhile
it is necessary for	it seems to me that
it is true that	it is well-known to us
it is important for	it is advisable
it is necessary for us to	it is known to us
it is reported that	it is very difficult
it is convenient for	it can be said
it is universally acknowledged that	it would be better
it is very hard to	it is not useful
it is significant to	it was not until
it is high time that	it is our responsibility
it goes without saying that	it is convenient for us to
it is undeniable that	it is not wise
it is clear that	it is no denying that
it is well known that	it is suitable for
it is unnecessary for	it is imperative for
it is very convenient	it is beneficial to
it is impossible for	it is our duty to
it is difficult for us to	it is different from
it is known to all that	it is time to
it is widely acknowledged that	it can be seen
it was the first time	it is harmful to
it is helpful for us to	it is impossible for us to
it is a pity	it is the best way to
it is likely that	it is necessary for me to
it is essential for	it is much easier to
it is time for us to	it is better to
it is believed that	it is wise to
it (doesn't/does not) matter	it is time for
it is the same with	it is useful for
it is important for me to	it is likely to
it is not fair	it is not easy for
(2). Existential CSSs:
there is no doubt that	there are some disadvantages
there are (many/several/some/two/three/numerous) reasons	there are still (many/some)
there is no denying that	there are many interesting
there is a saying	there will be a lot of
there is an old saying	there is a widespread concern over
there are a large number of	there is a phenomenon that
there are plenty of	there are many places
there are a variety of	there is something wrong
there are many kinds of	there are many factors
there is no doubt	there are thousands of
there are all kinds of	there are many differences between
there are (many/some) advantages	there is some truth in
there is no denying the fact that	there is no better
there is only one	there is no need
there are many problems	there is no one
there are more and more people	there is no real
there are so (many/much)	there is no way
there is an increasing	there is one thing
there are many benefits	there may be some
there are a number of	there were so many
there are many disadvantages	there will be many
on the other hand there are some	there will be some
(3). Personal CSSs:
different people have different	students should learn how to
everyone has their own	we can make friends with
some people think	some people don't think
everyone has his own	we should be grateful
we can see	people are pursuing
different people hold different	students think that
some people say	in this way can we keep
we should pay attention to	we will be able to
when they grow up	people are aware of
we are supposed to	we look forward to your
we should try our best to	in the picture we can see
we should make full use of	people insist that
we should cherish	we should balance
some people hold the opinion that	we shall fight him by
some people believe that	students pay attention to
we are able to	we must be careful
from the picture we can see	other people can reach them
when they graduate	students can apply
we can not afford to lose	people are concerned about
we can communicate with	only in this way can we live
some people suppose	many people argue
we can draw a conclusion that	students spend too much time
some people hold the belief that	we can benefit a lot from
first of all we have to	people are afraid of
we should take part in	we are no longer
some students think	we should bear in mind
so that we can get	people suggest that
some people hold the view that	students can learn how to
others believe that	some people agree
we should pay more attention to	can we solve the problem
we should cultivate	we can make full use of
we can draw the conclusion that	we all recognize
when they were young	we are enclosing
some people consider that	more and more people prefer to
some people argue that	students do not pay
if they want to	can we improve our
we should learn how to	we have enough time to
some people claim that	how can we harness
we often see	some people are in favor of
if we insist	we must try our best to
other people think	we should focus on
others argue that	we don't know how to
we are faced with	we should attach importance to
some people support	we should spare no effort to
we should take some measures to	we can't imagine
people pay more attention to	many people want to
we can not deny	if one wants to
teacher told us	people are addicted to
we should take measures to	people are of the opinion that
college students should learn	we should limit
parents should give their children	more and more people start to
college students face	some people think we should
many people think	we can clearly see
different people have diverse	we are talking about
so that we can make	only in this way can we make
people believe that	we help each other
we can not live without	majority of people believe that
students are addicted to	we should communicate with
we can improve our	we are pleased to
students have their own	we should take care of
if we try our best	we can try our best to
many students think that	we can safely draw the conclusion that
people hold different opinions	a lot of people worry
different people have quite different views on	we should continue to
everyone is eager	students should pay more attention
we can learn a lot from	we have less time to
some people suggest that	we usually require
we must learn how to	students can learn
people are accustomed to	people think that we should
we can not ignore	in this way can we get
people are likely to	we should make the best of
we can't live without	we can not emphasize the importance
everyone should try	people stand on
people would like to	students do not pay attention to
people are willing to	people will try their best to
others hold the opposite	people will be accustomed to
we are not able to	we should be aware of
people hold the idea that	people worry that credit cards may
we can use it to	more students think
everyone has a different	people use the internet
people have realized	students should read
how should we deal with	everybody wants to
in my opinion we should read	students take part in
we are required to	students pay less attention
we should spend more time	people pay attention to
we must admit	children should be allowed to
we should make good use of	people are beginning to
people are fond of	we can not afford
different people have quite different	we can take part in
we are glad to	people said that
(4). Impersonal (general subject) CSSs:
reason is that	reasons lead to
number is #	experience is the best
case in point is that	opinions vary from person to
problem can be solved	attention should be paid
nothing is more important than	reasons contribute to
advantage is that	number of people hold
research shows that	efforts should be made
great changes have taken place	the phenomenon is that
phenomenon has aroused	story is about
disadvantage is that	the fact is that
years have witnessed	view is that
experience is more important	the reason is that they
ability is more important than	factors contribute to
topic is about	from this nothing will turn
reasons can account for	the problem will be solved
(5). Impersonal (specific subject) CSSs:
measures should be taken	life is full of
love is the greatest	appearance is more important than
earth is becoming warmer and	love is a product of
government should take	fatigue is one of the most common
soho lifestyle is becoming	low-carbon lifestyle means
English is an international	how college has affected my life
university is a place	college has affected
the Olympic games will be held	this report is to
reading can broaden our	companies should encourage
online shopping has become	air pollution has become
life is filled with	sales confirmation have been shipped
life will become	success belongs to
nowadays online shopping is becoming	the world will become
no. # is this available in white	social practice is playing
social practice can offer	love is based on
library is a place	practice is more important
college is a place	the world is becoming
olympic games will be held in	online shopping is becoming more and
part-time job can help	social practice may bring
shopping on the internet also has its	school is located in
the dragon boat festival is one of	the earth is becoming
measures must be taken	reading is more important
the government should strengthen	the internet has become
government should take measures to	with the time goes by
reading like other activities brings unique	honesty is the best
technology has brought	government should establish
books can make us	government needs to
courses will start at	reading can enrich
life is different from	internet can provide
frustration is a part of	some waste can be degraded while others
online shopping has made	love makes the world
low-carbon lifestyle has become	saving money is a good
life is bound up with three	my opinion is that
study is the most important	education plays an important role in
measures have been taken	competition is a common
(6). Impersonal (demonstrative pronoun subject) CSSs:
when it comes to	this means that
that is why	it turned out
it also brings	it does not mean
this is because	it will lead to
but it doesn't mean	it can teach us
it depends on	it will affect
that is a question	this is why
that is the reason why	it has aroused
this is the first	it is called
it (will result/results) in	it does harm to

Open in a new tab

References

1.Pawley A., Syder H. In: Language and Communication. Richard J.C., Schmidt R.W., editors. Longman; New York: 1983. Two puzzles for linguistic theory: nativelike selection and nativelike fluency; pp. 191–225. [Google Scholar]
2.Granger S., Paquot M. In: Phraseology: an Interdisciplinary Perspective. Granger Sylviane, Meunier Fanny., editors. John Benjamins; Amsterdam/Philadelphia: 2008. Disentangling the phraseological web; pp. 27–49. [Google Scholar]
3.Flowerdew J., Li Y. Language Re-use among Chinese apprentice scientists writing for publication. Appl. Ling. 2007;28(3):440–465. [Google Scholar]
4.Simpson-Vlach R., Ellis N.C. An academic formulas list: new methods in phraseology research. Applied linguistics. 2010;31(4):487–512. [Google Scholar]
5.Hammond K. “I need it now!” Developing a formulaic frame phrasebank for a specific writing assessment: student perceptions and recommendations. J. Engl. Acad. Purp. 2017:1–8. Available online 15 December 2017. [Google Scholar]
6.Li J., Pang Y. Characteristic sentence stems in academic texts: distributions of their patterns and functions. Foreign Language Learning Theory and Practice. 2021;(1):25–36. [Google Scholar]
7.Alvarez L., Capitelli S., Valdés G. Beyond sentence frames: scaffolding emergent multilingual students' participation in science discourse. TESOL J. 2023;14(3):1–19. [Google Scholar]
8.Hyland K. As can be seen: lexical bundles and disciplinary variation. Engl. Specif. Purp. 2008;27(1):4–21. [Google Scholar]
9.Zhang L., Su H. Applying local grammars in EAP teaching. J. Engl. Acad. Purp. 2021;51 [Google Scholar]
10.Gisle A. Phraseology in a cross-linguistic perspective: a diachronic and corpus-based account. Corpus Linguist. Linguistic Theory. 2022;18(2):365–389. [Google Scholar]
11.Wang Z., Wu X. A corpus-based study on chunk-explicitation in interpreting: a case study of Chinese leaders' speeches under the COVID-19 pandemic. International Journal of English Language Studies. 2023;5(4):45–59. [Google Scholar]
12.Rodriguez-Mojica C., Rutherford-Quach S. In: Equity in Multilingual Schools and Communities: Celebrating the Contributions of Guadalupe Valdés. Kibler A., Walqui A., Bunch G., Faltis C., editors. Multilingual Matters; Bristol, Blue Ridge Summit: 2024. Curricularizing Language: examining underlying assumptions in classroom practice; pp. 148–159. [Google Scholar]
13.Altenberg B. In: Phraseology: Theory, Analysis, and Applications. Cowie A.P., editor. Clarendon Press; Oxford: 1998. On the phraseology of spoken English: the evidence of recurrent word-combinations; pp. 101–122. [Google Scholar]
14.Moon R. Oxford University Press; Oxford & New York: 1998. Fixed Expressions and Idioms in English. [Google Scholar]
15.Su Q., Gu C., Liu P. Association measures for collocation extraction: automatic evaluation on a large-scale corpus. Int. J. Corpus Linguist. 2024;29(1):59–86. [Google Scholar]
16.Pecina P. Lexical association measures and collocation extraction. Comput. Humanit. 2010;44(1–2):137–158. [Google Scholar]
17.Church K., Hanks P. Word association norms, mutual information, and lexicography. Computational linguistics. 1990;16(1):22–29. [Google Scholar]
18.Dice L.R. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. [Google Scholar]
19.Smadja F., McKeown K.R., Hatzivassiloglou V. Translating collocations for bilingual lexicons: a statistical approach. Computational linguistics. 1996;22:1–38. [Google Scholar]
20.Gries S.T. 50-something years of work on collocations: what is or should be next…. Int. J. Corpus Linguist. 2013;18(1):137–166. [Google Scholar]
21.Dunn J. Multi-unit association measures: moving beyond pairs of words. Int. J. Corpus Linguist. 2018;23(2):183–215. [Google Scholar]
22.Schmid H.J. From Corpus to Cognition. Mouton de Gruy; Berlin/New York: 2000. English abstract nouns as conceptual shells. [Google Scholar]
23.Schmid H.J., Küchenhoff H. Collostructional analysis and other ways of measuring lexicogrammatical attraction: theoretical premises, practical problems and cognitive underpinnings. Cognit. Ling. 2013;24(3):531–577. [Google Scholar]
24.Ellis N.C., Ferreira–Junior F. Construction learning as a function of frequency, frequency distribution, and function. Mod. Lang. J. 2009;93(3):370–385. [Google Scholar]
25.Wei N., Li J. A new computing method for extracting contiguous phraseological sequences from academic text corpora. Int. J. Corpus Linguist. 2013;18(4):506–535. [Google Scholar]
26.Gries S.T. What do (some of) our association measures measure (most)? Association? Journal of Second Language Studies. 2022;5(1):1–33. [Google Scholar]
27.Gries S.T. John Benjamins; 2024. Frequency, Dispersion, Association, and Keyness: Revising and Tupleizing Corpus-Linguistic Measures. [Google Scholar]
28.Lai R.K.Y. Why we need asymmetric measures to classify multi-word expressions: the case of Tibetan light verb constructions. Proceedings of the Society for Computation in Linguistics (SCiL) 2024:302–306. [Google Scholar]
29.Yi W., Man K., Maie R. Investigating first and second language speaker intuitions of phrasal frequency and association strength of multiword sequences. Lang. Learn. 2023;73(1):266–300. [Google Scholar]
30.Li J., Wei N. A study of functional sentence stems in academic English texts: their extraction method and frequency distributions. Foreign Lang. Teach. Res. 2017;49(2):202–214. [Google Scholar]
31.da Silva J., Lopes G. In: Proceedings of the 6th Meeting on the Mathematics of Language. Rogers J., Moss L., editors. Kluwer; Dordrecht: 1999. A lo cal maxima method and a fair dispersion normalization for extracting multi-word units from corpora; pp. 369–381. [Google Scholar]
32.Shimohata S., Sugio T., Nagata J. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics. Cohen P., Wahlster W., editors. Association for Computational Linguistics; Stroudsburg, PA: 1997. Retrieving collocations by co-occurrences and word order constraints; pp. 476–481. [Google Scholar]
33.Jiang M., Zhang Q., Chen Y., Chang B. Chinese multi-word chunks extraction for computer aided translation. J. Chin. Inf. Process. 2007;21(1):9–16. [Google Scholar]
34.McEnery A., Xiao R., Tono Y. Routledge; London: 2006. Corpus-based Language Studies: an Advanced Resource Book. [Google Scholar]
35.Jiang M., Myaeng S., Park S. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. 2007. Using mutual information to resolve query translation ambiguities and query term weighting; pp. 223–229. [Google Scholar]
36.Andrews R. Cassell; London, NY: 1995. Teaching and Learning Argument. [Google Scholar]
37.Wingate U. ‘Argument!’ helping students understand what essay writing is about. J. Engl. Acad. Purp. 2012;11(2):145–154. [Google Scholar]
38.Groom N. In: Learning to Argue in Higher Education. Mitchell S., Andrews R., editors. Portsmouth: Boynton/Cook Heinemann; 2000. A workable balance: self and source in argumentative writing; pp. 65–73. [Google Scholar]
39.Millar N. The processing of malformed formulaic language. Appl. Ling. 2011;32(2):129–148. [Google Scholar]
40.Ellis N.C. In: Cognition and Second Language Instruction. Robinson P., editor. Cambridge University Press; Cambridge: 2001. Memory for language; pp. 33–68. [Google Scholar]
41.Lee D., Swales J. A corpus-based EAP course for NNS doctoral students: moving from available specialized corpora to self-compiled corpora. Engl. Specif. Purp. 2006;25(1):56–75. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

[bib1] 1.Pawley A., Syder H. In: Language and Communication. Richard J.C., Schmidt R.W., editors. Longman; New York: 1983. Two puzzles for linguistic theory: nativelike selection and nativelike fluency; pp. 191–225. [Google Scholar]

[bib2] 2.Granger S., Paquot M. In: Phraseology: an Interdisciplinary Perspective. Granger Sylviane, Meunier Fanny., editors. John Benjamins; Amsterdam/Philadelphia: 2008. Disentangling the phraseological web; pp. 27–49. [Google Scholar]

[bib3] 3.Flowerdew J., Li Y. Language Re-use among Chinese apprentice scientists writing for publication. Appl. Ling. 2007;28(3):440–465. [Google Scholar]

[bib4] 4.Simpson-Vlach R., Ellis N.C. An academic formulas list: new methods in phraseology research. Applied linguistics. 2010;31(4):487–512. [Google Scholar]

[bib5] 5.Hammond K. “I need it now!” Developing a formulaic frame phrasebank for a specific writing assessment: student perceptions and recommendations. J. Engl. Acad. Purp. 2017:1–8. Available online 15 December 2017. [Google Scholar]

[bib6] 6.Li J., Pang Y. Characteristic sentence stems in academic texts: distributions of their patterns and functions. Foreign Language Learning Theory and Practice. 2021;(1):25–36. [Google Scholar]

[bib7] 7.Alvarez L., Capitelli S., Valdés G. Beyond sentence frames: scaffolding emergent multilingual students' participation in science discourse. TESOL J. 2023;14(3):1–19. [Google Scholar]

[bib8] 8.Hyland K. As can be seen: lexical bundles and disciplinary variation. Engl. Specif. Purp. 2008;27(1):4–21. [Google Scholar]

[bib9] 9.Zhang L., Su H. Applying local grammars in EAP teaching. J. Engl. Acad. Purp. 2021;51 [Google Scholar]

[bib10] 10.Gisle A. Phraseology in a cross-linguistic perspective: a diachronic and corpus-based account. Corpus Linguist. Linguistic Theory. 2022;18(2):365–389. [Google Scholar]

[bib11] 11.Wang Z., Wu X. A corpus-based study on chunk-explicitation in interpreting: a case study of Chinese leaders' speeches under the COVID-19 pandemic. International Journal of English Language Studies. 2023;5(4):45–59. [Google Scholar]

[bib12] 12.Rodriguez-Mojica C., Rutherford-Quach S. In: Equity in Multilingual Schools and Communities: Celebrating the Contributions of Guadalupe Valdés. Kibler A., Walqui A., Bunch G., Faltis C., editors. Multilingual Matters; Bristol, Blue Ridge Summit: 2024. Curricularizing Language: examining underlying assumptions in classroom practice; pp. 148–159. [Google Scholar]

[bib13] 13.Altenberg B. In: Phraseology: Theory, Analysis, and Applications. Cowie A.P., editor. Clarendon Press; Oxford: 1998. On the phraseology of spoken English: the evidence of recurrent word-combinations; pp. 101–122. [Google Scholar]

[bib14] 14.Moon R. Oxford University Press; Oxford & New York: 1998. Fixed Expressions and Idioms in English. [Google Scholar]

[bib15] 15.Su Q., Gu C., Liu P. Association measures for collocation extraction: automatic evaluation on a large-scale corpus. Int. J. Corpus Linguist. 2024;29(1):59–86. [Google Scholar]

[bib16] 16.Pecina P. Lexical association measures and collocation extraction. Comput. Humanit. 2010;44(1–2):137–158. [Google Scholar]

[bib17] 17.Church K., Hanks P. Word association norms, mutual information, and lexicography. Computational linguistics. 1990;16(1):22–29. [Google Scholar]

[bib18] 18.Dice L.R. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. [Google Scholar]

[bib19] 19.Smadja F., McKeown K.R., Hatzivassiloglou V. Translating collocations for bilingual lexicons: a statistical approach. Computational linguistics. 1996;22:1–38. [Google Scholar]

[bib20] 20.Gries S.T. 50-something years of work on collocations: what is or should be next…. Int. J. Corpus Linguist. 2013;18(1):137–166. [Google Scholar]

[bib21] 21.Dunn J. Multi-unit association measures: moving beyond pairs of words. Int. J. Corpus Linguist. 2018;23(2):183–215. [Google Scholar]

[bib22] 22.Schmid H.J. From Corpus to Cognition. Mouton de Gruy; Berlin/New York: 2000. English abstract nouns as conceptual shells. [Google Scholar]

[bib23] 23.Schmid H.J., Küchenhoff H. Collostructional analysis and other ways of measuring lexicogrammatical attraction: theoretical premises, practical problems and cognitive underpinnings. Cognit. Ling. 2013;24(3):531–577. [Google Scholar]

[bib24] 24.Ellis N.C., Ferreira–Junior F. Construction learning as a function of frequency, frequency distribution, and function. Mod. Lang. J. 2009;93(3):370–385. [Google Scholar]

[bib25] 25.Wei N., Li J. A new computing method for extracting contiguous phraseological sequences from academic text corpora. Int. J. Corpus Linguist. 2013;18(4):506–535. [Google Scholar]

[bib26] 26.Gries S.T. What do (some of) our association measures measure (most)? Association? Journal of Second Language Studies. 2022;5(1):1–33. [Google Scholar]

[bib27] 27.Gries S.T. John Benjamins; 2024. Frequency, Dispersion, Association, and Keyness: Revising and Tupleizing Corpus-Linguistic Measures. [Google Scholar]

[bib28] 28.Lai R.K.Y. Why we need asymmetric measures to classify multi-word expressions: the case of Tibetan light verb constructions. Proceedings of the Society for Computation in Linguistics (SCiL) 2024:302–306. [Google Scholar]

[bib29] 29.Yi W., Man K., Maie R. Investigating first and second language speaker intuitions of phrasal frequency and association strength of multiword sequences. Lang. Learn. 2023;73(1):266–300. [Google Scholar]

[bib30] 30.Li J., Wei N. A study of functional sentence stems in academic English texts: their extraction method and frequency distributions. Foreign Lang. Teach. Res. 2017;49(2):202–214. [Google Scholar]

[bib31] 31.da Silva J., Lopes G. In: Proceedings of the 6th Meeting on the Mathematics of Language. Rogers J., Moss L., editors. Kluwer; Dordrecht: 1999. A lo cal maxima method and a fair dispersion normalization for extracting multi-word units from corpora; pp. 369–381. [Google Scholar]

[bib32] 32.Shimohata S., Sugio T., Nagata J. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics. Cohen P., Wahlster W., editors. Association for Computational Linguistics; Stroudsburg, PA: 1997. Retrieving collocations by co-occurrences and word order constraints; pp. 476–481. [Google Scholar]

[bib33] 33.Jiang M., Zhang Q., Chen Y., Chang B. Chinese multi-word chunks extraction for computer aided translation. J. Chin. Inf. Process. 2007;21(1):9–16. [Google Scholar]

[bib34] 34.McEnery A., Xiao R., Tono Y. Routledge; London: 2006. Corpus-based Language Studies: an Advanced Resource Book. [Google Scholar]

[bib35] 35.Jiang M., Myaeng S., Park S. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. 2007. Using mutual information to resolve query translation ambiguities and query term weighting; pp. 223–229. [Google Scholar]

[bib36] 36.Andrews R. Cassell; London, NY: 1995. Teaching and Learning Argument. [Google Scholar]

[bib37] 37.Wingate U. ‘Argument!’ helping students understand what essay writing is about. J. Engl. Acad. Purp. 2012;11(2):145–154. [Google Scholar]

[bib38] 38.Groom N. In: Learning to Argue in Higher Education. Mitchell S., Andrews R., editors. Portsmouth: Boynton/Cook Heinemann; 2000. A workable balance: self and source in argumentative writing; pp. 65–73. [Google Scholar]

[bib39] 39.Millar N. The processing of malformed formulaic language. Appl. Ling. 2011;32(2):129–148. [Google Scholar]

[bib40] 40.Ellis N.C. In: Cognition and Second Language Instruction. Robinson P., editor. Cambridge University Press; Cambridge: 2001. Memory for language; pp. 33–68. [Google Scholar]

[bib41] 41.Lee D., Swales J. A corpus-based EAP course for NNS doctoral students: moving from available specialized corpora to self-compiled corpora. Engl. Specif. Purp. 2006;25(1):56–75. [Google Scholar]

PERMALINK

Identification of sentence stems characteristic of Chinese learner English writing

Jingjie Li

Wenjie Hu

Abstract

1. Introduction

Fig. 1.

2. Literature review

2.1. Previous studies of the sentence stem in phraseology

2.2. Statistics-based extraction of phraseological units

Table 1.

Table 2.

3. The concept of CSS and the corpus

3.1. Identifying criteria of CSS

3.2. The corpus

4. CSS extraction method

4.1. POS tagging

4.2. N-gram segmentation

4.3. Structure identification

4.4. Significance of occurrence calculation

4.4.1. Calculation of internal association

4.4.2. Calculation of boundary independence

Table 3.

4.4.3. Threshold setting

4.5. Text range calculation

4.6. Overlapping sequence reduction

Table 4.

5. Results and discussions

5.1. Overall data profile of the extracted CSSs

Table 5.

Table 6.

5.2. Comparison of the extracted CSSs using different association measures

Fig. 2.

Fig. 3.

Fig. 4.

5.3. Pedagogical implications

6. Conclusion and limitations

CRediT authorship contribution statement

Data availability statement

Declaration of Competing Interest

Acknowledgements

Footnotes

Appendix A.

Appendix A. 500 examples of CSSs categorized by structure

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases