Skip to main content
Heliyon logoLink to Heliyon
. 2024 Aug 30;11(3):e37166. doi: 10.1016/j.heliyon.2024.e37166

Identification of sentence stems characteristic of Chinese learner English writing

Jingjie Li a,, Wenjie Hu b
PMCID: PMC11947701  PMID: 40196792

Abstract

Phraseological units in academic English texts have been a central focus in recent corpus linguistic research. This paper describes a special category of clause-level phraseological units, namely, Characteristic Sentence Stems (CSSs), with a view to describing their identifying criteria and their extraction method. CSSs are contiguous lexico-grammatical sequences which contain a subject-predicate structure and which are frame expressions characteristic of academic writing. The extraction method of a CSS consists of six steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range calculation, and overlapping sequence reduction. The significance of occurrence calculation is the crux of this method. It includes the computing of both the internal association and the boundary independence of a CSS, and it tests the occurring significance of the CSS from both the inside and the outside perspectives. Our methods and results suggest that CSSs can be statistically defined and extracted from corpora and can employed in large-scale studies to more fully account for the phraseological features of non-native English academic writing.

Keywords: Sentence stem, Phraseological unit, Identification measures, Chinese EFL learners

1. Introduction

The sentence stem is commonly known as a phraseological unit that is of clause length or longer and that contains a subject-predicate structure and serves as the frame of a sentence, such as the final point is …, another thing is …, it will be shown that …. As clause-level phraseological units, sentence stems are indispensable elements with which our utterances are largely made [1]. They tend to appear in recurrent clusters, representing the “preferred ways of sayings things” of a speech community ([2]: 35). Sentence stems reflect the idiomaticity of language use.

Studies in diverse fields have referenced the importance of sentence stems in language use. Spoken fluency research has found that “fluent and idiomatic control of a language rests to a considerable extent on knowledge of a body of sentence stems which are ‘institutionalized’ or ‘lexicalized’” ([1]: 191). These sentence stems carry the authority of regular and accepted use by members of the speech community (ibid: 209). English for academic purposes (EAP) studies have shown that sentence stems can serve as key disciplinary markers, being commonly used in certain disciplines while being less prevalent in others. As noted by Hyland (2008: 5), certain sentence stems help to “shape text meanings and contributing to our sense of distinctiveness in a register”, and can act as crucial indicators for exploring typicality in language use across disciplines.

In the field of EFL teaching, research has noticed a lack of phraseological competence in the use of sentence stems among non-native learners. Flowerdew and Li [3] interviewed nine Chinese L1 writers about how they utilized specialist literature in composing their own articles in English. The participants in their interviews stated that, to prepare for their own writing, they refer to a selection of published articles and keep notes of “good” and “potentially useful” expressions therein and that the expressions include a number of sentence stems, such as In this article/In this paper/Here we report …, This figure reveals that …, and The fit shown in Fig. 1 corresponds to … (ibid: 449–457). Their study suggest that the shortage of sentence stems stored in non-native learners’ mind has restricted their achievement in fluent and independent writing.

Fig. 1.

Fig. 1

Distribution of border entropy values.

This underscores the importance and the great potential value of sentence stems in the application, particularly, to non-native academic teaching and learning. Supporting this view, Simpson-Vlach and Ellis [4] created a “pedagogically useful” list of formulaic sequences for academic speech and writing; it includes a large number of sentence stems, such as I'm talking about, it is obvious that, and if you look (at) (the) (ibid: 500). Hammond [5] also designed a formulaic frame phrasebank to facilitate first-year students' skill development in writing. A number of example phrases in the phrasebank also turned out to be sentence stems, such as This theory describes …, This stage is followed by …, and An important concept in this stage is … (ibid: 4). It can be seen that some sentence stems are practically useful for non-native learners' writing and, to some extent, they have become essential language resources for their writing (e.g., Ref. [[5], [6], [7]]).

However, unlike other types of phraseological units such as chunks, lexical bundles, semantic sequences, etc., sentence stems have not been sufficiently studied in previous research of EFL learners' writing. Some research (e.g., Ref. [3,[8], [9], [10], [11]]) touched upon sentence stems when examining the collocational features of students' writing, but few studies have focused on and systematically investigated sentence stems in EFL learners’ writing. As far as we know, no measures or tools have yet been developed to extract sentence stems from large English corpora.

In this study, we use the term Characteristic Sentence Stem (CSS) to refer to the sentence stem notable for its phraseological salience and characteristic of EFL learners' writing. Phraseological salience implies that a sentence stem extends beyond being a mere random creation by an individual learner. Instead, it is a sequence characterized by a prominent co-occurrence of lexical and grammatical elements that consistently establish form-meaning pairings. As phraseologically salient sequences, CSSs cohere significantly more than would be expected by chance, effectively “glued together” to reveal a strong and recurring pattern of clause-level phraseological use. In keeping with this perspective, we have conducted the present study to develop a feasible method for automatic extraction of CSSs to facilitate large-scale studies of sentence stems, enrich the contents of learner phraseology, and broaden our research perspective to analyze EFL learners’ writing performance. To delve further into these objectives, this article will address the following research questions:

  • (a).

    How can we identify and extract CSSs automatically from Chinese learner corpora if we are unable to foresee and predetermine any specific CSS at the start of our extraction?

  • (b).

    Do the extraction results of CSSs differ when using different word association measures and, if so, in what ways?

  • (c).

    Using the described method, what CSSs have been extracted from the corpus, and what pedagogical implications could these CSSs have for Chinese EFL teaching?

2. Literature review

2.1. Previous studies of the sentence stem in phraseology

Sentence stems have long received attention in the field of phraseology. They consist of at least a subject and a verb, and may optionally include other thematic elements, such as discourse item, linking word, etc., and a rhematic post-verbal element, such as an object or complement [12]. Sentence stems can manifest as full clause-like constructions like NP be-TENSE sorry to keep-TENSE you waiting (I am sorry to keep you waiting) and “Who (the EXPLET) do-PRES NPi think PROi be-PRES!” (Who the hell do you think you are!). They can also constitute multiple clause constituents, such as my name is and It seems to me (that). In this case, sentence stems act as extended onsets, forming the springboard of utterances leading up to the lexically most variable element (Altenberg 1998: 113).

Phraseological analysis reveals that only a minority of sentence stems are entirely novel, featuring new combinations of words that follow regular syntactic rules. More commonly, sentence stems are recurrent, familiar sequences, retrieved as more or less prefabricated or routinized strings readily available for the production of discourse. Such sentence stems are the essential “building blocks” of an utterance, shaping regular expression patterns and embodying typical ways of meaning-making. Sentence stems of this kind are typically categorized under two terms, each highlighting different characteristics: “lexicalized sentence stems” [1], defined by the fixedness of the sequence, and “textual sentence stems” [2], defined by the functionality of the sequence.

Pawley and Syder [1] first proposed the concept of the lexicalized sentence stem, or regular form-meaning pairing in language. By “lexicalized”, they mean that the “grammatical form and lexical content” of a sentence stem are “wholly or largely fixed” (ibid: 191–192). An example of lexicalized sentence stems is NP be-TENSE sorry to keep-TENSE you waiting, as in I am sorry to keep you waiting. Pawley and Syder (1983: 202) held that fluent communication largely relies on native speakers’ adoption of a clause-chaining style to string lexicalized sentence stems together. Their research was the first to identify the sentence stem as a phraseological unit. However, “lexicalization is a matter of degree” (ibid: 212). Fully lexicalized sentence stems take up only a very small proportion, and most of them are only partially lexicalized or are lexicalized to a low degree ([14]: 37). Moreover, lexicalization is not particularly operationalizable in technical terms. It is hard to measure whether a sentence stem is lexicalized and to what degree it is lexicalized. All of these lower the feasibility of the empirical studies of lexicalized sentence stems.

Further developments in phraseology have highlighted that a large number of recurrent sentence stems are structurally incomplete yet notably functional; they are routinized expressions tied to specific discourse-pragmatic functions. This usage is so pervasive that these sentence stems, to a large extent, establish regular form-meaning pairings with the functions. Expanding on this functional emphasis, Granger and Paquot [2] put forward the concept of the textual sentence stem, defining it as a “routinized sentence fragment that consists of a subject and a verb and serves textual functions” (ibid: 44). Examples include I will discuss…and it will be shown that … (ibid: 44). This concept accentuates the functionality of a sentence stem rather than its lexicalization. Regrettably, Granger and Paquot [2] only offered a concept, with no specific research into it. Furthermore, they defined textual sentence stems as primarily “serving textual functions” (ibid: 44), but we discovered that many sentence stems perform other functions besides textual functions, such as interpersonal functions (e.g. it would be interesting to).

The two types of sentence stems discussed, though representing different perspectives, both emphasize the phraseological salience of a sentence stem. Such sentence stems signify the conventional and typical language use within the speech community. Building on these insights, we try to introduce the notion of CSS, aiming to identify phraseologically salient sentence stems that represent an important clause-level phraseological resource in Chinese EFL learners' writing. In phraseology, the salience of a multi-word sequence can be statistically measured through the significance of the co-occurrence of its constituent words. This statistical approach enhances the feasibility of the automatic extraction of CSSs from corpora, thus facilitating large-scale, corpus-driven studies of CSSs to account for clause-level phraseological features in Chinese EFL learners’ writing.

2.2. Statistics-based extraction of phraseological units

Most current statistical approaches for extracting phraseological units rely on word association measures (e.g., Mutual Information, Dice coefficient, Odds Ratio, Fisher Exact p-value, Log-Likelihood Ratio). These measures calculate the strength of association between the components of a sequence based on their occurrence and co-occurrence in a corpus [15]. The association strength indicates the chance of a candidate to be a phraseological unit [16]. Notably, current word association measures are primarily designed to quantify the association within two-word sequences. However, a CSS is a multi-word sequence, varying in length. When dealing with such multi-word sequences, a common practice involves the transformation of a multi-word sequence into a set of two-word sequences based on its constituent words. Subsequently, current word association measures are applied to calculate the association within each two-word sequence. Specific algorithms are then used to integrate these individual association values, yielding a composite value determining the overall internal association for the multi-word sequence. This research follows the mentioned methodology to calculate the internal association of a sentence stem.

The current word association measures can be broadly categorized into two types: non-directional word association calculation and directional word association calculation. The former posits a mutual association between words “a” and “b”, quantifying the overall degree of their association. Consequently, the association degree between “a” and “b” is computed as a single value. Notably, this calculation does not examine whether the association degree of “a→b” is equivalent to the association degree of “b→a.” The latter, on the other hand, assumes that the association between words “a” and “b” has directionality. The association from “a→b” should be distinguished from the association from “b→a.” Therefore, when calculating the association degree between “a” and “b”, it is imperative to differentiate the direction and obtain two distinct values: the association degree of “a→b” and the association degree of “b→a.”

  • (a)

    Non-directional word association measures

Church and Hanks [17] were the first to introduce the pointwise mutual information (MI) algorithm to measure the significance of mutual associations between words, enabling the automatic identification of typical collocations. This algorithm remains one of the most widely used word association measures in linguistic studies. The MI is an algorithm for quantifying shared information between two known words “a” and “b” in a text, that is, the mutual interaction force arising from the co-occurrence of “a” and “b”; or, in other words, the reduction in uncertainty about “b” given knowledge of “a”, or how much information “b” reveals about “a”. This statistical measure possesses the characteristics of non-negativity and is commonly denoted as I(a,b). The MI formula is shown below (Expression 1). Here, p(a) and p(b) represent the probabilities of the individual occurrences of words “a” and “b” in the corpus, and p(a,b) represents the probability of the co-occurrence of “a” and “b.”

I(a,b)=log2p(a,b)p(a)p(b) (1)

After the introduction of the MI algorithm to linguistics studies, various statistics-based non-directional word association algorithms emerged. The Dice coefficient (Dice) algorithm, proposed by Dice [18], was introduced to identify collocations by Smadja et al. [19], and the formula is expressed as follows (Expression 2). Here, f(a) represents the frequency of occurrence of word “a” in the corpus, f(b) is the frequency of occurrence of word “b” in the corpus, and f(a,b) represents the co-occurrence frequency of “a” and “b.”

Dice(a,b)=2·f(a,b)f(a)+f(b) (2)

As noted above, the formulas for MI and Dice are relatively simple. The variables involved in Expressions (1) and (2) are only associated with the occurrence (rather than non-occurrence) of words. These variables include the probabilities (or frequencies) of the individual occurrences of the word “a” or “b” (i.e., p(a), p(b), f(a), f(b)), as well as the co-occurrence probability (or frequency) of “a” and “b” (i.e., p(a,b), f(a,b)). However, more word associations algorithms involve the computation of variables associated with both the occurrence of words (referred to as “occurrence variable”) and the non-occurrence of words (referred to as “non-occurrence variable”), and often require the consideration of multiple combinations of occurrence variables and non-occurrence variables.

For ease of exposition in the following text, we employ a contingency or two-by-two table Gries, 2013; [21], which represents the frequency distribution resulting from the cross-classification of two or more variables, to illustrate the variable combinations typically involved in word association algorithms (as shown in Table 1). The cells (cell1, cell2, cell3, cell4) in Table 1 respectively denote different variable combinations. Cell1 corresponds to f(a, b), representing the frequency of co-occurrence of words “a” and “b”; cell2 corresponds to f(a, ¬b), representing the frequency of occurrence of “a” while “b” does not occur; cell3 represents f(¬a, b), indicating the frequency of occurrence of “b” while “a” does not occur; and cell4 represents f(¬a, ¬b), signifying the frequency of non-occurrence of both “a” and “b.” In the four cells, cell1 corresponds to an exclusive “occurrence variable” combination and cell4 an exclusive “non-occurrence variable” combination, while cell2 and cell3 represent mixed combinations involving both “occurrence” and “non-occurrence” variables.

Table 1.

Contingency table for the association calculation between words “a” and “b”.

Frequency of occurrence of a: f(a) Frequency of non-occurrence of a:f(¬a)
Frequency of occurrence of b: f(b) cell1: f(a, b) cell3: f(¬a, b)
Frequency of non-occurrence of b: f(¬b) cell2: f(a, ¬b) cell4: f(¬a, ¬b)

The Odds Ratio is a relatively simple statistical measure that considers both “occurrence” and “non-occurrence” variables; it is commonly used to determine the degree of association between the occurrence or non-occurrence of “a” and “b.” The formula is expressed as follows (Expression 3); it signifies the ratio of the frequency of occurrence of “a” given the occurrence of “b” to the frequency of occurrence of “a” given the non-occurrence of “b”, divided by the frequency of occurrence of “b” given the non-occurrence of “a” to the frequency of non-occurrence of “b” given the non-occurrence of “a.” If the final calculated Odds Ratio value is greater than 1, it indicates that “a” and “b” are positively associated, meaning that the presence of one word increases the probability of the other word's occurrence.

OddsRatio(a,b)=cell1cell2:cell3cell4 (3)

The Fisher Exact p-value (p) and the Log-Likelihood Ratio (LLR) are also widely used measures of word associations involving multiple sets of “occurrence” and “non-occurrence” variables. The formulas for the two algorithms (Expression 4 and Expression 5) are complex and are expressed as follows. Here, cell1,cell2,cell3, and cell4 have the same meanings as described above, and n represents the total number of cell1,cell2,cell3, and cell4. In the context of Fisher Exact p-value, a lower p-value indicates a stronger association between words. Typically, when the p-value is less than 5 %, it signifies a significant association between words; when it is less than 1 %, the association is considered highly significant. Nevertheless, there is currently no consensus on the optimal threshold for LLR to determine the significance of word associations.

Fisher Exact p-value Formula:

p(a,b)=(cell1+cell3)!(cell2+cell4)!(cell1+cell2)!(cell3+cell4)!n!cell1!cell2!cell3!cell4! (4)

Log-Likelihood Ratio (LLR) Formula:

2logLLR(a,b)=cell·log(cell1·n(cell1+cell2)·(cell1+cell3))+cell2·log(cell2·n(cell2+cell1)·(cell2+cell4))+cell3·log(cell3·n(cell3+cell1)·(cell3+cell4))+cell4·log(cell4·n(cell4+cell2)·(cell4+cell3)) (5)

In comparison to measures considering only “occurrence” variables (e.g., MI, Dice), measures including both “occurrence” and “non-occurrence” variables (e.g., Odds Ratio, Fisher Exact p-value, LLR) provide a more comprehensive perspective on the assessment of association between words “a” and “b.” Consequently, the resulting statistical values seem to be more comprehensive and accurate. However, in practical applications, the formulas for measures including both “occurrence” and “non-occurrence” variables are conspicuously intricate, demanding additional computational resources. Additionally, these measures are sensitive to factors such as corpus size, word classes, data distribution, and parameter choices.

Conversely, measures considering only “occurrence” variables demonstrate general computational simplicity and efficiency, rendering them suitable for large-scale text data. Thus, whether measures including both “occurrence” and “non-occurrence” variables perform significantly better than those considering only “occurrence” variables remains an unsettled matter. It is widely acknowledged that the choice of measures hinges upon considerations of data size, computational efficiency, dataset characteristics, and research objectives.

  • (b)

    Directional word association measures

Non-directional word association measures, such as the aforementioned MI, Dice, Odds Ratio, etc., quantify the overall strength of mutual attraction between words “a” and “b.” Those measures do not distinguish the directionality of attraction, i.e., whether the attraction from “a” to “b” differs from that of “b” to “a.” In other words, non-directional word association measures compute the overall probability of co-occurrence between “a” and “b” in the form of (a, b) without considering which word (“a” or “b”) serves as the predominant factor in their co-occurrence. To further investigate the directionality issue of attractions between “a” and “b,” it is necessary to employ directional word association measures, among which the series of pairwise measures such as Attraction and Reliance, ΔP Attraction and ΔP Reliance, are the most widely used.

The pairwise measures of Attraction and Reliance were proposed by Schmid [22]: 54–55). The former (Attraction) signifies the strength of attraction from “a” to “b” (i.e., the association strength of “a→b”), calculated by dividing the frequency of co-occurrence of “a” and “b” by the frequency of occurrence of “b” (Expression 6a in Table 2 below). The latter (Reliance) represents the degree to which “a” is attracted by “b”, or in other words, the strength of attraction from “b” to “a” (i.e., the association strength of “b→a”), calculated by dividing the frequency of co-occurrence of “a” and “b” by the frequency of occurrence of “a” (Expression 6b). To be able to render the scores as percentages, the dividend is multiplied by 100 in both divisions.

Table 2.

Calculating Attraction and Reliance scores.

Attraction=cell1×100cell1+cell3 (6a)
Reliance=cell1×100cell1+cell2 (6b)
ΔPAttraction=cell1cell1+cell3cell2cell2+cell4 (7a)
ΔPReliance=cell1cell1+cell2cell3cell3+cell4 (7b)

Note: cell1+cell3 (in Expressions 6a and 7a) yields the frequency of occurrence of “b”, and cell1+cell2 (in Expression 6b and 7b) yields the frequency of occurrence of “a.”

Schmid and Küchenhoff [23] posited that the pairwise measures of Attraction and Reliance lack consideration of the variable cell4 (see Table 2), thus failing to establish a connection between the frequency distribution of words “a” and/or “b” and the frequency of non-occurrence of both “a” and “b.” Therefore, they refined the measures of Attraction and Reliance using the Delta P algorithm1 (ΔP, [24]) and introduced the pairwise measures of ΔP Attraction (Expression 7a) and ΔP Reliance (Expression 7b), with the formulas provided in Table 2.

As discussed above, both non-directional and directional word association measures have distinct characteristics. Non-directional measures are better suited for calculating the overall strength of association between words, while directional measures can further explore the directionality issues of word associations. Both non-directional and directional measures have been extensively used in phraseological unit extraction (e.g., Ref. [16,[25], [26], [27], [28]]). Different association measures may excel at identifying varying types of phraseological units, but their overall performance and effectiveness have been repeatedly validated and verified across different data types, making them widely accepted in phraseological studies (e.g., Ref. [15,23,29]). As Su et al. (2024: 62) noted, most current approaches are statistically based on word association measures.

In this study, we will use the aforementioned association measures to individually calculate the co-occurring salience of the constituent words in a sentence stem (internal association calculation) and then compare the extracted CSSs using these measures. Specifically, in Section 4.4, we will illustrate the process of internal association calculation using MI as an example. In Section 5.2, we will compare and evaluate the results derived from the combined use of MI and border entropy with those obtained through the utilization of alternative association measures. In addition, as mentioned earlier, most association measures are designed for evaluating the association between two words; it is necessary to adjust existing two-word association algorithms to measuring longer multi-words sequences, with a specific focus on sentence stems. The adjusted approach will be further detailed in Section 4.4.

3. The concept of CSS and the corpus

3.1. Identifying criteria of CSS

In this research, a CSS is temporarily defined as a recurrent contiguous lexico-grammatical sequence which contains a subject-predicate structure and which is of phraseological salience characteristic of Chinese EFL learners’ writing. This definition indicates two identifying criteria.

First, a CSS must have a subject-predicate structure. Three points are worth noting regarding this criterion. (a) A CSS does not necessarily contain all of the subject and predicate elements. Following Granger and Paquot's [2] identifying criteria, our identification requires that a CSS have at least a subject and a predicate verb, such as years have witnessed and we must admit. (b) A CSS may contain additional syntactic functional elements such as objects (e.g., we should spare no effort to) and adverbials (e.g., from the picture we can see); in fact, this is mostly the case. In sum, a CSS takes the “subject + predicate verb(s)” as the core components but varies greatly in length and structure. It can be either a clause constituent such as we can draw the conclusion that … and it is convenient for …, or a full clause (refer to Ref. [1]) such as practice makes perfect and reasons are as follows (c) Subject omission structure (e.g., as noted above). and predicate omission structure (e.g., if possible) require individual treatment. As-introduced sequences, whether with a subject (e.g., as we have seen) or without a subject (e.g., as can be seen), are both counted as sentence stems in our identification.

Second, not all of the sequences with a subject-predicate structure are CSSs. A CSS should be phraseologically salient, marked by a higher-than-expected co-occurrence of its constituent words and demonstrating a strong and recurring pattern of clause-level phraseological use in Chinese learners' writing. This standard emphasizes the salience or the typicality of a CSS and filters out such sentence stems as she is also, we have done. In our study, the salience of each CSS will be measured in terms of two statistical parameters: internal association (i.e., the co-occurring salience of the constituent words of a CSS) and boundary independence (i.e., the clarity of the left and right borders of a CSS). Please refer to Section 4.4 for detailed discussion. Only the sentence stems that reach high statistical standards will be considered characteristic of Chinese EFL learners’ writing and will become candidate CSSs.

3.2. The corpus

We based our study on the TECCL corpus; it contains approximately 10,000 English essays written by various groups of Chinese EFL learners, among which university students constitute the overwhelming majority of the writers. The corpus consists of texts written in class, in testing, and after class; the writing samples included were produced between 2011 and 2015. To the best of our knowledge, the TECCL corpus, totaling 1,817,335 word tokens, is one of the largest Chinese learner corpora publicly available, and it has been widely used for studies of Chinese EFL learners' English. The corpus figures prominently for its representativeness in two respects. (a) The geographical spread of the writers in the corpus is by far the widest of all Chinese EFL learners’ English corpora. (b) The proportion of the essays written by top-notch university students and by non-top-notch university students corresponds well to the actual distribution of top-notch universities in China.

In this study, we selected the essays written by university undergraduate students from TECCL and formed a 1.3-million-word corpus, TECCL-Sample, to use as the data source for illustrating the extraction procedures of CSSs and for analyzing the idiomatic usages of sentence stems in Chinese EFL learners’ writing. The TECCL-Sample contains 6886 essays with the size of 1,387,716 tokens (running words), 27,851 types (distinct words), and 91,747 sentences. Its standardized TTR amounts to 41.58 %, and its mean sentence length is 15.13 (in words).

4. CSS extraction method

This section will take TECCL-Sample as an example and will enlarge on our CSS extraction method that consists of six main steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range setting, and overlapping sequence reduction. POS tagging is used to label each word in the corpus with its Part of Speech (POS); n-gram segmentation is used to segment each running text in the corpus into different groups of linear sequences; structure identification is used to identify sentence stems out of all the linear sequences; significance of occurrence calculation is used to measure the salience of each sentence stem in learners’ writing and its possibility to become a CSS; text range setting is used to examine the inter-textual frequency distribution of each sentence stem; and overlapping sequence reduction is used to ensure that each of the extracted CSSs does not substantially overlap the others. In what follows, we will discuss these six steps in more detail.

4.1. POS tagging

We POS tag all the texts in TECCL-Sample with the tagging software CLAWS7 and its C7 tagging set. It should be pointed out that the selection of tagging software is rather flexible. In addition to CLAWS7, other common tagging software, such as NLTK tagging packages,2 can also be used. We choose to use CLAWS7 based on two considerations. First, CLAWS7 provides high tagging accuracy. According to the official website, CLAWS has consistently achieved 96–97 % accuracy. (b) The C7 tagging set has very detailed tag categories, which are crucial for the structure identification in 4.3.

4.2. N-gram segmentation

All the tagged texts in TECCL-Sample are segmented into linear sequences, whose lengths vary from two to eight words. That is, every single text in TECCL-Sample is repeatedly segmented into different groups of n-grams, such as 2-g, 3-g, 4-g. Then we process all the n-grams by (a) deleting sequences that span two paragraphs, sentences, or include punctuation marks as semicolon, colons, unpaired brackets, and unpaired quotation marks, and by (b) calculating the frequency of each n-gram. It is worth noting that, at this stage, we do not take the POS tag into account, and thus, the obtained n-grams may include both sentence stems and sequences in other structures, as well.

4.3. Structure identification

Based on POS tags, we search for sequences with a subject-predicate structure in the obtained n-grams. This step, though seemingly simple, is quite complex in practice because many kinds of parts of speech can serve as subjects, and sometimes there is no subject. After repeated discussions and examinations, we identify five broad categories and, altogether, 36 subcategories of POS tags that can function as subjects: nouns (22 types), pronouns (9 types), determiners3 (3 types), as-introduced structure (1 type), and existential-there (1 type). We also identify four broad categories (16 subcategories) of verbs that can function as predicates: i.e., VB (VBDR, VBDZ, VBI, VBM, VBR, VBZ), VD (VD0, VDD, VDZ), VH (VH0, VHD, VHZ), VV (VV0, VVD, VVI, VVZ) in the C7 tagging set.

4.4. Significance of occurrence calculation

The most challenging point during the extracting process is to rule out non-salient sequences using statistical measures. This research employs two frequency-based algorithms to calculate the salience of each sequence in order to limit the potential interference from absolute frequency. The two algorithms measure the internal association and boundary independence of each sequence respectively: the former focuses on the inside of the sequence and looks at the attractions among its constituent words, while the latter takes the sequence as a whole and measures the variability of its outside neighboring words. Only when the internal association and the boundary independence of the sequence are both larger than their respective thresholds will the sequence be identified as a candidate CSS. By this means, each sequence is examined for significance from both the inside and the outside perspective. We got this idea from Jiang et al. (2007a: 9–16), who used the hybrid method to extract Chinese chunks. We modify their integrated algorithm and propose a normalization algorithm for overlapping sequence reduction, in order to optimize the extraction of English sentence stems.

4.4.1. Calculation of internal association

Many association algorithms have been proposed hitherto, such as MI, Log Likelihood Ratio, Dice, Fisher Exact p-value, and Odds Ratio, all of which are confined to measure the association or the attraction between two words (2-g). In order to measure that of a longer n-gram (where n > 2), we use pseudo-bigram transformation and the probability-weighted average algorithm to calculate the internal association of CSSs, as exemplified by MI [25,30]. The principles are:

  • (a)

    We adopt the idea of pseudo-bigram transformation to turn every single n-gram (n 2) into n-1 pseudo-bigrams [31]. The concept of pseudo-bigram transformation assumes that every n-gram (w1, w2, w3, … …, wn) has n-1 dispersion points, i.e., the spaces located between the positions of the constituent words of the n-gram. Each dispersion point transform the n-gram into a pseudo-bigram consisting of two parts: a left part (w1…wi) and a right part (w(i+1)…wn) (1in1). Therefore, n-1 dispersion points can transform the n-gram into n-1 pseudo-bigrams. For example, the tri-gram “practice makes perfect” can be transformed into two pseudo-bigrams: “practice ∗ makes perfect” and “practice makes ∗ perfect”. Once n-grams have undergone this transformation, current association measures can be applied to calculate the internal associations for these pseudo-bigrams.

  • (b)

    We calculate the expected value of joint probability for each pseudo-bigram, and we get n-1 expected values of joint probability for the n-gram. We then compute the probability-weighted average of all the expected values and obtain the weighted expected value of joint probability for the whole n-gram, denoted by WAP. Finally, we take the logarithmic function the empirical joint probability divided by WAP, resulting in the MI for the whole n-gram as its internal association value. The formula is shown below (Expression 8), where Sn represents a multi-word sequence consisting of n words Sn={w1,w2,w3,,wn}, and i represents the dispersion point which transforms the sequence W into a pseudo-bigram and is “located” between a left and a right part of the pseudo-bigram: w1, w2, …, wi and w(i+1), , wn (1in1,n2).

{MI(Sn)=logP(Sn)WAPWAP=i=1i=n1P[P(w1,,wi)·P(wi+1,,wn)][P(w1,,wi)·P(wi+1,,wn)] (8)

Once again using the tri-gram “practice makes perfect” as an example, the data related to MI calculation is as follows:

  • (i)

    Ppractice=8931361171=6.5605×104

  • (ii)

    Pperfect=2381361171=1.7485×104

  • (iii)

    Ppracticemakes=171292954=1.3148×105

  • (iv)

    Pmakesperfect=151292954=1.1601×105

  • (v)

    Ppracticemakesperfect=111225132=8.9786×106

To calculate the WAP for the trigram, we will have the expected joint probability E1 for the pseudo-bigram “practice ∗ makes perfect” and E2 for the pseudo-bigram “practice makes ∗ perfect” respectively as follows:

  • (vi)

    E1=E(practicemakesperfect)=Ppractice×Pmakesperfect

=6.5605×104×1.1601×1057.6108×109
  • (vii)

    E2=E(practicemakesperfect)=Ppracticemakes×Pperfect

=1.3148×105×1.7485×1042.2989×109

Applying the formula of probability-weighted average (Expression 8), we have.

  • (viii)

    WAP(practicemakesperfect)=i=1i=2P(Ei)·Ei=P(E1)·E1+P(E2)·E2=7.6108×1097.6108×109+2.2989×109×7.6108×109+2.2989×1097.6108×109+2.2989×109×2.2989×1096.3785×109

According to the refined MI algorithm (Expression 8), the internal association of the trigram “practice makes perfect” is finally calculated as follows:

  • (ix)

    MI(practicemakesperfect)=log2(PpracticemakesperfectWAP)=log2(8.9786×1066.3785×109)10.4590

4.4.2. Calculation of boundary independence

Boundary independence is another approach used to calculate the significance of occurrence of a multi-word sequence ([32]: 476–481 [33];: 9–16). It measures the clarity of outside borders (i.e., the left and right sides) of the sequence, by considering the uncertainty of its adjacent collocates. Its logic is that the borders of a sequence are clearer if its collocates are more various and more evenly distributed; in other words, the more various words a sequence can collocate with, the more independent the borders of that sequence. Here, we employ the concept of “border entropy” to measure the degree of boundary independence of a sentence stem. If we consider the adjacent collocates of a sequence as random distributions, the larger the value of border entropy, the more uncertain the sequence's collocates will be; hence, the higher independence significance the sequence has, and the more likely it becomes a CSS. The procedures are described as follows.

  • (a)

    For every sentence stem, the respective set of its left and right adjacent collocates are automatically generated. Each set contains information about collocates, such as how many varied words the sequence can collocate with, what those words are, and how many times each word co-occurs with the sequence. In order to improve the processing efficiency, we use the nested mode of dictionary data structure of Python to store the relevant data of each sequence's left and right adjacent words. It is worth noting that two possible situations can lead to the increase of the border entropy value. One is the normal case that the left and right adjacent words of a sequence are of various types and are in even distribution, which indicates an unstable collocation between the sequence and its adjacent words. The other is that punctuation marks, such as colons, semicolons, periods, parentheses, and quotation marks, appear right before or after a sequence, which also indicates a very small possibility that the sequence will cross those punctuation boundaries to form a CSS with other words. For this reason, we create a special category of “empty border” to annotate the punctuation marks which occur on either the left or the right side of a sequence. In order to maximize the border entropy of the sequence with empty borders, we assign different key names to each occurrence of empty border, such as { ‘none_1’: 1, ‘none_2’: 1, ‘none_3’: 1, …… }, even though the same punctuation mark may occur repeatedly on the sequence's borders.

  • (b)

    With reference to the statistics generated in step (a), we calculate the respective left and right border entropy of the sequence with Expression 9 below. Let S be the candidate sequence. A represents the set of words that occur to the left side of S, a is an element in the set A, and P(aS|S) refers to the probability of co-occurrence of word a and sequence S under the condition that S has occurred. B refers to the set of words that occur to the right side of S, b is an element in B, and P(Sb|S) means the conditional probability that word b occurs with S, given S. We also derive an algorithm to integrate the left border entropy (H(S)left) and the right border entropy (H(S)right) of the sequence S in order to determine the overall value of boundary independence for S (H(S)). The integrated algorithm is shown in Expression 10 below.

{H(S)left=aAP(aS|S)·log2P(aS|S)H(S)right=bBP(Sb|S)·log2P(Sb|S) (9)
H(S)=H(S)left·H(S)right (10)

Here, we take the sequence there is a widespread concern over, which has occurred 5 times in TECCL-Sample, as an example to illustrate the calculation of boundary independence. Table 3 shows the concordances of there is a widespread concern over, with its left and right adjacent words or its punctuation marks highlighted in bold and shade.

Table 3.

Concordances of the sentence stem there is a widespread concern over.

1. There is a widespread concern over the issue that whether you prefer to study
2. There is a widespread concern over the topic about the formal examination, it
3. There is a widespread concern over whether famous people shoulder more res
4. Currently, there is a widespread concern over hunting wild animals for meals. A recent s
5. s by reading literature. There is a widespread concern over the issue the importance of Reading Litera

As shown in Table 1, the left side of there is a widespread concern over consists of comma (1 time), period (1 time), and “null” (the first sentence of an essay, 3 times); the right adjacent words and their respective frequency are: the (3 times), whether (1 time), and hunting (1 time). Note that “null” has occurred 3 times on the left border of there is a widespread concern over, but in our calculation, we regard it as three different “empty borders”, each of which occurs once, because three different “empty borders” will yield a larger value of the left border entropy than one “empty border” occurring 3 times, and a larger entropy value indicates a greater clarity on the left border for the sequence (Refer to step (a) for more details). Based on this consideration, the sequence there is a widespread concern over is stored as follows in our program:

there is a widespread concern over ’: { ‘Freq’ : 5,

‘leftWords’:{ ‘none_1’: 1, ‘none_2’: 1, ‘none_3’: 1, ‘none_4’: 1, ‘none_5’: 1},

‘rightWords’:{ ‘the’ : 3, ‘whether’ : 1, ‘hunting’ : 1} }

According to Expressions (2) and (3), the boundary independence of there is a widespread concern over is calculated as follows:

H(thereisawidespreadconcernover)left=(15×log215+15×log215+15×log215+15×log215+15×log215)2.321928
H(thereisawidespreadconcernover)right=(35×log235+15×log215+15×log215)1.3709506
H(thereisawidespreadconcernover)=H(thereisawidespreadconcernover)left×H(thereisawidespreadconcernover)right=2.321928×1.37095061.784166

4.4.3. Threshold setting

The combined use of internal association calculation (i.e., MI) and boundary independence calculation (i.e., border entropy) can effectively reduce the redundancy of candidate CSSs, but this approach also increases the complexity of setting threshold values. In our case, the MI threshold setting is relatively simple, as a threshold value of 3 has been determined empirically and widely used in linguistic research for choosing the best candidate and for assigning a fairly high weight (See Ref. [34]: 217 [35];: 227). We follow this tradition and use 3 as the cut-off value for the internal association MI.

On the other hand, no specific threshold value has been established for border entropy, leaving us without a reference for setting our threshold. To address this, we first plotted the distribution of border entropy values for all sentence stems, as shown in Fig. 1.

The distribution of border entropy values shows a complex pattern. For nearly four-fifths of the plot, the values form staircase-like lines with a noticeable gap, followed by a small-scale fluctuation and a sharp rise near the end. Upon examining the data, we found that among the 29,914 different candidate sequences, 15,182 have a border entropy value of 0, forming the lower horizontal line in the plot, and 7687 have a value of 1, forming the higher, shorter line. Only 518 sequences have a value between 0 and 1, creating the small gap in the plot. We now consider the cases with border entropy values of 0 and 1 for threshold setting.4

  • (a)

    When the border entropy value of a sequence is 0. According to Expressions 9 and 10, there is only one possibility for the border entropy value of 0: the sequence only co-occurs with one same word on its left or right border. In other words, the sequence has a very strong tendency to collocate with one word and, thus, its border is not clear at all. For example, it may be true occurs 4 times in TECCL-Sample, all collocating with the word that on its right side. This leads to its border entropy value of 0. For this reason, we decide to delete the sentence stems whose border entropy value is 0.

  • (b)

    When the border entropy value of a sequence is 1. Based on our observations of the data, almost all the sequences with the border entropy value of 1 occur twice in the corpus and at the same time collocate with different words on both left and right sides. For example, it is widely acknowledged that the occurs twice, with two “empty borders” on its left side and two words (world and main) on the right side, which results in a border entropy value of 1. Considering that the parameter of text range, for which a threshold value of “more than four different essays” (See 4.5 for a detailed account), we exclude the sequences whose occurring frequency is 2 and whose border entropy is 1.

Based on the above considerations, we empirically set the threshold value of MI to 3 and the threshold of border entropy to 1.

4.5. Text range calculation

We also include “text range” in our extraction of CSSs, in that only when the occurrences of a sequence are statistically significant, and their inter-textual distributions are dispersive to a certain degree, do we have the reason to treat the sequence as an expression characteristic of Chinese learner English. In other words, the parameter of text range is employed to ensure that the use of a CSS is not the idiosyncrasy of an individual student but, rather, an expression recognized by other peers. The threshold value of text range (R) is set at a fairly low level in our extraction: R > 4, which means as long as a sequence appears in more than four essays, it meets our text range requirement.

In total, we have set three parameters to delimit CSSs: internal association (MI > 3), boundary independence (H > 1), and text range (R > 4). Only sentence stems that satisfy all of the three requirements are identified as candidate sequences for the next procedure. After taking this step, 2408 varied sequences have been extracted.

4.6. Overlapping sequence reduction

The cut-off scores of the above three parameters (i.e. MI, border entropy, and text range) have excluded a large number of noise sequences, but we still extract sequences like it is convenient (MI = 5.48, H = 3.21, D = 55), it is convenient for (MI = 6.31, H = 3.53, D = 32), it is convenient for us (MI = 6.53, H = 2.34, D = 14), it is convenient for people (MI = 5.28, H = 1.68, D = 6), it is convenient for us to (MI = 5.6, H = 3.01, D = 10), since their MI, border entropy, and text range scores are all above the threshold values. A commonly noticeable feature of sequences in different lengths, as such, is that the shorter sequence is part of, or is included in, the longer one; that is, they are overlapping sequences. In the present context, the shorter sequence is called a “sub-string” and the longer sequence a “super-string.” A problem facing us now is how to choose an appropriate sequence as the CSS, among all of the sub-strings and super-strings.

In this study, we use the LocalMax algorithm to remove the overlapping sequences [31]. Let Sn be the candidate sequence that consists of n words. Sn-1 represents any substring of Sn that has the size of (n-1), and Sn+1 is any super-string of Sn that has the size of (n+1). After we have extracted the sequences whose MI, border entropy, and text range scores are all larger than their cut-off values, we calculate the product of normalized MI, N(MI), and normalized border entropy, N(H), for each sequence Sn, and assign the obtained value to a new variable GI (Global Index). The formula for LocalMax is as follows:

{GI=N(MI)×N(H)GI(Sn)GI(Sn1)GI(Sn)>GI(Sn+1) (11)

However, it should be noted that, in Expression 11 above, we use the normalized values of MI and border entropy instead of their absolute values in the calculation of GI, i.e., GI=N(MI)×N(H). This is because the distribution range of absolute values for the respective MI and border entropy vary greatly; to be more specific, MI values are distributed within the range of [-3.56, 19.50], while border entropy values are in the range of [0, 7.08]. As we counted, 97.56 % of the sequences obtained in Step 4.4 have a higher or even a much higher value of MI than of border entropy. In this regard, if we calculate the GI of each sequence simply by multiplying the absolute values of MI and border entropy, then the MI tends to have a larger impact on the final value of GI than the border entropy. As a consequence, the choice to keep or to delete a sequence will lie more on its internal association than on boundary independence. However, it is argued that the inside measurement of a sequence (i.e., the MI in this study) and the outside measurement (i.e., the border entropy) should carry equal weight when judging the significance of occurrence of the sequence. Therefore, we introduce the Min-Max algorithm from statistics to normalize and convert the values of MI and border entropy, respectively, into the range [0, 1] by way of a linear transformation. This normalized conversion, for one thing, is to offset the unbalanced impact of MI and border entropy on the final result of our extraction, and for another to retain maximally the respective inner distributions of MI values and border entropy values. Suppose that MI={MIi,i=1,2,,n} is the set of MI absolute values, MImin and MImax are the respective minimum and maximum value of the set, N(MIi) is the normalized value of any MIi in the set; H={Hi,i=1,2,,n} is the set of border entropy values with Hmin and Hmax as its minimum and maximum values, and N(Hi) is the normalized value of any Hi. The Min-Max normalization algorithm is shown in Expression 12 below.

{N(MIi)=MIiMIminMImaxMIminN(Hi)=HiHminHmaxHmin (12)

Here, we take the above example it is convenient for, which contains four words, into consideration. Table 4 shows all the 3-word sub-strings and 5-word super-strings of the sequence with their respective GI values.

Table 4.

3-word sub-strings and 5-word super-strings of it is convenient for (with GI value).

3-word sub-strings Candidate CSS 5-word super-strings
it is convenient (GI = 0.18) it is convenient for (GI = 0.21) it is convenient for us (GI = 0.15)
it is convenient for people (GI = 0.09)
more it is convenient for (GI = 0.05)
firstly it is convenient for (GI = 0.11)

Note that we only compare the sub-strings and super-strings which are sentence stems and whose MI and border entropy values are both larger than the cut-off scores. Altogether, one 3-word sub-string and four different 5-word super-strings of it is convenient for are selected for the calculation of LocalMax; other sub-strings and super-strings are discarded because either they are not sentence stems (e.g., is convenient for) or they do not satisfy the threshold requirements of MI, border entropy, or text range (e.g., it is convenient for students, if it is convenient for). By performing the LocalMax algorithm, the 4-word sentence stem it is convenient for is finally identified as a CSS because its GI value, which stands at 0.21, is higher than that of any of its 3-word sub-strings and 5-word super-strings.

5. Results and discussions

5.1. Overall data profile of the extracted CSSs

With the aforementioned steps, 1293 different CSSs (types), which occur 16,324 times in total (tokens), were automatically extracted, with their lengths varying from three to seven words. We then manually checked through each extracted CSS for precision and filtered out 320 varied dubious sequences that did not fit our intuition. In the end, we identified 973 different CSSs (types) with a total of 12,249 instances (tokens) from the corpus (See Appendix A for a list of 500 examples of finally-identified CSSs). In what follows, we will demonstrate the structural and functional distribution of the extracted CSSs to offer a more refined profile of the overall CSS data.

  • (1)

    Structural distribution of CSSs

Drawing on Altenberg's [13] taxonomy, we classify CSSs into two broad structural categories: full clauses and clause constituents (see Table 5).

Table 5.

Distribution of CSSs of different structural categories.

Structural categories Types Percentage (%) Tokens Percentage (%)
Full clauses 122 12.538 1725 14.083
(a). Independent clause: maxims, proverbs, etc. 39 4.008 447 3.649
(b). Independent clause: others 58 5.961 457 3.731
(c). Dependent clause: as-introduced CSSs 25 2.569 821 6.703

Clause constituents 851 87.460 10524 85.917
(a). Personal subject 429 44.090 5676 46.338
(b). Impersonal subject general 49 5.036 495 4.041
specific 100 10.277 797 6.507
demonstrative pronoun 58 5.961 761 6.213
(c). Dummy-it construction 128 13.155 1566 12.785
(d). Existential construction 87 8.941 1229 10.033

Total 973 100 12249 100

As shown in Table 5, 122 types and 1725 tokens of CSSs are identified to be full clauses; they are classified into three sub-categories: (a) Independent clauses that are maxims, proverbs, and other fragments of rhetoric (e.g., every coin has two sides). (b) Other independent clauses that are mainly expressed by two groups of subject: the general subject that consists of the expressions commonly used in developing an argument (e.g., advantages outweigh the disadvantages), and the specific subject that is linked to the semantic content of the topic of an essay (e.g., online shopping has many advantages). (c) As-introduced CSSs (e.g., as can be seen), which are identified as dependent clauses, a main subcategory of full clauses according to Altenberg (1998: 109).

The clause constituent CSSs, with 851 types and 10,524 tokens, outnumber the full clause CSSs by nearly four to one (in types) and by more than three to one (in tokens). CSSs in this category are divided into four structural sub-categories: (a) Personal subject CSSs, which are expressed by personal pronouns or nouns as subject (e.g., we should pay attention to). (b) Impersonal subject CSSs, whose subject position is occupied by impersonal nouns or pronouns in three groups: the topic-specific subject (e.g., appearance is more important than), the general subject (e.g., the main reason is that), and the demonstrative pronoun (e.g., that is the reason why). (c) Dummy-it CSSs, which is introduced by the anticipatory it (e.g., it is obvious that). (d) Existential CSSs, which is introduced by existential-there (e.g., there is no doubt that).

  • (2)

    Functional distribution of argumentation-related CSSs

Scrutiny of the list of sentence stems (Appendix A) shows that most of the extracted CSSs are argumentation-related. It is found that the argumentation-related CSSs, with 931 types and 11,882 tokens, consist mostly of the expressions that are commonly used in the two components of developing an argument [36]: “analyzing and evaluating content knowledge” and “developing the writer's own position” (see Table 6).

Table 6.

Distribution of argumentation-related CSSs of different functional types.

Functional categories Types Percentage (%) Tokens Percentage (%)
Analyzing and evaluating content knowledge 313 33.62 3846 32.37
(a). describing the current situation or background 214 22.99 2116 17.81
(b). stating others' views 39 4.19 612 5.15
(c). stating popular assumptions 26 2.79 696 5.86
(d). indicating source of the opinion 23 2.47 330 2.78
(e). identifying conflicting points of view 11 1.18 92 0.77

Developing the writer's own position 618 66.38 8036 67.63
(a). stating an opinion or expressing a stance 443 47.58 5935 49.95
(b). giving reasons or explanations 97 10.42 1116 9.39
(c). support a claim with maxims, proverbs, etc. 39 4.19 447 3.76
(d). indicating conditions 24 2.58 429 3.61
(e). raising a question 7 0.75 43 0.36
(f). concluding or summarizing 8 0.86 66 0.56

Total 931 100.00 11882 100.00

The first element of argumentation “analyzing and evaluating content knowledge” requires that students possess adequate subject knowledge and are capable of distinguishing relevant from irrelevant information in the literature ([37]: 147). It is shown that 313 types and 3846 tokens of the extracted CSSs are used in relation to this element; specifically, they are used to realize five discourse-pragmatic functions: (a) “describing the current situation or background” (e.g., people suffer from), (b) “stating others’ views” (e.g., some people hold the opinion that), (c) “stating popular assumptions” (e.g., there is a widespread concern over), (d) “indicating source of the opinion” (e.g., as an old saying goes), (e) “identifying conflicting points of view” (e.g., there are different opinions among people).

The second element of argumentation “developing the writer's own position” requires that students express their opinion or establish a position based on their subject knowledge and be able to show a ‘workable balance between self and sources’ ([38]: 65). 618 types and 8036 tokens of CSSs are identified in relation to this element, about twice as many as those for “analyzing and evaluating content knowledge” either in types (618/313) or in tokens (8036/3846). Specifically, the CSSs of this type are found to realize six discourse-pragmatic functions: (a) stating personal opinion or expressing a stance (e.g., there is no doubt that), (b) giving reasons or explanations (e.g., that is the reason why), (c) quoting maxims, proverbs or fragments of rhetoric, typically, to support a claim, view, etc. (e.g., practice makes perfect), (d) indicating conditions (e.g., when it comes to), (e) raising a question (e.g., how should we deal with), (f) concluding or summarizing (e.g., we can safely draw the conclusion that).

5.2. Comparison of the extracted CSSs using different association measures

We employed six association measures, including four non-directional measures (Dice, Odds Ratio, Fisher Exact p-value, and LLR) and two directional measures (ΔP Attraction and ΔP Reliance), as substitutes for MI, to individually calculate the internal association of a sentence stem. These measures were applied in conjunction with boundary independence calculation and text range calculation to extract CSSs from the corpus. We selected the top 500 CSSs (types) from the results of each association measure (including MI) for comparison. Fig. 2 shows the overall distribution of CSS types in relation to sequence length. The figure consists of seven clustered bar charts, each representing the distribution of CSSs of different lengths extracted using different association measures.

Fig. 2.

Fig. 2

Sequence length distributions of the top 500 CSSs.

The data analysis of Fig. 2 reveals distinctions in the sequence length distribution based on the choice of association measures. It is shown that the distributions derived from Dice, LLR, and Fisher Exact p-value are remarkably similar. Specifically, the highest number of CSSs corresponds to three-word sequences, as indicated by the light blue bars in Fig. 2. This prevalence of three-word sequences is significantly higher than sequences in other lengths. On the other hand, MI, ΔP Attraction, ΔP Reliance, and Odds Ratio exhibit a relatively similar distribution, with the highest number of CSSs being four-word sequences, represented by the orange bars. This dominance of four-word sequences is noticeably greater than sequences in other lengths. However, subtle variations exist among the four algorithms in terms of their performance on three-, five-, and six-word sequences.

We then conducted a comparative analysis by juxtaposing two sets of CSSs. The first set encompasses the top 500 CSSs extracted using MI, while the second set comprises the top 500 CSSs extracted individually using each of the alternative association measures (Odds Ratio, ΔP Reliance, ΔP Attraction, Dice, LLR, and Fisher Exact p-value). Through the application of set intersection operations to these sets, we aim to delineate the common or shared sentence stems that are extracted by MI and the association measures. The overall intersection of CSSs between each of the two sets is graphically presented in Fig. 3.

Fig. 3.

Fig. 3

Shared sentence stems between MI and alternative association measures.

Fig. 3 provides an overarching perspective on the degree of overlap between the sentence stems obtained through six alternative association measures and those obtained through MI. From the number of sentence stems (types) in the intersections, ΔP Attraction demonstrates the highest level of overlap with MI, with 280 shared sentence stems. This is followed by ΔP Reliance with 266, Dice with 223, LLR with 219, and p-value with 206. Odds Ratio exhibits the lowest level of overlap, with 187 shared sentence stems. Turning to sequence length, a notable observation emerges in terms of the shared sentence stems; that is, four-word sequences consistently manifest a substantial overlap across all six “association measure-MI” pairings. This overlap also maintains a relatively stable count across the pairings, ranging from 82 (MI-Odds Ratio) to 135 (MI-ΔP Attraction).

To offer a more detailed examination of each comparative pairing, Fig. 4 is introduced to provide a granular breakdown of the intersection results of CSSs in terms of shared and distinct sentence stems for each “association measure-MI” pairing. The figure comprises 6 bar charts, each illustrating the comparative results of MI and the other association measure in a pairing. The grey bars represent the number of shared sentence stems extracted by the two measures (i.e., MI∩association measure). The blue bars denote the number of distinct sentence stems exclusively by MI but not by the other association algorithm in a pairing (i.e., MI - association measure), while the orange bars represent the number of sentence stems extracted by the other association measure but not by MI (i.e., association measure - MI). The blue and orange bars collectively depict the difference of extraction results between the two measures in a pairing.

Fig. 4.

Fig. 4

Overlap and disparity between MI and alternative association measures in the extraction of CSSs. A: MI vs. Dice. B: MI vs. Odds Ratio. C: MI vs. LLR. D: MI vs. ΔP Attraction. E: MI vs. Fisher Exact p-value. F: MI vs. ΔP Reliance.

The analysis of Fig. 4 reveals that MI tends to favor the extraction of longer sequences, compared to Dice, LLR, and Fisher Exact p-value. As shown in Fig. 4, in the pairings of “MI-Dice” (Fig. 4A), “MI- LLR” (Fig. 4C), and “MI- pValue” (Fig. 4E), the orange bars for three-word sequences are significantly higher than their corresponding blue bars. This indicates that the Dice, LLR, and Fisher Exact p-value algorithms extract far more three-word sequences than MI. Conversely, for four-to seven-word sequences, the blue bars are notably higher than their orange counterparts. This suggests that MI consistently extracts more four-to seven-word sequences than Dice, LLR, and Fisher Exact p-value. However, in the pairings of “MI-Odds Ratio” (Fig. 4B), “MI-ΔP Attraction” pairing (Fig. 4D), and “MI-ΔP Reliance” (Fig. 4F), MI does not exhibit the pronounced tendency to favor the extraction of longer sequences.

In summary, based on the above comparative analysis of association measures for sentence stem extraction, we can outline two main findings as follows.

  • (a)

    In the comparison between MI and the six alternative association measures, MI and ΔP Attraction stand out by extracting the highest number of shared sentence stems (as illustrated in Fig. 3). This suggests a higher degree of similarity in the results extracted using the two measures. Among the shared sentence stems across the six “MI-association measure” pairings, four-word sequences are the most prevalent.

  • (b)

    Among the three association measures — Dice, LLR, and Fisher Exact p-value — there is a notable inclination toward favoring shorter sequences, with three-word sequences being the most frequently extracted. In contrast, MI, ΔP Attraction, Odds Ratio and ΔP Reliance exhibit a preference for extracting longer sequences, which, while potentially less frequent than their shorter counterparts, are likely to be more informative. For example, the five-word sequence as is known to all (extracted by MI) yields more specific information compared to the four-word sequence as is known to (extracted by Fisher Exact p-value), which in turn provides more information than the corresponding three-word sequence as is known (extracted by LLR). The sequence as is known to all, as opposed to the simpler sequence as is known manifests a higher level of information richness and collocational specificity, and thus, is more beneficial for non-native students' writing.

5.3. Pedagogical implications

The phraseological deficits experienced by learners have long been noted. Errors stemming from a lack of phraseological competence, although not always major and varying in their impact on intelligibility, have an appreciable impact on the effectiveness of student writing. To enhance EFL learners' phraseological competence, it has been widely argued that approaches to second-language instruction should ensure that learners develop a rich repertoire of formulaic sequences ([39]: 142). Our investigation into CSSs underscores the expansive nature of this repertoire, extending beyond phrase-level sequences to encompass clause-level sequences. The identified CSSs demonstrate the formulaic patterns at the clause level in Chinese EFL learners’ essays and reveal the typical way in which Chinese learners write essays. Additionally, categorizing CSSs into lists based on their sentence patterns and discourse-pragmatic functions enables teachers to target either the structural or functional aspects of clause-level idiomatic expressions in Chinese learner English. Thus, from a pedagogical perspective, CSSs could have potential implications for Chinese EAP teaching and learning.

Next, we will present a specific example illustrating the pedagogical implication of CSSs.

It has been argued that the element of argumentation “developing the writer's own position” poses considerable difficulties for the novice writer ([37]: 147). Our extraction captures 39 different maxim-like expressions to realize this element in Chinese EFL learners' essays. The use of those expressions, which consist of maxims, proverbs, and other fragments of rhetoric, is deemed a good indicator to evaluate the sophistication of lexical use of an EFL learner in essay writing. Our extraction shows that Chinese EFL learners are aware of using maxims, proverbs, etc., in developing arguments in their essays. However, their usages of maxim-like expressions reveal two notable issues.

  • (a)

    Some maxims have been used too often to be considered striking or interesting usages; a few of them have become overused even to the point of being trite and clichéd. For example, among the 73 essays on the topic of failure or success, the expression failure is the mother of success occurred 13 times, which means that nearly one of four (17.81 %, 13/73) essays of the topic used the expression. The repetitive occurrence of some maxims across Chinese learners' essays undermines the effectiveness of those maxims and makes the argument less interesting.

  • (b)

    It is found that some idiomatic expressions can be traced back to their origins in Chinese. For example, the expression long time no see is derived from the Chinese greeting “好久不见.” A few expressions are also found to be direct translations of Chinese idioms or maxims, such as practice is the sole criterion for testing truth (实践是检验真理的唯一标准), knowledge is power (知识就是力量). Those English expressions may sound unidiomatic or strange to the native ear, but their corresponding Chinese expressions are highly familiar ones with Chinese native speakers.

From the analysis above, we can see that Chinese EFL learners are willing to express themselves with maxim-like expressions in developing an argument, but they seem to lack the ability to quote widely varying maxims, especially under (time) pressure. Explicit instruction on subject-related idiomatic expressions, especially those suitable for argumentative contexts, could potentially empower Chinese EFL learners to quote a diverse range of maxims more effectively. To enhance writing quality, instructors could also foster students' awareness of the importance of carefully selecting and using a broader range of nuanced maxim-like expressions to avoid the overuse of clichéd maxims. Moreover, instructors could foster students' cultural awareness regarding the use of maxim-like expressions. This involves guiding students to develop sensitivity and originality by expanding their repertoire of culturally-loaded expressions, ensuring a nuanced and culturally-aware use of maxim-like expressions in Chinese EFL learners’ writing.

6. Conclusion and limitations

This article explored the feasibility of automatic extraction of CSSs, a special category of clause-level phraseological units, from Chinese learner corpora. It also compared the extraction results of CSSs by using different association measures and discussed potential implications that the extracted CSSs could have for Chinese EFL teaching and learning.

The extraction method of CSS is the focal point of this article. It involves six steps: POS tagging, n-gram segmentation, structure identification, significance of occurrence calculation, text range setting, and overlapping sequence reduction. The procedure starts with the preliminary extraction of formally qualified sequences with subject-predicate structures. Then, three parameters (internal association, boundary independence, and text range) are used to measure the typicality of each sequence in academic texts. Internal association measures the adhesions inside a CSS, boundary independence measures the clarity of a CSS's outside borders, and text range calculates the inter-textual dispersion of a CSS. Finally, the Min-Max normalization algorithm is applied to remove overlapping sequences. Using this method, the study extracted 973 different CSSs from the corpus. This paper also compared the extracted CSSs using different association measures for internal association calculation.

Our methods for the automatic extraction of CSSs from corpus data offers significant potential for advancing the phraseology of EFL learners. As Ellis [40]: 41) stated, “language acquisition is essentially a sequence learning problem.” Our method and results suggest that CSSs (a specific type of sequences) can be statistically measured and automatically extracted from corpora, enabling a detailed examination of the clause-level formulaic patterns characteristic of EFL learners’ writing. By categorizing CSSs according to their primary structures and functions, we could enhance our understanding of learner phraseology and support a targeted analysis of clause-level phraseological use within this demographic. Moreover, the adaptability of our method to different corpora enables customized analyses of phraseology in EFL.

However, we have to admit that this article is only a preliminary exploration of CSS, focusing on the methodological issues of CSS, and that it has not yet investigated the patterns of co-selection of a CSS when realizing a specific function. For example, the CSS we should pay attention to frequently co-occurs with result/inference adverbials (e.g., so, as a result) or hedging expressions (e.g. I think, as far as I'm concerned) to emphasize a cautious recommendation based on previous information or reasoning. As noted by Lee and Swales [41]: 57), “what apprentice writers may be mostly missing is fine tuning of lexical and syntactic subtleties, particularly in terms of their strategic and rhetorical implications.” Therefore, careful scrutiny of the combinatory behavior of CSSs with their patterns and functions will facilitate our understanding about how Chinese learners apply their clause-level phraseological competence in essay writing. In our follow-up study, we will examine the co-selection patterns of each of the extracted CSS. It is believed that the CSSs, combined with their co-selection patterns for realizing specific functions, would have greater potential value in the application to non-native EAP teaching and learning. Another limitation of this study is that the extracted CSSs are constrained to a length ranging from three to seven words. It is imperative to acknowledge that some idiomatic sequences may surpass this limit. Expanding the range of sequence length for extracting CSSs could capture a broader range of idiomatic expressions. Nevertheless, the inclusion of longer sequences may also increase the computational complexity of statistical analyses.

CRediT authorship contribution statement

Jingjie Li: Writing – review & editing, Writing – original draft, Visualization, Validation, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. Wenjie Hu: Writing – review & editing.

Data availability statement

Data are available from the corresponding author upon reasonable request.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Jingjie LI reports financial support was provided by Shanghai Planning Office of Philosophy and Social Science.

Acknowledgements

This work has been supported by the Shanghai Planning Office of Philosophy and Social Science (grant ref. 2021BYY001). The authors are grateful to the anonymous reviewers for their detailed and helpful comments on earlier drafts of this paper.

Footnotes

1

It is noteworthy that Gries [20] also concurrently introduced the ΔP measure for calculating directional word associations and proposed the pairwise measure of ΔPleft-to-right and ΔPright-to-left, the formula of which is consistent with that of ΔP Attraction and ΔP Reliance.

2

The Natural Language Toolkit (NLTK) is an open source Python programs and data for Natural Language Processing, providing “a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning” (http://nltk.org/).

3

In the C7 tagging system, words that can serve as either pronouns or as determiners, such as any, some, this, and that, which are all tagged as determiners. To avoid omissions, we deliberately incorporated five out of thirteen subcategories of determiners into subject recognition.

4

We exclude the sequences with the border entropy value between 0 and 1 from our consideration, as their number is very small, only accounting for 1.73 % of the total data under processing.

Appendix A.

Appendix A. 500 examples of CSSs categorized by structure

1. Full Clause
(1). Independent clause: maxims, proverbs, fragments of rhetoric, etc.:
(every/each) coin has (its) two sides we are what we read
nothing succeeds without a strong will water is the source of life
everything has two sides book is the ladder of human progress
failure is the mother of success dream will come true
knowledge is power time is money
practice makes perfect practice is the sole criterion [for testing truth]
books are the ladder of human progress the early bird catches the worm
interest is the best teacher [a] friend in need is a friend indeed
nothing is impossible histories make men wise
long time no see [the years were a mirage and] there had been no years
actions speak louder than words [if you try your] best everything can be done
everyone is equal everyone has a dream
life is the greatest teacher classics represent the wisdom of the past
life is short everything is possible
nothing in the world is difficult [for one who sets his mind to it] smoking is harmful
(2). Independent clause: others
there are different opinions among people we should limit the development of tourism
reasons are as follows students lack social practice
advantages outweigh the disadvantages existing trade agreements should be repaired
skills and creativity are both worthwhile goals it has both advantages and disadvantages
one thing is certain life (will be/is) meaningful
online shopping has many advantages the government should establish free libraries
online shopping has become a fashion life will be colorful
generation gap is very common at present company has won a large export order
water shortage is becoming an urgent problem students are encouraged to make comments
college students should participate in social practice parents love their children
we should read more books college life is wonderful
my mother is a housewife online shopping is convenient
life will be better the death penalty is a step back
air is fresh we should help each other
answer is yes students choose to take part-time jobs
winter is coming reading is very important
spring festival is a traditional festival newspaper is a better source of news
families have only one child English is very important
life is boring life is not easy
going to classes should be optional college life is different
(3). Dependent clause: as-introduced CSSs
as we (all) know as the proverb goes
as is known to (all/us) as we can see (in/from) the picture
as (the/a) saying goes as the proverb says
(just) as (the/an) old saying goes as time went by
as everyone knows as is shown (in the picture)
as time goes by as can be seen
as we all known as is vividly depicted
as time goes on as mentioned above

2. Clause Constituent
(1). Dummy-it CSSs:
in my opinion it is necessary it is never too late to
it is obvious that it is evident that
it is easy to it is not difficult to find
it is said that it is worthwhile
it is necessary for it seems to me that
it is true that it is well-known to us
it is important for it is advisable
it is necessary for us to it is known to us
it is reported that it is very difficult
it is convenient for it can be said
it is universally acknowledged that it would be better
it is very hard to it is not useful
it is significant to it was not until
it is high time that it is our responsibility
it goes without saying that it is convenient for us to
it is undeniable that it is not wise
it is clear that it is no denying that
it is well known that it is suitable for
it is unnecessary for it is imperative for
it is very convenient it is beneficial to
it is impossible for it is our duty to
it is difficult for us to it is different from
it is known to all that it is time to
it is widely acknowledged that it can be seen
it was the first time it is harmful to
it is helpful for us to it is impossible for us to
it is a pity it is the best way to
it is likely that it is necessary for me to
it is essential for it is much easier to
it is time for us to it is better to
it is believed that it is wise to
it (doesn't/does not) matter it is time for
it is the same with it is useful for
it is important for me to it is likely to
it is not fair it is not easy for
(2). Existential CSSs:
there is no doubt that there are some disadvantages
there are (many/several/some/two/three/numerous) reasons there are still (many/some)
there is no denying that there are many interesting
there is a saying there will be a lot of
there is an old saying there is a widespread concern over
there are a large number of there is a phenomenon that
there are plenty of there are many places
there are a variety of there is something wrong
there are many kinds of there are many factors
there is no doubt there are thousands of
there are all kinds of there are many differences between
there are (many/some) advantages there is some truth in
there is no denying the fact that there is no better
there is only one there is no need
there are many problems there is no one
there are more and more people there is no real
there are so (many/much) there is no way
there is an increasing there is one thing
there are many benefits there may be some
there are a number of there were so many
there are many disadvantages there will be many
on the other hand there are some there will be some
(3). Personal CSSs:
different people have different students should learn how to
everyone has their own we can make friends with
some people think some people don't think
everyone has his own we should be grateful
we can see people are pursuing
different people hold different students think that
some people say in this way can we keep
we should pay attention to we will be able to
when they grow up people are aware of
we are supposed to we look forward to your
we should try our best to in the picture we can see
we should make full use of people insist that
we should cherish we should balance
some people hold the opinion that we shall fight him by
some people believe that students pay attention to
we are able to we must be careful
from the picture we can see other people can reach them
when they graduate students can apply
we can not afford to lose people are concerned about
we can communicate with only in this way can we live
some people suppose many people argue
we can draw a conclusion that students spend too much time
some people hold the belief that we can benefit a lot from
first of all we have to people are afraid of
we should take part in we are no longer
some students think we should bear in mind
so that we can get people suggest that
some people hold the view that students can learn how to
others believe that some people agree
we should pay more attention to can we solve the problem
we should cultivate we can make full use of
we can draw the conclusion that we all recognize
when they were young we are enclosing
some people consider that more and more people prefer to
some people argue that students do not pay
if they want to can we improve our
we should learn how to we have enough time to
some people claim that how can we harness
we often see some people are in favor of
if we insist we must try our best to
other people think we should focus on
others argue that we don't know how to
we are faced with we should attach importance to
some people support we should spare no effort to
we should take some measures to we can't imagine
people pay more attention to many people want to
we can not deny if one wants to
teacher told us people are addicted to
we should take measures to people are of the opinion that
college students should learn we should limit
parents should give their children more and more people start to
college students face some people think we should
many people think we can clearly see
different people have diverse we are talking about
so that we can make only in this way can we make
people believe that we help each other
we can not live without majority of people believe that
students are addicted to we should communicate with
we can improve our we are pleased to
students have their own we should take care of
if we try our best we can try our best to
many students think that we can safely draw the conclusion that
people hold different opinions a lot of people worry
different people have quite different views on we should continue to
everyone is eager students should pay more attention
we can learn a lot from we have less time to
some people suggest that we usually require
we must learn how to students can learn
people are accustomed to people think that we should
we can not ignore in this way can we get
people are likely to we should make the best of
we can't live without we can not emphasize the importance
everyone should try people stand on
people would like to students do not pay attention to
people are willing to people will try their best to
others hold the opposite people will be accustomed to
we are not able to we should be aware of
people hold the idea that people worry that credit cards may
we can use it to more students think
everyone has a different people use the internet
people have realized students should read
how should we deal with everybody wants to
in my opinion we should read students take part in
we are required to students pay less attention
we should spend more time people pay attention to
we must admit children should be allowed to
we should make good use of people are beginning to
people are fond of we can not afford
different people have quite different we can take part in
we are glad to people said that
(4). Impersonal (general subject) CSSs:
reason is that reasons lead to
number is # experience is the best
case in point is that opinions vary from person to
problem can be solved attention should be paid
nothing is more important than reasons contribute to
advantage is that number of people hold
research shows that efforts should be made
great changes have taken place the phenomenon is that
phenomenon has aroused story is about
disadvantage is that the fact is that
years have witnessed view is that
experience is more important the reason is that they
ability is more important than factors contribute to
topic is about from this nothing will turn
reasons can account for the problem will be solved
(5). Impersonal (specific subject) CSSs:
measures should be taken life is full of
love is the greatest appearance is more important than
earth is becoming warmer and love is a product of
government should take fatigue is one of the most common
soho lifestyle is becoming low-carbon lifestyle means
English is an international how college has affected my life
university is a place college has affected
the Olympic games will be held this report is to
reading can broaden our companies should encourage
online shopping has become air pollution has become
life is filled with sales confirmation have been shipped
life will become success belongs to
nowadays online shopping is becoming the world will become
no. # is this available in white social practice is playing
social practice can offer love is based on
library is a place practice is more important
college is a place the world is becoming
olympic games will be held in online shopping is becoming more and
part-time job can help social practice may bring
shopping on the internet also has its school is located in
the dragon boat festival is one of the earth is becoming
measures must be taken reading is more important
the government should strengthen the internet has become
government should take measures to with the time goes by
reading like other activities brings unique honesty is the best
technology has brought government should establish
books can make us government needs to
courses will start at reading can enrich
life is different from internet can provide
frustration is a part of some waste can be degraded while others
online shopping has made love makes the world
low-carbon lifestyle has become saving money is a good
life is bound up with three my opinion is that
study is the most important education plays an important role in
measures have been taken competition is a common
(6). Impersonal (demonstrative pronoun subject) CSSs:
when it comes to this means that
that is why it turned out
it also brings it does not mean
this is because it will lead to
but it doesn't mean it can teach us
it depends on it will affect
that is a question this is why
that is the reason why it has aroused
this is the first it is called
it (will result/results) in it does harm to

References

  • 1.Pawley A., Syder H. In: Language and Communication. Richard J.C., Schmidt R.W., editors. Longman; New York: 1983. Two puzzles for linguistic theory: nativelike selection and nativelike fluency; pp. 191–225. [Google Scholar]
  • 2.Granger S., Paquot M. In: Phraseology: an Interdisciplinary Perspective. Granger Sylviane, Meunier Fanny., editors. John Benjamins; Amsterdam/Philadelphia: 2008. Disentangling the phraseological web; pp. 27–49. [Google Scholar]
  • 3.Flowerdew J., Li Y. Language Re-use among Chinese apprentice scientists writing for publication. Appl. Ling. 2007;28(3):440–465. [Google Scholar]
  • 4.Simpson-Vlach R., Ellis N.C. An academic formulas list: new methods in phraseology research. Applied linguistics. 2010;31(4):487–512. [Google Scholar]
  • 5.Hammond K. “I need it now!” Developing a formulaic frame phrasebank for a specific writing assessment: student perceptions and recommendations. J. Engl. Acad. Purp. 2017:1–8. Available online 15 December 2017. [Google Scholar]
  • 6.Li J., Pang Y. Characteristic sentence stems in academic texts: distributions of their patterns and functions. Foreign Language Learning Theory and Practice. 2021;(1):25–36. [Google Scholar]
  • 7.Alvarez L., Capitelli S., Valdés G. Beyond sentence frames: scaffolding emergent multilingual students' participation in science discourse. TESOL J. 2023;14(3):1–19. [Google Scholar]
  • 8.Hyland K. As can be seen: lexical bundles and disciplinary variation. Engl. Specif. Purp. 2008;27(1):4–21. [Google Scholar]
  • 9.Zhang L., Su H. Applying local grammars in EAP teaching. J. Engl. Acad. Purp. 2021;51 [Google Scholar]
  • 10.Gisle A. Phraseology in a cross-linguistic perspective: a diachronic and corpus-based account. Corpus Linguist. Linguistic Theory. 2022;18(2):365–389. [Google Scholar]
  • 11.Wang Z., Wu X. A corpus-based study on chunk-explicitation in interpreting: a case study of Chinese leaders' speeches under the COVID-19 pandemic. International Journal of English Language Studies. 2023;5(4):45–59. [Google Scholar]
  • 12.Rodriguez-Mojica C., Rutherford-Quach S. In: Equity in Multilingual Schools and Communities: Celebrating the Contributions of Guadalupe Valdés. Kibler A., Walqui A., Bunch G., Faltis C., editors. Multilingual Matters; Bristol, Blue Ridge Summit: 2024. Curricularizing Language: examining underlying assumptions in classroom practice; pp. 148–159. [Google Scholar]
  • 13.Altenberg B. In: Phraseology: Theory, Analysis, and Applications. Cowie A.P., editor. Clarendon Press; Oxford: 1998. On the phraseology of spoken English: the evidence of recurrent word-combinations; pp. 101–122. [Google Scholar]
  • 14.Moon R. Oxford University Press; Oxford & New York: 1998. Fixed Expressions and Idioms in English. [Google Scholar]
  • 15.Su Q., Gu C., Liu P. Association measures for collocation extraction: automatic evaluation on a large-scale corpus. Int. J. Corpus Linguist. 2024;29(1):59–86. [Google Scholar]
  • 16.Pecina P. Lexical association measures and collocation extraction. Comput. Humanit. 2010;44(1–2):137–158. [Google Scholar]
  • 17.Church K., Hanks P. Word association norms, mutual information, and lexicography. Computational linguistics. 1990;16(1):22–29. [Google Scholar]
  • 18.Dice L.R. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. [Google Scholar]
  • 19.Smadja F., McKeown K.R., Hatzivassiloglou V. Translating collocations for bilingual lexicons: a statistical approach. Computational linguistics. 1996;22:1–38. [Google Scholar]
  • 20.Gries S.T. 50-something years of work on collocations: what is or should be next…. Int. J. Corpus Linguist. 2013;18(1):137–166. [Google Scholar]
  • 21.Dunn J. Multi-unit association measures: moving beyond pairs of words. Int. J. Corpus Linguist. 2018;23(2):183–215. [Google Scholar]
  • 22.Schmid H.J. From Corpus to Cognition. Mouton de Gruy; Berlin/New York: 2000. English abstract nouns as conceptual shells. [Google Scholar]
  • 23.Schmid H.J., Küchenhoff H. Collostructional analysis and other ways of measuring lexicogrammatical attraction: theoretical premises, practical problems and cognitive underpinnings. Cognit. Ling. 2013;24(3):531–577. [Google Scholar]
  • 24.Ellis N.C., Ferreira–Junior F. Construction learning as a function of frequency, frequency distribution, and function. Mod. Lang. J. 2009;93(3):370–385. [Google Scholar]
  • 25.Wei N., Li J. A new computing method for extracting contiguous phraseological sequences from academic text corpora. Int. J. Corpus Linguist. 2013;18(4):506–535. [Google Scholar]
  • 26.Gries S.T. What do (some of) our association measures measure (most)? Association? Journal of Second Language Studies. 2022;5(1):1–33. [Google Scholar]
  • 27.Gries S.T. John Benjamins; 2024. Frequency, Dispersion, Association, and Keyness: Revising and Tupleizing Corpus-Linguistic Measures. [Google Scholar]
  • 28.Lai R.K.Y. Why we need asymmetric measures to classify multi-word expressions: the case of Tibetan light verb constructions. Proceedings of the Society for Computation in Linguistics (SCiL) 2024:302–306. [Google Scholar]
  • 29.Yi W., Man K., Maie R. Investigating first and second language speaker intuitions of phrasal frequency and association strength of multiword sequences. Lang. Learn. 2023;73(1):266–300. [Google Scholar]
  • 30.Li J., Wei N. A study of functional sentence stems in academic English texts: their extraction method and frequency distributions. Foreign Lang. Teach. Res. 2017;49(2):202–214. [Google Scholar]
  • 31.da Silva J., Lopes G. In: Proceedings of the 6th Meeting on the Mathematics of Language. Rogers J., Moss L., editors. Kluwer; Dordrecht: 1999. A lo cal maxima method and a fair dispersion normalization for extracting multi-word units from corpora; pp. 369–381. [Google Scholar]
  • 32.Shimohata S., Sugio T., Nagata J. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics. Cohen P., Wahlster W., editors. Association for Computational Linguistics; Stroudsburg, PA: 1997. Retrieving collocations by co-occurrences and word order constraints; pp. 476–481. [Google Scholar]
  • 33.Jiang M., Zhang Q., Chen Y., Chang B. Chinese multi-word chunks extraction for computer aided translation. J. Chin. Inf. Process. 2007;21(1):9–16. [Google Scholar]
  • 34.McEnery A., Xiao R., Tono Y. Routledge; London: 2006. Corpus-based Language Studies: an Advanced Resource Book. [Google Scholar]
  • 35.Jiang M., Myaeng S., Park S. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. 2007. Using mutual information to resolve query translation ambiguities and query term weighting; pp. 223–229. [Google Scholar]
  • 36.Andrews R. Cassell; London, NY: 1995. Teaching and Learning Argument. [Google Scholar]
  • 37.Wingate U. ‘Argument!’ helping students understand what essay writing is about. J. Engl. Acad. Purp. 2012;11(2):145–154. [Google Scholar]
  • 38.Groom N. In: Learning to Argue in Higher Education. Mitchell S., Andrews R., editors. Portsmouth: Boynton/Cook Heinemann; 2000. A workable balance: self and source in argumentative writing; pp. 65–73. [Google Scholar]
  • 39.Millar N. The processing of malformed formulaic language. Appl. Ling. 2011;32(2):129–148. [Google Scholar]
  • 40.Ellis N.C. In: Cognition and Second Language Instruction. Robinson P., editor. Cambridge University Press; Cambridge: 2001. Memory for language; pp. 33–68. [Google Scholar]
  • 41.Lee D., Swales J. A corpus-based EAP course for NNS doctoral students: moving from available specialized corpora to self-compiled corpora. Engl. Specif. Purp. 2006;25(1):56–75. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.


Articles from Heliyon are provided here courtesy of Elsevier

RESOURCES