A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

Tamkeen Zehra Shah; Muhammad Imran; Sayed M Ismail

doi:10.1016/j.heliyon.2023.e22883

. 2023 Nov 29;10(1):e22883. doi: 10.1016/j.heliyon.2023.e22883

A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

Tamkeen Zehra Shah ^a, Muhammad Imran ^b,^∗, Sayed M Ismail ^c

PMCID: PMC10754703 PMID: 38163205

Abstract

Machine translation produces marginal accuracy rates for low-resource languages, but its deep learning model expects to yield improved accuracy with time. This longitudinal study investigates how Google Translate's Urdu-to-English translated output has evolved between 2018 and 2021. Accuracy and acceptability of the translations have been determined by, a) an interlinear gloss that identifies core semantic units and grammatical functions to be translated and, b) a descriptive comparison of the translated text's syntactic and semantic properties with those of the source text. Overall, despite a 50 % error rate that persists over the three-year interval, the research reports significant improvement in the overall intelligibility of the translations, in contrast to initial results from 2018, which exhibited rampant non-localized errors. Working backwards from instances of errors to morphosyntactic and semantic patterns underlying them, the study concludes that the pro-drop feature of Urdu, Urdu's case-marking system, identification of clause boundaries, polysemous terms, and orthographically similar words pose the greatest difficulty in neural machine translation. These results point to the need for incorporating syntactic information in training data.

Keywords: Neural machine translation, Urdu, Low-resource language, Google translate, Interlinear gloss, Comparative syntax

1. Introduction

Translation between Urdu and English has significant utility for bilingual users in the South Asian region encompassing Pakistan, India, Nepal, and Bangladesh [1], as well as the language diaspora comprising heritage speakers in other parts of the world. Despite the functional proficiency of most bilinguals, actual linguistic competence tends to vary in the literary or written genres of the two languages, with few users being able to read and write at an advanced level in both English and Urdu [2]. This can become a hindrance in certain domains of work, such as the legal [3], medical [4,5], educational [6,7], mass media, and development sectors, which require skills of being able to synthesize resources in English as well as Urdu. Users must resort to translation services where language skills prove inadequate in dealing with literary language or highly specialized registers. An emerging option is to generate translations online through Google Translate.

Although the use of Google Translate is common in work spheres, the convenience it has to offer may become a tradeoff with accuracy. A recent study has raised concern over American doctors and medical staff using Google Translate in emergencies, in view of the fact that translations into minor languages such as Farsi have accuracy rates as low as 67% [8]. This finding is corroborated by another study which Urdu ranked among 11 languages that resulted in ‘failed’ translations [9]. Urdu classifies as a low-resource language that does not have extensive source texts and parallel translations available as training data. A “performance degradation” is therefore observed when translating into “low resource languages such as Urdu and Hindi” [10]. For low-resource languages, in general, the quality of translation is found to be compromised [11]. Google Translate is more effective in translating standalone lexical items and is known to yield gross errors where, more extensive text sequences are involved.

However, there is promise of improvement. In 2016, Google adopted a new Neural Machine Translation system that runs on deep learning technology, yielding predictive results based on correspondences mapped between source texts and sample translations. The deep learning algorithms allow the translation engine to evolve incrementally by training on an ever-expanding repository of data sets. As such, translations are supposed to exhibit improved accuracy over time. Anticipating a fullyfunctional multilingual repertoire in the near future, Google has hailed the new NMT system as one that approaches human-level quality in translation [12].

In theory, such high-quality translation is possible for resource-rich languages. The claim needs to be tested out in the case of low-resource languages to determine whether the translated output is indeed viable for practical purposes. Six years after the NMT breakthrough, there are still reports of a high incidence of translations that select incorrect meaning representations for polysemantic words, showing that “NMT engines are currently unable to take into account semantic contextual information” [13]. Other problem areas identified in NMT are biased associations based on gender, non-recognition of a dialect where some standard variety dominates, distortion of tone, politeness/formality, and culturally appropriate forms, inconsistent paraphrasing of standard terminologies, omission of information, addition of extraneous content, and the system's superficial grammatical fluency which serves to mask erroneous translations [13]. Moreover, even for adequately resourced languages, certain domains and language registers might not lend themselves to adequate translation owing to the scarcity of domain-specific training data [14]. Thus, the prospect of human-like translation is, at minimum, an ambitious one, given the limitations of training data on one hand and the complexity that translation tasks must process at the various layers of language structure on the other.

2. Challenges for translation

Translation seeks to match the syntactical and semantic elements of a source language with equivalent structures in the target language so that meaning is preserved at the pragmatic level and the overall sense constructed by the message is roughly parallel in the two languages. The translation is, at best, a close approximation of the intended meaning – human translators have long recognized that exact translation equivalents between two languages do not exist. There can be considerable change in the form of the text after translation, and meaning is also subject to distortion. It is, therefore, impossible to produce one standard version as the prototypical translation for a given text. Out of the multiple translations that could be generated, some would be rated better than others based on how well they optimize crucial aspects of meaning construction.

The semantic and structural problems in obtaining an optimal translation are manifold. Generally, sentences which can be resolved into compositional meaning are relatively easier to translate, while formulaic sequences such as idioms constitute the greatest challenge to translation [10,[15], [16], [17]]. Formulaic expressions in a given language have roots in the culture and traditions that have shaped the language over time. They usually do not find any accurate comparison in an alternate language - the concept they express is culturally specific and usually comprehensible only to those familiar with the speech community's norms of interpretation. Therefore, they fail to convey meaning when translated word for word. A problem similarly surfaces for polysemous words; it is only by establishing meaning at the discourse level that proper meaning can be assigned to the individual word [18,19]. Adding to these complications, discourse structure itself is a vital component of translation that needs to be processed. Decoding and properly representing discourse structure is vital to interpreting other complex meaning structures that it might scaffold, such as arguments, explanations, causal relations, etc.

Similarly, figures of speech or stylistic devices found in literary texts may also prove intractable. The translator's task is to adequately capture the metaphorical signification of tropes, as well as meaning created through the manipulation and juxtaposition of the structural elements of the sentence – i.e., schemes. Since translation aims to achieve a similarity between the source text and the target text, where possible, effort is made to maintain the stylistic effect of the source text. For the same reason, more marked words are ideally replaced with similar jargon instead of common words, and the transformation of clause structures may take into consideration their length as well as placement.

2.1. Where (exactly) does NMT stand?

A translation task is complex enough when undertaken by expert human translators. It is easy to see why it may prove problematic for machine translation, especially in the low-resource scenario. Firstly, the quality of translation in neural configurations depends on the quality of training data available and the degree of correspondence between the source texts and translated texts. Secondly, there is the added limitation that such paired texts are sparse, to begin with, for under-resourced languages. This makes neural machine translation a discipline with immense research potential. There is a need for ongoing developments in the field to be accompanied by empirical research focusing on the accuracy of low-resource translation, so as to determine the true efficacy of new algorithms and techniques.

Based on the rationale above, this study examines the progress of NMT in the self-rectification of errors when translating out of a low-resource language such as Urdu. It specifically analyzes Urdu-English translation queries executed in 2018 and 2021 to compare them for their syntactic and semantic appropriacy, while also abstracting the language structure patterns which underlie the various instances of errors. Such an analysis aims to gain a window into potential areas where NMT could benefit from alternative, more sensitized natural language processing models.

3. Literature review

The Google Translate service available online can translate written text from 133 languages into English and vice versa, as well as translate to and from any of these languages. Between 2018 and 2022, twenty-four new languages were added to Google's repertoire. According to the Google AI Blog, the web application works not just with individual sentences but entire sections and webpages, in theory being able to translate at the discourse level [20].

At the time of its launch in 2006, the service deployed statistical machine translation (SMT), a procedure using a bilingual corpus of translations by human interpreters to draw up statistical probabilities of two text features being translation equivalents. After 2017, Google adopted a new Neural Machine Translation (NMT) system, which ‘learns’ the structural rules of the target language by training on millions of source documents instead of trying to map the structure of the source text into the target language. Google Translate's learning function is supposed to develop over time to generate more natural translations. A recent feature is the development of a Transformer algorithm that deploys a separate encoder and decoder module. Such a configuration is particularly adept at predicting a target word from the context. The algorithm hinges on “a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position.” [21].

Nevertheless, there are concerns that Google Translate is operating at inadequate accuracy levels for low-resource languages such as Urdu, which offer limited data repositories for the machine-learning system to train on. The Google AI Blog reports that “especially for low-resource languages, automatic translation quality is far from perfect” [20].

Coincidentally, research on the efficacy of Google Translate for low-resource languages has been scarce, with much of the extant literature focusing on high-resource languages [12]. One recent work on Urdu-English neural machine translation selects Urdu idioms and colloquial words and phrases as input. It reports errors in the output, which are explained by Urdu being a comparatively under-resourced and under-represented language in the electronic medium [22]. The present study differs from the research mentioned above in its longitudinal perspective on accuracy and in working with formal registers rather than colloquial ones.

The correlation between the scarcity of parallel corpora for a given language and the lack of NMT research for the specific language is well-established [11] (see graph, p.23). Moreover, empirical studies pertaining to low-resource languages and translation accuracy have mostly been conducted on languages other than Urdu, and the majority are dated prior to the introduction of NMT technology. A selective coverage of these empirical works follows:

When Google still used SMT procedures, an investigation into errors produced in translating between Persian and English concluded that the highest number of errors occurred in the lexicosemantic category [23]. Also, there were more errors translating from English-Persian than the other way around. Another inquiry focusing on errors in English-to-Spanish machine translation discovered that the complex system of tense and gender marking in Spanish contributed to the high error rate [24]. The errors owed to “the relatively long sentences” in the source text created a hindrance in matching the verb with the corresponding subject information, resulting in an incorrect form of the verb [24].

A later work analyzed errors produced by Google Translate when translating from English to Portuguese [25]. The study was similar in methodology to the present research, as it also aimed at uncovering language-specific loopholes and weaknesses in the translation algorithm. It concluded that misinterpretation of lexical items or, their complete omission and syntactical errors were the predominant faults of the Google system. A similar effort documented errors at the lexical level while translating an airline's “terms and conditions” website from English to Thai [26]. It attributed them to “non-equivalence between the source (English) and the target language (Thai) leading to choosing the wrong alternate meaning of a word or using the wrong part of speech.” Notably, this research also found that “the same phrase could be mistranslated twice” to cater to both possible meanings. Both of these studies were from Google's pre-NMT era.

Issues such as the above are precisely what neural configurations sought to resolve through the deep learning model. Although it has been found that NMT results for Urdu are an improvement over the earlier Phrase-based Statistical Machine Translation (PBSMT) [27], problems persist at two levels: semantics and syntax. Lexicosemantic errors remain a challenge in Urdu-English NMT, for which transliteration of uncommon/unrecognized words is proposed as a technique to improve translation quality [28]. The other cause of low accuracy has been identified as the non-linearity of the attention mechanism in neural networks, which arises for languages written from right to left, such as Urdu [29,30]. The attention mechanism is supposed to predict which word in the parallel text maps to a given word in the source text, but a reversed text direction in the parallel text, coupled with a different word order (SOV for Urdu), violates linearity and creates incorrect mappings during the learning phase of the algorithm, leading to faulty translation [31,32].

For Urdu-English interlingual translations, Google Translate would have to consider Urdu's language-specific features. Although Urdu is an SOV language, its word order is fluid and moveable, as grammatical relations such as subject and object are determined by case markers rather than by the relative positions of nouns in the sentence [32]. Furthermore, Urdu is a head-last language, and as such, is structurally opposite to English. It also differs from English in its pronominal system. Considering these structural differences in addition to the obvious lexico-semantic dissimilarities between Urdu and English, accurate and acceptable Urdu-to-English translation outputs, if achieved, would be a significant accomplishment in the field of machine translation.

4. Methodology

As a descriptive effort designed to investigate the errors produced by Google Translate; the research is predominantly qualitative. It analyzes the types of errors yielded, and seeks to offer explanations for them in terms of the syntactical and semantic differences between the two languages that the translation algorithm might have failed to resolve.

The study uses a longitudinal research design to compare translation accuracy between 2018 and 2021.

4.1. Research questions

1.
What kinds of errors arise when different samples of Urdu text are translated into English using Google Translate?
2.
How do these errors relate to syntactical differences between Urdu and English?
3.
What errors can be attributed to semantic complexity, such as polysemy and formulaicity?
4.
How has Google Translate's accuracy improved over three years for Urdu-to-English translations?

4.2. Data sources

The source language in this research is Urdu, whereas the target language is English.

The source text in Urdu was obtained from an online news portal, Geo News in Urdu (urdu.geo.tv). The language of news reporting was thought to constitute an optimal input to test machine translation - for two reasons. Firstly, the language of news employs a range of syntactical structures to condense large amounts of information into a single sentence. A wide range of structures would give the engine different test-cases to work with. Secondly, it can be expected that the language of news reports would be less idiomatic than the language of literary passages, as it needs to be concise and clear to be informative. Since machine translation runs into difficulties with idioms, it was thought best to avoid these in the input.

4.3. Procedure

The Urdu source text was pasted into the search field available on Google Translate (translate.google.com) after choosing the target language as English. An interlinear gloss was also prepared for each Urdu sentence, accompanied by a faithful translation (in italics) meant to capture the structural essence of the original text.

Twelve excerpts were translated from Urdu into English for the analysis. Of these, six translations pertained to 2018; a further six were later translated and analyzed in 2021. Some excerpts chosen were syntactically simpler and shorter than others, consisting of one main clause. Such a variety was included to check whether syntactic complexity affected translation accuracy.

The screenshots of the results for each translation query were taken and are pasted below for analysis.

5. Analysis and discussion

5.1. Analysis of source text (Fig. 1)

Thailand = ke wazir-azam = ne sahafion = ke tund-o-tez sawalat = se

Thailand = GEN minister-prime = ERG journalist:PL = GEN sharp-CONJ-quick question:PL = ABL

bach-nay = ka anokha tareeqa apnaa-ya jisey

avoid-INF = GEN peculiar.M.SG method.M.SG adopt-PRF.M.SG which-ACC

insaani haqooq = ki tanzeem shadeed tanqeed = ka

human:ADJ right:PL = GEN organization.F.SG extreme criticism = GEN

nishana banaa-rahi hai.

target.M make.PRS.PROG.F.SG be.PRS.3SG

Thailand's Prime Minister adopted a peculiar means of avoiding the journalists' provocative questions, which a human rights organization is criticizing severely.

The Urdu source text consists of a main clause followed by a relative clause (beginning with ‘jisey’). This relative clause is in active form, with “insaani haqooq = ki tanzeem” as the grammatical subject of the clause. The sentence begins with a series of possessive phrases indicated by the case marker “ke/ka.”

Analysis of translated text

Semantics: The overall meaning of the translation is the same as that of the source text. However, “tund-o-tez” would have more appropriately translated into “provocative questions” instead of “sharp questions.” We see here an instance of literal translation, which the engine must resort to since “tez” is a polysemous word of Urdu that can have different meanings in different contexts. Apparently, machine translation cannot determine the intended meaning of a polysemous word.

Syntax: The translated text is syntactically accurate. The genitive phrase “Thailand ke wazir-e− azam” has been appropriately converted into “The Thai Prime Minister.” The relative clause was in active form in Urdu, but this has been converted into the passive construction (“which is being criticized by a human rights organization”). Despite this, both texts achieve the same overall effect.

5.2. Analysis of source text (Fig. 2)

Unhon = ne kaha ke yeh-i wajah hai ke

PRN:3.PL = ERG say:PRF that this-FOC reason be.PRS.SG that

bharat riyasati dehshat_gardi = ke zariyay hamein nishaana bana-raha

India state.ADJ terrorism = GEN means[INS] we.PL.ACC target.M make-PRS.PROG.M

hai laikin wo riyasati dehshat-gardi = mein bhi

be.PRS.SG but 3DIST state.AJD terrorism-in = LOC also

na-kaam ho-chuka hai.

fail.ADJ be-PRF be.PRS.

They/He/She said that this is the reason why India is making us a target of state terrorism, but it has failed in state terrorism also.

The Urdu sentence consists of a main clause containing a complement clause introduced by the complementizer “ke.” The complement clause consists of two independent clauses coordinated with the conjunction “laikin.”

Analysis of translated text

Semantics: Although it is not clear whether the subject of the Urdu sentence is a plural entity or whether the plural pronoun “unhon” has been used to signal deference, the translation construes it as a singular masculine pronoun “He,” which is the more probable case. Such indeterminacy is inevitable as Urdu does not mark gender in the third-person pronoun, and the differentiation between plurality or deference can only be made on account of the context, which is absent for this decontextualized sentence.

This implies that the way Urdu pronouns encapsulate information about number/proximity/deference/absence of gender causes difficulty in translation, as the same information is not contained in English pronouns.

Besides this issue, the rest of the sentence has been translated appropriately into English.

Syntax: Both the source text and target text follow the same syntactical placement of clauses, with the complement clause coming first, followed by the first embedded independent clause, and then the second.

5.3. Analysis of source text (Fig. 3)

Khayaal rahay ke 2014 = mein Thailand = mein fauj = ne

Consideration keep:PRS.IMP that 2014 = LOC Thailand = LOC army = ERG

muntakhib hakumat = ka takhta ulat kar iktedar = par qabza

elected government = GEN rule overturn.PRF do.PFV power = LOC control

kar-liya tha jiss = ke baad = se wahan media samet

do-PRF be.PST which = DAT after = ABL over-there media including

mukhtalif shobon = mein paabandiyan barha di-gai

various domain:PL = in.LOC restriction:PL increase-CAUS do.PASS.PRF.F

hain.

be.PRS.PL.

Keep in consideration that in Thailand, in 2014, the army overthrew the elected government to take hold of power, after which restrictions on various domains, including the media, have been increased.

The sentence starts with a complement clause, which contains a relative clause starting with “jiss ke baad.” There are three locative phrases indicated by the locative case marker “mein” (2014 mein, Thailand mein and shaubon mein), one locative phrase marked by “par” (iktedaar par), and one temporal locative phrase marked by “ke” (jiss ke baad). There is also an ergative case (fauj ne) and a possessive case (hukumat ka takhta).

Analysis of translated text

Semantics: The translated sentence fails to make sense, primarily due to the lack of coherent structure. The engine is also unable to process the idiom “hakumat ka takhta ulat kar …” and this has been merely simplified to “occupied the government” instead of stating something more specific such as “overthrew the government.”

Syntax: The syntactical complexity of the Urdu sentence resulting from case markings has served to confuse the parser of the translation engine, resulting in an incorrect analysis of syntactic structures. The complications seem to arise due to the locative case marker “mein,” which comes before and also immediately after “Thailand.” Initially, the first grammatical relation, “2014 mein,” is correctly translated as “in 2014.” However, noticing the same locative again after “Thailand,” the parser has tried decoding the nested phrase “Thailand mein fauj” first. It has appended its translation (the army in Thailand) to the beginning of the sentence by rereading “mein” in “2014 mein” and translating it again. Hence the faulty structure we see (… in 2014 in the army in Thailand …).

After dealing with the nested phrase “Thailand mein fauj,” the parser notices that “fauj” is also the subject of the ergative case (fauj ne). Therefore, this relation is separately constructed to read “… in 2014 in the army in Thailand, the army occupied …” creating a repetition for “army” when not required.

Another interesting error is that the parser has not been able to express “jiss ke baad se” as a temporal/time connective. It has construed it as a causal relation expressed by “since” (since there have been restrictions …), whereas the correct translation should have been and since then/since which.

Also, the translation algorithm has no capacity to deal with the marker “kar” used in the expression “takhta ulat kar iktedaar pe …”. The word “kar” used in this sense merely indicates the progression of action, but this has been translated into “by” to wrongly show instrumentality (the army occupied the government by occupying power).

5.4. Analysis of source text (Fig. 4)

Amreeki saddar Donald Trump = ne yun_tau saal-bhar bohot-si

American president Donald Trump = ERG though_generally year-full many-PART

aisi harkatein keen jin = par unhein

such acts do.PRF which.PL.DAT = upon.LOC 3.HON.OBL

shadeed tanqeed = ka nishana banaya-gya, taham yahan un = ke

severe criticism = GEN target make.PRS-PASS.PRF yet here his = GEN.PL

chand chedã cheda mutanazay faisalon, bayanaat aur takrao = ka

a_few selected controversial.PL decision:PL.DAT statement.PL and clash = DAT

zikr kiya-jai-ga. Mention do PASS-FUT.

American President Donald Trump has, although generally, committed many acts throughout the year that have made him the target of severe criticism, a few of his controversial decisions will be mentioned here.

The Urdu text opens with a subordinate clause followed by a relative clause in the passive form. This subordinate construction is linked with the independent clause through the conjunction “taham.”

Analysis of translated text

Semantics: Again, the translation yields gross errors. There are repetitions that prevent the sentence from making sense. First of all, the engine seems unable to process the word “harkatein”, so this has been completely ignored. With “harkatein” cast aside, the parser had to interpret “boht si aisi harkatein” in some way; it has taken the prenominal determiner “aisi” to be a pronoun referring back to “saal.” Hence, the resultant structure so many years in place of boht si aisi. Therefore, the translation for “saal” erroneously appears twice (… Donald Trump has done so many years throughout the year).

In the relative clause (jin par unhein shadeed tanqeed ka nishana banaya gya), the parser has processed “tanqeed” in two ways and provided both results, whereas only one was required. It has converted it into the verb form “was criticized for” as well as the noun form “criticism.” Thus we have the redundancy “which he was criticized for severely criticism.”

Then, the expression “cheda cheda” also has not been translated correctly, perhaps because the lexical database does not have this expression as an entry. It is a reduplicated word which the translation engine does not cater to. So instead, the engine has retrieved the closest possible match in its Urdu lexical store, “chela,” which it has translated as “disciple.” Such a construal has mistakenly made the supposed “disciples” the possessor of the genitive case instead of the President.

Syntax: Errors in syntax result from the semantic deficiency of the engine. Since several words were not ‘understood’ by the engine, it resorted to other near possibilities in the language or searched within the sentence to assign existing nouns in the sentence to other pronouns. This caused a complete distortion in meaning.

5.5. Analysis of source text (Fig. 5)

Supreme court Karachi registry = mein Chief Justice Pakistan = ki

Supreme court Karachi registry = in.LOC Chief Justice Pakistan = GEN

sarbarahi = mein teen-rukani bench = ne Shahzeb qatal case = par

headship = in.LOC three-member[ADJ] bench = ERG Shahzeb murder case = on.LOC

Sindh High Court = ke faislay = ke khilaaf civil society = ki appeal suni.

Sindh High Court = GEN decision = DAT against civil society = GEN appeal hear.PST.SG.F.

In the Supreme Court Karachi Registry, the three-member bench headed by the Chief Justice of Pakistan heard the civil society appeal against the Sindh High Court's decision on the Shahzeb murder case.

The source text consists of one lengthy main clause, which includes genitive, ergative, and locative cases.

Analysis of translated text

Semantics: The parser has not been able to identify or process “Sindh High Court” and, ignoring it completely, has substituted it with a random noun “Shahbaz” to stand in as a possessor for the genitive phrase Sindh High Court ke faislay (Sindh High Court's decision), resulting in the fabrication “Shahbaz's decision.” Nowhere is the word “Shahbaz” to be seen in the source text. Other than this, the rest of the sentence has been translated correctly and effectively.

Syntax: There are no syntactical errors in the target sentence. All the case phrases have been processed correctly and given an accurate placement in the output.

5.6. Analysis of source text (Fig. 6)

Richard Olsen = ne amreeka = ki janib = se Pakistan = ki

Richard Olsen = ERG America = GEN behalf = ABL Pakistan = GEN

askari imdaad band karnay aur nai policy = par new york times = mein

military assistance stop do.INF and new policy = on.LOC New York Times = in.LOC

aik article likha jiss = mein un = ka kehna hai ke trump hakumat = ki

an article write.PRF which = in.LOC he = GEN say.ACNNR be.PRS that Trump administration = GEN

janib = se Pakistan = ko be-izzat karnay = ki policy ziada arsa nahin chalay-gi.

Behalf = ABL Pakistan = DAT dis-honour do.INF = GEN policy long period no go_on-F.FUT.

Richard Olsen has written an article in the New York Times on America's cessation of military assistance to Pakistan and the new policy, in which he has stated that the Trump administration's policy to discredit Pakistan will not work for long.

The Urdu text contains a main clause, a relative clause signaled by “jiss mein” and a complement clause nested inside the relative clause.

Analysis of translated text

Semantics: The translated sentence does not convey meaning as it carries syntactical errors. The Urdu word “imdaad” has been omitted from the translation as the parser failed to recognize it. Therefore, the expression “askari imdaad band karnay” has been mistakenly translated as “military closure” instead of “closure of military assistance.”

The Urdu form of “New York Times” has also not been recognized by the parser and has resultantly been ignored.

Furthermore, the formulaic expression “nahin chalay gi” could also not be interpreted by the engine. This phrase has also been omitted from the translation process.

Syntax: Failing to translate the formulaic sequence “nahin chalay gi” has resulted in a syntactically inaccurate rendition of the relative clause, as the verb “he termed” is lacking a complement (unsustainable? untenable?).

Also, Urdu case phrases have thrown the parser off track once again. In the phrase askari imdaad band karnay aur nai policy par, it first constructed the grammatical relation par (on) as having scope on both imdaad band karnay and nai policy. But then, it also separately translates the relation of par only with nai policy (nai policy par). This results in the redundant and inaccurate translation “Richard Olsen wrote an article on the US military closure of Pakistan and new policy on the new policy.”

5.7. Analysis of source text (Fig. 7)

Zarai = ke mutabiq yeh khorakein namunasib darja-hararat

Sources-DAT according these doses unsuitable degree-heat

aur muqarara waqt = par na lagnay = ke baais zaya huein,

and appointed time = LOC not apply.PASS.INF = DAT cause waste be.PRS.PL.F.DAT

iss-mein har brand = ki vaccine shamil hai.

this = in.LOC every brand = GEN vaccine include.ADJ be.PRS.SG.

According to sources, these doses were wasted because of not being administered at the right temperature and within the specified time; this includes vaccines of every brand.

The source text consists of two independent clauses conjoined with a comma. The subject of the first clause (khorakain) refers to vaccines, and this referent is more evidently stated in the second clause.

Analysis of translated text

Semantics: The translation is unable to work out from the general context that the Urdu word khorakain refers to ‘doses’ of vaccines rather than ‘food.’ This is another instance of a polysemous word proving to be problematic. The contextually relevant verb for vaccines would be ‘administered’ instead of the translation ‘applied.’

Syntax: The second independent clause has been entirely omitted from the translation. Possibly, the translation engine could not decode the words ‘vaccine’ and ‘brand,’’ which are transliterations from English. This clause is a typical case of a failed translation.

Overall, the translated sentence is syntactically correct but erroneous and unacceptable on account of omission and semantic distortion, which as altered the sense of the reference text completely.

5.8. Analysis of source text (Fig. 8)

Wafaaqi wazir_e_dakhla sheikh rashid ahmed = ne parliament

Federal minister-PTL-interior Sheikh Rashid Ahmed = ERG parliament

House = mein media = se guftugu kartay_huay kaha ke

House = in.LOC media = ABL conversation do_while.PRG say.PRF that

qaumi salamati = ke maslay = par saari opposition aur saari hukamran

national security = GEN issue:DAT = in.LOC entire.F opposition and entire.F ruling

party aik hai, tamam jamaton = ne jo tajaweez

party united.ADJ be.PRS.SG all party.PL = ERG whatever recommendation.PL

deen woh mulk = ke liyay di_hain

give.F.PRF that/those country = DAT for give:PRS.PRF.PL.

Federal Interior Minister Sheikh Rashid Ahmed, talking to the media at Parliament House, said that the whole opposition and the entire ruling party were united on the issue of national security; whatever recommendations the parties gave were in the interest of the country.

The Urdu text sequence is syntactically complex, consisting of a main clause, an embedded complement clause, and a second independent clause appended at the end of this primary structure with a comma. The source text also contains transliterations from English, such as ‘Parliament House.’

Analysis of translated text

At the syntactic level, the translation has an acceptable degree of accuracy despite the structural complexity of the source text. There are no significant omissions or errors. Semantically, however, the translation is incongruent to the original text, as the last clause, “all the parties have given suggestions for the country,” is a literal translation that falls short of conveying the intended idiomatic meaning of “whatever recommendations all the parties gave were in the interest of the country.”

5.9. Analysis of source text (Fig. 9)

Tarjuman Counter Terrorism Department (CTD) Baluchistan = ke mutabiq

Spokesperson Counter Terrorism Department (CTD) Baluchistan = GEN according.

Quetta dhamakay = mein 6 afrad zakhmi huay jin-mein = se

Quetta blast = in.LOC six people injure become.PRF whom/which = in.LOC = ABL

Doe zakhmiyon = ki halat tashweesh_naq hai,

Two injure.ADJ.PL = GEN.F condition critical be.PRS

dhamakay = ke zakhmiyon = ko haspataal muntaqil kar_diya gya hai.

Blast = GEN injure.ADJ.PL = ACC hospital shift do_PRF PASS be.PRS.

According to the spokesperson of Counter Terrorism Department (CTD), six people were injured in the Quetta blast, of which two are in critical condition; those injured from the blast have been shifted to hospital.

The Urdu extract begins with an adverbial phrase, followed by a main clause, a relative clause, and a second independent clause conjoined to the first with a comma. The extract has a high degree of syntactic complexity.

Analysis of translated text

The translation is accurate both in terms of syntax and meaning. The adverbial phrase, the main clause, and the relative clause all occupy correct syntactic positions. Although the Urdu text fronts the adverbial (locative) phrase “Quetta dhamakay mein” (“in the Quetta blast”), the translation has successfully placed the main clause containing the subject (“six people were injured”) in the fronted position, following the SVO word order for English. Urdu uses a case-marking system to indicate grammatical function, thus allowing the flexibility to move grammatical roles of subject, object, indirect object, and adverbials such as “Quetta Dhamakay mein” to different positions within a sentence. In English, grammatical roles are determined by syntactic position, leading to a more fixed syntactic structure. Google Translate has transposed the fluid word order of Urdu correctly onto the fixed order of English. The engine has also correctly recognized the Urdu journalistic convention of joining independent clauses with commas and has converted the conjoined clause into another sentence separated by a full stop. However, the longer sentence structure of the original text could have been retained with the use of a semi-colon. The appended sentence of the translation omits information (“from the blast”) that is repeated from the previous sentence and follows a construction that is more idiomatic in the target language (“the injured” vs. “those injured from the blast”).

At the semantic level, individual words have all been translated into appropriate equivalents in English, making the translation easily readable.

5.10. Analysis of source text (Fig. 10)

Hareem Shah = ne daawa kiya hai ke police = ki wardi = mein

Hareem Shah = ERG claim do-PRF be.PRS that police = GEN uniform = in.LOC

malboos afraad = ki janib = se salon intezaamia = se poocha gya

clad persons = GEN behalf = GEN salon management = ABL ask PASS

ke Hareem kon_si gaari = mein aai thi, kisi minister = ki

that Hareem which_PTL.F car = in.LOC come.F be.PST.F some minister = GEN

gaari thi, kis rang = ki thi aur kon driver tha? Hataa_ ke

car be.PST which color = GEN be.PST and who driver be.PST Even that

intezaamia = se uss waqt = ki CCTV footage = ka bhi sawaal kiya_gya.

management = ABL that time = GEN.F CCTV footage = GEN also question do.PRF_PASS

Hareem Shah has claimed that people dressed in police uniforms have asked the salon management which car Hareem came in, whether it was some minister's car, what color it was, and who was the driver. The management was even asked for the CCTV footage of that time.

The source text contains a main clause, a complement clause, and three embedded questions. This structure is followed by another sentence consisting of an independent clause.

Analysis of Translated Text

Syntax: The second embedded question, “kissi minister ki gaari thi,” has been translated incorrectly. In the Urdu text, the pronoun ‘kissi’ is a determiner that translates into ‘a’ or ‘some’ in the given context. Google Translate has under-analyzed the determiner ‘kissi’ into the related lexical form “kiss” which is an interrogative word, and subsequently translated it as the equivalent “which.” This error has upturned the syntax of the entire embedded question. Interestingly, “which” is interpreted as pronominal instead of prenominal and is thus placed in the subject position, with the predictable effect that “minister's car” is erroneously shifted to the predicate position.

The error has occurred due to the PRO-drop convention of Urdu. The English translation needs to add the pronoun “it” in the subject position to retain sense and precede the structure not with a Wh-word but a complementizer such as ‘whether’ or ‘if.’ This would yield the correct form “whether it was some minister's car.”

Semantics: Google Translate predicts the correct translation equivalent from intra-sentential cues. The predictive function does not span sentences. Although the Urdu extract uses the same word “intezaamia” in both instances, the first instance, which is preceded by the word “salon,” is correctly translated as “salon management”, while the second instance, which occurs without a qualifier, has been translated into “administration” instead, which is semantically odd when used to talk about the salon's management.

Furthermore, the translation “at the time” seems like an incomplete expression, requiring a complement such as “at the time of the incident.” A choice more aligned with the input text would have been “CCTV footage of that time.”

5.11. Analysis of source text (Fig. 11)

Zarai = ke mutabiq Asif Ali Zardari = ki tabiyat na_saaz honay = ke baais

Sources = GEN according Asif Ali Zardari = GEN health unwell happen = GEN reason

unhein daktaron = ki hadayat = par haspataal = mein dakhil karwaya gya.

3HON.ACC doctor.PL = GEN advice = LOC hospital = in.LOC admit do.PRF.CAUS PASS.

According to sources, Asif Ali Zardari was admitted to hospital on the advice of doctors because of ill health.

The Urdu sequence consists of a subordinate clause, a main clause, and two adverbial phrases, one opening the sentence, and one nested in the middle.

Analysis of translated text

The translation is correct syntactically as well as semantically.

5.12. Analysis of source text (Fig. 12)

Alim-i mosamyat-i tabdeelion = ke manfi asaraat Quetta = mein

global-ADJ climate-ADJ change:PL = GEN negative effects Quetta = in.LOC

bhi nazar_aana shuroo ho_chukay hain, shehr = mein mamool = se

also visibile_come.INF start happen.PL.PRF be.PL.PRS city = in.LOC normal = ABL

kam barishon aur kam barafbari = ke baais zer-e-zameen paani = ki

less rain.PL and less snowfall = GEN reason under-CONJ-ground water = GEN.F

satah musalsal gir_ rahi_hai, iss tabdeeli = se aabi wasail = ki

level continuous.ADV fall PROG_be.F.PRS this change = ABL water.ADJ resource.PL = GEN

kami, zaraat aur mosami shidat jaisay sangeen masail

scarcity agriculture and weather extremity such/like serious problem:PL

janam lenay lagay_hain.

birth take begin_to_happen.PRS.

The negative effects of global climatic changes have also started becoming evident in Quetta; due to below average rainfall and less snowfall in the city, groundwater levels are steadily falling; because of this change, serious problems such as lack of water resources, agriculture, and weather extremities have begun to take birth.

The Urdu extract is a run-on sequence of three independent clauses joined with commas.

Analysis of translated text

Syntax: Although the translation contains no error in the first two clauses, the last one is syntactically incorrect and fails to convey meaning. The problem is a result of omission – the translation has failed to identify that the last main clause includes the subordinate clause “iss tabdeeli se” (because of this change). In fact, the engine has not treated the last clause as a clause at all. Removing the subject position, the translation has linked the objects of this clause to the already generated structure by adding the words “resulting in.” With the English syntactical positions filled, the remaining words “jaise sangeen masail janam lene agay hain” have been pushed off into a separate sentence (“As serious problems begin to arise.”). This sentence is, of course, a fragment, as it requires noun complements for the linking word “jaise” (“such as …”), which have incidentally been allocated to the previous construction. “

Had the translation engine recognized the whole Urdu clause, and the subordinate clause “iss tabdeeli se” appropriately analyzed, the subject “sangeen masail” (“serious problems”) would have been fronted to form the translation “Because of this change, serious problems such as scarcity of water resources, agriculture, and extreme weather conditions have begun to arise.”

Thus, the error in this example owes to the incorrect analysis of the subject.

Semantics: Jaise” is a polysemous word; it is a substitute for “as” in expressions like “as you know,” but is also used as to signal an illustration or example, taking on the meaning of “such as.” The application incorrectly used “jaise” in the former sense instead of the latter.

However, Google Translate has perceptively treated the expression “masail janam lene lagain hain” idiomatically rather than literally, converting it into the appropriate equivalent “problems begin to arise” rather than the literal alternative “problems begin to take birth.”

6. Conclusion

This research has analyzed the patterns of errors yielded in Google Translate's Urdu-to-English translations and has attempted to trace them to syntactic operations at the back-end of the system. Through a description of the errors, we are able to reverse-engineer the parsing processes at work as a first step toward understanding the limitations of text processing in neural machine translation. It is found that Google Translate has shown significant improvements in the quality of translations from 2018 to 2021, despite there being no change during this interval in the proportion of accurate translations obtained for the total number of queries.

A comparison of the outputs for the two years reveals that the initial Urdu-to-English translations from 2018 were too inaccurate to be of any practical value. Out of six news extracts entered into the Google Translate interface, only Query 1 and 2 resulted in translations which conveyed the same meaning as the source text. Query 5 was syntactically correct but contained erroneous extra information supplanted automatically by the engine. Queries 3, 4, and 6 failed to give meaningful results due to syntactic errors and redundant structures in the machine translation.

This implies that in 2018, half of the results were inaccurate. The error rate of Google Translate was 50 % - too high to be reliable.

The reasons identified for errors witnessed in 2018 were as follows:

1.
Google Translate contained an insufficient lexical store for Urdu words, thus being unable to identify and process many words. The effect was that any prenominal determiners in the sentence were read as pronouns and then randomly substituted with other nouns that existed in the sentence to maintain subject and object relations, creating repetitions.
2.
The engine did not have a database for formulaic expressions of Urdu. These were omitted from the translation.
3.
The translation algorithm did not accurately construct semantic relations determined by the case markers in Urdu. It was unable to determine which structures were included in the scope of the case marker.
4.
It generated redundant structures by trying to work out all the possible grammatical relations with the case marker. In the test cases analyzed above, just translating one relation per case marker would have produced more meaningful sentences. This finding is supported by Vidhayasai [26], cited in the literature review, who discovered that Google Translate generates translations twice to cater to all possible meanings/relations.

Results have been significantly better in 2021. Of the six queries processed, Query 8, Query 9, and Query 11 were completely accurate. Although this still amounts to 3 correct outputs out of 6, with a consistent error rate of 50 %, the incorrect translations were largely meaningful and had localized errors in a specific structure. Also, in the 2021 test run, the application fared better in processing formulaic expressions, converting them to acceptable and idiomatic equivalents in English.

The errors obtained seemed to occur on account of the following weaknesses in the machine learning algorithm:

1.
The algorithm could not process the pro-drop feature of Urdu. The engine needs to supply a relevant subject pronoun for the English translation to have structural validity.
2.
In some longer sequences of Urdu, specific syntactic structures, such as subordinate clauses, were omitted from consideration during translation, creating fragments and incomplete constructions.
3.
The meanings of polysemous words could not be decoded from context.
4.
The engine collapsed similar lexical items into the same lexeme, which can be erroneous. Urdu words such as “kissi” and “kiss” may appear to be inflections of one another but are quite different words – the first being a determiner (“some”), the second a Wh-pronoun (“who/which”).

Google Translate needs to target the problem mentioned above areas to produce viable and meaningful translations from Urdu to English (31). The types of errors analyzed in this study are still too fundamental to be ignored, as their effect spills onto nearby syntactic structures and obfuscates the meaning of the rest of the sentence.

To translate out of a minor language, the NMT technology would have to implement an improved parser and encoder that can accurately represent the syntax of the input language and exhibit sensitivity to specific grammatical features of the language that encode indispensable information (32). Rather than building language models from bilingual corpora of paired texts, which might not be exhaustive or comprehensive enough, the NMT system could benefit from initial modeling of the minor language with respect to its key grammatical features. The need to incorporate syntactic information in neural models also seems to be the direction indicated by much of concurrent research on low-resource languages [[33], [34]].

7. Limitations of the study

The present study has only analyzed isolated, decontextualized sentences. Its results cannot be generalized to predict the behavior of the translation engine when longer strings of text spanning over many sentences are entered. Future research should, therefore, use entire paragraphs. Such a study would cover further discourse-level issues that could be encountered in machine translation.

Ethics declarations

Authors didn't use any material for which prior approval/consent is required. All consulted works are duly cited and referenced in this study according to the journal reference style. Moreover, no human or animal is used in this study.

Data availability statement

No data was used for the research described in the article.

CRediT authorship contribution statement

Tamkeen Zehra Shah: Writing – original draft, Investigation, Formal analysis, Data curation, Conceptualization. Muhammad Imran: Writing – review & editing, Supervision, Resources, Project administration, Methodology, Funding acquisition, Formal analysis, Data curation. Sayed M. Ismail: Writing – review & editing, Validation, Supervision, Funding acquisition, Formal analysis, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

a) The authors thank Prince Sultan University for technical support.

b) This study is supported via funding from Prince Sattam Bin Abdulaziz University Project Number (PSAU 2023/R/1444).

Contributor Information

Tamkeen Zehra Shah, Email: tzehras@gmail.com.

Muhammad Imran, Email: mimran@psu.edu.sa, imranjoyia76@gmail.com.

Sayed M. Ismail, Email: a.ismail@psau.edu.sa.

References

1.Migiro G. 2019, February 26. Where Is Urdu Spoken? WorldAtlas.https://www.worldatlas.com/articles/where-is-urdu-spoken.html Retrieved October 18, 2022, from. [Google Scholar]
2.Pangarkar N.A. 2015. Language Dominance in Urdu-English Bilinguals : a Comparison of Subjective and Objective Measures.https://repositories.lib.utexas.edu/handle/2152/31818 [Google Scholar]
3.Kanwal N., Iqbal M.J., Mushtaq M. Minimalist perspective on legal communication: a case study of English to Urdu translation of Punjab laws. Register Journal. 2022;15(1):64–90. [Google Scholar]
4.Mental Health Information in Urdu. Royal College of Psychiatrists; 2023. www.rcpsych.ac.uk https://www.rcpsych.ac.uk/mental-health/translations/urdu [Google Scholar]
5.Mustafa K., Waqas S., Dar R.K., Dar R.K., Sherazi Q.U.A., Tariq M., Asim H.M. Translation and validation of barthel index in Urdu language for stroke patients. Pakistan Journal of Medical & Health Sciences. 2022;16(3):163. 163. [Google Scholar]
6.Shaghaghi N., Ghosh S., Ali F., Ali A.B. In: Services – SERVICES 2021. SERVICES 2021. Serhani M.A., Zhang L.J., editors. vol. 12996. Springer; Cham: 2022. An English to Urdu educational video translation pipeline to reinforce mother-tongue based learning. (Lecture Notes in Computer Science). [DOI] [Google Scholar]
7.Afzal M.I., Asif S., Mohsin L.A. Urdu-English texts translation practices: qualities and hindrances at intermediate level in Pakistan. Webology (ISSN. 2022;19(2) : 1735-188X) [Google Scholar]
8.Wetsman N. 2021. Google Translate still isn’t good enough for medical instructions," Theverge.com, Mar. 9.https://www.theverge.com/2021/3/9/22319225/google-translate-medical-instructions-unreliable [Online]. Available: [Google Scholar]
9.Benjamin M. 2021. Empirical evaluation of Google Translate across 107 languages, Teach You Backwards, Apr. 22.https://www.teachyoubackwards.com/empirical-evaluation/ [Online]. Available: [Google Scholar]
10.Ghafoor A., Shariq I.A., Daudpota M.S., Kastrati Z. The impact of translating resource-rich datasets to low-resource languages through multilingual text processing. IEEE Access. 2021;9:124478–124490. https://ieeexplore.ieee.org/document/9529190 [Google Scholar]
11.Ranathunga S., Lee E.S.A., Skenduli M.P., Shekhar R., Mehreen A., Kaur R. 2021. Neural Machine Translation for Low-Resource Languages: a survey.Computer Science: Computation And Language.https://arxiv.org/abs/2106.15115 [Google Scholar]
12.Aiken M. An updated evaluation of Google Translate accuracy. Studies in Linguistics and Literature. 2019;3(3):253–260. doi: 10.22158/sll.v3n3p25. [DOI] [Google Scholar]
13.Klimova B., Pikhart M., Benites A.D., Lehr C., Sanchez-Stockhammer C. Neural machine translation in foreign language teaching and learning: a systematic review. Educ. Inf. Technol. 2023;28(1):663–682. [Google Scholar]
14.Hedderich Michael A., Lange Lukas, Adel Heike, Strötgen Jannik, Dietrich Klakow. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. A survey on recent approaches for natural language processing in low-resource scenarios; pp. 2545–2568. [Google Scholar]
15.Wray Alison. Cambridge University press; Cambridge: 2002. Formulaic Language and the Lexicon. [Google Scholar]
16.Kecskes I. Formulaic language in English lingua franca. Explorations in pragmatics:Linguistic, cognitive and intercultural aspects. 2007;1:191–218. [Google Scholar]
17.Nazzal A. A preliminary study of the translation of English idiomatic/formulaic expressions by ESL/EFL students: as marked and non-canonical forms. International Journal on Studies in English Language and Literature. 2017;5(1):1–12. [Google Scholar]
18.Liu F., Lu H., Neubig G. 2017. Handling Homographs in Neural Machine Translation. arXiv preprint arXiv:1708.06510. [Google Scholar]
19.Hauer B., Kondrak G. One homonym per translation. Proc. AAAI Conf. Artif. Intell. 2020;34(5):7895–7902. doi: 10.1609/aaai.v34i05.6296. [DOI] [Google Scholar]
20.Caswell I., Liang B. 2020, Jun 8. Recent Advances in Google Translate. Google AI Blog.https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html [Google Scholar]
21.Uszkoreit J. Transformer: a novel neural network architecture for language understanding. 2017. https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Google AI Blog.
22.Ullah Z., Shah F.I. Investigating the accuracy of Google translate in translating Urdu linguistic elements. Journal of English Language, Literature and Education. 2020;2(1):37–51. http://jelle.lgu.edu.pk/index.php/jelle/article/view/79/66 [Google Scholar]
23.Ghasemi H., Hashemian M. A comparative study of “Google translate” translations: an error analysis of English-to-Persian and Persian-to-English translations. Engl. Lang. Teach. 2016;9(3):13–17. [Google Scholar]
24.Vilar D., Xu J., d'Haro L.F., Ney H. Proceedings of the Fifth International Conference on Language Resources and Evaluation. 2006, May. Error analysis of statistical machine translation output.http://www.lrec-conf.org/proceedings/lrec2006/pdf/413_pdf.pdf LREC’06) [Google Scholar]
25.Costa A., Luís T., Coheur L. Proceedings of the Ninth International Conference on Language Resources and Evaluation. 2014. Translation errors from English to Portuguese: an annotated corpus.http://www.lrec-conf.org/proceedings/lrec2014/pdf/199_Paper.pdf (LREC’14), 1231–1234, Reykjavik, Iceland. [Google Scholar]
26.Vidhayasai T., Keyuravong S., Bunsom T. Investigating the use of Google translate in” terms and conditions” in an airline's official website: errors and implications. PASAA A J. Lang. Teach. Learn. Thail. 2015;49:137–169. [Google Scholar]
27.Muzaffar S., Behera P. A qualitative evaluation of Google's translate: a comparative analysis of English-Urdu phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) systems. Language in India. 2018;18(10):154–164. [Google Scholar]
28.Din U.M. Urdu-English machine transliteration using neural networks. ArXiv. 2020 Accessed online, https://doi.org/10.48550/arXiv.2001.05296 on January 10, 2023. [Google Scholar]
29.Afzaal M., Imran M., Du X., Almusharraf N. Automated and human interaction in written discourse: a contrastive parallel corpus-based investigation of metadiscourse features in machine-human translations. Sage Open. 2022;12(4) [Google Scholar]
30.Rai S.I., Khan M.U.S., Waqas Anwar M. English to Urdu: optimizing sequence learning in neural machine translation. 2020 3rd international conference on computing. Mathematics and Engineering Technologies (iCoMET) 2020 doi: 10.1109/icomet48670.2020.9074. [DOI] [Google Scholar]
31.Imran M., Chen Y., Wei X.M., Akhtar S. A critical study of coordinated management of meaning theory: a theory in practitioners' hands. Int. J. Engl. Ling. 2019;9(5):301–306. [Google Scholar]
32.Durrani N. Proceedings 12th Himalayan Language Symposium 27th Annual Conference of Linguistic Society of Nepal. Kathmandu, Nepal; 2006. System of grammatical relations in Urdu.https://alt.qcri.org/∼ndurrani/pubs/system_grammatical_relations.pdf [Google Scholar]
33.Imran M., Almusharraf N. Analyzing the role of ChatGPT as a writing assistant at higher education level: a systematic review of the literature. Contemporary Educational Technology. 2023;15(4) [Google Scholar]
34.Mohammed Elaffendi . Sep-2022. Khawlah Elrajhi,“Beyond the Transformer: A Novel Polynomial Inherent Attention (PIA) Model and its Great Impact on Neural Machine Translation,” in HINDAWI. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No data was used for the research described in the article.

[bib1] 1.Migiro G. 2019, February 26. Where Is Urdu Spoken? WorldAtlas.https://www.worldatlas.com/articles/where-is-urdu-spoken.html Retrieved October 18, 2022, from. [Google Scholar]

[bib2] 2.Pangarkar N.A. 2015. Language Dominance in Urdu-English Bilinguals : a Comparison of Subjective and Objective Measures.https://repositories.lib.utexas.edu/handle/2152/31818 [Google Scholar]

[bib3] 3.Kanwal N., Iqbal M.J., Mushtaq M. Minimalist perspective on legal communication: a case study of English to Urdu translation of Punjab laws. Register Journal. 2022;15(1):64–90. [Google Scholar]

[bib4] 4.Mental Health Information in Urdu. Royal College of Psychiatrists; 2023. www.rcpsych.ac.uk https://www.rcpsych.ac.uk/mental-health/translations/urdu [Google Scholar]

[bib5] 5.Mustafa K., Waqas S., Dar R.K., Dar R.K., Sherazi Q.U.A., Tariq M., Asim H.M. Translation and validation of barthel index in Urdu language for stroke patients. Pakistan Journal of Medical & Health Sciences. 2022;16(3):163. 163. [Google Scholar]

[bib6] 6.Shaghaghi N., Ghosh S., Ali F., Ali A.B. In: Services – SERVICES 2021. SERVICES 2021. Serhani M.A., Zhang L.J., editors. vol. 12996. Springer; Cham: 2022. An English to Urdu educational video translation pipeline to reinforce mother-tongue based learning. (Lecture Notes in Computer Science). [DOI] [Google Scholar]

[bib7] 7.Afzal M.I., Asif S., Mohsin L.A. Urdu-English texts translation practices: qualities and hindrances at intermediate level in Pakistan. Webology (ISSN. 2022;19(2) : 1735-188X) [Google Scholar]

[bib8] 8.Wetsman N. 2021. Google Translate still isn’t good enough for medical instructions," Theverge.com, Mar. 9.https://www.theverge.com/2021/3/9/22319225/google-translate-medical-instructions-unreliable [Online]. Available: [Google Scholar]

[bib9] 9.Benjamin M. 2021. Empirical evaluation of Google Translate across 107 languages, Teach You Backwards, Apr. 22.https://www.teachyoubackwards.com/empirical-evaluation/ [Online]. Available: [Google Scholar]

[bib10] 10.Ghafoor A., Shariq I.A., Daudpota M.S., Kastrati Z. The impact of translating resource-rich datasets to low-resource languages through multilingual text processing. IEEE Access. 2021;9:124478–124490. https://ieeexplore.ieee.org/document/9529190 [Google Scholar]

[bib11] 11.Ranathunga S., Lee E.S.A., Skenduli M.P., Shekhar R., Mehreen A., Kaur R. 2021. Neural Machine Translation for Low-Resource Languages: a survey.Computer Science: Computation And Language.https://arxiv.org/abs/2106.15115 [Google Scholar]

[bib12] 12.Aiken M. An updated evaluation of Google Translate accuracy. Studies in Linguistics and Literature. 2019;3(3):253–260. doi: 10.22158/sll.v3n3p25. [DOI] [Google Scholar]

[bib13] 13.Klimova B., Pikhart M., Benites A.D., Lehr C., Sanchez-Stockhammer C. Neural machine translation in foreign language teaching and learning: a systematic review. Educ. Inf. Technol. 2023;28(1):663–682. [Google Scholar]

[bib14] 14.Hedderich Michael A., Lange Lukas, Adel Heike, Strötgen Jannik, Dietrich Klakow. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. A survey on recent approaches for natural language processing in low-resource scenarios; pp. 2545–2568. [Google Scholar]

[bib15] 15.Wray Alison. Cambridge University press; Cambridge: 2002. Formulaic Language and the Lexicon. [Google Scholar]

[bib16] 16.Kecskes I. Formulaic language in English lingua franca. Explorations in pragmatics:Linguistic, cognitive and intercultural aspects. 2007;1:191–218. [Google Scholar]

[bib17] 17.Nazzal A. A preliminary study of the translation of English idiomatic/formulaic expressions by ESL/EFL students: as marked and non-canonical forms. International Journal on Studies in English Language and Literature. 2017;5(1):1–12. [Google Scholar]

[bib18] 18.Liu F., Lu H., Neubig G. 2017. Handling Homographs in Neural Machine Translation. arXiv preprint arXiv:1708.06510. [Google Scholar]

[bib19] 19.Hauer B., Kondrak G. One homonym per translation. Proc. AAAI Conf. Artif. Intell. 2020;34(5):7895–7902. doi: 10.1609/aaai.v34i05.6296. [DOI] [Google Scholar]

[bib20] 20.Caswell I., Liang B. 2020, Jun 8. Recent Advances in Google Translate. Google AI Blog.https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html [Google Scholar]

[bib21] 21.Uszkoreit J. Transformer: a novel neural network architecture for language understanding. 2017. https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Google AI Blog.

[bib22] 22.Ullah Z., Shah F.I. Investigating the accuracy of Google translate in translating Urdu linguistic elements. Journal of English Language, Literature and Education. 2020;2(1):37–51. http://jelle.lgu.edu.pk/index.php/jelle/article/view/79/66 [Google Scholar]

[bib23] 23.Ghasemi H., Hashemian M. A comparative study of “Google translate” translations: an error analysis of English-to-Persian and Persian-to-English translations. Engl. Lang. Teach. 2016;9(3):13–17. [Google Scholar]

[bib24] 24.Vilar D., Xu J., d'Haro L.F., Ney H. Proceedings of the Fifth International Conference on Language Resources and Evaluation. 2006, May. Error analysis of statistical machine translation output.http://www.lrec-conf.org/proceedings/lrec2006/pdf/413_pdf.pdf LREC’06) [Google Scholar]

[bib25] 25.Costa A., Luís T., Coheur L. Proceedings of the Ninth International Conference on Language Resources and Evaluation. 2014. Translation errors from English to Portuguese: an annotated corpus.http://www.lrec-conf.org/proceedings/lrec2014/pdf/199_Paper.pdf (LREC’14), 1231–1234, Reykjavik, Iceland. [Google Scholar]

[bib26] 26.Vidhayasai T., Keyuravong S., Bunsom T. Investigating the use of Google translate in” terms and conditions” in an airline's official website: errors and implications. PASAA A J. Lang. Teach. Learn. Thail. 2015;49:137–169. [Google Scholar]

[bib27] 27.Muzaffar S., Behera P. A qualitative evaluation of Google's translate: a comparative analysis of English-Urdu phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) systems. Language in India. 2018;18(10):154–164. [Google Scholar]

[bib28] 28.Din U.M. Urdu-English machine transliteration using neural networks. ArXiv. 2020 Accessed online, https://doi.org/10.48550/arXiv.2001.05296 on January 10, 2023. [Google Scholar]

[bib29] 29.Afzaal M., Imran M., Du X., Almusharraf N. Automated and human interaction in written discourse: a contrastive parallel corpus-based investigation of metadiscourse features in machine-human translations. Sage Open. 2022;12(4) [Google Scholar]

[bib30] 30.Rai S.I., Khan M.U.S., Waqas Anwar M. English to Urdu: optimizing sequence learning in neural machine translation. 2020 3rd international conference on computing. Mathematics and Engineering Technologies (iCoMET) 2020 doi: 10.1109/icomet48670.2020.9074. [DOI] [Google Scholar]

[bib31] 31.Imran M., Chen Y., Wei X.M., Akhtar S. A critical study of coordinated management of meaning theory: a theory in practitioners' hands. Int. J. Engl. Ling. 2019;9(5):301–306. [Google Scholar]

[bib32] 32.Durrani N. Proceedings 12th Himalayan Language Symposium 27th Annual Conference of Linguistic Society of Nepal. Kathmandu, Nepal; 2006. System of grammatical relations in Urdu.https://alt.qcri.org/∼ndurrani/pubs/system_grammatical_relations.pdf [Google Scholar]

[bib33] 33.Imran M., Almusharraf N. Analyzing the role of ChatGPT as a writing assistant at higher education level: a systematic review of the literature. Contemporary Educational Technology. 2023;15(4) [Google Scholar]

[bib34] 34.Mohammed Elaffendi . Sep-2022. Khawlah Elrajhi,“Beyond the Transformer: A Novel Polynomial Inherent Attention (PIA) Model and its Great Impact on Neural Machine Translation,” in HINDAWI. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A diachronic study determining syntactic and semantic features of Urdu-English neural machine translation

Tamkeen Zehra Shah

Muhammad Imran

Sayed M Ismail

Abstract

1. Introduction

2. Challenges for translation

2.1. Where (exactly) does NMT stand?

3. Literature review

4. Methodology

4.1. Research questions

4.2. Data sources

4.3. Procedure

5. Analysis and discussion

5.1. Analysis of source text (Fig. 1)

Fig. 1.

Analysis of translated text

5.2. Analysis of source text (Fig. 2)

Fig. 2.

Analysis of translated text

5.3. Analysis of source text (Fig. 3)

Fig. 3.

Analysis of translated text

5.4. Analysis of source text (Fig. 4)

Fig. 4.

Analysis of translated text

5.5. Analysis of source text (Fig. 5)

Fig. 5.

Analysis of translated text

5.6. Analysis of source text (Fig. 6)

Fig. 6.

Analysis of translated text

5.7. Analysis of source text (Fig. 7)

Fig. 7.

Analysis of translated text

5.8. Analysis of source text (Fig. 8)

Fig. 8.

Analysis of translated text

5.9. Analysis of source text (Fig. 9)

Fig. 9.

Analysis of translated text

5.10. Analysis of source text (Fig. 10)

Fig. 10.

Analysis of Translated Text

5.11. Analysis of source text (Fig. 11)

Fig. 11.

Analysis of translated text

5.12. Analysis of source text (Fig. 12)

Fig. 12.

Analysis of translated text

6. Conclusion

7. Limitations of the study

Ethics declarations

Data availability statement

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgements

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases