Evaluating the Impact of Integrating Similar Translations into Neural Machine Translation

: Previous research has shown that simple methods of augmenting machine translation training data and input sentences with translations of similar sentences (or fuzzy matches ), retrieved from a translation memory or bilingual corpus, lead to considerable improvements in translation quality, as assessed by a limited set of automatic evaluation metrics. In this study, we extend this evaluation by calculating a wider range of automated quality metrics that tap into different aspects of translation quality and by performing manual MT error analysis. Moreover, we investigate in more detail how fuzzy matches inﬂuence translations and where potential quality improvements could still be made by carrying out a series of quantitative analyses that focus on different characteristics of the retrieved fuzzy matches. The automated evaluation shows that the quality of NFR translations is higher than the NMT baseline in terms of all metrics. However, the manual error analysis did not reveal a difference between the two systems in terms of total number of translation errors; yet, different proﬁles emerged when considering the types of errors made. Finally, in our analysis of how fuzzy matches inﬂuence NFR translations, we identiﬁed a number of features that could be used to improve the selection of fuzzy matches for NFR data augmentation.


Introduction
Machine translation (MT) systems are routinely evaluated using a restricted set of automated quality metrics, especially at early stages of development [1,2]. This was not different for neural fuzzy repair (NFR) [3][4][5], an MT data augmentation method that relies on the retrieval of translations of similar sentences, called fuzzy matches (FMs), from a translation memory (TM) or bilingual corpus. Using mainly BLEU [6], a metric quantifying the degree of exact overlap between MT output and a reference translation, substantial quality improvements were demonstrated between NFR systems and strong neural machine translation (NMT) baselines. This difference in terms of BLEU score was, arguably, consistent (across language pairs and data sets) and large enough to be interpreted as a strong indication that NFR can lead to translations of better quality. However, considering that BLEU scores only target one specific component of translation quality, this study intends to provide a more detailed and varied analysis of how the NFR output compares to the output of a baseline NMT system. Our aim is two-fold; not only do we want to obtain a better picture of the quality of translations produced with NFR, we also hope to gain more insight into how NFR leads to better translation quality and to identifying patterns that can be exploited to further improve the system. It can be argued that different translation tasks and contexts require different definitions of translation quality, as well as distinct evaluation techniques [7]. In this study, we focus on a specific translation context, namely, the European institutions, and the Commission in particular. At the Commission's Directorate-General for Translation (DGT), the translation quality requirements are very high, as the translated texts are often legally binding, politically sensitive, confidential and/or important for the image of the institutions [8]. This also means that consistency and exact lexical choices are often important factors contributing to translation quality next to, for example, meaning preservation and morpho-syntactic well-formedness.
Different methods were applied to extend the evaluation of the NFR output: (a) we calculated a wider range of automated quality estimation metrics targeting different aspects of translation quality, (b) we analysed required edit operations (to bring the MT output in line with the reference translation) for different fuzzy match ranges and different word classes and (c) we performed a fine-grained error analysis to establish the error profiles of the MT systems. Additionally, we zoomed in on the role that the retrieved FMs play in the NFR input and output and tried to identify FM-related features that can explain differences in quality between the NFR system and the NMT baseline, and that thus potentially can be used to further improve the NFR system.
We present the background to the study in Section 2, before introducing the study itself with the research questions (Section 3). The methodology is described in Section 4, followed by the results (Section 5) and their discussion (Section 6). In the final part, we formulate the conclusions (Section 7).

Research Background
In this section, we provide background related to the integration of similar translations into MT (Section 2.1) and the evaluation of MT quality and translation errors (Section 2.2).

TM-MT Integration
TMs are widely used in translation workflows [9], since they aid translators with finding existing translations of similar sentences. Easy access to existing translations is not only useful for speeding up the translation process, but also, for example, to ensure (terminological) consistency. In order to retrieve FMs, a wide range of automated metrics has been used, such as (token) edit distance [10,11], percentage overlap [12], vector similarity [13] and MT evaluation metrics (see Section 2.2) [14]. In recent years, MT has been claiming an increasingly prominent place in computer-assisted-translation (CAT) workflows [15][16][17], alongside TMs. MT output is typically presented to translators in case no sufficiently similar translation is retrieved from the TM [18]. In spite of recent advances in the overall quality of MT, professional translators still have more confidence in translations retrieved from a TM than in MT output, for example, due to the unpredictability of MT errors [19,20].
For over twenty years, attempts have been made to combine the advantages of TMs and MT. Different integration methods were proposed in the context of various MT paradigms [21][22][23][24]. Recent approaches have focused on integrating TM matches, or similar translations in general, in NMT models [25][26][27][28]. In this study, we focus on a simple approach to TM-NMT integration, neural fuzzy repair (NFR), that relies on source sentence augmentation through the concatenation of translations of similar source sentences retrieved from a TM [3]. This method has been shown to work well with the Transformer architecture [29], with the FM retrieval being based on the cosine similarity of sentence embeddings [4,5]. In this paper, we do not focus on comparing different TM-MT integration methods, but rather on evaluating one NFR configuration that was shown to perform well in a previous study, using BLEU as evaluation metric [4]. The NFR system evaluated in this study is presented in more detail in Section 4.2.
The data augmentation method used in NFR, which was inspired by work conducted in automated post-editing [30] and multi-source translation [31], has been shown to also yield promising results in other fields, such as text generation [32] and code summarization [33]. In the context of MT, it has also proven to be helpful in domain adaptation [5], increasing data robustness [34] and specialised translation tasks [35]. Another interesting line of related research focuses on combining NMT with similar sentences retrieved from a monolingual corpus [36].
Thus far, the evaluation of the NFR method has relied exclusively on a restricted set of automated quality metrics. In this study, our aim is to also target other aspects of translation quality. Considering that this data augmentation method is gaining some ground in the field of MT, as well as in different NLP domains, a more extensive evaluation seems warranted. In the next section, we discuss the notion of MT quality in more general terms.

Quality Assessment of MT
Translation quality remains, to a certain extent, an elusive concept [1,37]. There are many possible answers to the question of what makes something a "good" translation. Theoretically speaking, translation quality is clearly a multidimensional and multicomponent construct, that can be approached from different theoretical and practical angles [38,39]. Moreover, different types of translation tasks and contexts potentially require different definitions of quality, rendering its evaluation even more complex [7]. In an attempt to increase the objectivity and feasibility of quality evaluation, a basic distinction is often made between the accuracy (or adequacy) and fluency (or acceptability) of translations [40,41]. Broadly speaking, accuracy is concerned with the extent to which the source content and meaning are retained in the translation. Fluency, on the other hand, refers to whether a translation is well-formed, regardless of its meaning. Other researchers have operationalised MT quality in terms of concepts such as readability, comprehensibility and usability, as well as style and register [1,37].
In research practice, MT quality has been assessed in two broad ways, i.e., (a) by relying on automated metrics and (b) by performing human evaluation. These two approaches are described in the following sections.

Automatic Quality Assessment
A look at almost any MT-related research paper shows that automated metrics for quality estimation are very much at the core of MT research. Not least because they are extremely practical to use, they play a central role in system development and selection, as well as overall evaluation. A key characteristic of almost all of these metrics is that they rely on a (human-produced) reference translation, most often by calculating the similarity between MT output and this gold standard [1].
The various metrics that exist differ with regard to the approach they take to measuring the similarity between the MT output and the reference translation. Whereas certain metrics are based on n-gram precision (e.g., BLEU, METEOR [42] and NIST [43]) or n-gram Fscore (e.g., chrF [44]), others compute edit distance (e.g., TER [45]). On the other hand, more recently developed measures are often based on calculating the similarity of vector representations (e.g., BERTScore [46] and COMET [47]). The metrics also differ with regard to their unit of measurement; some consider token strings (e.g., BLEU, METEOR and TER), other character strings (e.g., chrF), while others use token or sentence embeddings (e.g., BERTScore and COMET). Some metrics have also been optimised for certain evaluation tasks. COMET, for example, was trained on predicting different types of human judgements in the form of post-editing effort, direct assessment or translation error analysis [47].
It can be argued that these measures capture a combination of translation accuracy and fluency, with some metrics being more oriented towards the former (e.g., BLEU) and others towards the latter (e.g., BERTScore) [48]. Because of their different design, the measures thus target different aspects of translation quality. For example, whereas exact token-based metrics (such as BLEU) measure the amount of lexical overlap (i.e., presence of identical sequences of tokens) with the reference translation, some accept grammatical variation by either looking for overlap in characters (chrF), or by evaluating word lemmas instead of tokens (METEOR). Semantic variability can also be taken into account to a certain extent by accepting synonyms (METEOR). Measures based on vector representations can be argued to also measure semantic similarity, as they do not compare strings of tokens or characters, but rather multidimensional vector representations that are claimed to encode semantic information [49]. Metrics that focus on edit distance, on the other hand, explicitly target the edit operations required to bring the MT output in line with a reference translation. Related to this, it has also been proposed to compare the syntactic structure of MT output and reference translations by calculating the edit distance between syntactic labels. One such approach, dependency-tree edit distance (DTED), uses labels derived from dependency parsing [50]. By targeting syntactic labels rather than word tokens, this approach is concerned with syntactic rather than lexical or semantic accuracy.
The metrics that are used in this study all rely on a comparison with a reference translation (see Section 4.3.1 for more details). Next to such reference-based evaluation metrics, some studies have also employed measures that target other aspects of MT quality. In fact, the automated evaluation of MT quality without using reference translations could be seen as a goal in itself. A detailed discussion of this task, which is commonly referred to as MT quality estimation (QE) [51], falls outside the scope of the current study.
Even though, practically speaking, automated evaluation is fast(er) and less costly than human evaluation, theoretically speaking, it is almost universally acknowledged as being inferior to human evaluation. As a result, the main goal of automated quality metrics is to resemble human evaluation as closely as possible. Therefore, a "good" quality metrics is often defined as a metric that correlates well with human ratings and is able to mimic human judgements [52][53][54].

Manual Quality Assessment
Among the various human evaluation methods that have been developed, three approaches have become well-established standards in the field, namely, (a) direct assessment (DA) and/or ranking of MT output, (b) measuring of technical and temporal post-editing effort (PEE) and (c) translation error analysis. In ranking, multiple MT outputs are simply ordered by human assessors based on their quality [55]. DA consists of collecting human assessments of a given translation in terms of how adequately it expresses the meaning of the corresponding source text or a reference translation on an analogue rating scale [54,56]. PEE is often measured in terms of technical PEE (i.e., the amount of editing work involved in the post-editing process) [15,57] or temporal PEE (i.e., the time it takes to amend an MT suggestion to turn it into a high-quality translation) [58,59].
Even though all above-mentioned methods are useful for assessing MT quality for different purposes, they all fail to capture the exact nature of the errors made by an MT system and to provide reasons for a given quality assessment. For this reason, error analysis, which consists of detecting, categorizing and marking errors in machine-translated text, has emerged as a crucial human quality assessment technique, especially when it comes to identifying specific translation issues and carrying out diagnostic and comparative evaluations of translations [60,61]. A variety of MT error taxonomies has been proposed in the literature. They can be grouped as follows: (a) taxonomies that use common error categories as a basis, such as omissions, word order errors, incorrect words, unknown words and punctuation errors [61,62]; (b) linguistically-motivated taxonomies, which classify the MT errors into different linguistic levels, such as the orthographic, lexical and semantic levels [63,64]; and (c) fine-grained, hierarchical error taxonomies, that take the distinction between accuracy (or adequacy) and fluency as a basis for translation error classification [65,66].
While fine-grained error analysis is at the core of understanding the relationship between translation errors and quality, it is also labour-and time-intensive and classification of the errors in MT output is by no means unambiguous [66,67]. In order for the findings to be reliable, it should be possible to apply an error analysis task consistently by multiple assessors, yielding a high inter-annotator agreement (IAA) [68]. Fine-grained error analysis has been applied to different MT architectures and it is still a common MT quality assessment technique that allows us to identify the strengths and weaknesses of NMT systems, for example, for different domains and language pairs [67,[69][70][71]. To our knowledge, NFR output, or that of similar TM-NMT integration systems, has not been analysed yet in terms of the types of errors it contains. However, such an exercise could shed more light on how the error profile of an NMT system changes when similar translations are integrated in its input. In this study, we used the hierarchical SCATE MT error taxonomy, described in more detail in Section 4.4.1, to evaluate and compare the translations produced by the NFR and a baseline NMT system.

The Current Study
The aim of this study is to provide a more thorough evaluation of the best-performing NFR system according to previous evaluations [4] by comparing its output to that of a baseline NMT system. In doing so, we hope to identify the strengths and weaknesses of the NFR system and the types of changes in translation quality relative to the baseline NMT system, as well as potential areas for further improvement of this data augmentation method for NMT. Our research questions are as follows: • RQ1: How does the quality of NFR output compare to that of a baseline NMT system in terms of (a) semantic, syntactic and lexical similarity to a reference translation, as measured by automated quality metrics; and (b) the types and number of edit operations required to transform the MT output into the reference translation? • RQ2: What types of and how many translation errors do the NFR and the baseline NMT system make? To what extent are the error profiles of the two systems different? • RQ3: How do fuzzy matches in NFR influence the MT output, i.e., how often do tokens in fuzzy matches appear in NFR output and in the reference translations and to what extent are these tokens aligned to tokens in the source sentence? • RQ4: Which factors influence NFR quality and to what extent can these factors explain the variation in quality differences between the NFR and the baseline NMT system?
The first two research questions explicitly target MT quality evaluation, while questions 3 and 4 deal with the internal workings of NFR. As will become clear in the next section, we evaluated quality in two fundamentally distinct ways, i.e., (a) automatically, by applying metrics that calculate different types of similarity between the MT output and a reference translation; and (b) manually, by annotating translation errors on the basis of the MT output and the input sentence (without access to the reference translation). We consider both approaches to be complementary here, as they target distinct, though related aspects of MT quality and rely on different types of information.
Crucially, for practical reasons, both evaluations were performed out-of-context and at the sentence level in this study. Even though this is common research practice, it is clearly not ideal [2,72], especially in the specific translation context that is investigated in this study. Yet, in spite of inherent shortcomings, we believe that a combination of both automated metrics (that rely on comparisons with reference translations) and fine-grained human error annotation (based on the source text and the MT output) can provide insights into the quality of both NMT systems, including the types and amounts of errors they make. This is also one of the reasons why we opted for human error annotation instead of ranking or scoring. Most of the quality metrics have also been shown to correlate reasonably well with human judgements [52][53][54] and some were specifically trained for this purpose [47]. With regard to measuring post-editing effort, properly conducting an experimental study is difficult and costly in the specific context under investigation here, since we would need to rely on expert translators and set up (semi-)controlled experiments involving realistic translation tasks [73]. We come back to this issue in the discussion.

Methodology
In this section, we describe the data sets (Section 4.1) and the NMT systems (Section 4.2) used in the study. We then provide more details on the automated quality assessment (Section 4.3) and human error annotation (Section 4.4). The subsequent sections focus on the analysis of the impact of FMs on NFR translations (Section 4.5) and the methods used for identifying features that influence NFR quality (Section 4.6).

Data
To train our NMT models, we used the TM of the European Commission's translation service (DGT-TM), which consists of texts and their translation written for mostly legal purposes, such as contracts, reports, regulations, directives, policies and plans within the Commission [74]. The translations in this data set are verified at multiple levels to ensure that they are of high quality, with consistent use of style and terminology. We focus on translation from English into Dutch.
The training set consisted of 2.389 M sentence pairs. A total of 3000 sentences were set aside for validation and 6207 sentences for testing. The validation and test sets did not contain any sentences that also occurred in the training set (i.e., 100% matches were removed). All sentence pairs were truecased and tokenised using the Moses toolkit [75].
We extracted a subset of 300 sentences from the test set for manual error analysis, applying stratified random sampling and filtering. With the aim of ensuring that the subset contained different types of segments while being representative of the original test set, we applied the following criteria: • The distribution of the number of sentences in different FM ranges (based on the similarity between the input sentence and the sentence retrieved from the TM; see Section 4.2) was similar in both data sets; • The subset contained an equal number of sentences with different source lengths (i.e., short sentences of length 1-10 tokens, medium of length 11-25 and long with a length over 25 tokens) per FM range; • Segments consisting (almost exclusively) of chemical formulas, numbers or abbreviations were excluded. Table 1 describes the test set as well as the subset used for manual analysis. We subdivided the data set into different FM ranges, since the FM score was found to have a strong impact on the quality of the resulting MT output [3]. Note that we used the score of the "best" retrieved FM per input sentence for the purpose of subdividing the data set into match ranges (i.e., the FM with the highest similarity score). A small number of sentences in the test set (18, or 0.3%) did not have any FMs with a similarity score above 50%. Such sentences were not included in the subset. As the table shows, we somewhat overrepresented the lower match ranges in the subset (i.e., 50-89%) in order to better balance the data set. The proportion of sentences in the highest match range is 40.5% in the full test set and only 25% in the subset.

NMT Systems
For the purpose of our evaluations, we compared the best NFR configuration and the baseline NMT system reported in a previous study [4]. Both systems use the Transformer architecture [29] and were trained using OpenNMT [76]. More details on hyperparameters and training options are provided in the original study [4]. To train the NFR system, input sentences are augmented with similar translations retrieved from a TM (which is also used for training the baseline NMT system). In addition, alignment-based features are added to these augmented inputs. In what follows, we provide a brief overview of the different steps in the procedure.
First, for a given data set consisting of source/target sentence pairs S, T, for each source sentence s i ∈ S, n similar sentences {s 1 , . . . , s n } ∈ S, are retrieved from the same data set, where s i / ∈ {s 1 , . . . , s n }, given that the similarity score is above a given threshold λ > 0.5. The sentence similarity score SE(s i , s j ) between two sentences s i and s j is defined as the cosine similarity of their sentence embeddings e i and e j , that is, where e is the magnitude of vector e. Similar to [5], we obtained sentence embeddings by training sent2vec [77] models on in-domain data and, for efficient similarity search, we built a FAISS index [78] containing the vector representation of each sentence. Bytepair encoding (BPE) was applied to the data prior to calculating sentence similarity and was used in all subsequent steps [79] (https://github.com/rsennrich/subword-nmt (accessed on 15 July 2021)). To this end, we used a 32 K vocabulary for source and target language combined. In a second step, for each s i , once the most similar n source sentences (i.e., fuzzy sources) are retrieved using cosine similarity, tokens in each fuzzy target {t 1 , . . . , t n } ∈ T are augmented with word alignment features that indicate which source tokens they are aligned with in s i . This alignment process is conducted in two steps, i.e., firstly, by aligning the source tokens in s i with the fuzzy source tokens in s j by back-tracing the optimal path found during edit distance calculation between the two segments; and, secondly, by aligning fuzzy source tokens in s j with fuzzy target tokens in t j by referring to the automatically generated word alignments, which are obtained with GIZA++ [80]. After obtaining the alignments, fuzzy target tokens that are aligned with source tokens are marked as m (match) or nm (no-match). All tokens in the original input sentence are also marked with the feature S (source) for correct formatting.
In the next step, the best-scoring FM target t 1 is combined with another FM target t j , which maximises the number of tokens covered in the input sentence s i , provided that a second FM with additional matching input tokens can be found. If this is not the case, this method falls back to using the second best fuzzy target t 2 . For each input sentence s i in the bilingual data set, an augmented sentence s i is generated by concatenating the combined fuzzy target sentences to s i . We used "@@@" as the boundary token between each sentence in the augmented input sentence.
Finally, the NMT model is trained using the combination of the original TM, which consists of the original source/target sentence pairs S, T and the augmented TM, consisting of augmented-source/target sentence pairs S , T. At inference, each source sentence is augmented using the same method. If no FMs are found with a match score above λ, the non-augmented (i.e., original) source sentence is used as input. Figure 1, which is modified from Tezcan, Bulté and Vanroy [4], illustrates the NFR method for training and inference.

Automated Quality Estimation
In this section, we describe the procedures used for automated quality estimation. We list and briefly discuss the different metrics that were applied (Section 4.3.1) and provide more details on the identification of necessary edit operations using TER (Section 4.3.2). The human evaluation based on error analysis is described in the subsequent Section 4.4.

Metrics
We calculated seven automated metrics for translation quality estimation. The goal of using a variety of metrics is to cover different aspects of translation quality and to analyse if all metrics, which use different methods to measure quality, agree on the potential difference in quality between the NFR and the baseline NMT systems. Table 2 provides an overview of these metrics, specifying which method they use, the unit they target and the translation quality dimension they assess. More details concerning their concrete implementation in this study are provided in Appendix A.  BLEU and METEOR are similar metrics in that they measure n-gram precision at the level of token strings. Whereas BLEU targets exact lexical overlap, METEOR takes into account morphological and lexical variation (i.e., by also comparing lemmas and synonyms). Similarly to BLEU and METEOR, TER also targets token strings but estimates edit operations instead of n-gram precision. In this sense, it can be said to tap into the technical post-editing effort required to bring the MT output in line with the reference translation. The version of TER we used here, similar to BLEU, does not accept morphological and lexical variation. Unlike the three previous metrics, chrF targets character strings rather than token strings. It measures exact overlap, as BLEU does, but, since it also targets sub-word units, morphological variation is penalised less strongly.
BERTScore and COMET measure semantic similarity by calculating the distance between vector representations of tokens and sentences, respectively. Thus, they are fundamentally different from the previous metrics, which operate on (non-transformed) token or character strings. Moreover, COMET scores have been tuned towards predicting human translation quality ratings. Finally, DTED measures syntactic similarity by calculating edit distance for syntactic dependency labels [81].

TER Edit Operations
TER computation is based on the identification of different types of token-level edit operations required to bring the MT output in line with a reference translation (i.e., insertions, deletions, substitutions and token or group shifts). By using TER, we aim to find out what types and what amount of edits are required to transform both the NFR and the baseline NMT systems into the reference translation. An additional reason for using TER edits is that the edit types are identified for each token separately, which makes it possible to analyse whether the edits affect different classes of words. This is important, considering that content words, which possess semantic content and contribute significantly to the meaning of the sentence in which they occur, arguably matter more than function words when it comes to translation quality, considering that they have been associated with increased post-editing effort [82,83]. In the context of automatic evaluation, attaching more weight to content than to function words has been shown to lead to higher correlations with human quality judgements [84].
We compared the required edit operations for the NFR and baseline translations to determine whether they were distributed differently. To obtain a more detailed picture, we distinguished between edit operations affecting content words (i.e., nouns, proper nouns, adjectives, verbs and adverbs), function words (i.e., articles, determiners, prepositions, etc.) and other words/tokens (such as punctuation, symbols and other tokens that cannot be assigned a part-of-speech tag), following the classification used in the context of Universal Dependencies [85]. To automate the detection of part-of-speech (POS) tags, we rely on the state-of-the-art stanza parser developed by the Stanford NLP group [86].

Manual Error Analysis
In this section, we first describe the error taxonomy that was used for the error analysis (Section 4.4.1). This is followed by an overview of the annotation procedure (Section 4.4.2). We also report and discuss inter-annotator agreement (Section 4.4.3).

Error Taxonomy
To identify errors in the MT outputs, we used the SCATE MT error taxonomy [66,69]. As shown in Figure 2, this taxonomy is hierarchical, consisting of three levels. At the highest level, a distinction is made between accuracy and fluency errors (see also Section 2.2). Any error that can be detected by analysing the MT output alone is defined as a fluency error. If an error can only be detected by analysing both the source text and the MT output, it is classified as an accuracy error. Both of these error types have two additional levels of subcategories (e.g., accuracy → mistranslation → word sense). In this study, we used a slightly adapted version of the taxonomy; the category "fluency → orthography" was added and the sub-category "non-existing word" was moved from the fluency error category "coherence" to "lexicon". These changes were implemented based on annotator feedback in the context of previous research.

Procedure
Two annotators performed the error annotation task. Both were native speakers of Dutch with advanced English proficiency (level C2 of the CEFR scale). Both annotators had a bachelor's degree in applied linguistics, with courses taken in translation studies, as well as a master's degree (in interpreting, in the case of annotator 1, and in translation, for the second annotator). Prior to the annotation task, they were briefed about the task, read the annotation guidelines and performed a test task with 30 sentences for which they received feedback. The sentences in the test task were not included in the final annotation set. It has to be noted that the annotators were not experienced translators working at the DGT and did not have expert knowledge of the terminology used in this domain. They also lacked document-level contextual information, which should be taken into account when interpreting the results.
The annotation task was performed using the online annotation tool WebAnno https:// webanno.github.io/webanno/ (accessed on 21 September 2021). MT errors were annotated in two steps. First, fluency errors were annotated by analysing the MT output alone, without having access to the corresponding source text (monolingual annotation). Second, the accuracy errors were annotated by analysing the MT output and the source text together (bilingual annotation). During the second step, the fluency errors annotated in the first step were visible to the annotators. The annotators were allowed to annotate multiple error categories on the same text span (i.e., errors can overlap). To ensure consistency in the error annotations, the translations of the NFR and baseline NMT systems were presented side by side to the annotators. The system details were masked during the annotations. The error definitions used in the SCATE taxonomy and the detailed annotation guidelines that were provided to the annotators can be found at https://github.com/lt3/nfr/blob/main/ webanno/SCATE_annotation_guidelines.pdf (accessed on 21 September 2021). Both annotators completed the annotation task (for NFR and baseline NMT output) in approximately 34 h. After the annotation task, the annotators worked together to resolve disagreements and to create a final, consolidated version of error annotations for both data sets, which can be found at https://github.com/lt3/nfr/tree/main/webanno (accessed on 21 September 2021).

Inter-Annotator Agreement
To assess the level of inter-annotator agreement (IAA), we evaluated the two tasks the error annotation process was based on, namely, error detection and error categorization, using the methodology proposed by Tezcan, Hoste and Macken [66]. Error detection is seen as a binary decision task, which consists of deciding whether a token corresponds to an error or not. IAA is assessed by calculating Cohen's kappa at token level (i.e., tokens marked as error or not) and at sentence level (i.e., sentences marked as containing an error or not). To assess error categorization, we used alignment-based IAA; Cohen's kappa was calculated for both annotators' error annotations which overlapped in terms of the tokens they spanned. In this analysis, isolated annotations (i.e., when only one of the annotators detected an error on a given text span) were not included. Similarly, when multiple annotations overlapped between the two annotators, only aligned errors were analysed. Agreement on error categorization was analysed for the three hierarchical levels of the error taxonomy. Figure 3 provides an example of error annotations by both annotators for the baseline NMT translations of the input sentence "The top part is screwed onto the nut".  . "Notenboom" would be the Dutch translation of "walnut tree"; "nut" was correctly translated by the NFR system as "moer".
In the annotation examples provided in Figure 3, there are 9 tokens in the MT output. Both annotators agreed that this output contained errors (agreement on sentence-level error detection). On the other hand, while annotator 1 marked 1 token as erroneous (and 8 as correct), annotator 2 identified 4 erroneous tokens (low agreement on token-level error detection). By aligning the "mistranslation" and "logical problem" error annotations between the annotators, we can see that both annotators agreed on categorizing both errors at the first level (accuracy and fluency), as well as at the second level (mistranslation and logical problem) of the taxonomy, whereas, on level 3, they only agreed on categorizing the "logical problem" error as "coherence". The annotators disagreed when categorizing the "mistranslation" error (word sense vs. semantically unrelated). Table 3 shows IAA on error detection at token and sentence level, for both the NFR and baseline NMT translations. The confusion matrix on which the calculation of Cohen's kappa was based, is provided in the appendices (Table A3). IAA can be characterised as low at token level and moderate at sentence level. However, it should be noted that error detection at the token level is a very unbalanced task (in case of high-quality translations), with the vast majority of tokens not corresponding to errors (i.e., the probability of overall chance agreement is very high). There are two further reasons why IAA at token level seems low, namely, (a) even though both annotators often detected the same error, their annotations covered different text spans; and (b) annotator 2 was more critical than annotator 1 overall, not only having identified a higher number of errors, but also having covered more tokens (see Table A3). Table 4 summarises the results for IAA on error classification at the three levels of the error taxonomy. At Level 1, agreement at the top level of the hierarchy was analysed (accuracy vs. fluency). For Levels 2 and 3, the same analyses were performed for the sub-categories in the hierarchy (if the annotators agreed on the higher level). At the lowest level (3), we only evaluated the two categories with the largest number of identified errors (i.e., accuracy → mistranslation and fluency → style and register). Cohen's kappa was very high for error classification at the highest level of the taxonomy, as well as at Level 2, when all accuracy errors were considered. For fluency errors, IAA was slightly lower. At the lowest level, agreement was perfect for fluency errors related to style and register, whereas, for accuracy errors related to mistranslations, IAA was found to be moderately high. It can also be seen that the level of agreement was higher for the NFR system for all levels of error detection and categorisation.
Even though some of the kappa scores reported for this study may seem low, they are on par with those reported in similar studies on MT error analysis [67,68,87]. However, it also has to be noted that there is no consensus on how IAA should be analysed for this task, which makes it difficult to compare the IAA rates reported in different studies.

Fuzzy Match Analysis in NFR
NFR works by augmenting input sentences with the target side of an FM retrieved from a TM. In the NFR system evaluated here, two FMs were concatenated to the original input sentence. To analyse the impact these FMs had on the MT output, we counted the number of tokens in the FMs that also appeared in the MT output and compared this to the tokens that appeared in the baseline NMT translation. Moreover, we distinguished between FM tokens that were either aligned or not to tokens in the input sentence (i.e., the match and no-match alignment features described in Section 4.2). In a final step, we verified how many of the match/no-match tokens that appeared in the MT outputs were also present in the reference translation.

Features and Statistical Models
The aim of our final analysis is to identify features that can be used to further improve the quality of NFR translations. To this end, we investigated the extent to which certain characteristics of the source sentence and the selected FMs influenced the difference in quality between the NFR and baseline NMT systems or, in other words, we looked for features that made NFR translations "better" (or "worse") than baseline translations. This identification of features should allow us to improve the performance of the NFR system in three possible ways: (a) by better selecting FMs used for data augmentation, (b) by optimising FM combination strategies and (c) by possibly selecting which input sentences not to augment.
For the purpose of this analysis, we calculated the difference between the sentencelevel TER scores obtained by the NFR and the baseline NMT system. Table 5 gives an overview of the candidate features that we investigated. First, we considered the vocabulary of the source sentence in terms of the frequency of tokens in the data set as a whole by calculating the percentage of tokens in the sentence belonging to different frequency bands. We also looked at the length of the source sentence, as well as the length of both FMs relative to the length of the source sentence and to one another. Next, we considered the similarity of the FMs to the source sentence, as well as the ratio of match/no-match tokens per FM. Finally, we included the mutual dependency-tree edit distances between the source sentences, FM1 and FM2. As a first step in our analysis, we selected a subset of the test set based on the difference in TER scores between the baseline NMT and the NFR systems, keeping only those sentences for which a considerable difference in TER scores (i.e., >0.1) was observed. We then split this subset into sentences for which NFR obtained a better TER score and sentences for which the baseline translation scored better. We compared the mean scores for each feature for both subsets and estimated the significance and magnitude of the difference between both using independent samples t tests and Cohen's d effect size.
The features that showed the largest difference (in terms of Cohen's d) between the two subsets were then entered as independent variables into a linear model, with the difference in TER scores between the NFR and baseline translations as the dependent variable. We used R's [88] lm function to fit the model and include all sentences in the test set as observations. The aim of this step is to jointly model the effects of the different features, focusing on the interpretability of results. We performed forward and backward model selection using Aikake's Information Criterion (AIC) to arrive at the most parsimonious model [89,90].

Automated Quality Assessment
In this section, we present the results related to the first research question. We first look at the quality assessment using automated metrics (Section 5.1.1), before turning to the analysis of required edit operations (Section 5.1.2).

Semantic, Syntactic and Lexical Similarity
We used seven automated evaluation metrics to estimate the quality of the NFR system and compare it to the baseline Transformer, as well as to the most similar translation retrieved from the TM (using cosine similarity). The results are presented in Table 6. We report the overall score per metric for the baseline and NFR systems, as well as the TM; further, we indicate the absolute and relative difference between the baseline and the NFR systems. To allow a comparison across metrics to be conducted, we also report the difference in terms of standardised (or z) scores for each metric. Z-scores represent deviations from the mean in terms of standard deviations. Table 6. Automatic evaluation results for TM, baseline NMT and NFR. Arrows next to each metric indicate that either higher scores (↑) or lower scores (↓) are better. Scores are reported per system, absolute and relative difference between baseline and NFR and difference in terms of standard (z) scores. According to all evaluation metrics, the quality of the NFR translations was estimated to be higher than that of those produced by the baseline NMT system. All improvements were statistically significant according to the Mann-Whitney U test (p < 0.001). The largest standardised difference between the two systems was recorded for BLEU (+0.157), followed by chrF (+0.155) and BertScore (+0.134). COMET showed the smallest difference (+0.077). Both the baseline NMT and the NFR systems outperformed the TM, in this scenario, according to all metrics. For reference, the scores for all evaluation metrics obtained on the subset of the test set used for manual evaluation are provided in the appendices (Table A1).
To allow a more detailed analysis to be performed, the BLEU scores per FM range are reported in Table 7. These scores confirm previous findings [4]. With the increase in FM scores, (a) the estimated translation quality increased for both systems and (b) the difference between the baseline NMT and NFR systems became larger. Table 7 also shows that, in the highest FM range (i.e., 90-99%), FMs retrieved with cosine similarity achieved a higher BLEU score when used as the final output than the baseline NMT system (75.25 vs. 74.64). On the other hand, the scores obtained by the NFR system were much higher (85.25). The BLEU scores for the different FM ranges obtained on the subset used for manual evaluation are provided in the appendices (Table A2).
Finally, the correlations between the different evaluation metrics are shown in Table 8. We did not include BLEU in this analysis, since BLEU scores are not reliable at sentence level [42]. Generally speaking, the correlations between the different evaluation metrics could be described as strong to very strong, even though most correlations did not exceed 0.85. The strongest correlations were found between chrF and BERTScore (0.934 and 0.929) and between chrF and METEOR (0.923 and 0.916). The reported correlations confirm that DTED was the most distinct of the metrics, targeting syntactic structure rather than token strings or (vectorised) semantics. It showed the weakest correlation with four out of five of the other metrics and its strongest correlation was with TER (0.838). In addition, COMET appeared to capture different information. The correlation between COMET and other metrics did not exceed 0.824. It is also worth noting that the correlations between the metrics were highly comparable for the NFR and baseline NMT translations. The largest difference was found for the correlation between TER and METEOR (−0.811 for NFR and −0.852 for the baseline NMT system).

Edit Operations
We analysed the number and types of TER edit operations to obtain a more detailed picture of the formal editing required to transform MT output into the reference translation. We first looked at the number of different types of edit operations for the complete test set for both MT systems, before turning to an analysis per match range. The results of the first analysis are presented in Table 9. For the purpose of this analysis, we made a distinction between edits involving content words, function words and other words or tokens. The results show that NFR produced translations that required fewer edit operations overall (−15.10% edits and −1.18 edits per sentence), as well as for all individual edit types. The reduction in the number of edits seemed to be consistent and balanced for content, function and other words, for all types of edits. When we compared the relative frequencies of edit types, we could see that substitutions made up the largest group of edits, with an average of 3.49 and 2.90 edits required per sentence in the baseline NMT and NFR outputs, respectively. The difference between the NFR and baseline systems in terms of number of required substitutions was also substantial (−16.97%), especially when compared to the difference in terms of insertions (−9.61%) and, to a lesser extent, deletions (−13.11%). Two other edit types showed an even larger reduction in the total number of edits required-shift tokens (−19.55%) and shift groups (−18.76%).
For the second analysis, we divided the test set according to the match score of the first FM that was retrieved for the NFR system. A detailed overview of the frequency of edit operations per match range is provided in the appendices (Table A4). A summary of the results is visualised in Figure 4. In this graph, for the sake of clarity, we only distinguish between edit types and not token types and plot the percentage difference between the baseline NMT and NFR systems.  Figure 4 shows that the estimated translation quality difference between the NFR and the baseline NMT systems, measured in terms of number and types of edits required, was larger for higher FM scores. For source sentences that were augmented with FMs with a similarity score of up to 70%, there was little difference between the NFR and the baseline systems. In fact, compared to the NFR system, the baseline system required fewer substitutions and shifts in the FM range 50-59% and fewer insertions and deletions in the range 60-69%. With FM similarity scores of 70% and higher, NFR requires fewer edits for all edit types. The reduction in required edits increased dramatically for higher FM ranges, reaching a difference of between −26.88% (for insertions) and −55.77% (for shift groups) in the FM range 90-99%.

Fine-Grained Error Analysis
Our second research question aimed to compare the types and number of errors that were made by the NFR and the baseline NMT systems, classified according to the SCATE error taxonomy. The results of the consolidated human error annotation are summarised in Table 10. Overall, the annotators identified the same number of errors (199) for both translation systems. Likewise, the number of sentences that contained at least one error was highly similar for both systems (120 for baseline NMT and 123 for NFR). However, the errors in the NFR output spanned more tokens (442 compared to 395, or 7.9% and 7.0%, respectively). It is worth noting that, overall, around 60% of the sentences in the test set did not contain any errors, according to the annotations. Table 11 shows which types of errors were made by both MT systems according to the SCATE error taxonomy.  Table 11, the majority of errors made by both the baseline NMT (110) and the NFR system (128) was accuracy errors. For both systems, the two error categories with the largest number of errors were mistranslation and omission errors. Taken together, the errors in these two categories made up 85% and 76% of all accuracy errors for the baseline and NFR systems, respectively. Looking more closely at the mistranslation errors that both systems made, different error profiles emerged; while the NFR system made more errors in the sub-category "semantically unrelated" (27), the baseline system seemed to perform better in this respect and produced translations with 56% fewer errors in this category (12). On the other hand, the baseline system produced more errors in all other sub-categories of mistranslation, with, especially, the "word sense" and "multi-word expressions" categories standing out when compared to the NFR system (even though, in absolute terms, these types of errors were less frequent). When we consider the other types of accuracy errors, the NFR system clearly made more addition errors (8 vs. 25). While the NFR system made more accuracy errors, producing more fluent output seemed to be its strength, compared to the baseline NMT system, as it produced translations with 19% fewer fluency errors (89 vs. 71). Within the main category of fluency, errors were distributed differently for both systems. The most striking difference could be observed for the "lexicon" category; the baseline system clearly made worse "lexical choices" while producing translations (17 vs. 9) and used words outside the Dutch vocabulary (7 errors in the category "non-existing/foreign" errors), which did not seem to be a big issue for the NFR system (1 error ). On the other hand, the baseline system generated fewer errors of "grammar and syntax" than the NFR system (8 vs. 15).

Impact of Data Augmentation
Our third research question targeted the impact FMs had on the MT output. Table 12 shows the percentage of matched tokens per selected FM (% m/FM1 and % m/FM2) and how many of these tokens appear in the MT output relative to the total number of matched tokens (% m-pass). We report these values for the full test set for both the baseline NMT and the NFR systems, even though the FM tokens, of course, did not actually appear in the baseline input. The table also shows what percentage of match/no-match tokens that appeared in the MT output, also formed part of the reference translation (%m-pass-REF).
In addition to providing the values for the full test set, we provide them for the lowest (50-59%) and the highest (90-99%) match ranges for the NFR system. In the full test set, 88.4% of the matching tokens were transferred to the NFR output. For comparison, 82.6% of these tokens could also be found in the baseline NMT output. Of all the matching tokens that appeared in the NFR output, 69.5% could also be found in the reference translation. For the baseline system, this was only 66.6%. Another interesting observation is that fewer matching tokens passed to the NFR output in the lowest FM range (74.5%), especially when compared to the highest range (95.8%).
Looking at how no-match tokens were utilised by both systems, on average, a higher ratio of such tokens appeared in the NFR translations (36.5%) than the baseline NMT output (31.9%). It is worth noting that a higher proportion of no-match tokens in NFR translations also formed part of the reference translation (29.6%) than is the case for the baseline translations (27.3%). On the other hand, the majority of no-match tokens that appeared in the NFR and baseline outputs (70.4% and 72.7%, respectively) did not appear in the reference translations. The NFR system also seemed to carry a higher percentage of no-match tokens to its output in higher FM ranges (45.8% in 90-99% vs. 23.5% in 50-59%) and more of these tokens appeared in the reference translations in the higher match range.
As detailed in Section 4.2, for the NFR configuration used in this study, the highestscoring FM was combined with a second FM, which did not necessarily achieve a high similarity score but which maximized the total number of source words that were aligned. As a result, there were, on average, fewer matching tokens in FM2 than in FM1 (46.8% vs. 62.2%). However, an overall pattern similar to the one observed for FM1 emerges when comparing how matching/non-matching tokens were utilised in NFR and the baseline NMT systems. The main difference seemed to be that fewer FM2 tokens (both match and non-match) were transferred to the NFR output than FM1 tokens. Similar patterns could also be observed for FM1 and FM2 when comparing the lowest and highest FM ranges in NFR.

Variables Influencing MT Quality
The final research question aimed to identify factors that influence the quality of NFR translations, with the ultimate aim of further improving NFR quality. We first analysed the effect of separate features by comparing their average values for sentences where NFR scored substantially better than the baseline NMT (in terms of TER) and those where the baseline scored better. As outlined in Section 4.6, the selected features were extracted per input sentence and targeted four different characteristics: • Vocabulary frequency: Percentages of tokens in the input sentence that belonged to the 99%, 90% and 75% least frequent tokens in the source side of the training set (% Q99, % Q90 and % Q75). The results of this analysis are shown in Table 13. There were 1310 sentences for which NFR translations obtained a TER score that was at least 0.10 higher than the baseline NMT translation. In contrast, for only 579 sentences, the baseline translations scored better. A more detailed analysis of the distribution of sentences that obtained higher TER scores per system and per FM range is provided in the appendices (Table A5). This table shows that, in each FM range, there were sentences that were translated better by the baseline system, although the proportion of sentences for which the NFR system performed better clearly was larger for higher FM ranges.
For 7 out of 15 features, a substantial difference, with a Cohen's d effect size of over 0.20, was found between the two subsets. Most of these had to do with similarity between the FMs, and the source sentence and the percentage of matched tokens they contained (FM1 score, FM2 score, % m/FM1 and total_m/source). In addition, the length of the source sentence, as well as the length difference between the two FMs, differed substantially for both subsets. The final feature qualifying was the dependency-tree edit distance between both FMs. After performing model selection, six out of seven predictors were retained in our final linear model. The outcome variable for this model was the difference in TER scores between the baseline NMT and NFR translations of the sentences in the test set. The parameter estimates of the model are presented in Table 14. Since we are dealing with a linear and additive model without interaction terms, the coefficients have to be interpreted as changes to the outcome variable when other terms in the model are kept constant. Table 14. Parameter estimates (b), standard errors (S.E.), standardised estimates (β), tand p-values for the linear model estimating TER difference. Adjusted R 2 = 0.064. (* p < 0.05; ** p < 0.01; *** p < 0.001). All six variables contributed significantly to the model. The strongest effect, as shown by the standardized betas, was observed for FM1 (positive) and FM2 score (negative). However, note that the overall explanatory power of this model is limited, with only 6.4% of the variance explained. Figure 5 visualises the strength and direction of the effect of the six features. Note that the scale of the Y-axis is not constant across these plots. Plot A shows how NFR translations appeared to outperform baseline NMT translations when source sentences were longer and plot B confirms that high FM scores were associated with better NFR scores than the baseline. The model also predicted low scoring matches to lead to worse NFR translations than the baseline system. However, this was true only for the first FM. According to the model, a higher-scoring second FM lead to worse NFR performance (plot C). There were three more positive effects: the difference in TER scores was more in favour of the NFR system when the length of both FMs was more alike (D), when there were more matched tokens in both FMs combined (E) and when the dependency-tree edit distance between both FMs was larger (F). The relevance of these findings is discussed in the next section.

Discussion
In the first part of the discussion, we summarise the findings of our study and point to potential implications (Section 6.1). The second part discusses the limitations (Section 6.2).

RQ1: Automated Quality Evaluation and Required Edits
Our first research question aimed to compare the quality of NFR and baseline NMT translations. To this end, we calculated scores on seven automated quality metrics and analysed the number and types of edit operations required to bring the MT output in line with a reference translation. The results are unequivocal. NFR translations were more similar to reference translations according to all quality metrics, regardless of whether they targeted lexical overlap, semantic similarity, or syntactic equivalence. The largest difference was found in terms of exact metrics (i.e., BLEU, TER and, to a lesser extent, chrF) and the smallest for COMET, which did not require exact lexical overlap. This seems to indicate that NFR was especially strong at staying close to reference translations in terms of lexical equivalence (i.e., producing words and/or tokens that were identical to the reference), which is one of the main aims of TM-MT integration methods. In terms of semantic equivalence (so accepting lexical variation), the difference with the baseline NMT system was slightly less pronounced, but still significant (as measured by BERTScore and METEOR). This was also the case for COMET, which has been shown to correlate well with human evaluations [53]. It is also interesting to note that we observed a difference in terms of syntactic similarity between the MT output and the reference translations. To our knowledge, dependency tree edit distance is not widely used as an MT evaluation metric, as it does not capture any semantic or lexical aspects, but it could be a useful addition to the repertoire of automatic measures of quality, since it targets a different (linguistically motivated) type of similarity between the MT output and the reference translation. In this regard, we noted that the correlations between the different evaluation metrics were moderate to very high, yet not high enough to discard any measures as being redundant. Our study also shows that, considering the clearly different focus of these measures, it can be worthwhile to include a wide selection of them when performing MT evaluations.
In addition, the current study confirms that the NFR system outperformed the FMs retrieved from the TM in all similarity ranges, also when using cosine similarity as the similarity measure instead of edit distance [3]. Even though the baseline NMT system used in this study also outperformed the TM output when all sentences in the test set were evaluated, this was not the case when we focused on the highest FM range, where TM matches obtained higher BLEU scores than the baseline NMT output (75.25 vs. 74.64). This observation illustrates why a baseline NMT system is often used as a back-off to a TM in CAT workflows [16,91] and why NFR, as a single system, could be a better alternative to this traditional TM-MT combination approach; even in the highest FM range, the NFR system outperformed the TM output by a large margin, at least in terms of BLEU scores (85.25 vs. 75.25).
Looking at the number of required edit operations, our analyses show that the NFR output not only necessitated fewer edits overall, but did so across all types of edits (insertions, deletions, substitutions and shifts) and words (content, function and other). This shows, among other things, that the differences between NFR output and baseline NMT output concerned all types of words, including words with high semantic substance and not only, for example, punctuation and/or function words.
Zooming in on the different types of edits, the two edit types with the largest reduction in the total number of required edits when comparing the NFR and the baseline NMT systems were shifts (−19.55% token shifts and −18.76% group shifts) and substitutions (−16.97%). Substitutions were the edit type with the largest absolute reduction (−3680) and, since these substitutions also involved a large proportion of content words, one of the strengths of NFR seemed to be its ability to make better lexical choices than the baseline NMT system. Whether this also means that the NFR system is better at making terminological choices needs to be investigated in a follow-up study explicitly targeting the translation of terms. On the other hand, the large reduction in terms of shifts (tokens and groups) for NFR output indicates that NFR was able to produce an output that was more similar to the reference translation in terms of word order.
Our analysis of edit operations per match range shows that the difference between the NFR and the baseline NMT system in terms of required edit operations was larger for higher FM ranges, affecting all types of edit operations. The reduction in edit operations was especially pronounced for substitutions and shifts in the highest FM range (i.e., >90%), which corresponded to the largest group of sentences in the test set; in total, 40.5% of all source sentences in the test set were augmented with an FM with a similarity score of 90% or higher. These findings expand on evidence indicating that the similarity between the input sentence and the sentence retrieved from the TM is crucial for NFR quality [3,4]. In terms of TER edits, it seems that the difference between NFR and baseline NMT translations became substantial only with matches scoring 70% or higher. While it has been shown that including low FMs is useful to boost NFR quality [3,4], it might be worthwhile to be more selective about which input sentences in the low FM ranges are augmented at inference.

RQ2: Error Profiles
Our second research question aimed to investigate the amount and types of translation errors made by the NFR and the baseline NMT systems and to analyse to what extent the error profiles of both systems differed. A first observation is that the overall number of errors made by both systems was relatively low, with around 60% of sentences in the test set not containing any errors.
When comparing the NFR and baseline systems, the results of the automated analyses are not confirmed. Whereas according to the analysis using automated metrics NFR clearly outperformed the baseline system, the manual error analysis showed that the NFR system made an equal number of translation errors as the baseline, made more accuracy errors and produced translations with more tokens that corresponded to errors and slightly more sentences that contained errors than the baseline system. Moreover, the detailed error classification revealed different error profiles; while the NFR system seemed to produce more fluent translations with better lexical choices and fewer coherence errors than the baseline system, it also made more grammatical errors. Moreover, the NFR system fell short in terms of accuracy, making more errors in the categories "addition" and "semantically unrelated".
There are a number of potential explanations for this apparent discrepancy. First, the automatic and manual evaluation methods used in this study relied on a different type of information to assess translation quality. While the automatic evaluation compared the MT output to a reference translation, the manual error annotation was performed by analysing the MT output and source sentence only. Thus, it can be argued that the two evaluation methods provide complementary information, yielding a more nuanced picture of the differences in translation quality. A second potential reason for the difference in results is that the sentence-level manual error analysis performed in this study, by definition, considered all deviations from the source text as accuracy errors. This may be problematic, since sentence-level translations can contain apparent deviations from the source text (e.g., related to the use of cohesive devices or translation decisions affecting sentence boundaries) that do not constitute actual errors when analysed in context [2,72]. At the same time, the automatic evaluation metrics compared the MT output to reference translations taken from the TM. These reference translations, as they are taken from larger documents, can also contain apparent deviations from the source text, potentially distorting the evaluations.
To investigate this issue further, we analysed the percentage of errors annotated in the NFR and baseline NMT systems that were also found in the reference translations. The results of this analysis are presented in Table 15. The results indicate that the errors annotated in both MT outputs, to a certain extent, also appeared in the reference translations. This seems to apply to a much larger percentage of NFR errors when compared to the baseline NMT; a total of 18.1% of all errors annotated for NFR also appeared in the reference translations (in comparison to only 8.5% for the baseline system). Most of these errors were accuracy errors (27), representing 21.1% of all accuracy errors annotated in this data set. These findings confirm that certain deviations from the source content and meaning, which can be considered to be translation errors according to a strict assessment method, were also present in reference translations. Table 16 shows an example of an addition that appeared both in the NFR output and the reference translation. Table 16. Example of an addition found both in the NFR output and the reference translation.

Source (EN)
Furthermore, the importers and the retailers will not be substantially affected Baseline NMT (NL) Bovendien zullen de importeurs en detailhandelaren geen grote gevolgen ondervinden NFR (NL) Bovendien zullen de importeurs en detailhandelaren geen ernstige gevolgen van de maatregelen ondervinden Reference (NL) Bovendien zullen de importeurs en kleinhandelaren geen ernstige gevolgen van de maatregelen ondervinden The phrase "van de maatregelen (by the measures)" was not present in the source text, but was added to the NFR translation and it also appeared in the reference translation. This addition, most likely, made sense in the context of the complete paragraph the sentence was taken from, but it was annotated as an error (accuracy → addition) in this study. No errors were annotated in the baseline NMT output, since detailhandelaren and kleinhandelaren are both correct translations of the English word retailers. Similarly, grote and ernstige are equally correct translations of the word substantially. When using automated evaluation metrics that rely on a comparison with the reference translation and especially metrics that evaluate exact lexical overlap (such as BLEU and TER), the different lexical choices in the baseline translation cause the estimated translation quality to decrease. Whether such deviations should be considered real errors or not is open to debate and, potentially, also depends on the translation context. We argue that, in the context under investigation in this study, exact lexical choices are important, which would be an argument to count different lexical choices as errors. Whatever the case may be, it is clear that this constitutes an important, potentially confounding factor in MT evaluation. We believe that ideal evaluation set-ups should include methods that use both the source text and reference translations for the purpose of assessing MT quality. Moreover, the analysed example demonstrates that evaluations at the sentence-level are inherently flawed in most translation contexts and that document-level evaluations, or context-aware evaluations in general, are the preferred option [72].
Even though the fine-grained error analysis allowed us to reveal the different error profiles of both systems, it did not inform us about the severity of the errors, a potentially subjective notion in itself [92]. The severity of errors has been studied extensively in the context of post-editing effort. There is some consensus in the literature as to which types of errors are the most challenging to post-edit and which take the least effort. Reordering and lexical choice errors and errors that lead to shifts in meaning compared to the source text are among the most challenging error types [7,83,[93][94][95], whereas errors of incorrect word form [83], omissions and additions [7,93] and orthography errors [93] are reported to have the least impact on post-editing effort. Combining these findings with the error profiles observed in this study, it appears that most of the most frequent errors made by the NFR system had a restricted impact on post-editing effort (i.e., addition, omission and word form). Moreover, the NFR system made fewer errors in lexical choices than the baseline NMT, one of the error categories that is commonly reported to have a high impact on post-editing effort. However, this issue needs to be investigated further, for example, by carrying out an empirical study aimed at analysing the actual post-editing effort involved in correcting the errors made by the NFR and a baseline system.

RQ3: Impact of Fuzzy Matches
We analysed the impact FMs had on NFR output by comparing the proportion of match and no-match tokens that appeared in the NFR and baseline NMT outputs, as well as in different match ranges of the NFR output. More matched tokens from both the first and second FM appeared in NFR output than in the baseline output. While this difference was not very big, it still shows us that the data augmentation method successfully allowed the NFR system to find a higher number of "relevant" tokens for producing more similar translations to the reference translations than the baseline system. This was confirmed by the higher proportion of matched tokens in NFR translations that also appeared in the reference translations when compared to the baseline system. The analysis also showed that higher-scoring FMs not only had a higher percentage of matched tokens, but also that a higher proportion of these tokens appeared in the NFR output. In other words, the NFR system became more confident about using matching tokens in the translations when the FM was highly similar to the source sentence. Another observation is that NFR seemed to be more influenced by the first FM than by the second FM. Even though, in the current NFR configuration, the first FM always had a higher similarity score to the source sentence, it may be interesting to analyse whether this effect was intensified due to the order in which FMs were concatenated to the source sentences.
At the same time, the NFR output also contained more no-match tokens than the baseline NMT output and most of these tokens did not appear in the reference translation. This demonstrates that this data augmentation method, to a certain extent, also introduced FM-related noise in the translations. This phenomenon was observed in spite of the fact that matched and non-matched tokens were labelled with a different feature (see Section 4.2). However, our analyses also showed that a considerable proportion of FM tokens that were labelled as no-match also appeared in the reference translations. In this context, it should be noted that the automatic alignment method used in the NFR approach is not a fail-safe procedure and that certain non-aligned tokens may be relevant regardless of being aligned to the source sentence. This would need to be analysed in a follow-up study, in which the quality of alignments is explicitly evaluated and linked to the quality of (and/or errors in) NFR output.

RQ4: Factors Influencing NFR Quality
With a view to further improve the NFR system, we tried to identify which features influenced NFR quality, in comparison to the quality of baseline NMT translations. We identified seven potential features, six of which were maintained in our final model. Not surprisingly, the similarity score of the first FM was confirmed to have a strong positive impact on NFR quality [3]. However our model also predicted the similarity score of the second FM to negatively influence NFR quality. This could mean that it is indeed a good strategy to not always select the match with the second-highest score to be added as second FM. This is exactly what the maximum coverage mechanism, explained in Section 4.2, aims to achieve. The analysis also confirmed that increasing the total number of matched tokens for both FMs combined was associated with a higher TER difference between NFR and baseline. Moreover, it seemed to be better to select two FMs that did not differ too much in terms of their lengths but, at the same time, to select matches that did differ in terms of their syntactic structure. Taken together, these results suggest that it could be worthwhile to attempt to better model the joint selection of both FMs in NFR, for example, by imposing additional restrictions in terms of syntactic similarity and length ratios between the FMs.
We are aware of the fact that the explanatory power of our linear model is limited, with only a small percentage of variance explained, but, in this context, it is important to note that we modelled the difference in TER scores between NFR and baseline NMT translations and not the TER scores as such. The overall absolute TER difference between the two systems on the test set was only 4.41 points. In addition, we also do not expect to be able to explain all of the variance in TER scores based on the features included in the model, considering the inherent variation in single-sentence translations. In spite of this, we believe that TER difference, while showing less variation, was a more meaningful variable to model than absolute TER scores for this particular analysis, since we were interested in modelling the difference in quality between the two systems. At the same time, while we were aiming for explicability with this linear and additive model, we may have oversimplified the complex interrelationships between the features themselves, as well as between the features and the TER difference scores (which were non-linear in certain cases). Whatever the case may be, the potential merit of this analysis should be tested by evaluating the impact of integrating these features in the FM-selection component of the NFR system on translation quality.

Limitations
In our discussion of this study's findings, we touch upon a number of potential limitations. In this section, we focus on some additional issues that we feel should be made explicit. First, none of the MT evaluation methods are without controversy and, potentially, inherent problems [1] and the methods used in this study are no different. To date, there is also no universally agreed-upon methodology for either of these approaches [54]. This being said, we feel that an empirical study involving a task-specific evaluation of postediting productivity (i.e., temporal post-editing effort) would give us more insight into the potential practical benefits of NFR compared to a baseline NMT system. It should be noted that measuring temporal post-editing effort is argued to be the most (if not only) meaningful when the measurements are carried out with highly-qualified translators, using the tools of their preference and carrying out real post-editing tasks with realistic time constraints, which should all be taken into account when setting up such an empirical study [59,73].
Second and related to this, the annotators in this study did not work for European institutions, which led to potential problems with error identification, especially with respect to lexical choices. Third, we did not manually verify the reference translations in the test set. Even though DGT-TM is well maintained, it cannot be excluded that there are, for example, misaligned translation pairs or translations that are not entirely accurate [2]. Fourth, it is a well-established fact that relying on a single reference translation is problematic [1], although, in practice, it is often not possible to obtain alternative translations, especially when the data sets originate from TMs. Finally, the evaluations in this study only concerned a single translation context and data set, as well as a single language pair and translation direction.

Conclusions
This study set out to provide a more thorough and multifaceted evaluation of the NFR approach to NMT data augmentation, which relies on the retrieval of fuzzy matches from a TM. The evaluations were carried out both automatically, by relying on a broad spectrum of automated quality metrics, and manually, by performing fine-grained human error annotation. In terms of results, all automated metrics, which compared the MT output to a reference translation, indicated higher scores for the NFR system than for an NMT baseline and confirmed the significant improvements it achieved in estimated translation quality [4,5]. The detailed TER analysis showed that the strengths of the NFR system are to produce translations with more similar lexical choices and word order than the reference translations. It also showed that the reduction in the amount of edits was balanced between content and function words.
The error analysis, which did not rely on reference translations, did not yield the same results, as both systems made a comparable amount of errors. We did find different error profiles for both systems. While the NFR system produced more fluent translations, with a significant reduction in lexicon and coherence errors, it also diverged from the source content and meaning (i.e., reduced accuracy) more often than the baseline NMT system, making more errors of addition and mistranslations, which were semantically unrelated to the source content. On the other hand, we observed that 21% of the annotated accuracy errors also appeared in the reference translations, which, at least in part, can be attributed to translation decisions taken in the context of document-level translation tasks. Taken together, the analyses seemed to indicate that NFR can lead to improvements in translation quality compared to a baseline NMT system with regard to lexical choices and, more generally speaking, in terms of what we label "exactness" or, in other words, the ability to be consistent and to closely resemble (in terms of semantics, lexicon and syntax) a reference translation.
An additional aim of this study was to more closely analyse the impact the retrieved and added fuzzy matches had on the MT output, with a view to identifying features that could be used to further improve the NFR system. We found that the tokens in the fuzzy translations appeared more frequently in the NFR output than in baseline NMT translations. This was the case for both tokens that were aligned to tokens in the input sentence and those that were not. In both cases, we observed that more of these tokens were also present in the reference translations. We were also able to identify a number of features related to the length, number of aligned tokens and similarity of the fuzzy matches, including syntactic similarity, that could potentially be used to improve the selection of fuzzy matches in the context of NFR.
In future work, we hope to carry out an empirical study focused on measuring translators' post-editing effort when using NFR compared to a baseline NMT system in a CAT workflow. One potential added advantage of NFR in such a setting is the possibility to automatically annotate the MT output with information that is potentially useful for translators, such as indications of which tokens also appear in the best-scoring fuzzy match retrieved from the TM. Next to this empirical study, experiments should also be carried out aimed at exploring whether the features identified in this study can be employed to further improve the quality of NFR translations.

Funding: This research study was funded by Research Foundation-Flanders (FWO).
Institutional Review Board Statement: The study was approved by the Ethics Committee, Faculty of Arts and Philosophy, Ghent University (11-02-2021).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
Publicly available data sets were analysed in this study. These data can be found at https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory (accessed on 15 August 2020).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: CAT computer-assisted translation FM fuzzy match NFR neural fuzzy repair NMT neural machine translation MT machine translation TM translation memory