(Go: >> BACK << -|- >> HOME <<)

Academia.eduAcademia.edu
An Evaluation of Persian-English Machine Translation Datasets with Transformers Amir Sartipi University of Isfahan amirsartipi.msc@eng.ui.ac.ir Meghdad Dehghan University of Isfahan meghdadd78@gmail.com arXiv:2302.00321v1 [cs.CL] 1 Feb 2023 Afsaneh Fatemi University of Isfahan a_fatemi@eng.ui.ac.ir Abstract Nowadays, many researchers are focusing their attention on the subject of machine translation (MT). However, Persian machine translation has remained unexplored despite a vast amount of research being conducted in languages with high resources, such as English. Moreover, while a substantial amount of research has been undertaken in statistical machine translation for some datasets in Persian, there is currently no standard baseline for transformer-based text2text models on each corpus. This study collected and analysed the most popular and valuable parallel corpora, which were used for Persian-English translation. Furthermore, we fine-tuned and evaluated two state-of-the-art attention-based seq2seq models on each dataset separately (48 results). We hope this paper will assist researchers in comparing their Persian to English and vice versa machine translation results to a standard baseline. 1 Introduction The primary purpose of machine translation is to translate texts from one language to another. Previously a statistical language model used to be considered as the frontier of this task (Brown et al., 1993; Koehn, 2009; Lopez, 2008). However, because of the vast amount of data currently available, neural machine translation (Bahdanau et al., 2015; Kalchbrenner and Blunsom, 2013; Wu et al., 2016; Cho et al., 2014) is now surpassing statistical approaches. Then, a new simple network architecture based solely on attention was proposed by Vaswani et al. (2017) as an alternative to the dominant sequence transduction models based on recurrent and convolutional neural networks. The encoder part Figure 1: Transformer model architecture of transformer architecture has been widely used in Devlin et al. (2019) and Liu et al. (2020b) which pre-trained on large amount of unlabeled text. Raffel et al. (2020) examined the landscape of transfer learning strategies for NLP resulting in the emergence of transfer learning as a potent technique in NLP. It presents a system that transforms all language tasks into text-to-text format which is called T5. The mT5 is a multilingual variant of the T5 model that has been pre-trained with a new Common Crawl-based dataset that contains 101 languages (Xue et al., 2021). In order to combat overfitting while training on thousands of tasks, Costa-jussà et al. (2022) proposed multiple architectural and training improvements. They used a human-translated benchmark, Flores-200, to evalu- ate the performance of over 40,000 different translation directions. Compared to the previous stateof-the-art seq2seq models, their model achieved a 44% improvement in BLEU Score. Both of these two models (google T5 and meta NLLB) utilize the transformer architecture with some changes and improvements in the encoder or the decoder part. The transformer architecture is shown in Figure 1. The purpose of this paper can be summarized as follows: 1. We review statistical and neural machine translation systems and related datasets. 2. We release all experiments results, including last model checkpoint, best model checkpoint, model prediction, history of training and development phase, and execution times are publicly available in Hugging Face1 and also codes are available in the GitHub2 repository. 3. We establish baselines for the Persian-English machine translation task to compare by future research. 4. We investigate the influence of the number of instances on the BLEU score. The rest of the article is structured in the following manner. In section 2 we summarize prior approaches to translating Persian-English machine translation. Section 3 explains the most popular corpora which are used for experiments. In addition to that, we also provide a detailed analysis of their statistics in this section. An extensive set of experiments with language models are provided in section 4 for each dataset, and they are conducted in both directions. In section 5, the challenges of the study are argued and an analysis of models’ predictions is provided. Finally, in Section 6, the conclusions of the study are presented. 2 Related Work As far as previous research is concerned, there have been several studies conducted for English to Arabic (Nagoudi et al., 2022), French (Tian et al., 2022; Dione et al., 2022; Liu et al., 2020a), and Russian (Yu, 2019; Littell et al., 2019), which focus on transformers as a basic architecture and represent 1 2 https://huggingface.co/ https://github.com/ results. However, there are some Persian-English datasets without any results on language models. In this section we will investigate previous works on Persian-English machine translation. First we consider two statistical and neural approaches to machine translation and introduce recent works on these domains. Then we review the attempts in which parallel corpus for Persian-English machine translation was introduced. Baselines on SMT systems. Results for a Persian-English SMT system were first obtained in the PersianSMT (Pilevar and Faili, 2010). They used a phrased-based SMT system and obtained results on the movie subtitle domain as their parallel corpus’s main resource. In addition, Bakhshaei et al. (2010) obtained results for phrase-based Persian-English SMT system. Different values of the SMT system parameters were tested, and the results for each parameter value were compared. Mohaghegh and Sarrafzadeh (2010) and Mohaghegh et al. (2010) achieve results for an SMT system for different sizes of language model corpora. They concluded that training SMT systems with larger corpora led in better results. Mohaghegh et al. (2011) created a combined parallel corpus called NSPEC and obtained better results for their SMT system than their previous work. Pilevar (2011) created a RBSMT system followed by statistical editing and obtained results for their system. Their new approach outperformed the existing RBSMT systems, yet SMT systems were still more effective than their approach. Mohaghegh (2012) compared two hierarchical (the Joshua) and classical (the Moses) SMT systems. They obtained results for both directions; however, using the hierarchical system only in the Englishto-Persian translation direction produced better results. Jabbari et al. (2012) created a new corpus whose obtained results for SMT systems outperformed the previous ones. Mansouri and Faili (2012) compared several SMT systems and also used a maxent classifier to refine the existing state-of-the-art SMT system. Rasooli et al. (2013) showed that segmenting Persian verbs is effective and improves the BLEU score. Passban et al. (2015) improved exiting TEP corpus and created TEP++. They also gained results on their new corpus and compared them to other corpora like TEP and Mizan. The findings of their study surpassed previous results on both TEP and Mizan corpora. In their study, Figure 2: Examples of English (top) and Persian (bottom) side instances for each dataset Mizan Bible Quran PEPC Bidirectional PEPC One Directional TEP TEP ++ OPUS-100 avg 13 28 29 20 22 8 7 10 min 1 3 1 7 7 1 1 1 max 232 124 373 178 178 37 34 1,487 Persian 92% all 26 13,464,236 48 1,796,084 61 30,235,077 35 4,163,011 37 3,539,183 14 716,113 13 4,445,543 21 10,284,744 unique 131,751 18,166 28,380 169,637 158,707 22,710 92,037 155,874 avg 13 23 33 21 21 7 8 9 min 0 2 1 7 7 1 0 1 max 226 100 772 153 153 33 32 839 English 92% all 26 13,360,397 38 1,428,716 74 34,227,828 36 4,354,619 36 3,359,635 14 684,242 14 4,720,821 20 9,524,220 unique 259,182 40,202 92,976 142,792 138,489 36,634 57,753 342,979 Table 1: General statistics for datasets Mizan Bible Quran PEPC Bidirectional PEPC One Directional TEP Tep ++ OPUS-100 train 1,006,430 51,329 1,013,756 175,442 138,005 72,748 515,925 1,000,000 dev 5,000 5,000 5,000 5,000 5,000 5,000 5,000 2,000 test 10,166 5,704 10,240 19,494 15,334 8,084 57,326 2,000 all 1,021,596 62,033 1,028,996 199,936 158,339 85,832 578,251 1,004,000 Table 2: The number of instances in train\dev\test Kashefi (2018) calculated BLEU score for the SMT system on their represented corpus (Mizan). They achieved results for both in-domain and out-ofdomain test sets. Baselines on NMT systems. Several attempts have been made to propose baselines on PersianEnglish machine translation using neural machine translation systems. Bastan et al. (2017) conducted a study on two tasks of translation and transliteration using a neural machine translation (NMT) system. They used RNNs in the NMT architecture for different numbers of layers. Additionally, they enhanced the results by changing the cost function and preprocessing the Persian corpus. Compared with existing NMT systems, Zaremoodi et al. (2018) and Zaremoodi and Haffari (2018) demonstrated that a multi-task-learning approach improves machine translation results for low-resource languages like Persian. PasriNLU used a neural language model for the first time to do machine translation between Persian and English (Khashabi et al., 2021). They fine-tuned four variations of the Google mT5 text2text model on a part of a benchmark that they created. The training dataset used in the fine-tuning process was integrated from four corpora for generalisation purposes. 3 Datasets The vast majority of research and benchmarks on the machine translation task have been done on the WMT dataset (Bojar et al., 2014). Also there are datasets like OPUS-100 (Zhang et al., 2020) and OpenSubtitles (Lison and Tiedemann, 2016) which contain 60 and 100 languages respectively and are used in the machine translation task for other languages. For the Persian-English language pair, we have collected nine datasets to be fine-tuned with neural seq2seq and to gain results for each of them. Moreover, ParsiNLU is a set of language understanding tasks, including machine translation, for the Persian language (Khashabi et al., 2021). In the machine translation part of their work, they created a large parallel corpus integrated from several corpora. The training dataset includes four domains: the questions from their question paraphrasing task, the Mizan corpus, the TEP corpus and the Global Voice corpus. The training dataset contains almost 1.6M entries. The evaluation set consists of Quran, Mizan, Bible and QPP datasets and contains about 47k sentences. Each collected dataset is introduced and their main attributes are investigated as follows. Quran. Quran is primarily an Arabic book which has been translated into many languages. Tiedemann (2012) proposed the Tanzil dataset from the Tanzil project as a part of the OPUS project. This dataset contains 42 languages. The Persian-English language pair of this dataset contains almost 1M sentence pairs and 57.02M words. Bible. Bible is another religious book which has been translated into many languages. As a part of the OPUS project, the Bible dataset was released in 100 languages (Tiedemann, 2012). The PersianEnglish language pair of this dataset contains almost 62,000 sentence pairs and 2.89M words. PEPC. PEPC is another parallel corpus for Persian-English language pairs obtained from Wikipedia documents (Karimi et al., 2018). They used bidirectional and one-directional methods to extract documents from Wikipedia, so they proposed two versions of datasets based on the extraction method. The bidirectional PEPC dataset contains near 200,000 sentence pairs, and the onedirectional PEPC dataset contains near 160,000 sentence pairs. (a) English side (b) Persian side Figure 3: Token distribution per sentences for Bible TEP. TEP (Tehran English Persian) is another parallel corpus made from movie subtitles. Almost 21000 subtitle files were collected from Opensubtitles, and only 1200 subtitle file pairs remained after removing duplicate files. The final dataset contains over 550,000 lines of text (Pilevar et al., 2011). TEP++. A refined version of the TEP corpus named TEP++ was introduced by Passban et al. (2015). They reported that the TEP corpus was noisy, and they tried to fix this problem in the new corpus. They also obtained better results for an SMT system by using the TEP++ corpus. This corpus has near 570,000 aligned sentences and near 5M tokens for both Persian and English languages. Mizan. Mizan was the largest Persian corpus at the time it was released. It was created from literature masterpieces. It contains more than one million sentence pairs and over 23M words for both Persian and English (Kashefi, 2018). We randomly selected an instance from each corpus which is illustrated in Figure 2. It appears that the OPUS-100 dataset places capitalized "We" and "Us," in the middle of a sentence, a dictation mistake in the Persian subtitle, and the word-by-word translation and its meaning is not perfectly aligned. Some sentences are enclosed in quotation marks or start with small letters in English. These features of datasets could affect the evaluation results. We used SPARK NLP (Kocaman and Talby, 2021) to provide general statistical information about datasets. As a result of this information, parameters such as sequence lengths can be selected more precisely. The max column in table 1 indicates the maximum number of tokens that are allowed in a sentence. Because each dataset contains a few long sequences that can be chosen as outliers and could be simply truncated by a more precise length, this number may not be a good choice. Therefore, for each dataset, we calculated a number which covers 92 percent of datasets. In other words, 92% of sentences have a less or an equal number of tokens. In terms of tokens per sentence, this number is much lower than the maximum. In addition, the table contains both the average and the minimum number of tokens per dataset, as well as the total number of tokens and the total number of unique tokens for both Persian and English corpora. 4 OPUS-100. OPUS-100 is a concatenation of movie subtitles, GNOME documentation, and Bible datasets that contains 100 languages and 99 language pairs, all of which use English as a source or target language (Zhang et al., 2020). Experiments In order to build our network, we used PyTorch (Paszke et al., 2019) and Transformers library from Hugging Face (Wolf et al., 2020) as implementation tools. Mizan Bible Quran PEPC Bidirectional PEPC One Directional TEP TEP ++ OPUS-100 mt5-small 12.22 13.93 4.79 7.10 5.37 11.70 21.02 10.81 EN-FA mt5-base 12.69 22.06 4.97 7.21 5.71 14.11 23.09 10.46 nllb-distilled 15.00 69.78 18.10 13.13 13.20 16.06 26.44 11.62 mt5-small 16.29 16.28 10.39 10.28 8.82 13.63 30.14 20.66 FA-EN mt5-base 16.70 18.83 10.04 10.22 9.85 23.64 31.63 20.91 nllb-distilled 18.05 49.93 27.65 17.01 16.84 26.74 35.98 24.16 Table 3: Evaluation of English to Persian (EN-FA) and Persian to English (FA-EN) on the language models Datasets’ splits. Table 2 provides information about the total number of instances and train/dev/test splits of each dataset. We used predefined data splits for OPUS-100 dataset. For others we manually split the whole datasets in train/dev/test splits. First we shuffled whole instances of each dataset to randomize their order. Then, for the datasets with more than one million instances, we chose 1% of whole instances for the test split, 5,000 instances for the dev split and other instances as train split. Hyper-Parameters: Khashabi et al. (2021) use 1e-3 learning-rate (lr) for fine-tuning phase. The same lr and fine-tuned models for 7 epochs with ADAMW optimizer was used in this study (Loshchilov and Hutter, 2019). In order to select sequence length during the training phase, we considered what sequence length includes 92% of our dataset. Besides the number of sentences versus the number of tokens in each sentence were drawn which allowed us to select reasonable sequence length. Figure 3 shows an example of this illustration for Bible dataset. Models One of the seq2seq models we used is mT5 which has embedding for Persian language. The other text2text model is NLLB which beats previous cutting-edge models. Because of a huge number of parameters and the amount of computation power needed for such models, we just finetuned datasets on the 2 Google mT5 variants {mT5 small, mT5 base} and one Facebook NLLB models: {distilled NLLB}. Below we summarize the main attributes of these models • Google mT5: Google T5 model is a text-totext transformer-based language model. It means that both input and output of this model are text. This model can be used for dif- ferent tasks such as question answering, machine translation, and text classification. The mT5 version of this model is pre-trained on multi-lingual mC4 data which contains 101 languages including Persian. The mT5-small version of this model is the smallest version with only 300 million parameters. The mT5base is the second smallest model with 580 million parameters. The largest version of this model has about 13 billion parameters. • Meta NLLB: The NLLB model which is the state-of-the-art text2text model of the time was proposed with the aim of improving the machine translation performance of lowresource languages. It supports embeddings for almost 200 languages. This model also uses a transformer-based architecture and has two types: Dense and MeE. The Dense type is the one that activates all model parameters for each input sequence while the MoE model is the one which activates only a subset of parameters for each input. The NLLB model has 5 variants regrading the size of the parameters. The smallest model has only 600 million parameters and is a Dense model while the largest model which is a MoE model has about 54.5 billion parameters. Evaluation metric: The BLEU score (Papineni et al., 2002) is the most common metric which has been used for evaluating machine translation results for many years. This metric uses combined N-gram precision for different N-gram sizes and a sentence brevity penalty. Due to the variety of configurations for choosing BLEU score parameters, the results of different baselines by researchers are not much reliable to be compared. For example in many researches, the size of maximum N-gram and the tokenization method is not reported. The sion, which resulted in a reduction in GPU consumption and execution time as opposed to using float16 rather than float32. (a) English to Persian direction (b) Persian to English direction Figure 4: The highest values of BLUE scores according to the datasets’ size sacreBLEU metric was proposed by Post (2018) to tackle some of these problems and establish a standard metric to be comparable in different researches. Training process: We considered one direction for each experiment since a model can be finetuned simultaneously in Persian and English. The model was evaluated at the end of each epoch during the training phase. The optimum models were selected based on the value of the evaluation metric on the development dataset. It is important to preprocess data before training the models, but we did not do that since we wanted to establish baselines for these datasets. MT systems can be improved by applying data-cleaning approaches to a dataset. Hardware: Our Google models were fine-tuned with float32 using TITAN RTX and RTX 3090 Ti GPUs. We used a NVIDIA V100 GPU for the Meta model since it requires a higher level of computation power. The latter was fine-tuned using a PyTorch feature known as automatic mixed preci- Results Our fine-tuned models were evaluated using SacreBLEU as the evaluation metric. As a result of limited computation power, the maximum sequence length of predicted sentences was smaller than this value for test data. It is not possible to compare real test data with predicted instances with precision. In order to resolve this issue, we truncated test instances that exceeded the maximum sequence length of predicted sentences before calculating the score. Table 3 shows the value of SacreBLEU with N − gram = 3. The value of N-grams is an important factor in determining the final BLEU score. This metric utilizes N-grams as contiguous sequences of {N} items from a given text sample. To avoid ambiguity and make the results comparable with future research, we report the BLEU measure for {3, 4, 5, 6, 7}. Figure 5 illustrates the relationship between N-grams and scores for three models in order to compare their performance and determine the impact of N-grams on their performance. As expected, the results for greater N-grams are lower compared to the smaller ones. In all of the datasets, the Meta NLLB model outperformed both variants of the Google mT5 models. Model Evaluation 7 shows detailed information on experiments about training and validation perplexities, and development BLEU scores during training. Training perplexities decreased dramatically from epoch one to two and then followed a gradual decline until epoch seven. However, validation perplexities decreased more rapidly from epoch one to two, and after that, they gradually declined. In some models, this value starts to rise, and models become overfit. Perplexity values in this phase have huge values at the beginning, but they drop after one epoch. To demonstrate changes in the value of BLEU scores during the training phase and comprehension of the models’ performance on each dataset separately, we calculated this value for the development sets per epoch. Most models experience a steady increase, and then tend to decrease or remain flat at this value. However, in three experiments including PEPC bidirectional for mt5-small-fa-en and mt5-base-fa-en, and one directional for mt5- (a) Mizan (b) Bible (c) Quran (d) PEPC Bidirectional Figure 5: BLEU Score results for different ngrams separated by translation direction (left side English to Persian and right side Persian to English) and model First part (e) PEPC One Directional (f) TEP (g) TEP ++ (h) OPUS-100 Figure 5: BLEU Score results for different ngrams separated by translation direction (left side English to Persian and right side Persian to English) and model Second part Figure 6: The impact of the number of training instances on the evaluation dataset for translating Persian to English on the mT5 small model. base-en-fa, the evaluation metric dipped at epoch 2 and recovered quickly. 5 Discussion In this section, some insights into the experiment’s outcomes are provided. Additionally, we discuss the quality of the experimented datasets in terms of the number of instances. Figure 6 shows the maximum BLEU scores for each dataset as a function of the datasets’ size in both directions which provides better comparisons of the results. Generally, datasets like Quran, OPUS-100, and Mizan, with more than one million instances, have received lower or almost the same BLEU score compared to smaller datasets, such as Bible, TEP, TEP++, and PEPC variants. In comparison to the TEP, the TEP++ dataset achieved a higher score, suggesting that refining noisy instances and increasing the number of instances had a positive impact on the dataset results. In contrast, PEPC dataset variations did not show significant differences between their scores. Although the Bible is the smallest dataset regarding the total number of instances, it achieved the highest score among all in both translation directions. Another point to be mentioned is that the average sequence length of instances in this dataset is the second largest after the Quran’s average sequence length, but the scores are highly lower for the Quran. Quality or Quantity? According to the traditional method of improving machine translation results, increasing the size of the training data is expected to increase the value of BLUE score. How- ever, this study indicates that datasets with a higher number of instances tend to achieve lower BLEU scores than datasets with a lower number of instances. Consequently, the quality of the data used for the fine-tuning phase could be more critical than the number of instances. Regarding quality, mistakes in dictation, translations that are not aligned, punctuation errors, and the incorrect word orders in the source and destination directions could change the concept and have a negative effect on the final evaluation value. Three datasets with more than one million instances were tested to demonstrate how the number of training samples affects the value evaluation metric. From those datasets, we sampled 40k and 80k instances and fine-tuned the Google mT5 small model. Based on this experiment, figure 6 shows that by increasing the number of instances, the model shows better results. Translation Direction The general order of obtained BLEU scores in both directions is almost identical. There are a few factors we should take into account. The Bible dataset represented the highest BLEU score in both directions. However, in the English-to-Persian direction, the OPUS-100 dataset had the lowest BLEU score, and the onedirectional PEPC dataset had the lowest BLEU score in Persian-to-English direction. Although almost all datasets performed better in Persian-toEnglish translation, the Bible dataset performed significantly better in English-to-Persian translation by near 20% higher BLEU score. In the Persianto-English translation, the OPUS-100 dataset performs significantly better than the Mizan dataset, while in the opposite direction, the Mizan dataset shows greater performance. 6 Conclusion In this study, we reviewed a majority of PersianEnglish parallel corpora and established standard baselines for eight datasets. The datasets are evaluated using two multilingual seq2seq models based on a transformer architecture. Our analysis of 48 experiments indicates that the Bible and PEPC datasets have the highest and lowest BLEU scores, respectively. Additionally, we conclude that Meta’s basic variant outperforms previous transformerbased approaches by a significant margin. The findings also indicate that in most experiments, the evaluation metric for translation from Persian to English is higher than the evaluation metric for (a) Mizan (b) Bible (c) Quran (d) PEPC Bidirectional Figure 7: BLEU scores, training perplexities, and validation perplexities for each dataset. First part (e) PEPC One Directional (f) TEP (g) TEP++ (h) OPUS-100 Figure 7: BLEU scores, training perplexities, and validation perplexities for each dataset. Second part translation from English to Persian. To the best of our knowledge, this is the first study that represents baselines for each dataset separately by seq2seq models. We hope that this research will assist researchers to compare their methods with the baselines and evaluate them specifically for the Persian language. Acknowledgements This work has been supported by the Simorgh Supercomputer - Amirkabir University of Technology under Contract No ISI-DCE-DOD-Cloud-9008081700. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Somayeh Bakhshaei, Shahram Khadivi, Noushin Riahi, and Hossein Sameti. 2010. A study to find influential parameters on a farsi-english statistical machine translation system. In 2010 5th International Symposium on Telecommunications, pages 985–991. Mohaddeseh Bastan, Shahram Khadivi, and Mohammad Mehdi Homayounpour. 2017. Neural machine translation on scarce-resource condition: A casestudy on persian-english. In 2017 Iranian Conference on Electrical Engineering (ICEE), pages 1485– 1490. Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263– 311. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics. Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv eprints:2207.04672v3, pages arXiv–2207. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Cheikh M. Bamba Dione, Alla Lo, Elhadji Mamadou Nguer, and Sileye Ba. 2022. Low-resource neural machine translation: Benchmarking state-of-the-art transformer for Wolof<->French. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6654–6661, Marseille, France. European Language Resources Association. Fattaneh Jabbari, Somayeh Bakshaei, Seyyed Mohammad Mohammadzadeh Ziabary, and Shahram Developing an open-domain Khadivi. 2012. English-Farsi translation system using AFEC: Amirkabir bilingual Farsi-English corpus. In Fourth Workshop on Computational Approaches to ArabicScript-based Languages, pages 17–23, San Diego, California, USA. Association for Machine Translation in the Americas. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, Seattle, Washington, USA. Association for Computational Linguistics. Akbar Karimi, Ebrahim Ansari, and Bahram Sadeghi Bigham. 2018. Extracting an EnglishPersian parallel corpus from comparable corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Mizan: Omid Kashefi. 2018. english parallel corpus. arXiv:1801.02107v3. A large persianarXiv preprint Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabagdi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, and Yadollah Yaghoobzadeh. 2021. ParsiNLU: A suite of language understanding challenges for Persian. Transactions of the Association for Computational Linguistics, 9:1147–1162. Veysel Kocaman and David Talby. 2021. Spark nlp: Natural language understanding at scale. Software Impacts, page 100058. Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press. Mahsa Mohaghegh, Abdolhossein Sarrafzadeh, and Tom Moir. 2011. Improving Persian-English statistical machine translation:experiments in domain adaptation. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), pages 9–15, Chiang Mai, Thailand. Asian Federation of Natural Language Processing. Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 923–929, Portorož, Slovenia. European Language Resources Association (ELRA). El Moatez Billah Nagoudi, AbdelRahim Elmadany, and Muhammad Abdul-Mageed. 2022. TURJUMAN: A public toolkit for neural Arabic machine translation. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and FineGrained Hate Speech Detection, pages 1–11, Marseille, France. European Language Resources Association. Patrick Littell, Chi-kiu Lo, Samuel Larkin, and Darlene Stewart. 2019. Multi-source transformer for Kazakh-Russian-English neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 267–274, Florence, Italy. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng Gao. 2020a. Very deep transformers for neural machine translation. ArXiv, abs/2008.07772. Peyman Passban, Andy Way, and Qun Liu. 2015. Benchmarking SMT performance for Farsi using the TEP++ corpus. In Proceedings of the 18th Annual Conference of the European Association for Machine Translation, Antalya, Turkey. European Association for Machine Translation. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020b. Ro{bert}a: A robustly optimized {bert} pretraining approach. Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys (CSUR), 40(3):1–49. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations. Amin Mansouri and Heshaam Faili. 2012. State-ofthe-art english to persian statistical machine translation system. In The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), pages 174–179. Mahsa Mohaghegh. 2012. Advancements in englishpersian hierarchical statistical machine translation. In NZCSRSC New Zealand Computer Science Research Student Conference. Mahsa Mohaghegh and Abdolhossein Sarrafzadeh. 2010. Performance evaluation of various training data in english-persian statistical machine translation. In 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy. Mahsa Mohaghegh, Abdolhossein Sarrafzadeh, and Tom Moir. 2010. Improved language modeling for english-persian statistical machine translation. In Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation, pages 75–82. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, PyJunjie Bai, and Soumith Chintala. 2019. torch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc. Abdol Hamid Pilevar. 2011. Using statistical postediting to improve the output of rule-based machine translation system. International Journal of Computer Science and Communication, 330:330–000. Mohammad Taher Pilevar and Heshaam Faili. 2010. Persiansmt: A first attempt to english-persian statistical machine translation. In JADT, volume 2010, page 10th. Mohammad Taher Pilevar, Heshaam Faili, and Abdol Hamid Pilevar. 2011. Tep: Tehran englishpersian parallel corpus. In Computational Linguistics and Intelligent Text Processing, pages 68–79, Berlin, Heidelberg. Springer Berlin Heidelberg. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Brussels, Belgium. Association for Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-totext transformer. Journal of Machine Learning Research, 21(140):1–67. Mohammad Sadegh Rasooli, Ahmed El Kholy, and Nizar Habash. 2013. Orthographic and morphological processing for Persian-to-English statistical machine translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1047–1051, Nagoya, Japan. Asian Federation of Natural Language Processing. Taoling Tian, Chai Song, Jin Ting, and Hongyang Huang. 2022. A french-to-english machine translation model using transformer network. Procedia Computer Science, 199:1438–1443. The 8th International Conference on Information Technology and Quantitative Management (ITQM 2020 2021): Developing Global Digital Economy after COVID-19. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics. Doron Yu. 2019. The en-Ru two-way integrated machine translation system based on transformer. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 434–439, Florence, Italy. Association for Computational Linguistics. Poorya Zaremoodi, Wray Buntine, and Gholamreza Haffari. 2018. Adaptive knowledge sharing in multitask learning: Improving low-resource neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 656– 661, Melbourne, Australia. Association for Computational Linguistics. Poorya Zaremoodi and Gholamreza Haffari. 2018. Neural machine translation for bilingually scarce scenarios: a deep multi-task learning approach. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1356–1365, New Orleans, Louisiana. Association for Computational Linguistics. Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. 2020. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628– 1639, Online. Association for Computational Linguistics.