An Evaluation of Persian-English Machine Translation Datasets with
Transformers
Amir Sartipi
University of Isfahan
amirsartipi.msc@eng.ui.ac.ir
Meghdad Dehghan
University of Isfahan
meghdadd78@gmail.com
arXiv:2302.00321v1 [cs.CL] 1 Feb 2023
Afsaneh Fatemi
University of Isfahan
a_fatemi@eng.ui.ac.ir
Abstract
Nowadays, many researchers are focusing
their attention on the subject of machine translation (MT). However, Persian machine translation has remained unexplored despite a vast
amount of research being conducted in languages with high resources, such as English.
Moreover, while a substantial amount of research has been undertaken in statistical machine translation for some datasets in Persian,
there is currently no standard baseline for
transformer-based text2text models on each
corpus. This study collected and analysed
the most popular and valuable parallel corpora, which were used for Persian-English
translation. Furthermore, we fine-tuned and
evaluated two state-of-the-art attention-based
seq2seq models on each dataset separately (48
results). We hope this paper will assist researchers in comparing their Persian to English and vice versa machine translation results to a standard baseline.
1
Introduction
The primary purpose of machine translation is to
translate texts from one language to another. Previously a statistical language model used to be considered as the frontier of this task (Brown et al.,
1993; Koehn, 2009; Lopez, 2008). However, because of the vast amount of data currently available,
neural machine translation (Bahdanau et al., 2015;
Kalchbrenner and Blunsom, 2013; Wu et al., 2016;
Cho et al., 2014) is now surpassing statistical approaches. Then, a new simple network architecture
based solely on attention was proposed by Vaswani
et al. (2017) as an alternative to the dominant sequence transduction models based on recurrent and
convolutional neural networks. The encoder part
Figure 1: Transformer model architecture
of transformer architecture has been widely used
in Devlin et al. (2019) and Liu et al. (2020b) which
pre-trained on large amount of unlabeled text. Raffel et al. (2020) examined the landscape of transfer
learning strategies for NLP resulting in the emergence of transfer learning as a potent technique
in NLP. It presents a system that transforms all
language tasks into text-to-text format which is
called T5. The mT5 is a multilingual variant of
the T5 model that has been pre-trained with a new
Common Crawl-based dataset that contains 101
languages (Xue et al., 2021). In order to combat
overfitting while training on thousands of tasks,
Costa-jussà et al. (2022) proposed multiple architectural and training improvements. They used a
human-translated benchmark, Flores-200, to evalu-
ate the performance of over 40,000 different translation directions. Compared to the previous stateof-the-art seq2seq models, their model achieved a
44% improvement in BLEU Score. Both of these
two models (google T5 and meta NLLB) utilize the
transformer architecture with some changes and
improvements in the encoder or the decoder part.
The transformer architecture is shown in Figure 1.
The purpose of this paper can be summarized as
follows:
1. We review statistical and neural machine translation systems and related datasets.
2. We release all experiments results, including last model checkpoint, best model checkpoint, model prediction, history of training
and development phase, and execution times
are publicly available in Hugging Face1 and
also codes are available in the GitHub2 repository.
3. We establish baselines for the Persian-English
machine translation task to compare by future
research.
4. We investigate the influence of the number of
instances on the BLEU score.
The rest of the article is structured in the following manner. In section 2 we summarize prior
approaches to translating Persian-English machine
translation. Section 3 explains the most popular
corpora which are used for experiments. In addition to that, we also provide a detailed analysis of
their statistics in this section. An extensive set of
experiments with language models are provided in
section 4 for each dataset, and they are conducted
in both directions. In section 5, the challenges of
the study are argued and an analysis of models’
predictions is provided. Finally, in Section 6, the
conclusions of the study are presented.
2
Related Work
As far as previous research is concerned, there have
been several studies conducted for English to Arabic (Nagoudi et al., 2022), French (Tian et al., 2022;
Dione et al., 2022; Liu et al., 2020a), and Russian
(Yu, 2019; Littell et al., 2019), which focus on
transformers as a basic architecture and represent
1
2
https://huggingface.co/
https://github.com/
results. However, there are some Persian-English
datasets without any results on language models.
In this section we will investigate previous works
on Persian-English machine translation. First we
consider two statistical and neural approaches to
machine translation and introduce recent works on
these domains. Then we review the attempts in
which parallel corpus for Persian-English machine
translation was introduced.
Baselines on SMT systems. Results for a
Persian-English SMT system were first obtained
in the PersianSMT (Pilevar and Faili, 2010). They
used a phrased-based SMT system and obtained results on the movie subtitle domain as their parallel
corpus’s main resource. In addition, Bakhshaei
et al. (2010) obtained results for phrase-based
Persian-English SMT system. Different values of
the SMT system parameters were tested, and the results for each parameter value were compared. Mohaghegh and Sarrafzadeh (2010) and Mohaghegh
et al. (2010) achieve results for an SMT system for
different sizes of language model corpora. They
concluded that training SMT systems with larger
corpora led in better results. Mohaghegh et al.
(2011) created a combined parallel corpus called
NSPEC and obtained better results for their SMT
system than their previous work.
Pilevar (2011) created a RBSMT system followed by statistical editing and obtained results
for their system. Their new approach outperformed
the existing RBSMT systems, yet SMT systems
were still more effective than their approach. Mohaghegh (2012) compared two hierarchical (the
Joshua) and classical (the Moses) SMT systems.
They obtained results for both directions; however,
using the hierarchical system only in the Englishto-Persian translation direction produced better results.
Jabbari et al. (2012) created a new corpus whose
obtained results for SMT systems outperformed
the previous ones. Mansouri and Faili (2012) compared several SMT systems and also used a maxent classifier to refine the existing state-of-the-art
SMT system. Rasooli et al. (2013) showed that
segmenting Persian verbs is effective and improves
the BLEU score. Passban et al. (2015) improved
exiting TEP corpus and created TEP++. They also
gained results on their new corpus and compared
them to other corpora like TEP and Mizan. The
findings of their study surpassed previous results
on both TEP and Mizan corpora. In their study,
Figure 2: Examples of English (top) and Persian (bottom) side instances for each dataset
Mizan
Bible
Quran
PEPC Bidirectional
PEPC One Directional
TEP
TEP ++
OPUS-100
avg
13
28
29
20
22
8
7
10
min
1
3
1
7
7
1
1
1
max
232
124
373
178
178
37
34
1,487
Persian
92%
all
26
13,464,236
48
1,796,084
61
30,235,077
35
4,163,011
37
3,539,183
14
716,113
13
4,445,543
21
10,284,744
unique
131,751
18,166
28,380
169,637
158,707
22,710
92,037
155,874
avg
13
23
33
21
21
7
8
9
min
0
2
1
7
7
1
0
1
max
226
100
772
153
153
33
32
839
English
92%
all
26
13,360,397
38
1,428,716
74
34,227,828
36
4,354,619
36
3,359,635
14
684,242
14
4,720,821
20
9,524,220
unique
259,182
40,202
92,976
142,792
138,489
36,634
57,753
342,979
Table 1: General statistics for datasets
Mizan
Bible
Quran
PEPC Bidirectional
PEPC One Directional
TEP
Tep ++
OPUS-100
train
1,006,430
51,329
1,013,756
175,442
138,005
72,748
515,925
1,000,000
dev
5,000
5,000
5,000
5,000
5,000
5,000
5,000
2,000
test
10,166
5,704
10,240
19,494
15,334
8,084
57,326
2,000
all
1,021,596
62,033
1,028,996
199,936
158,339
85,832
578,251
1,004,000
Table 2: The number of instances in train\dev\test
Kashefi (2018) calculated BLEU score for the SMT
system on their represented corpus (Mizan). They
achieved results for both in-domain and out-ofdomain test sets.
Baselines on NMT systems. Several attempts
have been made to propose baselines on PersianEnglish machine translation using neural machine
translation systems. Bastan et al. (2017) conducted
a study on two tasks of translation and transliteration using a neural machine translation (NMT)
system. They used RNNs in the NMT architecture for different numbers of layers. Additionally, they enhanced the results by changing the
cost function and preprocessing the Persian corpus. Compared with existing NMT systems, Zaremoodi et al. (2018) and Zaremoodi and Haffari
(2018) demonstrated that a multi-task-learning approach improves machine translation results for
low-resource languages like Persian. PasriNLU
used a neural language model for the first time to
do machine translation between Persian and English (Khashabi et al., 2021). They fine-tuned four
variations of the Google mT5 text2text model on a
part of a benchmark that they created. The training
dataset used in the fine-tuning process was integrated from four corpora for generalisation purposes.
3
Datasets
The vast majority of research and benchmarks on
the machine translation task have been done on
the WMT dataset (Bojar et al., 2014). Also there
are datasets like OPUS-100 (Zhang et al., 2020)
and OpenSubtitles (Lison and Tiedemann, 2016)
which contain 60 and 100 languages respectively
and are used in the machine translation task for
other languages.
For the Persian-English language pair, we have
collected nine datasets to be fine-tuned with neural seq2seq and to gain results for each of them.
Moreover, ParsiNLU is a set of language understanding tasks, including machine translation, for
the Persian language (Khashabi et al., 2021). In the
machine translation part of their work, they created
a large parallel corpus integrated from several corpora. The training dataset includes four domains:
the questions from their question paraphrasing task,
the Mizan corpus, the TEP corpus and the Global
Voice corpus. The training dataset contains almost
1.6M entries. The evaluation set consists of Quran,
Mizan, Bible and QPP datasets and contains about
47k sentences.
Each collected dataset is introduced and their
main attributes are investigated as follows.
Quran. Quran is primarily an Arabic book which
has been translated into many languages. Tiedemann (2012) proposed the Tanzil dataset from the
Tanzil project as a part of the OPUS project. This
dataset contains 42 languages. The Persian-English
language pair of this dataset contains almost 1M
sentence pairs and 57.02M words.
Bible. Bible is another religious book which has
been translated into many languages. As a part of
the OPUS project, the Bible dataset was released
in 100 languages (Tiedemann, 2012). The PersianEnglish language pair of this dataset contains almost 62,000 sentence pairs and 2.89M words.
PEPC. PEPC is another parallel corpus for
Persian-English language pairs obtained from
Wikipedia documents (Karimi et al., 2018). They
used bidirectional and one-directional methods to
extract documents from Wikipedia, so they proposed two versions of datasets based on the extraction method. The bidirectional PEPC dataset
contains near 200,000 sentence pairs, and the onedirectional PEPC dataset contains near 160,000
sentence pairs.
(a) English side
(b) Persian side
Figure 3: Token distribution per sentences for Bible
TEP. TEP (Tehran English Persian) is another
parallel corpus made from movie subtitles. Almost 21000 subtitle files were collected from Opensubtitles, and only 1200 subtitle file pairs remained
after removing duplicate files. The final dataset
contains over 550,000 lines of text (Pilevar et al.,
2011).
TEP++. A refined version of the TEP corpus
named TEP++ was introduced by Passban et al.
(2015). They reported that the TEP corpus was
noisy, and they tried to fix this problem in the new
corpus. They also obtained better results for an
SMT system by using the TEP++ corpus. This corpus has near 570,000 aligned sentences and near
5M tokens for both Persian and English languages.
Mizan. Mizan was the largest Persian corpus at
the time it was released. It was created from literature masterpieces. It contains more than one
million sentence pairs and over 23M words for
both Persian and English (Kashefi, 2018).
We randomly selected an instance from each corpus which is illustrated in Figure 2. It appears that
the OPUS-100 dataset places capitalized "We" and
"Us," in the middle of a sentence, a dictation mistake in the Persian subtitle, and the word-by-word
translation and its meaning is not perfectly aligned.
Some sentences are enclosed in quotation marks or
start with small letters in English. These features
of datasets could affect the evaluation results.
We used SPARK NLP (Kocaman and Talby,
2021) to provide general statistical information
about datasets. As a result of this information,
parameters such as sequence lengths can be selected more precisely. The max column in table
1 indicates the maximum number of tokens that
are allowed in a sentence. Because each dataset
contains a few long sequences that can be chosen as outliers and could be simply truncated by
a more precise length, this number may not be a
good choice. Therefore, for each dataset, we calculated a number which covers 92 percent of datasets.
In other words, 92% of sentences have a less or
an equal number of tokens. In terms of tokens
per sentence, this number is much lower than the
maximum. In addition, the table contains both the
average and the minimum number of tokens per
dataset, as well as the total number of tokens and
the total number of unique tokens for both Persian
and English corpora.
4
OPUS-100. OPUS-100 is a concatenation of
movie subtitles, GNOME documentation, and
Bible datasets that contains 100 languages and 99
language pairs, all of which use English as a source
or target language (Zhang et al., 2020).
Experiments
In order to build our network, we used PyTorch
(Paszke et al., 2019) and Transformers library from
Hugging Face (Wolf et al., 2020) as implementation tools.
Mizan
Bible
Quran
PEPC Bidirectional
PEPC One Directional
TEP
TEP ++
OPUS-100
mt5-small
12.22
13.93
4.79
7.10
5.37
11.70
21.02
10.81
EN-FA
mt5-base
12.69
22.06
4.97
7.21
5.71
14.11
23.09
10.46
nllb-distilled
15.00
69.78
18.10
13.13
13.20
16.06
26.44
11.62
mt5-small
16.29
16.28
10.39
10.28
8.82
13.63
30.14
20.66
FA-EN
mt5-base
16.70
18.83
10.04
10.22
9.85
23.64
31.63
20.91
nllb-distilled
18.05
49.93
27.65
17.01
16.84
26.74
35.98
24.16
Table 3: Evaluation of English to Persian (EN-FA) and Persian to English (FA-EN) on the language models
Datasets’ splits. Table 2 provides information
about the total number of instances and train/dev/test splits of each dataset. We used predefined
data splits for OPUS-100 dataset. For others we
manually split the whole datasets in train/dev/test
splits. First we shuffled whole instances of each
dataset to randomize their order. Then, for the
datasets with more than one million instances, we
chose 1% of whole instances for the test split, 5,000
instances for the dev split and other instances as
train split.
Hyper-Parameters: Khashabi et al. (2021) use
1e-3 learning-rate (lr) for fine-tuning phase. The
same lr and fine-tuned models for 7 epochs
with ADAMW optimizer was used in this study
(Loshchilov and Hutter, 2019). In order to select
sequence length during the training phase, we considered what sequence length includes 92% of our
dataset. Besides the number of sentences versus
the number of tokens in each sentence were drawn
which allowed us to select reasonable sequence
length. Figure 3 shows an example of this illustration for Bible dataset.
Models One of the seq2seq models we used is
mT5 which has embedding for Persian language.
The other text2text model is NLLB which beats
previous cutting-edge models. Because of a huge
number of parameters and the amount of computation power needed for such models, we just finetuned datasets on the 2 Google mT5 variants {mT5
small, mT5 base} and one Facebook NLLB models: {distilled NLLB}. Below we summarize the
main attributes of these models
• Google mT5: Google T5 model is a text-totext transformer-based language model. It
means that both input and output of this model
are text. This model can be used for dif-
ferent tasks such as question answering, machine translation, and text classification. The
mT5 version of this model is pre-trained on
multi-lingual mC4 data which contains 101
languages including Persian. The mT5-small
version of this model is the smallest version
with only 300 million parameters. The mT5base is the second smallest model with 580
million parameters. The largest version of this
model has about 13 billion parameters.
• Meta NLLB: The NLLB model which is the
state-of-the-art text2text model of the time
was proposed with the aim of improving
the machine translation performance of lowresource languages. It supports embeddings
for almost 200 languages. This model also
uses a transformer-based architecture and has
two types: Dense and MeE. The Dense type
is the one that activates all model parameters
for each input sequence while the MoE model
is the one which activates only a subset of
parameters for each input. The NLLB model
has 5 variants regrading the size of the parameters. The smallest model has only 600 million parameters and is a Dense model while
the largest model which is a MoE model has
about 54.5 billion parameters.
Evaluation metric: The BLEU score (Papineni
et al., 2002) is the most common metric which has
been used for evaluating machine translation results for many years. This metric uses combined
N-gram precision for different N-gram sizes and a
sentence brevity penalty. Due to the variety of configurations for choosing BLEU score parameters,
the results of different baselines by researchers are
not much reliable to be compared. For example
in many researches, the size of maximum N-gram
and the tokenization method is not reported. The
sion, which resulted in a reduction in GPU consumption and execution time as opposed to using
float16 rather than float32.
(a) English to Persian direction
(b) Persian to English direction
Figure 4: The highest values of BLUE scores according
to the datasets’ size
sacreBLEU metric was proposed by Post (2018)
to tackle some of these problems and establish a
standard metric to be comparable in different researches.
Training process: We considered one direction
for each experiment since a model can be finetuned simultaneously in Persian and English. The
model was evaluated at the end of each epoch during the training phase. The optimum models were
selected based on the value of the evaluation metric
on the development dataset. It is important to preprocess data before training the models, but we did
not do that since we wanted to establish baselines
for these datasets. MT systems can be improved by
applying data-cleaning approaches to a dataset.
Hardware: Our Google models were fine-tuned
with float32 using TITAN RTX and RTX 3090 Ti
GPUs. We used a NVIDIA V100 GPU for the
Meta model since it requires a higher level of computation power. The latter was fine-tuned using a
PyTorch feature known as automatic mixed preci-
Results Our fine-tuned models were evaluated
using SacreBLEU as the evaluation metric. As a
result of limited computation power, the maximum
sequence length of predicted sentences was smaller
than this value for test data. It is not possible to
compare real test data with predicted instances with
precision. In order to resolve this issue, we truncated test instances that exceeded the maximum
sequence length of predicted sentences before calculating the score. Table 3 shows the value of
SacreBLEU with N − gram = 3.
The value of N-grams is an important factor in
determining the final BLEU score. This metric
utilizes N-grams as contiguous sequences of {N}
items from a given text sample. To avoid ambiguity and make the results comparable with future
research, we report the BLEU measure for {3, 4, 5,
6, 7}. Figure 5 illustrates the relationship between
N-grams and scores for three models in order to
compare their performance and determine the impact of N-grams on their performance. As expected,
the results for greater N-grams are lower compared
to the smaller ones. In all of the datasets, the Meta
NLLB model outperformed both variants of the
Google mT5 models.
Model Evaluation 7 shows detailed information
on experiments about training and validation perplexities, and development BLEU scores during
training.
Training perplexities decreased dramatically
from epoch one to two and then followed a gradual
decline until epoch seven. However, validation perplexities decreased more rapidly from epoch one
to two, and after that, they gradually declined. In
some models, this value starts to rise, and models become overfit. Perplexity values in this phase
have huge values at the beginning, but they drop
after one epoch.
To demonstrate changes in the value of BLEU
scores during the training phase and comprehension of the models’ performance on each dataset
separately, we calculated this value for the development sets per epoch. Most models experience a
steady increase, and then tend to decrease or remain
flat at this value. However, in three experiments
including PEPC bidirectional for mt5-small-fa-en
and mt5-base-fa-en, and one directional for mt5-
(a) Mizan
(b) Bible
(c) Quran
(d) PEPC Bidirectional
Figure 5: BLEU Score results for different ngrams separated by translation direction (left side English to Persian
and right side Persian to English) and model First part
(e) PEPC One Directional
(f) TEP
(g) TEP ++
(h) OPUS-100
Figure 5: BLEU Score results for different ngrams separated by translation direction (left side English to Persian
and right side Persian to English) and model Second part
Figure 6: The impact of the number of training instances on the evaluation dataset for translating Persian
to English on the mT5 small model.
base-en-fa, the evaluation metric dipped at epoch 2
and recovered quickly.
5
Discussion
In this section, some insights into the experiment’s
outcomes are provided. Additionally, we discuss
the quality of the experimented datasets in terms of
the number of instances. Figure 6 shows the maximum BLEU scores for each dataset as a function of
the datasets’ size in both directions which provides
better comparisons of the results.
Generally, datasets like Quran, OPUS-100, and
Mizan, with more than one million instances, have
received lower or almost the same BLEU score
compared to smaller datasets, such as Bible, TEP,
TEP++, and PEPC variants.
In comparison to the TEP, the TEP++ dataset
achieved a higher score, suggesting that refining
noisy instances and increasing the number of instances had a positive impact on the dataset results.
In contrast, PEPC dataset variations did not show
significant differences between their scores.
Although the Bible is the smallest dataset regarding the total number of instances, it achieved the
highest score among all in both translation directions. Another point to be mentioned is that the
average sequence length of instances in this dataset
is the second largest after the Quran’s average sequence length, but the scores are highly lower for
the Quran.
Quality or Quantity? According to the traditional method of improving machine translation
results, increasing the size of the training data is expected to increase the value of BLUE score. How-
ever, this study indicates that datasets with a higher
number of instances tend to achieve lower BLEU
scores than datasets with a lower number of instances. Consequently, the quality of the data used
for the fine-tuning phase could be more critical than
the number of instances. Regarding quality, mistakes in dictation, translations that are not aligned,
punctuation errors, and the incorrect word orders in
the source and destination directions could change
the concept and have a negative effect on the final
evaluation value.
Three datasets with more than one million instances were tested to demonstrate how the number
of training samples affects the value evaluation metric. From those datasets, we sampled 40k and 80k
instances and fine-tuned the Google mT5 small
model. Based on this experiment, figure 6 shows
that by increasing the number of instances, the
model shows better results.
Translation Direction The general order of obtained BLEU scores in both directions is almost
identical. There are a few factors we should take
into account. The Bible dataset represented the
highest BLEU score in both directions. However,
in the English-to-Persian direction, the OPUS-100
dataset had the lowest BLEU score, and the onedirectional PEPC dataset had the lowest BLEU
score in Persian-to-English direction. Although
almost all datasets performed better in Persian-toEnglish translation, the Bible dataset performed significantly better in English-to-Persian translation
by near 20% higher BLEU score. In the Persianto-English translation, the OPUS-100 dataset performs significantly better than the Mizan dataset,
while in the opposite direction, the Mizan dataset
shows greater performance.
6
Conclusion
In this study, we reviewed a majority of PersianEnglish parallel corpora and established standard
baselines for eight datasets. The datasets are evaluated using two multilingual seq2seq models based
on a transformer architecture. Our analysis of 48
experiments indicates that the Bible and PEPC
datasets have the highest and lowest BLEU scores,
respectively. Additionally, we conclude that Meta’s
basic variant outperforms previous transformerbased approaches by a significant margin. The
findings also indicate that in most experiments, the
evaluation metric for translation from Persian to
English is higher than the evaluation metric for
(a) Mizan
(b) Bible
(c) Quran
(d) PEPC Bidirectional
Figure 7: BLEU scores, training perplexities, and validation perplexities for each dataset. First part
(e) PEPC One Directional
(f) TEP
(g) TEP++
(h) OPUS-100
Figure 7: BLEU scores, training perplexities, and validation perplexities for each dataset. Second part
translation from English to Persian. To the best of
our knowledge, this is the first study that represents
baselines for each dataset separately by seq2seq
models. We hope that this research will assist researchers to compare their methods with the baselines and evaluate them specifically for the Persian
language.
Acknowledgements
This work has been supported by the Simorgh Supercomputer - Amirkabir University of Technology
under Contract No ISI-DCE-DOD-Cloud-9008081700.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly
learning to align and translate.
In 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Somayeh Bakhshaei, Shahram Khadivi, Noushin Riahi,
and Hossein Sameti. 2010. A study to find influential parameters on a farsi-english statistical machine
translation system. In 2010 5th International Symposium on Telecommunications, pages 985–991.
Mohaddeseh Bastan, Shahram Khadivi, and Mohammad Mehdi Homayounpour. 2017. Neural machine
translation on scarce-resource condition: A casestudy on persian-english. In 2017 Iranian Conference on Electrical Engineering (ICEE), pages 1485–
1490.
Ondřej Bojar, Christian Buck, Christian Federmann,
Barry Haddow, Philipp Koehn, Johannes Leveling,
Christof Monz, Pavel Pecina, Matt Post, Herve
Saint-Amand, Radu Soricut, Lucia Specia, and Aleš
Tamchyna. 2014. Findings of the 2014 workshop on
statistical machine translation. In Proceedings of the
Ninth Workshop on Statistical Machine Translation,
pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
Peter F. Brown, Stephen A. Della Pietra, Vincent J.
Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter
estimation. Computational Linguistics, 19(2):263–
311.
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties
of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.
Marta R Costa-jussà, James Cross, Onur Çelebi, Maha
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv eprints:2207.04672v3, pages arXiv–2207.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Cheikh M. Bamba Dione, Alla Lo, Elhadji Mamadou
Nguer, and Sileye Ba. 2022. Low-resource neural
machine translation: Benchmarking state-of-the-art
transformer for Wolof<->French. In Proceedings of
the Thirteenth Language Resources and Evaluation
Conference, pages 6654–6661, Marseille, France.
European Language Resources Association.
Fattaneh Jabbari, Somayeh Bakshaei, Seyyed Mohammad Mohammadzadeh Ziabary, and Shahram
Developing an open-domain
Khadivi. 2012.
English-Farsi translation system using AFEC:
Amirkabir bilingual Farsi-English corpus. In Fourth
Workshop on Computational Approaches to ArabicScript-based Languages, pages 17–23, San Diego,
California, USA. Association for Machine Translation in the Americas.
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent
continuous translation models. In Proceedings of
the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, Seattle,
Washington, USA. Association for Computational
Linguistics.
Akbar Karimi, Ebrahim Ansari, and Bahram
Sadeghi Bigham. 2018. Extracting an EnglishPersian parallel corpus from comparable corpora.
In Proceedings of the Eleventh International Conference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan. European Language
Resources Association (ELRA).
Mizan:
Omid Kashefi. 2018.
english parallel corpus.
arXiv:1801.02107v3.
A large persianarXiv preprint
Daniel Khashabi, Arman Cohan, Siamak Shakeri,
Pedram Hosseini, Pouya Pezeshkpour, Malihe
Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze
Brahman, Sarik Ghazarian, Mozhdeh Gheini,
Arman Kabiri, Rabeeh Karimi Mahabagdi, Omid
Memarrast, Ahmadreza Mosallanezhad, Erfan
Noury, Shahab Raji, Mohammad Sadegh Rasooli,
Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi
Samghabadi, Mahsa Shafaei, Saber Sheybani,
Ali Tazarv, and Yadollah Yaghoobzadeh. 2021.
ParsiNLU: A suite of language understanding challenges for Persian. Transactions of the Association
for Computational Linguistics, 9:1147–1162.
Veysel Kocaman and David Talby. 2021. Spark nlp:
Natural language understanding at scale. Software
Impacts, page 100058.
Philipp Koehn. 2009. Statistical machine translation.
Cambridge University Press.
Mahsa Mohaghegh, Abdolhossein Sarrafzadeh, and
Tom Moir. 2011. Improving Persian-English statistical machine translation:experiments in domain adaptation. In Proceedings of the 2nd Workshop on South
Southeast Asian Natural Language Processing (WSSANLP), pages 9–15, Chiang Mai, Thailand. Asian
Federation of Natural Language Processing.
Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from
movie and TV subtitles. In Proceedings of the Tenth
International Conference on Language Resources
and Evaluation (LREC’16), pages 923–929, Portorož, Slovenia. European Language Resources Association (ELRA).
El Moatez Billah Nagoudi, AbdelRahim Elmadany,
and Muhammad Abdul-Mageed. 2022. TURJUMAN: A public toolkit for neural Arabic machine
translation. In Proceedinsg of the 5th Workshop
on Open-Source Arabic Corpora and Processing
Tools with Shared Tasks on Qur’an QA and FineGrained Hate Speech Detection, pages 1–11, Marseille, France. European Language Resources Association.
Patrick Littell, Chi-kiu Lo, Samuel Larkin, and Darlene Stewart. 2019. Multi-source transformer for
Kazakh-Russian-English neural machine translation.
In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers,
Day 1), pages 267–274, Florence, Italy. Association
for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Xiaodong Liu, Kevin Duh, Liyuan Liu, and Jianfeng
Gao. 2020a. Very deep transformers for neural machine translation. ArXiv, abs/2008.07772.
Peyman Passban, Andy Way, and Qun Liu. 2015.
Benchmarking SMT performance for Farsi using
the TEP++ corpus. In Proceedings of the 18th Annual Conference of the European Association for
Machine Translation, Antalya, Turkey. European Association for Machine Translation.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2020b.
Ro{bert}a: A robustly optimized {bert} pretraining
approach.
Adam Lopez. 2008. Statistical machine translation.
ACM Computing Surveys (CSUR), 40(3):1–49.
Ilya Loshchilov and Frank Hutter. 2019. Decoupled
weight decay regularization. In International Conference on Learning Representations.
Amin Mansouri and Heshaam Faili. 2012. State-ofthe-art english to persian statistical machine translation system. In The 16th CSI International Symposium on Artificial Intelligence and Signal Processing
(AISP 2012), pages 174–179.
Mahsa Mohaghegh. 2012. Advancements in englishpersian hierarchical statistical machine translation.
In NZCSRSC New Zealand Computer Science Research Student Conference.
Mahsa Mohaghegh and Abdolhossein Sarrafzadeh.
2010. Performance evaluation of various training
data in english-persian statistical machine translation. In 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome,
Italy.
Mahsa Mohaghegh, Abdolhossein Sarrafzadeh, and
Tom Moir. 2010. Improved language modeling for
english-persian statistical machine translation. In
Proceedings of the 4th Workshop on Syntax and
Structure in Statistical Translation, pages 75–82.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
PyJunjie Bai, and Soumith Chintala. 2019.
torch: An imperative style, high-performance deep
learning library. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
Abdol Hamid Pilevar. 2011. Using statistical postediting to improve the output of rule-based machine
translation system. International Journal of Computer Science and Communication, 330:330–000.
Mohammad Taher Pilevar and Heshaam Faili. 2010.
Persiansmt: A first attempt to english-persian statistical machine translation. In JADT, volume 2010,
page 10th.
Mohammad Taher Pilevar, Heshaam Faili, and Abdol Hamid Pilevar. 2011. Tep: Tehran englishpersian parallel corpus. In Computational Linguistics and Intelligent Text Processing, pages 68–79,
Berlin, Heidelberg. Springer Berlin Heidelberg.
Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on
Machine Translation: Research Papers, pages 186–
191, Brussels, Belgium. Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the limits of transfer learning with a unified text-totext transformer. Journal of Machine Learning Research, 21(140):1–67.
Mohammad Sadegh Rasooli, Ahmed El Kholy, and
Nizar Habash. 2013. Orthographic and morphological processing for Persian-to-English statistical machine translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1047–1051, Nagoya, Japan. Asian
Federation of Natural Language Processing.
Taoling Tian, Chai Song, Jin Ting, and Hongyang
Huang. 2022. A french-to-english machine translation model using transformer network. Procedia
Computer Science, 199:1438–1443. The 8th International Conference on Information Technology and
Quantitative Management (ITQM 2020 2021): Developing Global Digital Economy after COVID-19.
Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and
Evaluation (LREC’12), pages 2214–2218, Istanbul,
Turkey. European Language Resources Association
(ELRA).
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020.
Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,
Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Stevens, George Kurian, Nishant Patil, Wei Wang,
Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural machine
translation system: Bridging the gap between human
and machine translation. CoRR, abs/1609.08144.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Barua, and Colin Raffel. 2021. mT5: A massively
multilingual pre-trained text-to-text transformer. In
Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
pages 483–498, Online. Association for Computational Linguistics.
Doron Yu. 2019. The en-Ru two-way integrated machine translation system based on transformer. In
Proceedings of the Fourth Conference on Machine
Translation (Volume 2: Shared Task Papers, Day
1), pages 434–439, Florence, Italy. Association for
Computational Linguistics.
Poorya Zaremoodi, Wray Buntine, and Gholamreza
Haffari. 2018. Adaptive knowledge sharing in multitask learning: Improving low-resource neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 656–
661, Melbourne, Australia. Association for Computational Linguistics.
Poorya Zaremoodi and Gholamreza Haffari. 2018.
Neural machine translation for bilingually scarce
scenarios: a deep multi-task learning approach. In
Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 1356–1365, New
Orleans, Louisiana. Association for Computational
Linguistics.
Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. 2020. Improving massively multilingual neural machine translation and zero-shot translation. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–
1639, Online. Association for Computational Linguistics.