A DQF-MQM Evaluation of Machine Translation in English–Arabic Customs Law
DOI:
https://doi.org/10.17507/jltr.1703.26Keywords:
compliance hazard, Google Translate, metric discordance, neural machine translation, Revised Kyoto ConventionAbstract
This study presents a corpus-based diagnostic evaluation of Google Translate’s performance in translating the Revised Kyoto Convention (RKC) from English into Arabic. Building on the "Adequacy-Fluency Tradeoff" observed in recent literature, researchers employ an explanatory sequential triangulation design to assess whether high-resource NMT models can satisfy the strict compliance requirements of Customs law. The methodology applies the COMET metric to a stratified test suite of the RKC corpus (N = 1,000 segments) to establish a quantitative baseline. To diagnose potential metric discordance, researchers conducted a targeted human evaluation: a Likert-scale assessment of the highest-scoring segments (n = 100) to verify the "performance ceiling," followed by a granular DQF-MQM error analysis of the lowest-scoring segments (n = 200) conducted by a panel of Experts. The analysis identifies a profound discordance within the RKC corpus: while the system achieved a high mean COMET score of 87.70, the detailed analysis of the "performance floor" revealed 1,579 discrete errors. The resulting penalty score (1,292.5) exceeded the professional acceptance threshold for this text type by a factor of 5.7. These findings indicate that, within this specific legal corpus, Google Translate exhibits "specification-blindness," systematically failing to disambiguate polysemous "Terms of Art" required for international trade compliance. The research concludes that for the RKC, automated metrics do not yet serve as a reliable proxy for legal review, and that a Human-in-the-Loop framework remains an absolute prerequisite for the deployment of Google Translate in Customs administration.
References
Abdelaal, N. M., & Alazzawie, A. (2020). Machine translation: the case of Arabic-English translation of news texts. Theory and Practice in Language Studies, 10(4), 408–418. https://doi.org/10.17507/tpls.1004.09
AlAfnan, M. A. (2025). Artificial Intelligence and Language: Bridging Arabic and English with Technology. Journal of Ecohumanism, 3(8). https://doi.org/10.62754/joe.v3i8.4961
Aldawsari, H. A. H. (2025). Evaluating the Performance of Large Language Models on Arabic Lexical Ambiguities: A Comparative Study with Traditional Machine Translation Systems. World Journal of English Language, 15(3), 354. https://doi.org/10.5430/wjel.v15n3p354
Alghamdi, E. A., Zakraoui, J., & Abanmy, F. A. (2024). Domain adaptation for Arabic machine Translation: Financial Texts as a case study. Applied Sciences, 14(16), 7088. https://doi.org/10.3390/app14167088
Ali, M. A. (2020). Quality and Machine Translation: An Evaluation of Online Machine Translation of English into Arabic Texts. Open Journal of Modern Linguistics, 10(05), 524–548. https://doi.org/10.4236/ojml.2020.105030
Alkatheery, E. R. (2023). Google Translate Errors in Legal Texts: Machine Translation Quality Assessment. Arab World English Journal for Translation and Literary Studies, 7(1), 208–219. https://doi.org/10.24093/awejtls/vol7no1.16
Al-Khalifa, H., Al-Khalefah, K., & Haroon, H. (2024). Error analysis of Pretrained Language Models (PLMS) in English-to-Arabic machine translation. Human-Centric Intelligent Systems, 4(2), 206–219. https://doi.org/10.1007/s44230-024-00061-7
Al Maaytah, S. A. A. (2025). Evaluating Three neural machine translation platforms for English-Arabic translation: A Comparative study of linguistic accuracy and Cultural fidelity. World Journal of English Language, 16(2), 1. https://doi.org/10.5430/wjel.v16n2p1
Al Maaytah, S. A., Aalzobidy, S. A., & Alwidyan, M. F. (2025). Using machine translation English - Arabic procedures and challenges - A systematic review. Power System Technology, 49(1), 588–607. Retrieved February 18, 2026, from https://powertechjournal.com/index.php/journal/article/view/1582
Almahasees, Z., Meqdadi, S., & Albudairi, Y. (2021). Evaluation of google translate in rendering English COVID-19 texts into Arabic. Journal of Language and Linguistic Studies, 17(4), 2065–2080. https://doi.org/10.52462/jlls.149
Ameur, M. S. H., Meziane, F., & Guessoum, A. (2020). Arabic Machine Translation: A survey of the latest trends and challenges. Computer Science Review, 38, 100305. https://doi.org/10.1016/j.cosrev.2020.100305
Babaali, B., & Salem, M. (2022). Arabic machine Translation: A panoramic survey. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4312742
Berger, N., Riezler, S., Exel, M., & Huck, M. (2024). Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Vol. 1, pp. 636–646). European Association for Machine Translation (EAMT). Retrieved February 18, 2026, from https://aclanthology.org/2024.eamt-1.54/
Beseiso, M., Tripathi, S., Al-Shboul, B., & Aljadid, R. (2022). Semantics based English-Arabic machine translation evaluation. Indonesian Journal of Electrical Engineering and Computer Science, 27(1), 189–197. https://doi.org/10.11591/ijeecs.v27.i1.pp189-197
Elsayed, A. S. O. (2025). When machines meet gavel: a case study of the English–Arabic machine translation of the Egyptian arguments before the International Court of Justice (2024). Language and Semiotic Studies, 11(4), 661–690. https://doi.org/10.1515/lass-2025-0054
Fakih, A., Ghassemiazghandi, M., Fakih, A. H., & Singh, M. K. M. (2024). Evaluation of Instagram’s neural machine translation for literary texts: An MQM-Based Analysis. GEMA Online Journal of Language Studies, 24(1), 213–233. https://doi.org/10.17576/gema-2024-2401-13
Fields, P., Hague, D. R., Koby, G. S., Lommel, A., & Melby, A. (2014). What is quality? A management discipline and the translation industry get acquainted. Tradumàtica Tecnologies De La Traducció, 12, 404–412. https://doi.org/10.5565/rev/tradumatica.75
Freitag, M., Foster, G., Grangier, D., Ratnakar, V., Tan, Q., & Macherey, W. (2021). Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9, 1460–1474. https://doi.org/10.1162/tacl_a_00437
Gamal, D., Alfonse, M., Jiménez-Zafra, S. M., & Aref, M. (2023). Case study of improving English-Arabic translation using the Transformer Model. International Journal of Intelligent Computing and Information Sciences, 23(2), 105–115. https://doi.org/10.21608/ijicis.2023.210435.1270
Guerreiro, N. M., Rei, R., Van Stigt, D., Coheur, L., Colombo, P., & Martins, A. F. T. (2024). xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Computational Linguistics, 12, 979–995. https://doi.org/10.1162/tacl_a_00683
Hameed, D. A., Faisal, T. A., Abbas, A. K., Ali, H. A., & Hasan, G. T. (2022). DIA-English-Arabic neural machine translation domain: sulfur industry. Indonesian Journal of Electrical Engineering and Computer Science, 27(3), 1619–1624. https://doi.org/10.11591/ijeecs.v27.i3.pp1619-1624
Idrysy, F. Z. E., Hourri, S., Miqdadi, I. E., Hayati, A., Namir, Y., Ncir, B., & Kharroubi, J. (2025). Unlocking the language barrier: A Journey through Arabic machine translation. Multimedia Tools and Applications, 84(14), 14071–14104. https://doi.org/10.1007/s11042-024-19551-8
Junczys-Dowmunt, M. (2025). GEMBA V2: Ten Judgments Are Better Than One. In Proceedings of the Tenth Conference on Machine Translation (pp. 926–933). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.wmt-1.67
Kayano, Y., & Sugawara, S. (2025). Specification-Aware Machine Translation and Evaluation for Purpose Alignment. In Proceedings of the Tenth Conference on Machine Translation (pp. 113–141). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.wmt-1.7
Kit, C., & Wong, B. T. M. (2023). Evaluation in Machine Translation and Computer-Aided Translation. In Routledge Encyclopedia of Translation Technology (2nd edition, pp. 219–244). Routledge. https://doi.org/10.4324/9781003168348-13
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine Translation. In Proceedings of the First Workshop on Neural Machine Translation (pp. 28–39). Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-3204
Kovacs, G., Deutsch, D., & Freitag, M. (2024). Mitigating Metric Bias in Minimum Bayes Risk Decoding. In Proceedings of the Ninth Conference on Machine Translation (pp. 1063–1094). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.wmt-1.109
Läubli, S., Sennrich, R., & Volk, M. (2018). Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4791–4796). Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1512
Lavie, A., Hanneman, G., Agrawal, S., Kanojia, D., Lo, C., Zouhar, V., Blain, F., Zerva, C., Avramidis, E., Deoghare, S., Sindhujan, A., Wang, J., Adelani, D. I., Thompson, B., Kocmi, T., Freitag, M., & Deutsch, D. (2025). Findings of the WMT25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help. In Proceedings of the Tenth Conference on Machine Translation (pp. 436–483). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.wmt-1.24
Lommel, A. (2018). Metrics for Translation Quality Assessment: A case for Standardising error typologies. In Machine translation (pp. 109–127). Springer. https://doi.org/10.1007/978-3-319-91241-7_6
Moghe, N., Fazla, A., Amrhein, C., Kocmi, T., Steedman, M., Birch, A., Sennrich, R., & Guillou, L. (2024). Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets. Computational Linguistics, 51(1), 73–137. https://doi.org/10.1162/coli_a_00537
Mukherjee, A., & Shrivastava, M. (2023). IIIT HYD’s Submission for WMT23 Test-suite Task. In Proceedings of the Eighth Conference on Machine Translation (pp. 246–251). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.wmt-1.24
Nagi, K. A. (2023). Arabic and English relative clauses and machine translation challenges. Mağallaẗ Al-dirāsāt Al-iğtimāʿiyyaẗ, 29(3), 145–165. https://doi.org/10.20428/jss.v29i3.2180
Sabtan, Y. M. N., Hussein, M. S. M., Ethelb, H., & Omar, A. (2021). An evaluation of the accuracy of the machine translation systems of social media language. International Journal of Advanced Computer Science and Applications, 12(7). https://doi.org/10.14569/ijacsa.2021.0120746
Sabtan, Y. M. N., Omar, A., & Hamouda, W. I. (2024). Exploring the Role of Machine Translation in Translating English Collocations into Arabic: Insights from Student Translators. World Journal of English Language, 14(2), 74. https://doi.org/10.5430/wjel.v14n2p74
Semenov, K., Huang, X., Zouhar, V., Berger, N., Zhu, D., Oncevay, A., & Chen, P. (2025). Findings of the WMT25 Terminology Translation Task: Terminology is Useful Especially for Good MTs. In Proceedings of the Tenth Conference on Machine Translation (pp. 554–576). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.wmt-1.30
Shaalan, K., Siddiqui, S., Alkhatib, M., & Monem, A. A. (2019). Challenges in Arabic natural language processing. In Computational Linguistics, Speech and Image Processing for Arabic Language (pp. 59–83). World Scientific. https://doi.org/10.1142/9789813229396_0003
Shayegh, B., Peter, J., Vilar, D., Domhan, T., Juraska, J., Freitag, M., & Mou, L. (2025). Feeding Two Birds or Favoring One? Adequacy–Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation. In Proceedings of the Tenth Conference on Machine Translation (pp. 269–285). https://doi.org/10.18653/v1/2025.wmt-1.16
Singh, K. B., Kumar, D., & Ekbal, A. (2025). Evaluation of LLM for English to Hindi Legal Domain Machine Translation Systems. In Proceedings of the Tenth Conference on Machine Translation (pp. 823–833). https://doi.org/10.18653/v1/2025.wmt-1.57
Strandvik, I. (2025). Translation quality and the role of specifications – How standards can help the translation sector today. Across Languages and Cultures, 26(S), 5–24. https://doi.org/10.1556/084.2025.01057
Uhlig, K., Wuebker, J., Reinauer, R., & Denero, J. (2025). Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization. In Proceedings of the Tenth Conference on Machine Translation (pp. 31–51). https://doi.org/10.18653/v1/2025.wmt-1.2
Vanroy, B., Tezcan, A., & Macken, L. (2023). MATEO: MT evaluation Online. In The 24th Annual Conference of the European Association for Machine Translation (EAMT 2023) (pp. 499–500). European Association for Machine Translation. Retrieved February 18, 2026, from https://aclanthology.org/2023.eamt-1.52/
Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., & Tu, Z. (2023). Document-Level Machine Translation with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/2023.emnlp-main.1036
World Customs Organization. (1999). International convention on the simplification and harmonization of customs procedures (as amended). Brussels: World Customs Organization. Retrieved February 18, 2026, from https://www.wcoomd.org/-/media/wco/public/global/pdf/topics/facilitation/instruments-and-tools/conventions/kyoto-convention/revised-kyoto-convention/body_gen-annex-and-specific-annexes.pdf?la=en
Zakraoui, J., Saleh, M., Al-Maadeed, S., & Alja’am, J. M. (2021). Arabic Machine Translation: A survey with challenges and future directions. IEEE Access, 9, 161445–161468. https://doi.org/10.1109/access.2021.3132488
Zerva, C., Blain, F., De Souza, J. G. C., Kanojia, D., Deoghare, S., Guerreiro, N. M., Attanasio, G., Rei, R., Orasan, C., Negri, M., Turchi, M., Chatterjee, R., Bhattacharyya, P., Freitag, M., & Martins, A. (2024). Findings of the Quality Estimation Shared Task at WMT 2024: Are LLMs Closing the Gap in QE? In Proceedings of the Ninth Conference on Machine Translation (pp. 82–109). https://doi.org/10.18653/v1/2024.wmt-1.3
Zou, L., Saeedi, A., & Koby, G. S. (2025). Beyond automated metrics: Assessing GPT-4o and Google Translate against professional translation standards. SKASE Journal of Translation and Interpretation, 18(2). https://doi.org/10.33542/jti2025-s-9
Züfle, M., Zouhar, V., Dinh, T. A., Polo, F. M., Niehues, J., & Sachan, M. (2025). COMET-poly: Machine Translation Metric Grounded in Other Candidates. In Proceedings of the Tenth Conference on Machine Translation (pp. 887–904). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.wmt-1.63