Sentence Classification Using Machine Learning and Word Embedding: An Innovation in Indonesian Language Learning

Authors

  • Sri Kusuma Winahyu National Research and Innovation Agency (BRIN)
  • Fawwaz Zaini Ahmad Bank Rakyat Indonesia
  • Achril Zalmansyah National Research and Innovation Agency (BRIN)
  • Exti Budihastuti National Research and Innovation Agency (BRIN)
  • Pradicta Nurhuda National Research and Innovation Agency (BRIN)
  • Fairul Zabadi National Research and Innovation Agency (BRIN)
  • Zainal Abidin National Research and Innovation Agency (BRIN)
  • Suyadi National Research and Innovation Agency (BRIN)
  • Sri Yono National Research and Innovation Agency (BRIN)
  • Evi Maha Kastri National Research and Innovation Agency (BRIN)

DOI:

https://doi.org/10.17507/jltr.1604.17

Keywords:

Bag of Word, Continuous Bag of Word, sentence structure, syntactic assessment, Term Frequency-Inverse Document Frequency, vectorization techniques

Abstract

In applied linguistics, writing assessment examines language learning. There are various genres in writing, but the evaluation always includes a syntactic component or sentence structure. This research focuses on classifying sentence structure in the Indonesian language using the Random Forest Classifier algorithm on five different experiment models, which are trained using different vectorization techniques, including bag of word (BoW), hashing, Term Frequency-Inverse Document Frequency (TF-IDF), CBoW, and skipgram vectorizers. The results showed that the accuracy of the models varied significantly, with the highest accuracy of 76% achieved by the model trained using the CBOW vectorizer. The model trained using the BoW vectorizer and skipgram vectorizer had the lowest accuracies of 65%. These results suggest that different vectorization techniques significantly impact the accuracy of the model and the CBoW vectorization technique is the most effective. While the skipgram was trained using the dataset itself before being used to vectorize the dataset, but it did not show a significant improvement in accuracy. Classifying sentence structures with various models is important and may continue to support the syntactic assessment of computer-based Indonesian language writing skills.

Author Biographies

Sri Kusuma Winahyu, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Achril Zalmansyah, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Exti Budihastuti, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Pradicta Nurhuda, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Fairul Zabadi, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Zainal Abidin, National Research and Innovation Agency (BRIN)

Research Center for Preservation Language and Literature

Suyadi, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Sri Yono, National Research and Innovation Agency (BRIN)

Research Center for Language, Literature, and Community

Evi Maha Kastri, National Research and Innovation Agency (BRIN)

Research Center for Preservation Language and Literature

References

Abeywickrama, P., & Brown, H. (2010). Language Assessment: Principles and Classroom Practices. NY: Pearson Longman.

Ahmad, S. N., & Laroche, M. (2023). Extracting Marketing Information from Product Reviews: A Comparative Study of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. Journal of Marketing Analytics, 11(4), 662-676. https://doi.org/10.1057/s41270-023-00218-6

Alimyar, Z., & Lakshmi, G, S. (2021). A Study on Language Teachers’ Preparedness to Use Technology during COVID-19. Cogent Arts & Humanities, 8(1), 1999064. https://doi.org/10.1080/23311983.2021.1999064

Alimyar, Z., & Lakshmi G, S. (2021). A Study on Language Teachers’ Preparedness to Use Technology during COVID-19. Cogent Arts and Humanities, 8(1), 1999064. https://doi.org/10.1080/23311983.2021.1999064

Arooj, S., Altaf, S., Ahmad, S., Mahmoud, H., & Mohamed, A. S. N. (2024). Enhancing sign language recognition using CNN and SIFT: A case study on Pakistan sign language. Journal of King Saud University-Computer and Information Sciences, 36(2), 101934. https://doi.org/10.1016/j.jksuci.2024.101934

Breiman, L. (2001). Random Forests. Machine learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324

Campbell, M. I., Rai, R., & Kurtoglu, T. (2012). A Stochastic Tree-Search Algorithm for Generative Grammars. Journal of Computing and Information Science in Engineering, 12(3), 031006. https://doi.org/10.1115/1.4007153

De Smedt, F., Van Keer, H., & Merchie, E. (2016). Student, Teacher and Class-Level Correlates of Flemish Late Elementary School Children’s Writing Performance. Reading and Writing, 29, 833-868. https://doi.org/10.1007/s11145-015-9590-z

Elarnaoty, M., AbdelRahman, S., & Fahmy, A. (2012). A Machine Learning Approach for Opinion Holder Extraction in Arabic Language. arXiv preprint arXiv:1206.1011. https://doi.org/10.5121/ijaia.2012.3205

Fatonah, K., & Wiradharma, G. (2018). Pemetaan Genre Teks Bahasa Indonesia pada Kurikulum 2013 (Revisi) Jenjang SMA [Mapping of Indonesian Language Text Genres in the 2013 Curriculum (Revised) for High School Level]. https://repositori.kemdikbud.go.id/10046/1/dokumen_makalah_1540362989.pdf (taken d/d March 19, 2025)

Friedman, J. (1969). A Computer System for Transformational Grammar. Communications of the ACM, 12(6), 341-348. https://doi.org/10.1145/363011.363154

Friedman, J. (1971). A Computer Model of Transformational Grammar. https://doi.org/10.1145/363011.363154

Gunawan, D., Siregar, H. P., & Sitompul, O. S. (2019). Identifying Sentence Structure in Bahasa Indonesia by Using POS Tag and LALR Parser. 2019 5th International Conference on Computing Engineering and Design (ICCED).

Hariyanto, P., Zalmansyah, A., Endardi, J., Sukesti, R., Sumadi, S., Abidin, Z., . . . Ratnawati, R. (2023). Language maintenance and identity: A case of Bangka Malay. International Journal of Society, Culture, and Language, 11(2 (Themed Issue on Language, Discourse, and Society)), 60-74. https://doi.org/10.22034/ijscl.2023.2002013.3030

Harno, S., Chan, H. K., & Guo, M. (2024). Enhancing Value Creation of Operational Management for Small to Medium Manufacturer: A Conceptual Data-Driven Analytical System. Computers and Industrial Engineering, 190, 110082. https://doi.org/10.1016/j.cie.2024.110082

Hoyos Pipicano, Y. A. (2024). Exploring Standardized Tests Washback from the Decolonial Option: Implications for Rural Teachers and Students. Cogent Arts and Humanities, 11(1), 2300200. https://doi.org/10.1080/23311983.2023.2300200

Jagaiah, T., Olinghouse, N. G., & Kearns, D. M. (2020). Syntactic Complexity Measures: Variation by Genre, Grade-Level, Students’ Writing Abilities, and Writing Quality. Reading and Writing, 33, 2577-2638. https://link.springer.com/article/10.1007/s11145-020-10057-x

Jiang, L., Yu, S., & Lee, I. (2022). Developing A Genre-Based Model for Assessing Digital Multimodal Composing in Second Language Writing: Integrating Theory with Practice. Journal of Second Language Writing, 57, 100869. https://doi.org/10.1016/j.jslw.2022.100869

Joachims, T. (1999). Transductive inference for text classification using support vector machines. Proceedings of the 20th International Conference on Machine Learning.

Khair, U., & Misnawati, M. (2022). Indonesian Language Teaching in Elementary School: Cooperative Learning Model Explicit Type Instructions Chronological Technique of Events on Narrative Writing Skills from Interview Texts. Linguistics and Culture Review, 172-184. https://doi.org/10.21744/lingcure.v6nS2.1974

Kim, S., Park, H., & Lee, J. (2020). Word2vec-Based Latent Semantic Analysis (W2V-LSA) for Topic Modeling: A Study on Blockchain Technology Trend Analysis. Expert Systems With Applications, 152, 113401. https://doi.org/10.1016/j.eswa.2020.113401

Kim, Y. S. G., & Zagata, E. (2024). Enhancing Reading and Writing Skills through Systematically Integrated Instruction. The Reading Teacher, 77(6), 787-799. https://doi.org/10.1002/trtr.2307

Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2011). Automated Grammatical Error Detection for Language Learners. https://doi.org/10.1162/COLI_r_00062

Lee, M. C., Chang, J. W., & Hsieh, T. C. (2014). A Grammar‐Based Semantic Similarity Algorithm for Natural Language Sentences. The Scientific World Journal, 2014(1), 437162. https://doi.org/10.1155/2014/437162

Lima, J. F., Acosta-Urigüen, M. I., & Orellana, M. (2024). Machine learning and knowledge engineering for cognitive memory assessment of age groups by anomalies in a serious game. Intelligent Systems With Applications, 21, 200301. https://doi.org/10.1016/j.iswa.2023.200301

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781

Moeliono, A. M., Lapoliwa, H., Alwi, H., Tjatur, S. S., Sasangka, W., & Sugiyono, S. (2017). Tata bahasa baku bahasa Indonesia. Edisi keempat [Standard grammar of Indonesian language. Fourth edition] https://repositori.kemdikbud.go.id/16351/ (taken d/d March 19, 2025)

Moeljadi, D., Sugianto, R., Hendrick, J. S., & Hartono, K. (2016). Kamus Besar Bahasa Indonesia (KBBI) [Indonesian Dictionary (KBBI)]. Badan Pengembangan Bahasa dan Kebukuan, Kementerian Pendidikan dan Kebudayaan. https://davidmoeljadi.github.io/slides/kbbi2.pdf (taken d/d March 19, 2025)

Mosteller, F., & Wallace, D. L. (1963). Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275-309. https://doi.org/10.1080/01621459.1963.10500849

Nardiati, S., Isnaeni, M., Widodo, S. T., Hardaniwati, M., Susilawati, D., Winarti, S., ... Zalmansyah, A. (2023). Cultural and philosophical meaning of Javanese traditional houses: A case study in Yogyakarta and Surakarta, Indonesia. Eurasian Journal of Applied Linguistics, 9(2), 1-10. https://ejal.info/menuscript/index.php/ejal/article/view/516 (taken d/d March 19, 2025)

Nordin, N. R. M., Omar, W., & Ridzuan, I. N. I. M. (2022). Challenges and Solutions of Online Language Teaching and Assessment during COVID-19. World Journal of English Language. https://doi.org/10.5430/wjel.v12n8p410

Novillo Rangone, G., Pizarro, C., & Montejano, G. (2021). Automation of an Educational Data Mining Model Applying Interpretable Machine Learning and Auto Machine Learning. International Conference on Communication and Applied Technologies, 22-30. https://doi.org/10.1007/978-981-16-5792-4_3

Nursugiharti, T., Zalmansyah, A., & Rasyid, F. M. (2024). Religious Values of the Traditional Ceremony in Building a Bengkulu Malay Traditional House. In: ISVS e-journal.

Philippakos, Z. A. T., MacArthur, C. A., & Rocconi, L. M. (2023). Effects of Genre-Based Writing Professional Development on K to 2 Teachers' Confidence and Students’ Writing Quality. Teaching and Teacher Education, 135, 104316. https://doi.org/10.1016/j.tate.2023.104316

Pratama, M. R., Kusumadewi, S., & Hidayat, T. (2017). Penerapan Algoritma Lalr Parser dan Context-Free Grammar untuk Struktur Kalimat Bahasa Indonesia [Application of the Lalr Parser Algorithm and Context-Free Grammar for Indonesian Sentence Structure]. Jurnal Teknologi Elektro, 8(1), 1-8. https://doi.org/10.22441/jte.v8i1.1364

Quintero, J. B., Villanueva-Valdes, D., & Manrique-Losada, B. (2024). Artificial Neural Networks in the Development of Business Analytics Projects. International Journal of Information and Decision Sciences, 16(1), 46-72. https://doi.org/10.1504/IJIDS.2024.136283

Rabiah, S. (2018). Language as A Tool for Communication and Cultural Reality Discloser. http://dx.doi.org/10.31227/osf.io/nw94m

Rahman, M., Haque, S., & Saurav, Z. R. (2020). Identifying and Categorizing Opinions Expressed in Bangla Sentences Using Deep Learning Technique. International Journal of Computer Applications, 176(17), 13-17. https://doi.org/10.5120/ijca2020920119

Ramliyana, R., Pratiwi, N. K., & Megiati, Y. E. (2022). Analysis of Indonesian Language Error in Writing Reports of Students' Learning Results of The Amanah Fitrah Rabbani Foundation Using The Sipebi Application. Hortatori: Jurnal Pendidikan Bahasa dan Sastra Indonesia, 6(1), 6-16. https://doi.org/10.30998/jh.v6i1.998

Ratna, A. A. P., Purnamasari, P. D., & Adhi, B. A. (2015). SIMPLE-O, the Essay Grading System for Indonesian Language Using LSA Method with Multi-Level Keywords. The Asian Conference on Society, Education & Technology.

Renza, M. A., Affandi, L. H., & Setiawan, H. (2022). Pengembangan Media Gambar Berseri pada Materi Keterampilan Menulis Teks Narasi Siswa Kelas IV. Jurnal Ilmiah Profesi Pendidikan, 7(2), 445-451. https://doi.org/10.29303/jipp.v7i2.562

Rodríguez-Gonzalo, C., & Abad-Beltrán, V. (2023). Teaching Writing through Discourse Genres. In Development of writing skills in children in diverse cultural contexts: contributions to teaching and learning (pp. 301-323). Springer. https://doi.org/10.1007/978-3-031-29286-6_14

Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Information processing & management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0

Setiawan, D., Hartati, T., & Sopandi, W. (2019). Kemampuan Menulis Teks Eksplanasi Siswa Kelas 5 Sekolah Dasar melalui Model Read, Answer, Disscuss, Explain, And Create: Radec. Pendas: Jurnal Ilmiah Pendidikan Dasar, 4(1), 1-16. https://doi.org/10.23969/jp.v4i1.1575

Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated Essay Scoring: Writing Assessment and Instruction. International Encyclopedia of Education, 4(1), 20-26. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=eed567622453f29b7b2d72955d07d8aad5e86daf (take d/d March 19, 2025)

Sinaga, T., Kadaryanto, B., & Aulia, N. (2023). Indonesian High School Students’ Critical Thinking and Literary Text Comprehension. ELE Reviews: English Language Education Reviews, 3(2), 155-171. https://doi.org/10.22515/elereviews.v3i2.7621

Sulastra, J. (2014). Perancangan penganalisis struktur kalimat bahasa indonesia dengan menggunakan constraint-based formalism. Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, 5(2), 1–11.

Sun, T., Wang, C., & Wang, Y. (2022). The Effectiveness of Self-Regulated Strategy Development on Improving English Writing: Evidence from the Last Decade. Reading and Writing, 35(10), 2497-2522. https://doi.org/https://doi.org/10.1007/s11145-022-10297-z

Sundararajan, A., Hernandez, A. S., & Sarwat, A. I. (2020). Adapting Big Data Standards, Maturity Models to Smart Grid Distributed Generation: Critical Review. IET Smart Grid, 3(4), 508-519. https://doi.org/https://doi.org/10.1049/iet-stg.2019.0298

Tariq, H. I., Sohail, A., Aslam, U., & Batcha, N. K. (2019). Loan Default Prediction Podel using Sample, Explore, Modify, Model, and Assess (SEMMA). Journal of Computational and Theoretical Nanoscience, 16(8), 3489-3503. http://dx.doi.org/10.1166/jctn.2019.8313

Tektigul, Z., Bayadilova-Altybayev, A., Sadykova, S., Iskindirova, S., Kushkimbayeva, A., & Zhumagul, D. (2023). Language is a symbol system that carries culture. International Journal of Society, Culture & Language, 11(1), 203-214. https://doi.org/10.22034/ijscl.2022.562756.2781

Truong, D. (2024). Data Science and Machine Learning for Mon-Programmers: using SAS Enterprise Miner. https://doi.org/10.1080/00401706.2024.2374190

Utami, S. P. T. (2022). Teknologi dalam Penyuntingan Naskah Bahasa Indonesia: Studi Komparasi Pemanfaatan Aplikasi SIPEBI, Ejaan. id, lektur. id, typoonline. com, dan typograp. com [Technology in Editing Indonesian Manuscripts: A Comparative Study of the Utilization of SIPEBI Applications, Ejaan.id, lektur.id, typoonline.com, and typograp.com]. Itell. https://itell.or.id/conference/index.php/itell/itell2022/paper/view/169 (taken d/d March 19, 2025)

Valdez, D., Pickett, A. C., & Goodson, P. (2018). Topic Modeling: Latent Semantic Analysis for the Social Sciences. Social Science Quarterly, 99(5), 1665-1679. https://doi.org/10.1111/ssqu.12528

Venugopal, M., Sharma, V. K., & Sharma, K. (2023). Web Information Mining and Semantic Analysis in Heterogeneous Unstructured Text Data Using Enhanced Latent Dirichlet Allocation. Concurrency and Computation: Practice and Experience, 35(1), e7410. https://doi.org/10.1002/cpe.7410

Wardana, H. K., Swanita, I., & Yohanes, B. W. (2019). Sistem Pemeriksa Pola Kalimat Bahasa Indonesia berbasis Algoritme Left-Corner Parsing dengan Stemming. Jurnal Nasional Teknik Elektro Dan Teknologi Informasi, 8(3), 211-217. https://doi.org/10.22146/jnteti.v8i3.515

Winahyu, S. K. (2024). Pengembangan Instrumen Penilaian Keterampilan Menulis Artikel Opini Bahasa Indonesia Berbasis Komputer [Development of Computer-Based Indonesian Language Opinion Article Writing Skills Assessment Instrument] [UNIVERSITAS NEGERI JAKARTA]. https://lib.unj.ac.id/tugasakhir/index.php?p=show_detail&id=85003 (taken d/d March 19, 2025)

Wortmann, T., & Stouffs, R. (2018). Algorithmic Complexity of Shape Grammar Implementation. AI EDAM, 32(2), 138-146. https://doi.org/https://doi.org/10.1017/S0890060417000440

Yang, Y., Hua, J. X., Xin, H. D., & Li, X. (2012). Comparative Study on Feature Selection in Uighur Text Categorization. 19-26. https://doi.org/10.4156/AISS.vol4.issue3.3

Zalmansyah, A. (2017). Meningkatkan Perbendaharaan Kata (Vocabulary) Siswa dengan Menggunakan Komik Strip sebagai Media Pembelajaran Bahasa Inggris. Kandai, 9(2), 262-275. doi: https://doi.org/10.26499/jk.v9i2.292

Zalmansyah, A. (2017). Meningkatkan Perbendaharaan Kata (Vocabulary) Siswa dengan Menggunakan Komik Strip sebagai Media Pembelajaran Bahasa Inggris [Improving Students' Vocabulary by Using Comic Strips as English Learning Media] Kandai, 9(2), 262-275. https://doi.org/10.26499/jk.v9i2.292

Zalmansyah, A. (2018). Teknik Cooperative Integrated Reading and Composition (CIRC) untuk Meningkatkan Kemampuan Menulis [Cooperative Integrated Reading and Composition (CIRC) Technique to Improve Writing Skills] Ranah: Jurnal Kajian Bahasa, 7(2), 229-246. https://doi.org/10.26499/rnh.v7i2.573

Zalmansyah, A., Hastuti, H. B. P., Saptarini, T., & Budihastuti, E. (2023). The Cultural Identity of Minangkabau and Dayak Kanayatn: An Anthropolinguistic Study. Eurasian Journal of Applied Linguistics, 9(2), 151-162. https://ejal.info/menuscript/index.php/ejal/article/view/560 (taken d/d December 22, 2023)

Downloads

Published

2025-07-01

Issue

Section

Articles