Sentence Classification Using Machine Learning and Word Embedding: An Innovation in Indonesian Language Learning
DOI:
https://doi.org/10.17507/jltr.1604.17Keywords:
Bag of Word, Continuous Bag of Word, sentence structure, syntactic assessment, Term Frequency-Inverse Document Frequency, vectorization techniquesAbstract
In applied linguistics, writing assessment examines language learning. There are various genres in writing, but the evaluation always includes a syntactic component or sentence structure. This research focuses on classifying sentence structure in the Indonesian language using the Random Forest Classifier algorithm on five different experiment models, which are trained using different vectorization techniques, including bag of word (BoW), hashing, Term Frequency-Inverse Document Frequency (TF-IDF), CBoW, and skipgram vectorizers. The results showed that the accuracy of the models varied significantly, with the highest accuracy of 76% achieved by the model trained using the CBOW vectorizer. The model trained using the BoW vectorizer and skipgram vectorizer had the lowest accuracies of 65%. These results suggest that different vectorization techniques significantly impact the accuracy of the model and the CBoW vectorization technique is the most effective. While the skipgram was trained using the dataset itself before being used to vectorize the dataset, but it did not show a significant improvement in accuracy. Classifying sentence structures with various models is important and may continue to support the syntactic assessment of computer-based Indonesian language writing skills.
References
Abeywickrama, P., & Brown, H. (2010). Language Assessment: Principles and Classroom Practices. NY: Pearson Longman.
Ahmad, S. N., & Laroche, M. (2023). Extracting Marketing Information from Product Reviews: A Comparative Study of Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. Journal of Marketing Analytics, 11(4), 662-676. https://doi.org/10.1057/s41270-023-00218-6
Alimyar, Z., & Lakshmi, G, S. (2021). A Study on Language Teachers’ Preparedness to Use Technology during COVID-19. Cogent Arts & Humanities, 8(1), 1999064. https://doi.org/10.1080/23311983.2021.1999064
Alimyar, Z., & Lakshmi G, S. (2021). A Study on Language Teachers’ Preparedness to Use Technology during COVID-19. Cogent Arts and Humanities, 8(1), 1999064. https://doi.org/10.1080/23311983.2021.1999064
Arooj, S., Altaf, S., Ahmad, S., Mahmoud, H., & Mohamed, A. S. N. (2024). Enhancing sign language recognition using CNN and SIFT: A case study on Pakistan sign language. Journal of King Saud University-Computer and Information Sciences, 36(2), 101934. https://doi.org/10.1016/j.jksuci.2024.101934
Breiman, L. (2001). Random Forests. Machine learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
Campbell, M. I., Rai, R., & Kurtoglu, T. (2012). A Stochastic Tree-Search Algorithm for Generative Grammars. Journal of Computing and Information Science in Engineering, 12(3), 031006. https://doi.org/10.1115/1.4007153
De Smedt, F., Van Keer, H., & Merchie, E. (2016). Student, Teacher and Class-Level Correlates of Flemish Late Elementary School Children’s Writing Performance. Reading and Writing, 29, 833-868. https://doi.org/10.1007/s11145-015-9590-z
Elarnaoty, M., AbdelRahman, S., & Fahmy, A. (2012). A Machine Learning Approach for Opinion Holder Extraction in Arabic Language. arXiv preprint arXiv:1206.1011. https://doi.org/10.5121/ijaia.2012.3205
Fatonah, K., & Wiradharma, G. (2018). Pemetaan Genre Teks Bahasa Indonesia pada Kurikulum 2013 (Revisi) Jenjang SMA [Mapping of Indonesian Language Text Genres in the 2013 Curriculum (Revised) for High School Level]. https://repositori.kemdikbud.go.id/10046/1/dokumen_makalah_1540362989.pdf (taken d/d March 19, 2025)
Friedman, J. (1969). A Computer System for Transformational Grammar. Communications of the ACM, 12(6), 341-348. https://doi.org/10.1145/363011.363154
Friedman, J. (1971). A Computer Model of Transformational Grammar. https://doi.org/10.1145/363011.363154
Gunawan, D., Siregar, H. P., & Sitompul, O. S. (2019). Identifying Sentence Structure in Bahasa Indonesia by Using POS Tag and LALR Parser. 2019 5th International Conference on Computing Engineering and Design (ICCED).
Hariyanto, P., Zalmansyah, A., Endardi, J., Sukesti, R., Sumadi, S., Abidin, Z., . . . Ratnawati, R. (2023). Language maintenance and identity: A case of Bangka Malay. International Journal of Society, Culture, and Language, 11(2 (Themed Issue on Language, Discourse, and Society)), 60-74. https://doi.org/10.22034/ijscl.2023.2002013.3030
Harno, S., Chan, H. K., & Guo, M. (2024). Enhancing Value Creation of Operational Management for Small to Medium Manufacturer: A Conceptual Data-Driven Analytical System. Computers and Industrial Engineering, 190, 110082. https://doi.org/10.1016/j.cie.2024.110082
Hoyos Pipicano, Y. A. (2024). Exploring Standardized Tests Washback from the Decolonial Option: Implications for Rural Teachers and Students. Cogent Arts and Humanities, 11(1), 2300200. https://doi.org/10.1080/23311983.2023.2300200
Jagaiah, T., Olinghouse, N. G., & Kearns, D. M. (2020). Syntactic Complexity Measures: Variation by Genre, Grade-Level, Students’ Writing Abilities, and Writing Quality. Reading and Writing, 33, 2577-2638. https://link.springer.com/article/10.1007/s11145-020-10057-x
Jiang, L., Yu, S., & Lee, I. (2022). Developing A Genre-Based Model for Assessing Digital Multimodal Composing in Second Language Writing: Integrating Theory with Practice. Journal of Second Language Writing, 57, 100869. https://doi.org/10.1016/j.jslw.2022.100869
Joachims, T. (1999). Transductive inference for text classification using support vector machines. Proceedings of the 20th International Conference on Machine Learning.
Khair, U., & Misnawati, M. (2022). Indonesian Language Teaching in Elementary School: Cooperative Learning Model Explicit Type Instructions Chronological Technique of Events on Narrative Writing Skills from Interview Texts. Linguistics and Culture Review, 172-184. https://doi.org/10.21744/lingcure.v6nS2.1974
Kim, S., Park, H., & Lee, J. (2020). Word2vec-Based Latent Semantic Analysis (W2V-LSA) for Topic Modeling: A Study on Blockchain Technology Trend Analysis. Expert Systems With Applications, 152, 113401. https://doi.org/10.1016/j.eswa.2020.113401
Kim, Y. S. G., & Zagata, E. (2024). Enhancing Reading and Writing Skills through Systematically Integrated Instruction. The Reading Teacher, 77(6), 787-799. https://doi.org/10.1002/trtr.2307
Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2011). Automated Grammatical Error Detection for Language Learners. https://doi.org/10.1162/COLI_r_00062
Lee, M. C., Chang, J. W., & Hsieh, T. C. (2014). A Grammar‐Based Semantic Similarity Algorithm for Natural Language Sentences. The Scientific World Journal, 2014(1), 437162. https://doi.org/10.1155/2014/437162
Lima, J. F., Acosta-Urigüen, M. I., & Orellana, M. (2024). Machine learning and knowledge engineering for cognitive memory assessment of age groups by anomalies in a serious game. Intelligent Systems With Applications, 21, 200301. https://doi.org/10.1016/j.iswa.2023.200301
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781
Moeliono, A. M., Lapoliwa, H., Alwi, H., Tjatur, S. S., Sasangka, W., & Sugiyono, S. (2017). Tata bahasa baku bahasa Indonesia. Edisi keempat [Standard grammar of Indonesian language. Fourth edition] https://repositori.kemdikbud.go.id/16351/ (taken d/d March 19, 2025)
Moeljadi, D., Sugianto, R., Hendrick, J. S., & Hartono, K. (2016). Kamus Besar Bahasa Indonesia (KBBI) [Indonesian Dictionary (KBBI)]. Badan Pengembangan Bahasa dan Kebukuan, Kementerian Pendidikan dan Kebudayaan. https://davidmoeljadi.github.io/slides/kbbi2.pdf (taken d/d March 19, 2025)
Mosteller, F., & Wallace, D. L. (1963). Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275-309. https://doi.org/10.1080/01621459.1963.10500849
Nardiati, S., Isnaeni, M., Widodo, S. T., Hardaniwati, M., Susilawati, D., Winarti, S., ... Zalmansyah, A. (2023). Cultural and philosophical meaning of Javanese traditional houses: A case study in Yogyakarta and Surakarta, Indonesia. Eurasian Journal of Applied Linguistics, 9(2), 1-10. https://ejal.info/menuscript/index.php/ejal/article/view/516 (taken d/d March 19, 2025)
Nordin, N. R. M., Omar, W., & Ridzuan, I. N. I. M. (2022). Challenges and Solutions of Online Language Teaching and Assessment during COVID-19. World Journal of English Language. https://doi.org/10.5430/wjel.v12n8p410
Novillo Rangone, G., Pizarro, C., & Montejano, G. (2021). Automation of an Educational Data Mining Model Applying Interpretable Machine Learning and Auto Machine Learning. International Conference on Communication and Applied Technologies, 22-30. https://doi.org/10.1007/978-981-16-5792-4_3
Nursugiharti, T., Zalmansyah, A., & Rasyid, F. M. (2024). Religious Values of the Traditional Ceremony in Building a Bengkulu Malay Traditional House. In: ISVS e-journal.
Philippakos, Z. A. T., MacArthur, C. A., & Rocconi, L. M. (2023). Effects of Genre-Based Writing Professional Development on K to 2 Teachers' Confidence and Students’ Writing Quality. Teaching and Teacher Education, 135, 104316. https://doi.org/10.1016/j.tate.2023.104316
Pratama, M. R., Kusumadewi, S., & Hidayat, T. (2017). Penerapan Algoritma Lalr Parser dan Context-Free Grammar untuk Struktur Kalimat Bahasa Indonesia [Application of the Lalr Parser Algorithm and Context-Free Grammar for Indonesian Sentence Structure]. Jurnal Teknologi Elektro, 8(1), 1-8. https://doi.org/10.22441/jte.v8i1.1364
Quintero, J. B., Villanueva-Valdes, D., & Manrique-Losada, B. (2024). Artificial Neural Networks in the Development of Business Analytics Projects. International Journal of Information and Decision Sciences, 16(1), 46-72. https://doi.org/10.1504/IJIDS.2024.136283
Rabiah, S. (2018). Language as A Tool for Communication and Cultural Reality Discloser. http://dx.doi.org/10.31227/osf.io/nw94m
Rahman, M., Haque, S., & Saurav, Z. R. (2020). Identifying and Categorizing Opinions Expressed in Bangla Sentences Using Deep Learning Technique. International Journal of Computer Applications, 176(17), 13-17. https://doi.org/10.5120/ijca2020920119
Ramliyana, R., Pratiwi, N. K., & Megiati, Y. E. (2022). Analysis of Indonesian Language Error in Writing Reports of Students' Learning Results of The Amanah Fitrah Rabbani Foundation Using The Sipebi Application. Hortatori: Jurnal Pendidikan Bahasa dan Sastra Indonesia, 6(1), 6-16. https://doi.org/10.30998/jh.v6i1.998
Ratna, A. A. P., Purnamasari, P. D., & Adhi, B. A. (2015). SIMPLE-O, the Essay Grading System for Indonesian Language Using LSA Method with Multi-Level Keywords. The Asian Conference on Society, Education & Technology.
Renza, M. A., Affandi, L. H., & Setiawan, H. (2022). Pengembangan Media Gambar Berseri pada Materi Keterampilan Menulis Teks Narasi Siswa Kelas IV. Jurnal Ilmiah Profesi Pendidikan, 7(2), 445-451. https://doi.org/10.29303/jipp.v7i2.562
Rodríguez-Gonzalo, C., & Abad-Beltrán, V. (2023). Teaching Writing through Discourse Genres. In Development of writing skills in children in diverse cultural contexts: contributions to teaching and learning (pp. 301-323). Springer. https://doi.org/10.1007/978-3-031-29286-6_14
Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval. Information processing & management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
Setiawan, D., Hartati, T., & Sopandi, W. (2019). Kemampuan Menulis Teks Eksplanasi Siswa Kelas 5 Sekolah Dasar melalui Model Read, Answer, Disscuss, Explain, And Create: Radec. Pendas: Jurnal Ilmiah Pendidikan Dasar, 4(1), 1-16. https://doi.org/10.23969/jp.v4i1.1575
Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated Essay Scoring: Writing Assessment and Instruction. International Encyclopedia of Education, 4(1), 20-26. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=eed567622453f29b7b2d72955d07d8aad5e86daf (take d/d March 19, 2025)
Sinaga, T., Kadaryanto, B., & Aulia, N. (2023). Indonesian High School Students’ Critical Thinking and Literary Text Comprehension. ELE Reviews: English Language Education Reviews, 3(2), 155-171. https://doi.org/10.22515/elereviews.v3i2.7621
Sulastra, J. (2014). Perancangan penganalisis struktur kalimat bahasa indonesia dengan menggunakan constraint-based formalism. Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, 5(2), 1–11.
Sun, T., Wang, C., & Wang, Y. (2022). The Effectiveness of Self-Regulated Strategy Development on Improving English Writing: Evidence from the Last Decade. Reading and Writing, 35(10), 2497-2522. https://doi.org/https://doi.org/10.1007/s11145-022-10297-z
Sundararajan, A., Hernandez, A. S., & Sarwat, A. I. (2020). Adapting Big Data Standards, Maturity Models to Smart Grid Distributed Generation: Critical Review. IET Smart Grid, 3(4), 508-519. https://doi.org/https://doi.org/10.1049/iet-stg.2019.0298
Tariq, H. I., Sohail, A., Aslam, U., & Batcha, N. K. (2019). Loan Default Prediction Podel using Sample, Explore, Modify, Model, and Assess (SEMMA). Journal of Computational and Theoretical Nanoscience, 16(8), 3489-3503. http://dx.doi.org/10.1166/jctn.2019.8313
Tektigul, Z., Bayadilova-Altybayev, A., Sadykova, S., Iskindirova, S., Kushkimbayeva, A., & Zhumagul, D. (2023). Language is a symbol system that carries culture. International Journal of Society, Culture & Language, 11(1), 203-214. https://doi.org/10.22034/ijscl.2022.562756.2781
Truong, D. (2024). Data Science and Machine Learning for Mon-Programmers: using SAS Enterprise Miner. https://doi.org/10.1080/00401706.2024.2374190
Utami, S. P. T. (2022). Teknologi dalam Penyuntingan Naskah Bahasa Indonesia: Studi Komparasi Pemanfaatan Aplikasi SIPEBI, Ejaan. id, lektur. id, typoonline. com, dan typograp. com [Technology in Editing Indonesian Manuscripts: A Comparative Study of the Utilization of SIPEBI Applications, Ejaan.id, lektur.id, typoonline.com, and typograp.com]. Itell. https://itell.or.id/conference/index.php/itell/itell2022/paper/view/169 (taken d/d March 19, 2025)
Valdez, D., Pickett, A. C., & Goodson, P. (2018). Topic Modeling: Latent Semantic Analysis for the Social Sciences. Social Science Quarterly, 99(5), 1665-1679. https://doi.org/10.1111/ssqu.12528
Venugopal, M., Sharma, V. K., & Sharma, K. (2023). Web Information Mining and Semantic Analysis in Heterogeneous Unstructured Text Data Using Enhanced Latent Dirichlet Allocation. Concurrency and Computation: Practice and Experience, 35(1), e7410. https://doi.org/10.1002/cpe.7410
Wardana, H. K., Swanita, I., & Yohanes, B. W. (2019). Sistem Pemeriksa Pola Kalimat Bahasa Indonesia berbasis Algoritme Left-Corner Parsing dengan Stemming. Jurnal Nasional Teknik Elektro Dan Teknologi Informasi, 8(3), 211-217. https://doi.org/10.22146/jnteti.v8i3.515
Winahyu, S. K. (2024). Pengembangan Instrumen Penilaian Keterampilan Menulis Artikel Opini Bahasa Indonesia Berbasis Komputer [Development of Computer-Based Indonesian Language Opinion Article Writing Skills Assessment Instrument] [UNIVERSITAS NEGERI JAKARTA]. https://lib.unj.ac.id/tugasakhir/index.php?p=show_detail&id=85003 (taken d/d March 19, 2025)
Wortmann, T., & Stouffs, R. (2018). Algorithmic Complexity of Shape Grammar Implementation. AI EDAM, 32(2), 138-146. https://doi.org/https://doi.org/10.1017/S0890060417000440
Yang, Y., Hua, J. X., Xin, H. D., & Li, X. (2012). Comparative Study on Feature Selection in Uighur Text Categorization. 19-26. https://doi.org/10.4156/AISS.vol4.issue3.3
Zalmansyah, A. (2017). Meningkatkan Perbendaharaan Kata (Vocabulary) Siswa dengan Menggunakan Komik Strip sebagai Media Pembelajaran Bahasa Inggris. Kandai, 9(2), 262-275. doi: https://doi.org/10.26499/jk.v9i2.292
Zalmansyah, A. (2017). Meningkatkan Perbendaharaan Kata (Vocabulary) Siswa dengan Menggunakan Komik Strip sebagai Media Pembelajaran Bahasa Inggris [Improving Students' Vocabulary by Using Comic Strips as English Learning Media] Kandai, 9(2), 262-275. https://doi.org/10.26499/jk.v9i2.292
Zalmansyah, A. (2018). Teknik Cooperative Integrated Reading and Composition (CIRC) untuk Meningkatkan Kemampuan Menulis [Cooperative Integrated Reading and Composition (CIRC) Technique to Improve Writing Skills] Ranah: Jurnal Kajian Bahasa, 7(2), 229-246. https://doi.org/10.26499/rnh.v7i2.573
Zalmansyah, A., Hastuti, H. B. P., Saptarini, T., & Budihastuti, E. (2023). The Cultural Identity of Minangkabau and Dayak Kanayatn: An Anthropolinguistic Study. Eurasian Journal of Applied Linguistics, 9(2), 151-162. https://ejal.info/menuscript/index.php/ejal/article/view/560 (taken d/d December 22, 2023)