Javascript is required
Search
Volume 5, Issue 2, 2026

Abstract

Full Text|PDF|XML
Automated grading has become an important component of digital transformation in K-12 education, yet the structured recognition of handwritten responses on answer sheets remains a practical challenge. General-purpose vision-language models often show limited robustness when applied directly to school assessment materials, particularly in the presence of fixed answer regions, mixed Chinese-English content, and diverse handwriting styles. To address this issue, this study develops a task-oriented fine-tuning framework for automated recognition of handwritten answer sheets in K-12 educational settings. A multimodal dataset was constructed from Chinese and English answer sheets, with region-level annotations designed to support structured text extraction. Based on this dataset, the Qwen2.5-VL-7B-Instruct model was adapted through LoRA-based fine-tuning under a dual-A16 GPU environment to reduce computational cost while preserving practical deployment feasibility. An end-to-end workflow covering data preparation, model training, weight merging, and inference was then established for structured JSON output. Experimental results show that the fine-tuned model achieved stable convergence in both small-sample and medium-sample settings and improved the extraction quality of handwritten responses within predefined answer regions. The proposed framework provides a practical and reproducible solution for deploying vision-language models in school grading scenarios with limited computing resources. The study also offers an application-oriented reference for the integration of multimodal large models into educational assessment systems.
- no more data -