Javascript is required
Search
Volume 3, Issue 2, 2025

Abstract

Full Text|PDF|XML

The psychometric validity of multiple-choice questions (MCQs) generated by an advanced Artificial Intelligence (AI) language model (ChatGPT) was evaluated in comparison with those developed by experienced human instructors, with a focus on mathematics teacher education. Two parallel 30-item MCQ tests—one human-designed and one AI-generated—were administered to 30 mathematics teacher trainees. A comprehensive psychometric analysis was conducted using six metrics: item difficulty index (Pi), discrimination index (D), point-biserial correlation, item-test correlation (Rit), Cronbach’s alpha (α) for internal consistency, and score variance. The analysis was facilitated by the Analysis of Didactic Items with Excel (AnDIE) tool. Results indicated that the human-authored MCQs exhibited acceptable difficulty (mean Pi = 0.55), moderate discrimination power (mean D = 0.31), and strong internal consistency (Cronbach’s α = 0.752). In contrast, the AI-generated MCQs were found to be substantially more difficult (mean Pi = 0.22), demonstrated weak discrimination (mean D = 0.16), and yielded negative internal consistency reliability (Cronbach’s α = −0.1), raising concerns about their psychometric quality. While AI-generated assessments offer advantages in terms of scalability and speed, the findings underscore the necessity of expert human review to ensure content validity, construct alignment, and pedagogical appropriateness. These results suggest that AI, in its current form, is not yet equipped to autonomously generate assessment instruments of sufficient quality for high-stakes educational settings. A hybrid test design model is therefore advocated, wherein AI is leveraged for initial item drafting, followed by rigorous human refinement. This approach may enhance both efficiency and quality in the development of educational assessments. The implications extend to educators, assessment designers, and developers of educational AI systems, highlighting the need for collaborative human-AI frameworks to achieve reliable, valid, and pedagogically sound testing instruments.

Abstract

Full Text|PDF|XML

The reduction of excessive academic burden in China’s basic education system has been established as a central objective of national education reform and has become a subject of intense policy debate. To elucidate the complex strategic interactions that shape the implementation of the “Double Reduction” policy, a multi-agent evolutionary game model was constructed incorporating three principal stakeholder groups: government authorities, schools and teachers, and students and parents. Replicator dynamic equations were employed to examine the evolutionary stability of stakeholder strategies and the conditions under which equilibrium outcomes emerge. Through numerical simulations, the influence of regulatory enforcement intensity on behavioral trajectories and convergence patterns was evaluated. The results reveal that asymptotically stable equilibria exist, with optimal system performance achieved when government bodies maintain active and credible regulatory oversight, educational institutions engage in substantive and sustained burden-reduction efforts, and families adopt cooperative and adaptive responses. By clarifying the mechanisms through which stakeholder interactions determine collective outcomes, this study provides theoretical support for the refinement of policy coordination and the long-term enhancement of education governance capacity. These findings contribute not only to the understanding of the “Double Reduction” policy’s systemic impact but also to broader discussions on the role of evolutionary game theory in evaluating multi-agent policy interventions in education systems.

- no more data -