ACADlore - ESM, Volume 3, Issue 2, 2025

Home

journals

Education Science and Management

2025_3_2

Volume 3, Issue 2, 2025

Open Access

Research article

Psychometric Evaluation of Human-Crafted and AI-Generated Multiple-Choice Questions for Mathematics Instruction

rachid el chaal

rabei ben seghir

moulay othman aboutafail

Education Science and Management

Volume 3, Issue 2, 2025

Pages 78-92

https://doi.org/10.56578/esm030201

Available online: 04-10-2025

Abstract

Full Text|

PDF|

XML

The psychometric validity of multiple-choice questions (MCQs) generated by an advanced Artificial Intelligence (AI) language model (ChatGPT) was evaluated in comparison with those developed by experienced human instructors, with a focus on mathematics teacher education. Two parallel 30-item MCQ tests—one human-designed and one AI-generated—were administered to 30 mathematics teacher trainees. A comprehensive psychometric analysis was conducted using six metrics: item difficulty index (P_i), discrimination index (D), point-biserial correlation, item-test correlation (R_it), Cronbach’s alpha (α) for internal consistency, and score variance. The analysis was facilitated by the Analysis of Didactic Items with Excel (AnDIE) tool. Results indicated that the human-authored MCQs exhibited acceptable difficulty (mean P_i = 0.55), moderate discrimination power (mean D = 0.31), and strong internal consistency (Cronbach’s α = 0.752). In contrast, the AI-generated MCQs were found to be substantially more difficult (mean P_i = 0.22), demonstrated weak discrimination (mean D = 0.16), and yielded negative internal consistency reliability (Cronbach’s α = −0.1), raising concerns about their psychometric quality. While AI-generated assessments offer advantages in terms of scalability and speed, the findings underscore the necessity of expert human review to ensure content validity, construct alignment, and pedagogical appropriateness. These results suggest that AI, in its current form, is not yet equipped to autonomously generate assessment instruments of sufficient quality for high-stakes educational settings. A hybrid test design model is therefore advocated, wherein AI is leveraged for initial item drafting, followed by rigorous human refinement. This approach may enhance both efficiency and quality in the development of educational assessments. The implications extend to educators, assessment designers, and developers of educational AI systems, highlighting the need for collaborative human-AI frameworks to achieve reliable, valid, and pedagogically sound testing instruments.

Open Access

Research article

Evolutionary Game Analysis of Stakeholder Strategies in China’s Basic Education under the “Double Reduction” Policy

wang dou

Education Science and Management

Volume 3, Issue 2, 2025

Pages 93-105

https://doi.org/10.56578/esm030202

Available online: 04-14-2025

Abstract

Full Text|

PDF|

XML

- no more data -