The psychometric validity of multiple-choice questions (MCQs) generated by an advanced Artificial Intelligence (AI) language model (ChatGPT) was evaluated in comparison with those developed by experienced human instructors, with a focus on mathematics teacher education. Two parallel 30-item MCQ tests—one human-designed and one AI-generated—were administered to 30 mathematics teacher trainees. A comprehensive psychometric analysis was conducted using six metrics: item difficulty index (Pi), discrimination index (D), point-biserial correlation, item-test correlation (Rit), Cronbach’s alpha (α) for internal consistency, and score variance. The analysis was facilitated by the Analysis of Didactic Items with Excel (AnDIE) tool. Results indicated that the human-authored MCQs exhibited acceptable difficulty (mean Pi = 0.55), moderate discrimination power (mean D = 0.31), and strong internal consistency (Cronbach’s α = 0.752). In contrast, the AI-generated MCQs were found to be substantially more difficult (mean Pi = 0.22), demonstrated weak discrimination (mean D = 0.16), and yielded negative internal consistency reliability (Cronbach’s α = −0.1), raising concerns about their psychometric quality. While AI-generated assessments offer advantages in terms of scalability and speed, the findings underscore the necessity of expert human review to ensure content validity, construct alignment, and pedagogical appropriateness. These results suggest that AI, in its current form, is not yet equipped to autonomously generate assessment instruments of sufficient quality for high-stakes educational settings. A hybrid test design model is therefore advocated, wherein AI is leveraged for initial item drafting, followed by rigorous human refinement. This approach may enhance both efficiency and quality in the development of educational assessments. The implications extend to educators, assessment designers, and developers of educational AI systems, highlighting the need for collaborative human-AI frameworks to achieve reliable, valid, and pedagogically sound testing instruments.