OpenAI’s ChatGPT chat software, if unleashed into the world, would score between a B and a B- on the Wharton Business School’s operations management exam and achieve the required score to pass the US Medical Licensing Exam (USMLE) or surpass
While this says more about the static, document-centric nature of testing material than the intellectual prowess of software, it is nonetheless a concern and interest for educators and almost everyone else living in the age of automation.
Academics fear that because assistive technologies powered by machine learning have become so powerful, assistive systems like GitHub’s ChatGPT and Copilot (based on an OpenAI model called Codex) will require teachers to reevaluate how they teach and mark exams.
In education, AI consulting is becoming commonplace: The Stanford Daily just reported, “A large number of students have already used ChatGPT for their final exams.” An estimated 17 percent of students, based on an anonymous survey of 4,497 respondents, said they Used ChatGPT to help with assignments and exams during the fall quarter, with 5 percent reporting they had submitted material directly from ChatGPT with little or no editing — which is believed to be a violation of the Code of Ethics.
Separately, University of Pennsylvania Wharton School professor Christian Terwiesch and a group of medical researchers largely affiliated with Ansible Health decided to put ChatGPT, a arguably amoral automated advisor and fact-challenged expert system, to the test.
Both Terwiesch and the experts at Ansible Health made it clear that ChatGPT has limitations and is doing things wrong. Overall, they give mediocre grades, but made it clear that they assume that AI assistance systems will find a place in teaching and in other areas.
After all, the model has been trained on countless human-written texts, and so its ability to guess a satisfactory answer to a question from all the knowledge and facts it has inhaled is not unexpected.
“First, it does an excellent job on basic operations management and process analysis issues, including those based on case studies,” Terwiesch said in his article. “Not only are the answers correct, but the explanations are excellent.”
However, he observed ChatGPT making simple math errors and messing up advanced process analysis questions. However, the AI model responds to cues from humans for improvement – it can successfully correct itself when given cues from a human expert.
Human guidance has also served as a source of malicious input, as demonstrated by Microsoft’s Tay chatbot and subsequent research.
The medical research group that wrote Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models co-authored ChatGPT.
“ChatGPT contributed to the writing of several sections of this manuscript,” the biological authors explain in their article.
Other organizational affiliations of the authors are: Massachusetts General Hospital, Harvard School of Medicine, in Boston, Mass; Warren Alpert Medical School, Brown University, in Providence, Rhode Island; and Department of Medical Education at UWorld, LLC, a healthcare e-learning company based in Dallas, Texas.
The authors – Tiffany Kung, Morgan Cheatham, ChatGPT, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo and Victor Tseng – came to a similar conclusion as Wharton’s Terwiesch. In particular, they found that ChatGPT scores passably on the USMLE exam — above the variable pass threshold of about 60 percent — given the benefit of vague answers. And they expect Large Language Models (LLMs) to play an increasing role in medical education and clinical decision-making.
“ChatGPT delivers moderate accuracy, approaching pass performance on USMLE,” the authors state in their article. “Examination tasks were first coded as open-ended questions with variable entry prompts. This input format simulates a free natural user query pattern. With censored/included indeterminate responses, ChatGPT accuracy for USMLE steps 1, 2CK, and 3 was 68.0 percent/42.9 percent, 58.3 percent/51.4 percent, and 62.4 percent/55.7, respectively Percent.”
Describing ChatGPT’s performance as “almost over” is a generous way of phrasing it, especially when vague responses are attributed to the AI. Arriving at a doctor’s office and seeing a diploma advertised as a D grade might cause a little more concern from patients.
However, the researchers claim that the things ChatGPT got right closely matched the accepted answers, and that the AI model has improved remarkably, having only achieved a success rate of about 36.7 percent months earlier.
Interestingly, they found that ChatGPT performs better than PubMedGPT, an LLM based solely on biomedical data, achieving an accuracy of only about 50.8 percent (based on unpublished data).
“We speculate that domain-specific training may have created greater ambivalence in the PubMedGPT model as it absorbs real-world texts from ongoing academic discourse, which in its language tends to be inconclusive, contradictory, or very conservative or non-binding,” the authors note .
In essence, the less scientific, more idiosyncratic material that has gone into ChatGPT’s training seems like pamphlets explaining diseases that have made ChatGPT more idiosyncratic.
“As AI becomes more powerful, it will soon be ubiquitous and transform clinical medicine across all healthcare settings,” the authors conclude, adding that clinicians associated with AnsibleHealth use ChatGPT in their workflows and reported a 33 percent time savings have required to complete the documentation and indirect tasks of patient care.
This may explain Microsoft’s decision to invest billions in OpenAI for its future software.
The usefulness of ChatGPT in education — despite the fact that it’s often wrong — was underscored in a blog post published Sunday by Thomas Rid, professor of strategic studies and founding director of the Alperovich Institute for Cybersecurity Studies.
Rid describes a recent five-day malware analysis and reverse engineering course taught by Juan Andres Guerrero-Saade.
“Five days later, I had no doubts: this thing is going to transform higher education,” said Rid. “I was one of the students. And I was blown away by what real-time machine learning could do for us. And I say that as someone who has been a die-hard skeptic of the artificial intelligence hype for many years. Note that I didn’t say ‘likely’ transformation. It will transform higher education.”
Guerrero-Saade, in a twitter thread, acknowledges that ChatGPT did something wrong, but insists the tool helped students find better answers. He suggests that it works like a personal teaching assistant for each student.
“The fear mongering surrounding AI (or overblown expectations of perfect outcomes) clouds recognition of this LLM’s prodigious utility: as an assistant, able to quickly pull together information (true or false) of extreme relevance so that a more sophisticated intelligence (the user ) can work with it, ” he wrote.
Rid argues that while concerns about AI as a mechanism for plagiarism and fraud in education need to be addressed, the more important discussion has to do with how AI tools can improve educational outcomes. ®