Has ChatGPT made the US Education Report Card relevant?

6 minutes, 21 seconds Read

Classroom with students raising their hands

ChatGPT and GPT-4 scored above most students in grades 4, 8 and 12 on a standardized test of science questions.

Skynesher/Getty Images

The Nation’s Report Card, also known as the National Assessment of Educational Progress, NAEP, is a Standardized Examination of Student Aptitude In the United States, it has been administered by the US Board of Education since 1969 The test is widely referred to as the standard on which students stand on their ability to read, write, do math, understand scientific experiments, and many other skill areas.

Testing had a grim message for teachers, administrators and parents last year: Teens’ math scores show The biggest downfall A general long-term trend of declining math and reading scores since the assessment began.

Also: How tech professionals can survive and thrive in the workplace in the age of AI

The decline comes with the rise of generative artificial intelligence (AI) like OpenAI’s ChatGPT, and obviously, many people are asking if there’s a connection.

“ChatGPT and GPT-4 consistently outperformed most students answering each individual item on the NAP science assessment,” wrote Xiaoming Zhai of the University of Georgia and colleagues from the University’s AI4STEM Education Center and the College of the University of Alabama. Education, in a paper Published this week on the arXiv pre-print server“Can Generative Ai and Chatgpt Outperform Humans on Cognitive-Demand Problem-Solving Tasks in Science?”

Also: AI in 2023: A year of breakthroughs that leave nothing human unchanged

The report is “the first study to focus on comparing state-of-the-art GAI and K-12 students’ problem solving in science,” state Zhai and team.

There have been several studies in the past year showing that ChatGPT can “conform to human performance and transfer problems in practice, in line with the potential outcomes expected from human samples,” which they write, “underscoring ChatGPT’s ability to reflect average success. subject’s rate, thereby showing his proficiency in cognitive tasks.”

The authors created a NAEP test for the ChatGPT and GPT-4 by selecting 33 multiple-choice questions on science problem solving, with four questions designated as “choice response,” in which the examinee selects an appropriate response from a list after reading a passage. . There are three questions that present a scenario, with a sequence of linked questions; and 11 “constructed response” questions and 3 “extended constructed response” questions, where the examinee must write a response instead of choosing from answers.

An example of a science question might involve an imaginary view of a rubber band stretched between two fingernails, asking the student to explain why it makes a sound when plucked and what makes the sound reach a higher pitch. This question requires the student to write an answer about the vibration of air from a rubber band and how increasing tension can increase the pitch of the vibration.


Example of a constructed answer question that tests logic in science.

University of Georgia

The questions were all grade 4, 8, and 12 based Output from ChatGPT and GPT-4 was compared to the anonymous responses of human examinees, on average, as provided to the authors by the Department of Education.

ChatGPT and GPT-4 answered questions with “above average” accuracy — and in fact, human students scored highly compared to the two programs in numerous tests. ChatGPT scored well for 83%, 70%, and 81% of students on Grades 4, 8, and 12 questions, and GPT-4 was similar, leading 74%, 71%, and 81%, respectively.

The authors have a theory for what’s going on, and it perfectly suggests the kind of grind that standardized tests create. Human students are ultimately something like that John Henry’s famous story Trying to compete against steam powered rock drills.

The author draws a framework of psychology that “cognitive load“, which measures how intensely a task challenges the human brain’s working memory, the place where resources are held for short periods of time. Like computer DRAM, short-term memory has limited capacity, and things get flushed for short periods of time. TERM TERM MEMORY New information As must be present.

Also: I actually checked ChatGPT with Bird, Claude and Copilot – and it got weird

“Cognitive load in science education deals with the mental effort required of students to process and understand scientific knowledge and concepts,” the authors say. Specifically, working memory can be taxed by the various faces of a test, which “all compete for these limited working memory resources,” such as trying to remember all the variables of a test question at the same time.

Machines have a greater capacity to maintain variables in DRAM, and ChatGPT and GPT-4 can — through their different neural weights and the explicit context typed at the prompt — store much more input, the authors emphasize.

The issue comes to a head when the authors look at each student’s skills in relation to the complexity of the question. Average students get confused because the questions are difficult, but ChatGPT and GPT-4 do not.

“For each of the three grade levels, higher average student ability scores on NAEP science assessments are required with increasing cognitive demands, however, performance on both the ChatGPT and GPT-4 does not significantly affect the same condition, except for the lowest grade 4.”

Also: How to use Bing Image Creator (and why it’s better than ever)

In other words: “Their lack of sensitivity to cognitive demands demonstrates the potential of the GAI to overcome the working memory deficits that people experience when using higher-order thinking to solve problems.”

The authors argue that the ability of generative AI to exceed the limits of human working memory “holds significant implications for the evolution of assessment practices within the educational paradigm” and “has an imperative for educators to revise traditional assessment practices.”

Generative AI is “ubiquitous” in students’ lives, they note, and so human students will continue to use the tools, and will be beyond classification by tools on standardized tests like NAEP.

“Given the GAI’s noted sensitivity to cognitive load and its potential role as a tool in students’ future professional endeavors, reframing educational assessment becomes important,” write Zhai and team.


Average performance of human students was below GPT-4 and ChatGPT on most questions of class XII students.

University of Georgia

“The focus of these assessments should move away from solely measuring cognitive intensity to a greater emphasis on creativity and application of knowledge in novel contexts,” they suggest.

“This changes the growing importance of innovative thinking and problem-solving skills in a landscape increasingly dominated by advanced GAI technologies.”

Also: These jobs are most likely to be taken over by AI

Teachers, they feel, are “currently unprepared” for what appears to be a “significant change” in pedagogy. This transition means that it is up to educational institutions to focus on teachers’ professional development.

An interesting footnote to the study is the limitation of the two programs. In some cases, a program or another science has requested additional information for the question. When one program asked, but the other did not, “the model that did not request additional information often gave unsatisfactory answers.” This means, the authors conclude, that “these models rely heavily on the information provided to produce correct responses.”

Machines rely on learned parameters in prompts or models. This gap opens a path for people, perhaps, to excel where neither source has the necessary insight for problem-solving activities.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *