More states are using robo-graders to grade essay tests, reports Tovia Smith on NPR.
Pearson’s “automated scoring program graded some 34 million student essays on state and national high-stakes tests last year,” she writes, and there are competitors.
Computers analyze essays graded by humans to “learn” how to grade, says Peter Foltz, a Pearson vice president and a University of Colorado professor.
“We have artificial intelligence techniques which can judge anywhere from 50 to 100 features,” Foltz says. That includes not only basics like spelling and grammar, but also whether a student is on topic, the coherence or the flow of an argument, and the complexity of word choice and sentence structure.
Robo-grading has proven its accuracy in Utah, where only 20 percent of essays are reviewed by humans. The state saves money and “teachers get test results back in minutes rather than months,” writes Smith.
In Ohio, 25 percent of robo-graded exams are reviewed by humans. There have been glitches, reports Shannon Gilchrist in the Columbus Dispatch.
“The first time that artificial intelligence graded Ohio student essays was this past fall, for the English language arts test for third-graders,” she writes. The number of students earned a zero on the writing portion soared.
Machelle Kline, Columbus schools’ chief accountability officer, . . . learned that if a student copies the wording of the question into his answer, the computer interprets that as plagiarism, earning a zero. That copying is something they’re often taught to do, she said.
Students often copy much of the reading passage, leaving little original writing for the computer to grade, said Jon Cohen, vice president of assessment for the American Institutes for Research. Ohio and five other states use AIR’s computer essay scoring.
Massachusetts is considering using robo-grading for statewide tests, reports Smith.
“What is the computer program going to reward?” asked Kelly Henderson, a high school English teacher. “Is it going to reward some vapid drivel that happens to be structurally sound?”
Turns out that’s an easy question to answer, thanks to MIT research affiliate, and longtime-critic of automated scoring, Les Perelman. He’s designed what you might think of as robo-graders’ kryptonite, to expose what he sees as the weakness and absurdity of automated scoring. Called the Babel (“Basic Automatic B.S. Essay Language”) Generator, it works like a computerized Mad Libs, creating essays that make zero sense, but earn top scores from robo-graders.
With three words from the prompt for a practice GRE question, his Babel Generator earned a perfect score for this:
“History by mimic has not, and presumably never will be precipitously but blithely ensconced. Society will always encompass imaginativeness; many of scrutinizations but a few for an amanuensis. The perjured imaginativeness lies in the area of theory of knowledge but also the field of literature. Instead of enthralling the analysis, grounds constitutes both a disparaging quip and a diligent explanation.”
The scoring algorithm rewards big words, complex sentences and phrases such as “in conclusion,” says Perelman.
Robo-graders won’t pick up invented facts, admits Nitin Madnani, senior research scientist at Educational Testing Service (ETS). Human readers don’t have time to fact-check either, he says.
My high school’s English curriculum required us to write expository essays — and nothing else — for four straight years. My technique for the “3-3-3” paragraph was to write about imaginary people. That way I could make up all the supporting details.
Comments