Criterion-referenced tests are designed to label each student as a “success” or a “failure” based on criteria created by “education experts.”
A student can outperform 60% of the other students and still be labeled a failure by a criterion-referenced test. Or answer only 20% of the questions correctly and be labeled Excellent. The purpose of a criterion-referenced test is not diagnostic but judgmental.
Norm-referenced tests don't Pass or Fail students, instead they provide meaningful diagnostic assessment.
Norm-referenced test results are straightforward, transparent, numerical scores. Teachers and parents can interpret the scores based on national and historical data.
Criterion-referenced questions are evaluated with “rubrics.”
For example, a rubric might say: Award a score of 3 to a “mostly appropriate” essay and a score of 2 to a “somewhat appropriate” essay.
This means that whether the student passes (75%) or fails (50%) the test depends entirely on the grader’s interpretation of “mostly” vs. “somewhat.”
The same experts who use rubrics to label children as failures never admit the inherent flaw of their rubrics.
They fumble when confronted with the simple question, “How can anyone accurately and consistently distinguish a student answer that is ‘mostly appropriate’ from one that is ‘somewhat appropriate?’"
Asked this question directly, one older, white male, highly influential education expert responded, “It’s like pornography. You know it when you see it.”
He candidly admitted that his rubrics are subjective and success or failure is in the eye of the beholder.
Maybe his criterion-referenced tests are the reason some groups of students don’t “test well.” Many smart students might fail the tests simply because they don't express themselves in a way that appeals to sophisticated older white gentlemen.
Norm-referenced tests only use well-vetted multiple-choice questions that are not subject to the interpretation of a grader.
Notice the subtle, but profound difference between criterion-referenced tests and norm-referenced tests.
Criterion-referenced tests use arbitrary and subjective grading to give "experts"the final say in defining success and failure. Their primary purpose is to label children and schools.
Norm-referenced tests serve as objective diagnostic tools designed to help parents and teachers improve classroom instruction for students.
Norm-referenced tests are not trendy. They don't cost much, they don't require years to pilot, and they don't have to be computerized. The education elite pooh-pooh them as unsophisticated relics.
Yet look at what these sophisticated education elite spent billions of dollars replacing them with!
State Departments of Education hired their sole-source contractors to produce criterion-referenced tests that are like Soviet-era Communist Cars: low quality, high cost products produced by a bureaucracy and forced on a captive consumer.
No one who has a choice buys a Lada, What it's like to drive the greatest soviet car of all time .
Private schools, which are accountable to parents rather than to the whims of the educational bureaucracy, trust norm-referenced multiple choice tests like the Terranova 3 for annual diagnostic evaluations of their students.
Our public school students and teachers deserve the same respect.
State tests must be limited to multiple choice questions if schools are to have any hope of getting test results back in a reasonable amount of time.
I would not want to be a politician whose continued incompetence allowed the State Department of Education to make us wait until Fall to get results from a test taken in the Spring, again.
The SAT and ACT maintain their reputation by exclusively testing with norm-referenced multiple choice questions. A few years ago the SAT experimented with essays but the results were so questionable that they quickly changed their mind and abandoned the essay. College Board Will No Longer Offer SAT Subject Tests or SAT with Essay, 1-19-2021
If there are short answer or essay questions on a state examination, who is going to grade them?
Short answer and essay test questions would inevitably be scored either by part-time graders during the summer or by machines.
Pearson had no qualms about using Craigslist ads to find minimum wage temp workers to grade PARCC essays. The graders could earn small bonuses if they hit daily volume targets, so they quickly clicked scores of 0, 1, 2, or 3 to pass or fail children for comprehension and written expression based on vague, subjective rubrics (read the notes about the inherent, obvious flaw with rubrics).
However, even with these exceptionally low quality-control standards, the tests took months to grade.
Machine-graded essays are an insult to our intelligence
Human graders with any recent teaching experience are not available to score state tests during the school year. That is one reason why students often don't get their spring test scores back until well into the next school year.
In an attempt to address the ridiculously long time it takes to grade exams, the testing corporations have been selling "computerized grading" to our State Departments of Education.
The experts and elite guarantee that computers are capable of doing this job.
Really? Don't these experts understand the significance of the Captcha app?
Websites use the Captcha pop-up to prove you are a human and not a machine by requiring you to perform a simple task like picking out "the blocks that contain pictures of cars."
Sites are kept secure because machines can't do this.
Yet the experts have convinced many Departments of Education that their machines can reliably apply a vague subjective rubric and identify whether a Grade 3 student's prose provides: "effective development" (3/3 pts. = 100%), "some development" (2/3 pts. = 66%), "minimal development" (1/3 pts. = 33%) or is "underdeveloped" (0/3 pts. = 0%).
These same machines are stopped in their bot-tracks because they can't tell a car from a flagpole.
It is unconscionable that our State Departments of Education are spending taxpayer money to use these machines. And it is depraved that corporations would use these machines to label children as successes or failures, often setting the course of the rest of their lives.
Machine grading programs can be fooled.
Here is an example of how easily humans can game an artificial intelligence system like a computer grading machine.
The "ScoreItNow" AI program for the GRE, the test used to determine admission to graduate school, awarded a perfect score to a nonsense essay simply because it was filled with long, pretentious strings of words.
The essay contained sentences like:
"Even though the brain counteracts a gamma ray to veracity, the same pendulum may catalyze two different neutrinos with the promptly erroneous contentment."
The computer algorithm judged that the essay "presents a cogent, well-articulated examination of the argument and conveys meaning skillfully."
Click the pdf to read the full essay. It may make you laugh but it is unlikely to impress you as much as impressed the grading machine.
The newest snake oil is the "Computer Adaptive Test."
Computer Adaptive tests change as a student takes them, depending on how the student answers each question, much as a role-playing game like Dungeons and Dragons (TM) gives every player a different path every game.
A student who answers questions correctly gets presented more difficult questions. Easier questions are given to a student who has trouble.
Paper and pencil exams cannot test this way, so if technology companies can get states to adopt computer adaptive tests then they have, by default, required computerized testing.
Computer adaptive tests ensure that state budgets include money to fund the technology purchases necessary to fulfill testing mandates no matter how high the price tag. Computerized and computer adaptive testing are a horn of plenty for the technology companies and a buffet for the PhDs.
The experts and elite expect us to accept computer adaptive testing because it is the "newest trend." That is enough of a rationale for them but the practitioners and stakeholders see obvious inherent fatal flaws with computer adaptive state tests.
Only the experts and the elite could be ivory-tower dumb enough to believe that such a test could be used to fairly assess students on a state assessment.
Imagine that a state's 100,000 public school seniors each played a different game of Dungeons and Dragons. Then imagine the Legislature had to approve a State Department of Education cut-off score that determines which games pass and which games fail the graduation standard. This is the impossible position any state is put into if its Governor and Legislature do not stop the experts' demand for computer adaptive state tests.
Evaluation of performance is impossible when everyone takes a different test.
Computer adaptive tests are not only impossible to evaluate for any large-scale assessment program, but real psychometric experts know that a good computer-adaptive test is also very, very difficult to construct:
“The development of an adaptive test is no small feat, and requires five steps integrating the expertise of test content developers, software engineers, and psychometricians. The development of a quality adaptive test is not possible without a PhD. psychometrician experienced in both item response theory (IRT) calibration and Computerized adaptive test (CAT) simulation research.
Step 1: Feasibility, applicability, and planning studies. First, extensive monte carlo simulation research must occur, and the results formulated as business cases, to evaluate whether adaptive testing is feasible, applicable, or even possible.
Step 2: Develop item bank. An item bank must be developed to meet the specifications recommended by Step 1.
Step 3: Pretest and calibrate item bank. Items must be pilot tested on 200-1000 examinees (depends on IRT model) and analyzed by Ph.D. psychometicians.
Step 4: Determine specifications for final CAT. Data from Step 3 is analyzed to evaluate CAT specifications and determine the most efficient algorithms using CAT simulation software such as CATSim.
Step 5: Publish live CAT. The adaptive test is published in a testing engine capable of fully adaptive tests based on IRT."
Again, it is very, very difficult to construct a good computer adaptive test.
The experts hired by the State Departments of Education have spent decades producing such poorly-designed regular tests that they've needed to be replaced every few years.
Yet, the lobbyists are pushing Governors and Legislatures to trust these same "experts" to create computer adaptive tests.
Remember the government officials in the Emperor's New Clothes fairy tale? None of them were willing to question the dishonest tailors and risk looking unsophisticated and unsuited for their jobs. That is the foolish and vain way our Governors and Legislators have been acting for the past two decades.
As a result we are stuck with an embarrassing, costly mess while the dishonest tailors continue to be awarded hundreds of millions of dollars to create pretentious edu-babble like "task generation models," "vocabulary items exclusive to extended literature text," and "informational complexity analysis worksheets."
Some school districts have already tried computer adaptive tests and they have been expensive disasters. But these tests are a cornucopia for greedy corporations, educational elite and the bureaucracy, so expect a serious fight to block them.
The elite are barreling computer adaptive tests toward our schools.
"A new, Austin, Texas-based nonprofit, called New Meridian Corporation, now owns the PARCC questions, and works with states on designing state tests and other work."
"New Meridian has a contract with ISBE, for up to $19.6 million, to work on test development among other tasks, state records show."
"Arthur VanderVeen, the head of New Meridian, said the nonprofit 'is just one player' in ISBE’s vision for a comprehensive assessment system."
"The local exams are customized for individual kids, with the test adjusting the difficulty of questions to fit each student. If a student answers correctly, he or she gets harder questions. If the student gives the wrong answer, he or she gets an easier question."
"G. Gage Kingsbury is a psychometric consultant who has been involved in adaptive testing for decades. He consults for the Smarter Balanced consortium as well as providers of adaptive testing."
"Kingsbury said local adaptive testing is different than a large-scale assessment so some changes would be needed to create a statewide, adaptive exam for kids across Illinois. But it can be done, he said."
"The test can be gamed. Students can (and have) figured out that wrong answers will lead to easier questions and visa versa. This can lead to inaccurate and arguably useless results."