Neither Fair Nor Accurate: Research-Based Reasons Why High-Stakes Tests Should Not Be Used to Evaluate Teachers
A pitched battle raged in my hometown of Seattle this fall. Superintendent Maria Goodloe-Johnson and the Seattle Public Schools district fought with the Seattle Education Association over their most recent teachers’ union contract. At the heart of the dispute: Should teacher evaluations be based in part on student scores on standardized tests?
Seattle is not unique in this struggle, and it is clear that Superintendent Goodloe-Johnson takes her cue from what is happening nationally.
In August, for instance, the Los Angeles Times printed a massive study in which LA student test scores were used to rate individual teacher effectiveness. The study was based on a statistical model referred to as value-added measurement (VAM). As part of the story, the Timespublished the names of roughly 6,000 teachers and their VAM ratings (see sidebar, p. 37).
In October the New York City Department of Education followed suit, publicizing plans to release the VAM scores for nearly 12,000 public school teachers. U.S. Secretary of Education Arne Duncan lauded both the Times study and the NYC Department of Education plans, a stance consistent with Race to the Top guidelines and President Obama’s support for using test scores to evaluate teachers and determine merit pay.
Current and former leaders of many major urban school districts, including Washington, D.C.’s Michelle Rhee and New Orleans’ Paul Vallas, have sought to use tests to evaluate teachers. In fact, the use of high-stakes standardized tests to evaluate teacher performance à la VAM has become one of the cornerstones of current efforts to reshape public education along the lines of the free market.
On the surface, the logic of VAM and using student scores to evaluate teachers seems like common sense: The more effective a teacher, the better his or her students should do on standardized tests.
However, although research tells us that teacher quality has an effect on test scores, this does not mean that a specific teacher is responsible for how a specific student performs on a standardized test. Nor does it mean we can equate effective teaching (or actual learning) with higher test scores.
Given the current attacks on teachers, teachers’ unions, and public education through the use of educational accountability schemes based wholly or partly on high-stakes standardized test scores and VAM, it is important that educators, students, and parents understand why, based on educational research, such tests should not be used to evaluate teachers.
Although there are many well-documented problems with using VAM to evaluate teachers, I’ve chosen to highlight six critical issues with VAM that are so problematic they alone should be enough to stop the use of high-stakes standardized tests for such evaluations. I hope these will be helpful as talking points for op-ed pieces, blogs, and discussions at school board meetings, PTA meetings, and in the bleachers at basketball games.
Statistical Error Rates
There is a statistical error rate of 35 percent when using one year’s worth of test data to measure a teacher’s effectiveness, and an error rate of 25 percent when using data from three years, researchers Peter Schochet and Hanley Chiang find in their 2010 report “Error Rates in Measuring Teacher and School Performance Based on Test Score Gains,” released by the U.S. Department of Education’s National Center for Education Statistics.
Bruce Baker, finance expert at Rutgers University, explains that using high-stakes test scores to evaluate teachers in this manner means there is a one-in-four chance that a teacher rated as “average” could be incorrectly rated as “below average” and face disciplinary measures. Because of these error rates, a teacher’s performance evaluation may pivot on what amounts to a statistical roll of the dice.
Year-to-Year Test Score Instability
As Tim Sass, economics professor at Florida State University, points out in “The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy,” test scores of students taught by the same teacher fluctuate wildly from year to year. In one study comparing two years of test scores across five urban districts, more than two-thirds of the bottom-ranked teachers one year had moved out of the bottom ranks the next year. Of this group, a full third went from the bottom 20 percent one year to the top 40 percent the next. Similarly, only one-third of the teachers who ranked highest one year kept their top ranking the next, and almost a third of the formerly top-ranked teachers landed in the bottom 40 percent in year two.
If test scores were an accurate measurement of teacher effectiveness, “effective” teachers would rate high consistently from year to year because they are good teachers; and one would expect “ineffective” teachers to rate low in terms of test scores just as consistently. Instead, the year-to-year instability that Sass highlights shows that test scores have very little to do with the effectiveness of a single teacher and have more to do with the change of students from year to year (unless, of course, one believes that one-third of the highest ranked teachers in the first year of the study simply decided to teach poorly in the second).
Day-to-Day Score Instability
Fifty to 80 percent of any improvement or decline in a student’s standardized test scores can be attributed to one-time, randomly occurring factors, according to Thomas Kane of Harvard University and Douglas Staiger of Dartmouth College in their research report “Volatility in Test Scores.”
This means that factors such as whether or not a child ate breakfast on test day, whether or not a child got into an argument with parents or peers on the way to school, which other students happened to be in attendance while taking the test, and the child’s feelings about the test administrator account for at least half of any given student’s standardized test score gains or losses. Some factors, such as a dog barking outside an open window, can affect an entire class.
Kane and Staiger’s findings illustrate that using tests to evaluate teachers ignores the reality that a host of individual daily factors that are completely out of a teacher’s control contribute to how a student performs on any given test. To reward or punish a teacher based on such scores could literally mean rewarding or punishing a teacher based on how well or poorly a student’s morning went.
Nonrandom Student Assignments
The grouping of students—either within schools through formal and informal tracking or across schools through race, socioeconomic class, and linguistic (ELL) segregation—greatly influences VAM test results, as 10 leading researchers in teacher quality and educational assessment highlight in their policy brief “Problems with the Use of Student Test Scores to Evaluate Teachers,” published by the Economic Policy Institute.
These researchers note that “teachers who have chosen to teach in schools serving more affluent students may appear to be more effective simply because they have students with more home and school supports for their prior and current learning, and not because they are better teachers.”
Even when VAM models attempt to take into account a student’s prior achievement or demographic characteristics, the models assume that all students will show test gains at an equal rate. This assumption, however, does not necessarily hold true for groups of students who historically have performed poorly on tests, for English language learners who are asked to become proficient in both a new language and a tested subject area, or for students with disabilities whose test-based rates of progress may be incomparable to any other student.
Nonrandom student assignment means that a teacher could be punished, dismissed, or lose tenure purely because the course they teach or the school they teach in has a significant population of traditionally low-scoring students who may show variable or slower test score gains.
High-stakes, standardized tests are also unable to account for the complexities of learning (and, by extension, teaching). For instance, we know from the linguistic research of Steven Pinker and others that learning often happens in a U-shape—that making mistakes is an integral part of the learning process. When children are tested, we never quite know where on the U-shaped learning curve they might be, nor do we realize that their mistakes could be a vital part of a natural learning process. When tests are used to evaluate teachers, it is possible that highly effective teachers who push students out of their cognitive comfort zones are penalized for provoking the deep learning that requires students to make mistakes on the way to greater understanding.
Standardized tests are also too crude to account for the possibility of cognitive transfer of skills that students learn across different subjects. Using VAM, as the researchers in the above-mentioned Economic Policy Institute policy brief explain, means that “the essay writing a student learns from his history teacher may be credited to his English teacher, even if the English teacher assigns no writing; the mathematics a student learns in her physics class may be credited to her math teacher.” In other words, we can never be certain which class and which teacher contributed to a given student’s test performance in any given subject.
Out-of-school factors such as inadequate access to health care, food insecurity, and poverty-related stress, among others, negatively impact the in-school achievement of students so profoundly that they severely limit what schools and teachers can do on their own, explains David Berliner, Regents Professor of Education at Arizona State University, in his report “Poverty and Potential.”
Although it is clear from the research of Stanford University’s Linda Darling-Hammond and others that teachers play an absolutely pivotal role in student success, when we use high-stakes tests to evaluate teachers, we incorrectly assume that teachers have the ability to overcome any obstacle in students’ lives to improve learning. Although good teachers are critically necessary, they are not always sufficient.
To assume otherwise is to think that teachers (and schools) can somehow make up for the lack of housing, food, safety, and living wage employment, among other factors, all on their own. The social safety net is the responsibility of a much broader socioeconomic network—not the sole responsibility of the teacher.
Politics, Not Reality
The reality of standardized tests is that they are too imprecise and inaccurate to measure the effectiveness of individual teachers. The sad thing is that testing experts, researchers, and psychometricians have known this for quite some time. In 1999, for instance, the expert panel that made up the Committee on Appropriate Test Use of the National Research Council cautioned that “an educational decision that will have a major impact on a test-taker should not be made solely or automatically on the basis of a single test score.”
Yet two short years later, a bipartisan Congress and the presidential administration of George W. Bush passed No Child Left Behind and its test-and-punish approach to school reform into law.
Although the Bush administration seemed to ignore educational research as a matter of policy (as illustrated through NCLB’s Reading First program and the advocacy of using phonics-only teaching methods that had little basis in research), many hoped for something different with the election of President Obama.
Unfortunately, the Obama administration has sent a clear message: When it comes to high-stakes standardized testing, the research doesn’t matter.
It hasn’t mattered that, according to the above cited U.S. Department of Education report, “More than 90 percent of the variation in student gain scores is due to the variation in student-levelfactors that are not under control of the teacher.”
It hasn’t mattered that the National Research Council of the National Academy of Sciences has stated that “VAM estimates of teacher effectiveness should not be used to make operational decisions because such estimates are far too unstable to be considered fair or reliable.”
It hasn’t mattered that even the researchers who completed the Los Angeles Timesstudy acknowledged that VAM data were too unreliable to use as the sole measure of teacher performance (a point that the Timesneglected to clearly articulate in their article).
Sadly, with Bush, now with Obama, politics and ideology trump educational research.
One would think that all of the policy makers, politicians, pundits, superintendents, talk show hosts, documentary movie makers, business leaders, and philanthropic foundations so in love with the idea of using test score data to evaluate teachers would be equally as passionate about accuracy. People’s lives are at stake, and yet the “data” underlying important decisions about teacher performance couldn’t be shakier.
The shakiness of test-based VAM data illustrates that the current fight over teacher “accountability” isn’t really about effectiveness. The more substantial public conversation we should be having about rising poverty, the racial resegregation of our schools, increasing unemployment, lack of health care, and the steady defunding of the public sector—all factors that have an overwhelming impact on students’ educational achievement—has been buried. Instead, teachers and their unions have become convenient scapegoats for our social, educational, and economic woes.
Yes, teachers’ performance needs to be evaluated, but in a manner that is fair and accurate. Using high-stakes standardized tests and VAM to make such evaluations is neither.
© 2011 Rethinking Schools