testingAs early childhood assessments and quality ratings systems become more sophisticated, it is interesting to see the testing and accountability issues percolating K-12 start filtering down to the preschool world.

The same essential – and largely false – dichotomy seems to be emerging.  Either one is for tests or against them.  Perhaps standardized tests are the panacea to be used  to determine everything from an individual teacher’s salary to whether to allocate billions for public preschool.  Or perhaps tests are the mortal enemy, forcing our innocent little ones to fill in bubbles rather than cavort on the playground and fingerpaint as they ought to be.

In the early childhood realm, the discussion is becoming more heightened as states attempt to align pre-K and K-1 standards, and as quality rating systems become more sophisticated and data-driven.  Part of the disconnect stems from the reality that assessment data has two purposes.  One purpose is to provide information at the individual level – to let parents, teachers, and kids themselves know how they are doing, and, perhaps, perhaps, to better understand how teachers are doing.  A second purpose for data is to drive policy change based on aggregate results.  This approach allows us to ask questions like:  Are kids in poverty lagging behind wealthier kids in math or reading?  Is Head Start effective?  Are African-American children less likely than white children to be ready for kindergarten after completing preschool?

Creating systems that balance these two purposes sometimes creates inherent conflicts.  We want detailed information on our own children, but we don’t want them overtested.   We want to know how teachers are doing, but we don’t want that to be reflected by a single test.  With limited time and resources, tests must accomplish both aims:  they must tell us something useful about individual kids, but they must also be standardized enough to make draw conclusions about groups of children in various situations.

A delicate balance must be drawn, one that must dig deeper into the established validity of particular assessments in drawing various types of conclusions.  Here’s an example:  as Tim Bartik notes this week, early test scores are a decent predictor of future earnings, and thus can be used to examine the relative effectiveness of various types of programs on one important future outcome.  But if test scores are used for accountability purposes for individual programs, centers, or teachers, their value may be undermined – essentially an unintended version of the Hawthorne effect, in which the act of measurement changes behavior.

At the heart of the matter is the notion of validity: the appropriateness of a measure to answer a particular question.  There are no (or few) good or bad tests, just good or bad applications.  (OK, some tests really are inherently rotten, but their ranks have shrunk with improved test-creation and bias-flagging techniques).  Tests that are appropriate for one application – such as helping identify the math skills an individual child needs to work on – are not necessarily appropriate for another – such as making pronouncements about the quality of a school.

Fortunately, there are preschool assessments, such as the CLASS, which seem to tell us something meaningful both about the experience kids have in preschool as well as their future prospects.  We ought to continue to examine these assessments and build the integrated data systems that allow us to best use their results.