Showing posts with label Jim Gee. Show all posts
Showing posts with label Jim Gee. Show all posts

Friday, January 22, 2010

Can We Really Measure "21st Century" Skills?

The members of the 21st Century Assessment Project were asked a while ago to respond to four pressing questions regarding assessment of “21st Century Skills.” These questions had come via program officers at leading foundations, including Connie Yowell at MacArthur’s Digital Media and Learning Initiative, which funds our Project. I am going to launch my efforts to blog more during my much-needed sabbatical by answering the first question, with some help from my doctoral student Jenna McWilliams.

Question One: Can critical thinking, problem solving, collaboration, communication and "learning to learn" be reliably and validly measured?

As Dan Koretz nicely illustrated in the introduction to his 2008 book, Measuring Up: What Educational Testing Really Tells Us, the answers to questions about educational testing are never simple. We embrace strongly situative and participatory view of knowing and learning, which is complicated to explain to those who do not embrace it. But I have training in psychometrics (and completed a postdoc at ETS) and have spent most of my career refining a more pragmatic stance that treats educational accountability as inevitable. When it comes to assessment, I am sort of a born-again situativity theorist. Like folks who have newly found religion and want to tell everybody how Jesus helped them solve all of the problems they used to struggle with, I am on a mission to tell everyone how situative approaches measurement can solve some nagging problems that they have long struggled with.

In short, no, we don’t believe we can measure these things in ways that are reliable and yield scores that are valid evidence of what individuals are capable of in this regard. These are actually “practices” that can most accurately be interpreted using methods accounting for the social and technological contexts in which they occur. In this sense, we agree with skeptics like Jim Greeno and Melissa Gresalfi who argued that we can never really know what students know. This point riffs on the title of the widely cited National Research Council report of the same name that Jim Pellegrino (my doctoral advisor) led. And as Val Shute just reminded me, Messick has reminded us forever that measurement never really gets directly at what somebody knows, but instead provides evidence about what the seem to know. My larger point here is my concern about what happens with these new proficiencies in schools and in tests when we treat them as individual skills rather than social practices. In particular I worry what happens to both education and evidence when students, teachers, and schools are judged according to tests of these new skills.

However, there are lots of really smart folks who have a lot of resources at their disposal who think you can measure them. This includes most of my colleagues in the 21st Century Assessment Project. For example, check out Val Shute’s great article in the International Journal of Learning and Media. Shute also has an edited volume on 21st Century Assessment coming out shortly. Likewise Dan Schwartz has a tremendous program of research building on his earlier work with John Bransford on assessments as preparation for future learning. Perhaps the most far reaching is Bob Mislevy’s work on evidence-centered design. And of course there is the new Intel-Microsoft-Cisco partnership which is out to change the face of national assessments and therefore the basis of international comparisons. I will elaborate on these examples in my next post, as that is actually the second question we were asked to answer. But first let me elaborate on why I believe that the assessment (of what individuals understand) and the measurement (of what groups of individuals have achieved) of 21st Century skills is improved if we assume that we can never really know what students know.

To reiterate, from the perspective of contemporary situated views of cognition, all knowledge and skills are primarily located in the social context. This is easy to ignore when focusing on traditional skills like reading and math that can be more meaningfully represented as skills that individuals carry from context to context. This assumption is harder to ignore with these newer ones that everyone is so concerned with. This is expecially the case with explicity social practices like collaborating and communicating, since these can't even practiced in isolated contexts. As we argued in our chapter in Val’s book, we believe it is a dangerously misleading to even use the term skills in this regard. We elected to use the term proficiencies because that term is broad enough to capture the different ways that we think about them. As 21st Century Assessment project leader Jim Gee once put it
Abstract representations of knowlege, if they exist at all, reside at the end of long chains of situated activity.
However, we also are confident that that some of the mental “residue” that gets left behind when people engage meaningfully in socially situated practices can certainly be assessed reliably and used to make valid interpretations about what individuals know. While we think these proficiencies are primarily social practices, it does not exclude recognizing the secondary “echoes” of participating in these practices. This can be done with performance assessments and other extended activities that provide some of that context and then ask individuals or groups to reason, collaborate, communicate, and learn. If such assessments are created carefully, and individuals have not been directly trained to solve the problems on the assessments, it is possible to obtain reliable scores that are valid predictions of how well individuals can solve, communicate, collaborate, and learn in new social and technological contexts. But this continues to be difficult and the actual use of such measures raises serious validity issues. Because of these issues (as elaborated below), we think this work might best be characterized as “guessing what students know.”

More to the point of the question, we believe that only a tiny fraction of the residue from these practices can be measured using conventional standardized multiple-choice tests that provide little or no context. For reasons of economy and reliability, such tests are likely to remain the mainstay of educational accountabiity for years to come. Of course, when coupled with modern psychometrics, such tests can be extremely reliable, with little score variation across testing time or version. But there are serious limitations in what sorts of interpretations can be validly drawn from the resulting scores. In our opinion, scores on any standardized test of these new skills are only valid evidence of proficiency when they are
a) used to make claims about aggregated proficiencies across groups of individuals;
b) used to make claims about changes over longer times scales, such as comparing the consequences of large scale policy decisions over years; and
c) isolated from the educational environment which they are being used to evaluate.
Hence, we are pretty sure that national and international assessments like NAEP and PISA should start incorporating such proficiencies. But we have serious concerns about using these measures to evaluate individual proficiencies in an high-stakes sorts of ways. If such tests are going to continue to be used on any high stakes decisions, they may well best be left to more conventional literacies, numeracies, and knowledge of conventional content domains, which are less likely to be compromised.

I will say that I am less skeptical about standardized measures of writing. But they are about the only standardized assessments left in wide use that actually requires students to produce something. Such tests will continue to be expensive and standardized scoring (by humans or machines) requires very peculiar writing formats. But I think the scores that result are valid for making inferences about individual proficiency in written communication more broadly, as was implied by the original question. They are actually performance assessments and as such can bring in elements of different contexts. This is particularly true if we can relax some of the needs for reliability (which requires very narrowly defined prompts and typically gets compromized and writers get creative and original). Given that I think my response to the fourth question will elaborate on my belief that written communication is probably the single most important “new proficiency” needed for economic, civic, and intellectual engagement, I think that improved testing of written communication will be the one focus of assessment research that yields the most impact on learning and equity.

To elaborate on the issue of validity, it is worth reiterating that validity is a property of the way the scores are interpreted. Unlike reliability, validity is never a property of the measure. In other words, validity always references the claims that are being supported by the evidence. As Messick argued in the 90s, the validity of any interpretation of scores also depends on the similarity between prior education and training contexts and the assessment/measurement context. This is where things get messy very quickly. As Kate Anderson and I argued in a chapter in an NSSE Yearbook on Evidence and Decision Making edited by Pam Moss, once we attach serious consequences to assessments or tests for teachers or students, the validity of the resulting scores will get compromised very quickly. This is actually less of a risk with traditional proficiencies and traditional multiple choice tests. This is because these tests can draw from massive pools of items that are aligned to targeted standards. In these cases, the test can be isolated from any preparation empirically, by randomly sampling from a huge pool of items. As we move to newer performance measures of more extended problem solving and collaboration, there necessarily are fewer and fewer items and the items become more and more expensive to develop and validate. If teachers are directly teaching students to solve the problems, then it becomes harder and harder to determine how much of an individual score is real proficiency and how much is familiarity with the assessment format (what Messick called construct-irrelevant variance). The problem is that it is impossible to ever know how much of the proficiency is “real.” Even in closely studied contexts, different observers are sure to differ in the validity—a point made most cogently in Michael Kane’s discussions of validity as interpretive argument.

Because of these validity concerns, we are terrified that the publishers of these tests of “21st Century Skills” are starting marketing curricula and test preparation materials of those same proficiencies. Because of the nature of these new proficiencies, these new “integrated” systems raise even more validity issues than the ones that emerged under NCLB for traditional skills. Another big validity issue we raised in our chapter concerns the emergence of socially networked cheating. Once these new tests are used for high-stakes decisions (especially for college entrance), social networks will emerge to tell students how to solve the kinds of problems that are included on the tests. (This has already begun to happen, as in the "This is SPARTA!" prank on the English Advanced Placement test that we wrote about in our chapter and in a more recent "topic breach" wherein students in Winnipeg leaked the essay topic for the school's 12th grade English exam.)

Of course, proponents of these new tests will argue that learning how to solve the kinds of problems that appear on their tests is precisely what they want students to be doing. And as long as you adopt a relatively narrow view of cognition and learning, there is some truth to that assumption. Our real concern is that this unbalanced focus in addition to new standards and new tests will distract from the more important challenge of fostering equitable, ethical, and consequential participation in these new skills in schools.

That is it for now. We will be posting my responses to the three remaining questions over the next week or so. We would love to hear back from folks about their responses to the first question.


Questions remaining:
2) Which are the most promising research initiatives?
3) Is it or will it be possible to measure these things in ways that they can be scored by computer? If so, how long would it take and what sorts of resources would be needed?
4) If we had to narrow our focus to the proficiencies most associated with economic opportunity and civic participation, which ones do we recommend? Is there any evidence/research specifically linking these proficiencies to these two outcomes? If we further narrowed our focus to only students from underserved communities, would this be the same list?

Thursday, October 1, 2009

Positioning Portfolios for Participation

Much of our work in our 21st Century Assessment project this year has focused on communicating participatory assessment to broader audiences whose practices we are trying to inform. This includes:

  • classroom teachers whose practices we are helping reshape to include more participation (like those we are working with in Monroe County right now);

  • other assessment researchers who seem to dismiss participatory accounts of learning as “anecdotal” (like my doctoral mentor Jim Pellegrino who chaired the NRC panel on student assessment);

  • instructional innovators who are trying to support participation while also providing broadly convincing accounts of learning (like my colleagues Sasha Barab and Melissa Gresalfi whose Quest Atlantis immersive environment has been a testbed for many of our idea about assessment);

  • faculty in teacher education who are struggling to help pre-service teachers build professional portfolios while knowing that their score on the Praxis will count for much more (and whose jobs are being threatened by efforts in Indiana to phase out teacher education programs and replace them with more discipline-based instruction);

  • teachers in my graduate-level classroom assessment course who are learning how to do a better job assessing students in their classrooms, as part of their MA degree in educational leadership.


It turns out that participatory approaches to assessment are quite complicated, because they must bridge the void between the socially-defined views of knowing and learning that define participation, and the individually-defined models of knowing and learning that have traditionally been taken for granted by the assessment and measurement communities. As our project sponsor Jim Gee has quite succinctly put: Your challenge is clarity.

As I have come to see most recently, clarity is about entry. Where do we start introducing this comprehensive new approach? Our approach itself is not that complicated really. We have it boiled down to a more participatory version of Wiggins' well known Understanding by Design. In fact we have taken to calling our approach Participation by Design (or if he sues us, Designing for Participation). But the theory behind our approach is maddeningly complex , because it has to span the entire range of activity timescales (from moment-to-moment classroom activity to long-term policy change) and characterizations of learning (from communal discourse to individual understanding to aggregated achievement).

Portfolios and Positioning
Now it is clear to me that the best entry point is the familiar notion of the portfolio. Portfolios consist of any artifacts that learners create. Thanks to Melissa Gresalfi, I have come to realize that the portfolio, and the artifacts that they contain, are ideal for explaining participatory assessment. This is because portfolios position (where position is used as a verb). Before I get to the clarity part, let me first elaborate on what this means.

It turns out that portfolios can be used to position learners and domain content in ways that bridges this void between communal activity and aggregated attainment. In a paper with Caro Williams about the math project that Melissa and I worked on together, Melissa wrote that

“positioning, as a mechanism, helps bridge the space between the opportunities that are available for participation in particular ways and what individual participants do”

Building on the ideas of her doctoral advisor Jim Greeno (e.g., Greeno and Hull, 2002) Melissa explained that positioning refers to how students are positioned relative to content (called disciplinary positioning) and how they are positioned relative to others (called interpersonal positioning). As I will add below, positioning also refer to how instructors are positioned relative to the students and the content (perhaps called professorial positioning). This post will explore how portfolios can support all three types of positioning in more effective and in less effective ways.

Melissa further explained that positioning occurs at two levels. At the more immediate level positioning concerns the moment-to-moment process in which students take up opportunities that they are presented with. Over the longer term, students become associated with particular ways of participating in classroom settings (these ideas are elaborated by scholars like Dorothy Holland and Stanton Wortham). This post will focus on identifying two complementary functions for portfolios helps them support both types of positioning.

Portfolios and Artifacts
Portfolios are collections of artifacts that students created. Artifacts support participation because they are where students apply what they are learning in class to something personally meaningful. In this way they make new meanings. In our various participatory assessment projects, artifacts have included

  • the “Quests” that students complete and revise in Quest Atlantis’ Taiga world where they explain, for example, their hypothesis for why the fish in the Taiga river are in decline;
  • the remixes of Moby Dick and Huck Finn that students in Becky Rupert’s class at Aurora Alternative High School create in their work with the participatory reading curricula that Jenna McWilliams is creating and refining.
  • the various writing assignments that the English teachers in Monroe and Greene County have their students complete in both their introductory and advanced writing classes;
  • the wikifolio entries that my students in my graduate classroom assessment course complete where they draft examples of different assessment items for a lesson in their own classrooms, and state which of the several item writing guidelines in the textbook they found most useful.

  • In each case, various activities scaffold the student learning as they create their artifacts and make new meanings in the process. As a caveat, this means that participatory assessment is not really much use in classrooms where students are not asked to create anything. More specifically, if your students are merely being asked to memorize associations and understand concepts in order to pass a test, stop reading now. Participatory assessment won’t help you. [I learned this the hard way trying to do participatory assessment with the Everyday Mathematics curriculum. Just do drill and practice. It works.]


Problematically Positioned Portfolios
Probably the most important aspect of participatory assessment has to do with the way portfolios are positioned in the classroom. We position them so they serve as a bridge between the communal activities of participatory classroom and the individual accountability associated with compulsory schooling. If portfolios are to serve as a bridge, they must be firmly anchored. On one side they must be anchored to the enactment of classroom activities that support students’ creation of worthwhile portfolios. On the other side they must be anchored to the broader accountability associated with any formal schooling.



To keep portfolio practices from falling apart (as they often do) it is crucial that they rest on these two anchors. If accountability is placed on the portfolio, the portfolio practice will collapse. In other words, don’t use the quality of the actual portfolio artifacts for accountability. Attaching consequences to the actual artifacts means that learners will expect precise specifications regarding those artifacts, and then demand exhausting feedback on whether the artifacts meet particular criteria. And if an instructor’s success is based on the quality of the artifacts, that instructor will comply. Such classrooms are defined by an incessant clamor from learners asking “Is this what you want???”

When portfolios are positioned this way (and they often are), they may or may not represent what students actually learned and are capable of. When positioned this way, the portfolio is more representative of of (a) the specificity of the guidelines, (b) their ability to follow those guidelines, and (3) the amount of feedback they get from the instructor. Accountability-oriented portfolios position disciplinary knowledge as something to be competitively displayed rather than something to be learned and shared, and portfolios position students as competitors rather the supporters. Perhaps most tragically, attaching consequences to artifacts positions instructors (awkwardly) as both piano tuners and gatekeepers. As many instructors (and ex-instructors) know, doing so generates massive amounts of work. This is why it seems that many portfolio-based teacher education programs rely so heavily on doctoral students and adjuncts who may or may not be qualified to teach courses. The more knowledgeable faculty members simply don’t have the time to help students with revision after revision of their artifacts as students struggle to create the perfect portfolio. This is the result of positioning portfolios for production.

Productive Positioning Within Portfolios
Portfolio are more useful when they are positioned to support reflection. Instead of grading the actual artifacts that students create, any accountability should be associated with student reflection on those artifacts. Rather than giving students guidelines for producing their artifact, students need guidelines for reflecting on how that artifact illustrates their use of the “big ideas” of the course. We call these relevant big ideas, or RBIs. The rubrics we provide students for their artifacts essentially ask them to explain how their artifact illustrates (a) the concept behind the RBI, (b) the consequences of the RBI for practice, and (c) what critiques others might have of this characterization of the RBI. For example:

  • Students in my classroom assessment course never actually “submit” their wikifolios of example assessments. Rather, three times a semester they submit a reflection that asks them to explain how they applied the RBIs of the corresponding chapter.
  • Students in Taiga world in Quest Atlantis submit their quests for review by the Park Ranger (actually their teacher but they don’t know that). But the quest instructions (the artifact guidelines) also include a separate reflection section that asks students to reflect on their artifact. The reflection prompts are designed to indirectly cue them what their quest was supposed to address.
  • Students in Becky Rupert’s English class are provided a rubric for their remixes that ask them to explain how that artifact illustrates how an understanding of genre allows a remix to be more meaningful to particular audiences.
Assessing the resulting reflections positions portfolios, students, and teachers in ways that strongly support participation. For example, if the particular student’s artifact actually does not lend itself to applying the RBIs, my classroom assessment students can simply indicate that in their assignment. This is important for at least three reasons:

  1. it allows full individualization for students and avoids a single ersatz assignment that is only half-meaningful to some students and mostly meaningless to the rest;
  2. understanding if and how ideas from a course do not apply is a crucially important part of that expertise.
  3. The reflection itself provides more valid evidence of learning, precisely because it can include very specific guidelines. We give students very specific guidelines asking them to reflect on the RBIs conceptually, consequentially, and critically.

For example, the mathematics teachers in the classroom assessment course are going to discover that it is very difficult to create portfolio assessments for their existing mathematical practices. Rather than forcing them to do so anyways (and giving them a good grade for an absurd example), they can instead reflect on what it is about mathematics that makes it so difficult, and gain some insights into how they might more readily incorporate project-based instruction into their classes. The actual guidelines for creating good portfolios are in the book when they need them; reflecting on those guidelines more generally will set them up to use them more effectively and meaningfully in the future.

Another huge advantage of this way of positioning portfolios is that it greatly eliminate a lot of the grading busywork and allows more broadly useful feedback. In the Quest Atlantis example, our research teacher Jake Summers of Binford Elementary discovered that whenever the reflections were well written and complete, the actual quest submission would also be well done. In the inevitable press for time, he just started looking at the artifacts. Similarly in my classroom assessment course, I will only look need to go back and look at the actual wikifolio entries when a reflection is incomplete or confusing. Given that the 30 students each have 8 entries, it is impossible to carefully review all 240 entries and provide meaningful feedback. Rather throughout the semester, each of the students have been getting feedback from their group members and from me (as they specifically request and as time permits). Because the artifacts are not graded, students understand the feedback they get as more formative than summative, and not as instructions for revision. While some of the groups in class are still getting the hang of it, many of the entries are getting eight or nine comments along with comments on comments. Because the entries are wikis it is simple for the originator go in and revise as appropriate. These students are starting to send me messages that, for me, suggest that the portfolio has indeed been positioned for participation: “Is this what you meant?” (emphasis added). This focus on meaning gets at the essence of participatory culture.

In a subsequent post, I will elaborate on how carefully positioning portfolios relative to (a) the enactment of classroom activities and (b) external accountability can further foster participation.

Wednesday, July 22, 2009

I'm bringing sexyback: some thoughts on formative assessment

Immersed as I am lately in the world of participatory assessment, I go through cycles of forgetting and then remembering and then forgetting again that not everybody in educational research thinks assessment is sexy.

I was reminded of this again recently while reading Lorrie Shepard's excellent 2005 paper, "Formative Assessment: Caveat Emptor." The piece argues that the notion of "formative assessment" has been twisted in unfortunate ways as a result of the excessive hammering kids get from high-stakes standardized tests.

I helpfully plugged the entire paper into the wordle machine for you and got this:


In theory, then, assessment should be easy to understand: All of the most frequently used words in Shepard's paper are fairly common and comprehensible. In practice, though, assessment research is complicated by the impulse to put a fine point on things. Here's a sample paragraph from Shepard's piece, which starts out okay but descends into chaos before the end:
“Everyone knows that formative assessment improves learning,” said one anonymous test maker, hence the rush to provide and advertise “formative assessment” products. But are these claims genuine? Dylan Wiliam (personal communication, 2005) has suggested that prevalent interim and benchmark assessments are better thought of as “early-warning summative” assessments rather than as true formative assessments. Commercial item banks may come closer to meeting the timing requirements for effective formative assessment, but they typically lack sufficient ties to curriculum and instruction to make it possible to provide feedback that leads to improvement.


I'm not saying the language is unnecessary; I'm not saying that assessment types are putting too fine a point on things. What I will argue here is that assessment research has, for lots of good and not-so-good reasons, been divorced so thoroughly from other aspects of educational research that it's decontextualized itself right into asexuality. It's like that guy in the corner booth at the bar on Friday night who wants to talk about Marxism when everybody else just wants to make sure everybody gets the same amount of beer before closing time.

Think about that guy for a second. Let's call him Jeff. Jeff has been single for a long time now, and he's spent a lot of that time reading. Maybe he's grown nostalgic for the early days before his girlfriend cheated on him and then moved in with some guy she met in her Econ class. His friends miss those days, too, mainly because he was so much goddamn fun back then. They're nice enough; they want to take him out and help him snap out of it. But the minute the beers come he's back on the Marxism soapbox again and NOBODY. FREAKING. CARES. It's Friday night, late July, and everybody just wants to get stupid drunk. They drop him some hints. Sully slaps him on the back and asks him to tell that one joke he told last week.

"In a minute," Jeff says. "I'm explaining where Marxism went wrong."

Eventually his friends will tell him to either cut it out or go home. If he wants to keep hanging out with these guys, he'll shut up. Or maybe he'll tell that one joke he executes so well. If the girls around him laugh, he might tell another one. Girls like funny guys, he'll suddenly remember. They don't necessarily like Marxists.

All of this is what we might call "formative assessment." This guy wants to be accepted by his friends, which means he needs to pay attention to his behavior. He learns (or re-learns) how to act at the bar on Friday night by paying attention to the feedback he gets from his friends, from other people at the bar, from his memories of having a social life all those years ago.

If we wanted to, we could spend some time talking about better ways to help Jeff learn the social skills he needs. For example, his friends could have sat him down before they went out and explained that his primary goal was to be the funniest guy in the room. "Because girls like funny guys," his buddy Rufus might remind him. They might also set deadlines: By 11:30 you better have told at least three jokes. Then, over the course of the evening, they could check in with him and get a joke-count.

The point is that everybody's on board with the evening's goals. Everybody--Jeff, his friends--wants Jeff to have a good time, and they want to have a good time with him.

Haha! I tricked you into caring about formative assessment.

This is what assessment is, even if it doesn't always feel that way to students, teachers, or researchers. There is an end goal, an objective, and formative assessment is a way of getting everyone on board with this goal and keeping them on board. When it works right, everybody involved actually wants to achieve the objective and the assessment is valuable because it helps them get where they want to go.

But as Shepard's piece points out, too often the insanity of NCLB substitutes test scores for real, intrinsic motivation. Too often and too easily, students learn skills it takes to attain high test scores without actually learning anything. Though "(the) idea of being able to do well on a test without really understanding the concepts is difficult to grasp," Shepard writes, she gives as evidence a 1984 study performed by M.L. Koczor, which focused on two groups of children learning about Roman numerals:
One group learned and practiced translating Roman to Arabic numerals. The other group learned and practiced Arabic to Roman translations. At the end of the study each group was randomly subdivided again (now there were four groups). Half of the subjects in each original group got assessments in the same format as they had practiced. The other half got the reverse. Within each instructional group, the drop off in performance, when participants got the assessment that was not what they had practiced, was dramatic. Moreover, the amount of drop-off depended on whether participants were low, middle, or high achieving. For low-achieving students, the loss was more than a standard deviation. Students who were drilled on one way of translation appeared to know the material, but only so long as they were not asked to translate in the other direction.

Because NCLB and other insane policies that mandate high-stakes testing for accountability have pushed assessment out of its natural home--as Jim Gee explains it, "in human action"--assessment researchers have themselves been backed into a separate corner of the room.

This is not okay. It doesn't help anybody to take the sexy out of assessment by tossing it into a corner. What we need, more than anything, is to push assessment back where it belongs: inside of the participation structures that support authentic learning.

Participatory assessment is, at its core, about social justice, about narrowing the participation gap that keeps our society stratified by race and class, about motivating learners to achieve real goals and overcome real obstacles to their own learning. Participatory assessment, if we do it right, can make almost anything possible for almost anyone.