Last Wednesday Courtney and I did a half-day session on Inter-Rater Reliability training for a group of 34 principals, superintendents, and teachers. The administrators came from various regional districts, but the eight teachers were all from the same school system–Galway Central School (GCS). And, they came as future peer evaluators! I’ve written about teacher leadership in the past, and kudos to GCS for promoting such leadership within their school culture. Honestly, is there any other way to grow our profession and student achievement without such leadership?
Back to our inter-rater reliability session, the focus was for people to use the Danielson Frameworks for Teaching to observe and rate classroom instruction. Our goal was to make everyone aware of the necessity for collecting objective, detailed evidence to support teacher performance ratings (Highly Effective, Effective, Developing, and Ineffective). We used videos from the New Teacher Center to practice collecting and tagging evidence to rate the various instructional components within the 2011 revised Danielson Rubric (At the New York State Education Department Network Team Institutes, we are using the 2007 Danielson rubric in evaluator calibration sessions which goes into greater specificity than the 2011 version.).
Over the course of three hours, we watched two videos each 20-minutes in length. Audience members were directed to collect evidence, tag the appropriate Danielson component, and then rate the level of performance within each component. We then had people work in small groups to discuss observations and ratings. All data were collected and put into an Excel document to show the level of consistency among the 34 raters. Below is a chart of our second video results. Vertical columns rank the level of performance from ineffective (1) to highly effective (4), and horizontal rows represent components within domains two (Classroom Environment) and three (Instruction).
As you can see from the data, there was much greater variability within domain three. When the data were shared with the group, one of the Galway teachers asked, “Why is there so much spread with ratings from ineffective to highly effective in some components?” Great question. We explained that this is new work Race to the Top is asking of New York State educators. I asked the audience, “How many of you had inter-rater reliability training or reviewed classroom observation videos with rubrics in your districts or leadership programs?” No one raised their hand. Courtney and I explained that before Race to the Top, such attention to teacher observation and inter-rater consistency were sorely lacking in this state and across the nation, and that the work we are now doing in this area is exciting and promising for the field and the students. We also assured our hard-working group that consistency and efficiency grow over time as one becomes more familiar with the process.
We ended our day with some key thoughts and next steps. I suggested that people return to their districts and conduct similar sessions with colleagues. What a great thing to do at a faculty, department, or grade level meeting! If we hope to diminish the fear factor regarding teacher observation that legitimately resides within many schools, then we must develop people’s knowledge and comfort with the teacher observation process. We also have to remind the press and public that publishing teacher performance records greatly compromises the Race to the Top reform effort, sowing seeds of distrust, fear and resentment. The Annual Professional Performance Review through student achievement and teacher observation data are best used to grow teachers’ capacities to enhance student achievement–not to politicize and badger public education.