The Classroom Experiment: A Progress Report

Maya Bialik
14 min readDec 20, 2021

Well, it’s been three months since I gave up the life of a work-from-home education researcher and began the life of a 7th grade science teacher. This post is an update on how it’s going.

The best part about being a teacher compared to being a researcher is the turnaround time between having an idea and testing it. To paraphrase Lewis Carroll, why, sometimes I’ll test six educational theories before lunch!

This has let me iterate on every aspect of my teaching at lighting speed. I keep the things that work, figure out where they work best, and continue to experiment with everything else.

So, have I answered any questions I had coming in? Are there still some I am trying to answer? Are there new questions on my radar now that I hadn’t considered before? Yes, yes, and yes.

There’s too much ground to cover in a single blog post, so for this one I’ll limit myself to one theme that emerged as possibly my biggest surprise: grading.

When I first spoke to my eighth grade science teacher in August to get advice, he immediately began by talking about grading. I was confused, since he and I had spoken about education many times before, and grading hadn’t even come up. But now that I was asking for advice as a teacher, it seemed to be the most important topic. That’s when I first realized that I may have not given enough thought to grading as a researcher.

Here, I will discuss my thinking about:

  • why tests should be harder
  • the framing of assignments as high or low stakes
  • standards-based-grading and the debate around how to make day-to-day grading effective and motivating
  • tracking social-emotional growth

I will share my questions, realizations, and the solutions I’m excited about. I am sure this is just scratching the surface of what I will discover, but this is, after all, just a progress report.

Summative Assessment

In my previous work on assessment, it seemed like everyone in progressive circles was pretty firmly against summative assessment (usually a high-stakes test at the end of the unit that shows how well the students learned) and excited about formative assessment (low-stakes collection of evidence of learning to inform further teaching). My main question coming in was:

Why not just constantly do formative assessment? Wouldn’t this do more for learning and to convey the idea that we are always improving?

I figured out the reason, which is simply that it is either impossible or extremely difficult to do. My new question quickly became:

How can I keep track of 79 people’s constant, slow, nonlinear, improvement along multiple dimensions?

This hurdle kept coming back at me from different directions, until after many twists and turns I finally came up with a system (which I discuss in the last section).

But this experience did show me a rationale for summative assessment. Quizzes are almost indispensably convenient because they provide a concrete snapshot of what your students have learned before you move on to another topic, which is going to build on this knowledge.

But if that’s the case, it almost feels like cheating to give the test right after a review of the unit, right after finishing the unit. It even almost feels like cheating to let them study!

If learning really is slow and layered and complex (and it is), then summative assessments should be testing how consolidated the knowledge is, not how they’re doing on a given day (that’s …formative assessment, no?) and certainly not how they’re doing on a given day when they know they will be assessed on how they’re doing.

Shouldn’t we optimize for data that would predict future learning, rather than data that makes us feel good about how well we taught?

This testing situation is simply not similar enough to the conditions under which students will be reaching for this knowledge in the future, so it’s not a good predictor of how successful they will be at applying it when they need to. Specifically, in the future, students will have 1) cognitive load from the new material that’s building on this knowledge and 2) a lack of context around where exactly to find this knowledge in their mental models. (On the bright side, any material that builds on itself constitutes a natural spacing/interleaving/spiral of the curriculum).

And yet, even with a review day, even without spacing between learning the material and testing, students still struggle. That’s what makes it hard to push yourself to make it even harder. But I think I can figure out a system (and if I can’t I’m not sure summative assessment is worth doing at all). For example, maybe they can do infinite retakes but only one per test per week. Maybe the retakes have open response questions, to get more qualitative data after identifying where you need to pay more attention via quantitative (multiple choice, mostly) data? Maybe you have other ideas?

I am also actively considering the psychological effects of summative assessments. You don’t want to induce test anxiety and entrench an achievement orientation, so you may think it’s a good idea to let students retake tests infinitely and take their best score. This, at least, appears to be a popular opinion.

However, with the way we use standards-based grading, I am not actually “taking” their best, or their average, or making any specific mathematical calculation of their scores. I am looking at their learning qualitatively over time.

Do retakes that “all count” but only qualitatively cause more or less test anxiety?

After all, some kids do worse the second and third time, which is great for showing me a better sense of their “true” understanding, but stressful if you are a student who is already stressed about assessments.

I could tell them I’m only taking their best score and that they can redo the test until they get a perfect score. But if I really did it that way, I wouldn’t actually know which kids were struggling and with what. I would be less able to do my job: to help all the students actually understand the material.

One thing I realized is that a big part of giving grades is telling students what they need to hear in order to keep pushing themselves. However, different students need to hear different things, even when their work is objectively the same quality. If one cares about the role of grades in student motivation (and one should), it makes it extra challenging to “grade objectively”.


It has also been interesting to note how framing an activity as a quiz or as classwork drastically changes how the students approach it. I had, several times, given them a series of questions to work on together. They got distracted, waited for me to explain the answers, and still handed in incomplete work.

On one occasion, I gave a test that was too challenging, so after their first try, I allowed them to work with a partner. Instead of blowing it off, they were thrilled! They collaborated, stayed focused the whole time, and handed in high quality work. One student even shouted that I was the “best teacher ever!”

Both are sets of questions about the material that they were working on in small groups. The weird part is, I don’t average the grades, so it actually makes no difference at all on my end whether I call something a quiz or classwork.

A little bit of pressure can go a long way in motivating students to push themselves. Too much can cause anxiety. But too much for one student is not enough for another.

How do you balance adding just the right amount of pressure for every student in a class?

The prevailing progressive wisdom seems to be that school is stressful enough, so it’s best to just create a non-stressful environment with activities and projects that are motivating in themselves, where students push themselves because they want to learn and create things they’re proud of.

This is a worthwhile goal, but like with any pedagogy that is student-led, there is the question: what do you do when kids inevitably do not choose to learn the things that are keys to learning more things down the line? Or what if they would like to have chosen them, but all of their other classes, which do give grades and average them and thus have lifelong consequences, keep taking priority?

Formative Assessment

When it comes to formative assessment, I’ve found my stride in how I ask questions during class. This tells me what I need to re-teach or whether I can move on at any given moment, multiple times per class period (more on that in another post).

However, the logistics of grading is something I didn’t really place enough emphasis on before becoming a teacher.

How can anyone sustainably grade 79 of anything in a timely manner?

At first, I didn’t assign anything. Each lesson took place on Nearpod, where I could see students’ thinking in real time, so there was no need for much asynchronous formative assessment. They loved the gamified quiz part at the end, but since they got more points for speed (and there is no way to turn this off on any of the platforms), they rushed, leaving me with data so noisy as to be entirely unhelpful.

By the time the Chapter 1 test rolled around, I realized I had no idea how they were supposed to study for it without any concrete notes left over from the classes! I created a study guide for them, and switched from Nearpod to Google slides. (Also, Nearpod only lets you organize by first name and Canvas only lets you organize by last name, so they are actually incompatible. Incredible, but true. It would cost something like $5K for the package that integrates them.)

Google slides worked better because I was able to make slides with built-in checks for understanding during the class that would also serve as materials to study from. However, during class, when I asked them to turn their IPads around like mini-whiteboards or walked around to look at their answers individually, I only got a general gist of how the group as a whole was doing.

This information is still very valuable! By seeing the mistakes students make, I can boil down open ended questions into diagnostic multiple choice questions I can use as Do Nows or on quizzes.

But I can’t use this system to keep track of the progress of each of the 79 students. Remembering each individual’s answer in the moment is probably impossible without a photographic memory, and checking each person’s notes (even just for completion) every night was way more than I could actually grade.


It became very clear very quickly that most of my colleagues felt strongly about not giving students grades out of 100, but simply grading them complete or incomplete, and when incomplete, offering students feedback with the expectation that they will revise and resubmit.

It turns out this is part of a larger debate about standards-based grading. This tends to refer to grading schemes where content is broken down into bite-size learning goals and students are given a 1–4 grade on each learning goal. This makes intuitive sense to me, because I believe that grading out of 100 isn’t right unless you can reliably distinguish between 78 and 79 quality work. (And I think this is only possible if you have a large assignment broken down into smaller questions, each of which you are grading roughly on a 1–4 scale).

At my school this means that students do not receive a grade of A, B, C, D, or F in Science, English, and Social Studies. They are graded on four academic standards and two social-emotional standards. For each they effectively receive one of three ratings: Meeting Expectations (M), Progressing Toward Expectations (P), or Warning (W). The academic standards are Content Knowledge, Use of Information, Oral Communication, and Writing Skills, however, they may mean different things in different classes and on different teams. The social-emotional standards are Academic Habits and Citizenship (more on that later).

So I took these ratings and used them in my class. I also added one more rating: Exceeding Expectations.

My thinking was that this extra level of achievement would motivate the students who could meet expectations easily to push themselves further. Some teachers warned me this was not a good idea, but I couldn’t figure out why:

How would providing high achieving students with motivation to push themselves backfire?

I only gave about four students Exceeding Expectations, since… my expectations were high and well-calibrated. And yet, probably half of my students asked me why they didn’t receive Exceeding Expectations. They kept asking which part of the assignment they did not do correctly, and I kept explaining that they did it all fine, and that’s why their work met my expectations. These were not fruitful discussions about their learning so I quickly pivoted away from that strategy.

It really reminded me of the quote by Dylan Wiliam:

When students receive both scores and comments, the first thing they look at is their score, and the second thing they look at is… someone else’s score.

I explained this to the students when I changed the grading scheme yet again, saying, “I care about your learning and sometimes grades can distract from learning”. They all thought about it for a few seconds, and then completely agreed.

What I’ve settled on is to score everything out of 2 points. So a 2 is Meeting Expectations or Complete, a 1 is Progressing Toward Expectations or Incomplete, and a 0 is Warning/Missing. This was pretty intuitive, and Canvas clearly shows them they earned a 2/2 when they meet expectations.

I then explained that sometimes, not often, if I am really impressed by something, I might give it a 3/2 and gave them a few examples. This all made sense and no one complained about not exceeding expectations again.

I thought it was interesting that numbers, which a few other teachers seemed averse to using, actually made students less stressed, rather than more. I was able to convey the idea of truly exceeding expectations, rather than this qualitative phrase being mentally converted to the A B C D F paradigm. Numbers also let me roll up students’ work as a whole and see at a glance who needs the most help, to make sure no one falls through the cracks. I came to the conclusion that numbers can be extremely useful, as long as they are used thoughtfully.

But I ran into another problem:

What is the best way to respond to a score of 1 / Incomplete / Progressing Toward Expectations?

I’m still very torn about the idea of revising and resubmitting until they earn a Complete. I like this idea a lot in theory. It successfully diverts attention from “the grade” and onto the learning. It even corresponds to what real scientists do when they apply to journals!

However, if I make every set of class notes an assignment, I now have to theoretically ask kids to revise and resubmit these notes. If I assign two packets of notes per week and each time I tell 10% of my students they have to redo them, I have to grade 16 extra notes every week, on top of the 158 I would already have to grade (I teach 79 kids). At two minutes per assignment, that’s six hours of grading per week.

This of course does not include grading of any assignment other than class notes, and it has to be done immediately, or else we would have moved on and the formative feedback is useless. Even when everything is done very promptly, if students have to revise and resubmit their work multiple times, we will have moved on to the next assignment and their outstanding assignments will begin to pile up, which is counterproductive when it comes to alleviating their stress!

Tracking social-emotional learning

Supposing I fine-tune the classwork protocol, I still need a robust system for giving feedback on progress on the two standards that are not content related: Academic Habits and Citizenship.

How can I set up a system to take consistent notes on the progress of all of the students on behavioral, social, emotional dimensions in a way that also empowers them to reflect on their progress in these domains?

Once I realized I’d need to write a paragraph for each student for progress reports, I started to keep notes using Airtable, so that I could sort by student, by grade, by standard, etc.

However, those were my own notes, not student-facing, which left a big learning opportunity on the table, and also just generally kept students in the dark about their grades, which isn’t great. I experimented with having them sign off on slips of paper, but this had a more high-stakes vibe than I meant for it to and took a lot of effort overall. I had them self reflect but it was not very successful in helping them reach greater depths of understanding.

I looked around for some online tools but they were either clunky, sleek but too gamified (e.g. “points” for good behavior), or reserved for the school-level only (no teacher-level free trial).

Last week, I created a zap so that whenever I make a new note about a student or students, it automatically creates a draft of an email bcc’d to them, telling them what I noticed, and how it’s evidence of exceeding/meeting/progressing/below expectations and which standard we are talking about.

This has worked really well so far, in that the students seem very responsive to personal emails. It comes off as serious, but not too serious.

For the positive notes, they have said things like “this made my day!” and for the negative ones, they have sent me emails back acknowledging the feedback and promising to improve. In both cases, I have already seen noticeable improvement from students. It also opens a dedicated line of communication with each student that we can use asynchronously.

This means I now can keep track of 79 students improving along six dimensions at the same time!

The x axis represents each of the standards and the y axis represents the number of notes I have written in the second half of the first semester. Color represents performance level.

I can even make graphs of my notes to make sure I am writing enough different kinds of notes for each student and each standard!

(I just figured out how to do this, so I haven’t actually evened them out yet).

The x axis represents students and the y axis represents the number of notes I have written in the second half of the first semester. Color represents performance level.

There is one more layer to this system I hope to try out after Christmas break. I’ve created a form so that students can write their own notes and they will be added right into the Airtable. This will let me know of things I didn’t notice myself, and encourage students to self reflect on their own improvement.

By the way, I can now make the same types of graphs for learning objectives, student misconceptions, etc. My dream would be to hook it up to an automatic assessment grader, so I can just write and tag questions and answers, and it would grade them all and show me graphs of students’ achievement and misconceptions. Anyone know of such a tool? All it has to do is spit out a csv and I’ve got it from there!

Next Steps

I’ve joined a book club around the book Grading for Equity after we had a whole professional development day devoted to the topic. It seems like an extremely important and completely unresolved question.

If the same student doing the same work can get two totally different grades depending on which teacher they have, doesn’t this really mess with how we think of GPAs? It also messes with the student’s head, because they have to keep in mind seven different paradigms and somehow succeed in all of them.

In my past work, I’ve thought a lot about the pressures of getting into college and how they can impact and shape K-12 education. However, I’ve almost exclusively focused on testing.

Grading is the silent wild card in the equation.

It seems to be very challenging to talk about. Everyone has very strong opinions, and they are all difficult to articulate, because they get at the core of how each teacher organizes their class and what they value.

To me, that is all the more reason to examine it closely.



Maya Bialik

Creator of Teacher, author, and speaker making learning meaningful and making teaching more enjoyable.