These boosts to performance, and their benefits for longer-term learning, are examples of the testing effect—an effect that, though widely appreciated in cognitive psychology today, was less appreciated in the 1980s. Students learn from testing and retesting, especially if they receive corrective feedback that focuses on processes and concepts instead of simply being told whether they are right or wrong. Burke’s and Anania’s tutors were trained on how to provide effective feedback. Indeed, Burke wrote, “perhaps the most important part of the tutors’ training was learning to manage feedback and correction effectively.” The feedback and retesting also provided tutored students with more instructional time than the students receiving whole-class instruction—about an hour more per week, according to Burke.

How much of the two-sigma effect did the extra testing and feedback explain? About half. You can tell because, in addition to the tutored and whole-class groups, there was a third group of students who engaged in “mastery learning,” which did not include tutoring but did include feedback and testing after whole-class instruction. On a post-test given at the end of the three-week experiment, the mastery-learning students scored about 1.1 standard deviations higher than the students who received whole-class instruction. That’s just a bit larger than the effects of 0.73 to 0.96 standard deviations reported by meta-analyses that have estimated the effects of testing and feedback on narrow tests.

If feedback and retesting accounted for 1.1 of Bloom’s two sigmas, that leaves 0.9 sigmas that we can chalk up to tutoring. That’s not too far from the 0.84 sigmas that the Cohen, Kulik, and Kulik meta-analysis reports for tutoring’s effect on narrow tests.

Tutors received extra training. Extra testing and feedback might have been the most important extra in Anania’s and Burke’s tutoring intervention, but it wasn’t the only extra.

Anania’s and Burke’s tutors also received training, coaching, and practice that other instructors in their experiments did not receive. Burke mentioned training tutors to provide effective feedback, but tutors were also trained “to develop skill in providing instructional cues . . . to summarize frequently, to take a step-by-step approach, and to provide sufficient examples for each new concept. . . . To encourage each student’s active participation, tutors were trained to ask leading questions, to elicit additional responses from the students, and to ask students for alternative examples or answers”—all examples of active, inquiry-based learning and retrieval practice. Finally, “tutors were urged to be appropriately generous with praise and encouragement whenever a student made progress. The purpose of this training was to help the tutor make learning a rewarding experience for each student.”

Although previous tutoring studies had not found larger effects if tutors were trained, the training these tutors received may have been exceptional. Anania and Burke could have isolated the effect of training if they had offered it to some of the instructors in the whole-class or mastery-learning group. Unfortunately, they didn’t do that, so we can’t tell how much of their tutoring effect was due to tutor training.

Tutoring was comprehensive. Many public and private programs offer tutoring as a supplement to classroom instruction. Students attend class with everyone else and then follow up with a tutor afterwards. But the tutoring in Burke’s and Anania’s experiments wasn’t like that. Tutoring didn’t supplement classroom instruction; tutoring replaced classroom instruction. Tutored students received all instruction from their tutors; they didn’t attend class at all. That’s important because, according to Cohen, Kulik, and Kulik’s meta-analysis, tutoring is about 50 percent more effective when it replaces rather than substitutes for classroom instruction.

It’s great, of course, that Burke’s and Anania’s students received the most effective form of tutoring. But it also means that it wasn’t the kind of tutoring that students commonly receive in an after-school or pull-out program.

All That Glitters

My father may have had a two-sigma tutor in 1945. His tutor couldn’t foresee Anania’s and Burke’s experiments, 40 years in the future, but her approach had several components in common with theirs. She met with her student frequently. She was goal-oriented, striving to ensure that my father mastered the 2nd- and 3rd-grade curricula rather than just putting in time. She didn’t yoke herself to the pace of classroom instruction but moved ahead as quickly as she thought my father could handle. And she checked his comprehension regularly—not with quizzes but with short homework assignments, which she checked and corrected to explain his mistakes.

But not all tutoring is like that, and some of what passes for tutoring today is much worse than what my father received in 1945.

In the fall of 2020, I learned that my 5th grader’s math scores had declined during the pandemic. I knew that they hadn’t been learning much math, but the fact that their skills had gone backward was a bit of a shock.

To prepare them for what would come next, I told them the story about my father’s 2nd-grade tutor.

“Grandpa got tutored every day for seven weeks?” they asked me. “That seems excessive.”

“You think so?” I asked.

“Yeah—it’s 47 hours!”

“Come again?” I asked.

They reached for a calculator.

Once a week I drove them to a for-profit tutoring center at a nearby strip mall. It was a great time to be in the tutoring business, but this center wasn’t doing great things with the opportunity. My child sat with four other children, filling out worksheets while a lone tutor sat nearby—available for questions, but mostly doing her own college homework and exchanging text messages with her friends. One day my child told me that they had spent the whole hour just multiplying different numbers by eight. They received no homework. From a cognitive-science perspective, I was pretty sure that practicing a single micro-skill for an hour once a week was not optimal. The whole system seemed designed not to catch kids up, but to keep parents coming back and paying for sessions.

Unfortunately, overpriced and perfunctory tutoring is common. In an evaluation of private tutoring services purchased for disadvantaged students by four large school districts in 2008–2012, Carolyn Heinrich and her colleagues found that, even though districts paid $1,100 to $2,000 per eligible student (40 percent more in current dollars), students got only half an hour each week with a tutor, on average. Because districts were paying per student instead of per tutor, most tutors worked with several children at once, providing little individualized instruction, even for children with special needs or limited English. Students met with tutors outside of regular school hours, and student engagement and attendance were patchy.

Only one district—Chicago—saw positive impacts of tutoring, and those impacts averaged just 0.06 standard deviations, or 2 percentile points.

My grandmother would never have stood for that.

After these results were published, some of Chicago’s most disadvantaged high schools started working with a new provider, Saga Education. Compared to the tutoring services that Heinrich and her colleagues evaluated, Saga’s approach was much more structured and intense. Tutors were trained for 100 hours before starting the school year. They worked with just two students at a time. Tutoring was scheduled like a regular class, so that students met with their tutor for 45 minutes a day, and the way the tutor handled that time was highly regimented. Each tutoring session began with warmup problems, continued with tutoring tailored to each student’s needs, and ended with a short quiz.

The cost of Saga tutoring—$3,500 to $4,300 per student per year—was higher than the programs that Heinrich and her colleagues had evaluated, but the results were much better. According to a 2021 evaluation by Jonathan Guryan and his colleagues, Saga tutoring raised math scores by 0.16 to 0.37 standard deviations. The effect was “sizable,” the authors concluded—it wasn’t two sigmas, but it doubled or even tripled students’ annual gains in math.

Is Two-Sigma Tutoring Real?

The idea that tutoring consistently raises achievement by two standard deviations is exaggerated and oversimplified. The benefits of tutoring depend on how much individualized instruction and feedback students get, how much they practice the tutored skills, and on the type of test used to measure tutoring’s effects. Tutoring effects, as estimated by rigorous evaluations, have ranged from two full standard deviations down to zero or worse. About one-third of a standard deviation seems to be the typical effect of an intense, well-designed program evaluated against broad tests.

The two-sigma effects obtained in the 1980s by Anania and Burke were real and remarkable, but they were obtained on a narrow, specialized test, and they weren’t obtained by tutoring alone. Instead, Anania and Burke mixed a potent cocktail of interventions that included tutoring; training and coaching in effective instructional practices; extra time; and frequent testing, feedback, and retesting.

In short, Bloom’s two-sigma claim had some basis in fact, but it also contained elements of fiction.

Like some science fiction, though, Bloom’s claim has inspired a great deal of real progress in research and technology. Modern cognitive tutoring software, such as ASSISTments or MATHia, was inspired in part by Bloom’s challenge, although what tutoring software exploits even more is the feedback and retesting required for mastery learning. Video tutoring makes human tutors more accessible, and new chatbots have the potential to make AI tutoring almost as personal, engaging, and responsive. Chatbots are also far more available and less expensive than human tutors. Khanmigo, for example, costs $9 a month, or $99 per year.

My own experience suggests that the large language models that undergird AI tutoring, by themselves, quickly get lost when trying to teach common math concepts like the Pythagorean theorem. But combining chatbots’ natural language capabilities with a reliable formal knowledge base—such as a cognitive tutor, a math engine, or an open-source textbook—offers substantial promise.

There is also the question of how well students will engage with a chatbot. Since chatbots aren’t human, it is easy to imagine that students won’t take them seriously—that they won’t feel as accountable to them as my father felt to his tutor and his mother. Yet students do engage and even open up to chatbots, perhaps because they know they won’t be judged. The most popular chatbots among young people are ones that simulate psychotherapy. How different is tutoring, really?

It seems rash, though, to promise two-sigma effects from AI when human tutoring has rarely produced such large effects, and no evidence on the effects of chatbot tutoring has yet been published. Over-promising can lead to disappointment, and reaching for impossible goals can breed questionable educational practices. There are already both human and AI services that will do students’ homework for them, as well as more well-intentioned but still “overly helpful” tutors who help students complete assignments without fully understanding what they’re doing. Such tutors may raise students’ grades in the short term, but in the long run they cheat students of the benefits of learning for themselves.

In the early going, it would be sensible simply to aim for effects that approximate the benefits of well-designed human tutoring. Producing benefits of one-third of a standard deviation would be a huge triumph if it could be done at low cost, on a large scale, and on a broad test—all without requiring an army of human tutors, some of whom may not be that invested in the job. Effects of one-third of a standard deviation probably won’t be achieved just by setting chatbots loose in the classroom but might be within reach if we skillfully integrate the new chatbots with resources and strategies from the science of learning. Once effects of one-third of a standard deviation have been produced and verified, we should be able to improve on them through continuous, incremental A/B testing—slowly turning science fiction into science fact.

End