How Coursera builds next-generation learning tools

Ornella Altunyan, Winnie Tam, Sophie Gao

12 May 2025

Coursera is a global online learning platform serving millions of learners and enterprise customers. As they began adopting large language models (LLMs) to enhance their user experience, particularly with their Coursera Coach chatbot and AI-assisted grading tools, they quickly realized the need for a better evaluation workflow. In this case study, we'll share how Coursera built a structured evaluation process to quickly ship reliable AI features that customers love.

Scaling AI evaluation

Before establishing a formal evaluation framework, Coursera relied on fragmented offline jobs in spreadsheets and human labeling processes. Their teams used manual data reviews for error detection. It was also difficult for teams to collaborate on evaluations, because each group wrote their own scripts. This made it difficult to quickly validate AI features and confidently push them to production.

The business impact of AI features

To emphasize just how important it was for Coursera to get these AI features right, let's dig into the significant business impact and metrics that demonstrate their value. These aren't just experimental technologies, they're core features delivering measurable results for learners and the company.

The Coursera Coach serves as a 24/7 learning assistant and psychological support system for students, maintaining an impressive 90% learner satisfaction rating¹. The impact extends beyond satisfaction metrics—users engaging with Coach complete courses faster and finish more courses overall. By providing judgment-free assistance at any hour, Coach has become an integral part of the learning experience.

Automated grading addresses a critical scaling challenge in Coursera's educational model. Before implementing AI, grading was done manually by teaching assistants and peers, with both approaches facing limitations to scale. Teaching assistants were capable of evaluating learners' skills, but at high cost. Peer grading helped with scale but often resulted in variable feedback quality. The automated system now provides consistent, fair assessment with actionable feedback, significantly reducing grading time while maintaining educational quality. Learners now receive grades within 1 minute of submission and benefit from approximately 45× more feedback, driving a 16.7% increase in course completions within a day of peer review².

Evaluating AI features with Braintrust

The teams at Coursera use a four-step approach to evaluating their AI features.

Four step approach

1. Define clear evaluation criteria upfront

They begin by establishing exactly what "good enough" looks like before development begins. For each AI feature, they identify specific output characteristics that matter most to their users and business goals.

For the Coach chatbot, they evaluate various quality metrics including response appropriateness, formatting consistency, content relevance, and performance standards for natural conversation flow. Their automated grading system is measured on alignment with human evaluation benchmarks, feedback effectiveness, clarity of assessment criteria, and equitable evaluation across diverse submissions.

Key practice: Define what success looks like before building, not after.

2. Curate targeted datasets

Dataset quality drives evaluation quality, which is why Coursera invests in creating comprehensive test data. Their team manually reviews anonymized chatbot transcripts and human-graded assignments, paying special attention to interactions with explicit user feedback (like thumbs up/down ratings). They supplement this example data with synthetic datasets generated by LLMs to test edge cases and extract challenging real-world examples that might expose weaknesses.

This balanced approach ensures their evaluation covers both typical use cases and the edge scenarios where AI typically struggles, giving them confidence that new features will perform well across all situations.

Key practice: Balance real-world examples with synthetic data to test both common scenarios and edge cases.

3. Implement both heuristic and model-based scorers

Coursera's evaluation approach combines the precision of code-based checks with the nuance of AI-based judgments. Their heuristic checks provide deterministic evaluation of objective criteria, like format and response structure. For more subjective assessment, they employ LLM-as-a-judge evaluations to assess quality across multiple dimensions, including response accuracy and alignment with core teaching principles.

They round out evaluation with performance metrics that monitor latency, response time, and resource utilization to make sure AI features maintain operational excellence in addition to output quality.

Key practice: Create a mix of deterministic checks and AI-based evaluations to balance strict requirements with nuanced quality assessment.

4. Run evaluations and iterate rapidly

With evaluation infrastructure in place through Braintrust, Coursera maintains continuous quality awareness through three tracks. Their online monitoring logs production traffic through evaluation scorers, tracking real-time performance against established metrics and alerting on significant deviations. Offline testing runs comprehensive evaluations on curated datasets, comparing performance across different model parameters and detecting potential regressions before deployment.

For new features, their rapid prototyping process creates sample use cases in Braintrust's playground, comparing different models and testing feasibility before committing to full development. This approach allows them to catch issues early, communicate findings clearly across teams, and iterate quickly based on concrete data.

Key practice: Establish both real-time monitoring and batch testing processes to continuously validate AI performance.

Results: Better AI features, faster development

Coursera's structured evaluation framework has transformed their AI development process with benefits across their organization. Teams now validate changes with objective measures, significantly increasing development confidence. The data-driven approach moves ideas from concept to release faster, with clear metrics supporting go/no-go decisions. Perhaps most importantly, standardized evaluation metrics have created a common language for discussing AI quality across teams and roles, while enabling more comprehensive and thorough testing than was previously possible.

As a more concrete example, early automated grading prototypes focused on valid submissions. Through their structured evaluation process, the team found that providing vague answers would still result in a high score. They were able to go back and evaluate more examples of negative test cases, resulting in overall better quality.

Practical lessons for organizations adopting AI evaluation

Based on Coursera's experience, here are the key takeaways for implementing your own AI evaluation system:

Start with clear success criteria: Define what "good" looks like before building, not after.
Balance evaluation methods: Use both deterministic checks for non-negotiable requirements and AI-based evaluation for more subjective quality aspects.
Build realistic test data: Invest in dataset curation that reflects actual use cases, including edge cases where AI typically struggles.
Consider the full spectrum of metrics: Evaluate not just output quality but also operational aspects like latency and resource usage.
Integrate evaluation throughout development: Make testing a continuous process, not just a final validation step.

By establishing a robust evaluation foundation, Coursera has positioned itself to confidently expand AI features while maintaining quality and user trust. If you’re looking to do the same, get in touch.

Learn more about Coursera and Braintrust.

Thank you to Winnie and Sophie for sharing these insights!

¹ Coursera Coach Learner Survey, Q1 2025

² https://blog.coursera.org/ai-grading-in-peer-reviews-enhancing-courseras-learning-experience-with-faster-high-quality-feedback/