RL Preference Fine-Tuned LLM for Unit Tests (perfect grade)

Developed a pipeline to generate and iteratively improve unit tests for existing codebases using LLMs and direct preference optimization (DPO), aligning with emerging reinforcement learning from verifiable rewards and LLMs-as-a-judge techniques.

Motivation & Problem Statement

Unit testing is crucial for software quality but often tedious and repetitive to implement manually. In this seminar, we wanted to explore how large language models (LLMs) could help automate this process, not just by generating tests, but also by learning to generate better ones through iterative feedback.

The idea was to extend the frontier of AI-based coding assistance by combining LLMs with reinforcement learning signals, inspired by the latest trends in RLHF where models improve by integrating preference signals.

Approach & Implementation

My team designed a system that:

Used pretrained LLMs (like Microsoft’s Phi-3 or smaller Llama variants) to generate candidate unit tests for given source code.
Created two unit test candidates for each code snippet to simulate “preference pairs.”
Ran these tests to check compilation, execution success, and coverage as verifiable information about the quality of the tests.
Gave this info to an LLM as a judge using chain-of-thought prompting to evaluate the quality of the tests and yield a preference signal.
Fed this back into the pipeline via direct preference optimization (DPO).

The technical stack included:

Python-based orchestration to automate generation, compilation, and feedback collection.
Docker containers to ensure reproducible environments for test execution and LLM evaluation.

Results & Achievements

The system was able to automatically create unit tests that compiled and passed basic execution checks, reducing manual effort significantly.
Evaluations revealed that the preference dataset did contain a clear signal for the LLM to improve. However, this did not generalize to newly generated tests. Most likely because a bigger dataset would have been needed which wasn't obtainable with the available hardware resources.

Learnings & Reflections

This project was a crash course in:

LLM-driven software generation and the practical challenges of getting it to work reliably in dynamic code environments.
Designing reward and preference signals for iterative improvement, balancing purely objective metrics (like coverage) with more nuanced quality judgments.
The early limitations of small datasets and how robust data pipelines and better reward engineering are key for production-ready systems.

Overall, it greatly improved my understanding and gave me practical experience with the cutting edge of reinforcement learning with LLMs. While not perfectly designed, the idea was actually pretty close to what is currently being used at the major AI labs to train LLMs to think during test time with GRPO. We even suggested this possibility in the Future Work section of our report.