Teaching a 0.5B Model to Be an Executive Assistant

36 hours, 90 minutes of training, three model collapses, and one bar chart that nearly killed me.

Let me start with the moment I thought I'd lost the hackathon.

It's around 3pm on day two. I've been training for an hour and a half on a free Colab T4, and the cell finally finishes. I run the eval. The plot pops up. Baseline bars in red, trained bars in green. Easy task: 0.493 to 0.200. Medium: 0.348 to 0.200. Hard: 0.331 to 0.186.

The trained model was worse. Identically worse on every task, by exactly the same number every time. I sat there for a full minute trying to figure out if I was reading the chart wrong.

Turns out I wasn't. This is what GRPO collapse looks like, and now I had a textbook example of it. The model had given up exploring, found a single response that scored exactly 0.2 against my reward function no matter what input you gave it, and locked itself in. All my training was technically optimization. It was just optimization toward "say the dumbest possible thing every time and never deviate."

Here's how I dug out of it.

The environment

I built ExecAssist, an OpenEnv environment that simulates an executive's morning inbox. The agent gets emails (sender, subject, body, priority), a calendar (existing meetings, working hours), and contacts. It has to spit out a JSON action: an email reply, a calendar action like book or propose_alternatives, and meeting details if it's booking something.

Three difficulty tiers:

Easy. One email, calendar is wide open. Don't blow the basics.
Medium. The requested time conflicts with an existing meeting. Spot it, propose alternatives.
Hard. Three emails, multi-party coordination, priority conflicts. Actually plan.

Reward is a weighted blend of three independent graders. Email quality scores politeness markers, greeting and closing, sufficient detail. Scheduling correctness checks for double-booking, working hours, sensible duration. Conflict resolution looks for whether the agent recognized a clash and proposed real alternatives. Then four anti-hacking penalties on top: short emails, missing meeting details, generic phrasing, overly long responses.

I went with multiple independent graders because I'd read this line in the hackathon guide that stuck with me: "if you only have a single reward signal, it is easier for the model to hack it." Building it the right way from day one felt like the obvious move. I would later be very glad I did.

What collapsed, and why

The first config looked reasonable on paper:

GRPOConfig(
    learning_rate=5e-6,
    num_train_epochs=1,
    num_generations=4,
    # no beta term
)

Looking at the training log afterward, I could see exactly what happened. Reward bounced between 0.0 and 0.4 for the first seven steps, peaked at 0.397, then crashed and pinned itself at 0.14 for the next 38 steps. The model had found a safe floor and stopped trying anything that might score higher because variance was punishing it.

The diagnosis came down to three problems. No KL penalty meant nothing was anchoring the trained policy to the base model, so it was free to drift to any degenerate local minimum. The learning rate was too aggressive, which compounded the drift. And one epoch wasn't enough runway to recover even if it wanted to.

The fix was three changes:

GRPOConfig(
    learning_rate=1e-6,           # 5x slower
    num_train_epochs=3,           # 3x longer
    num_generations=8,            # more variety per step
    beta=0.1,                     # KL penalty, the critical one
)

I also dropped in a "reload clean model" cell before training started. The previous run had corrupted the weights, and I didn't want to start the next attempt from a broken policy. Then I hit Run All and walked away for 90 minutes.

The second run

Came back to find the cell still chugging along. 218 of 270 steps done. Trained weights were already in memory, so I ran the eval cells anyway and held my breath while the bars rendered.

Easy task: 0.345 to 0.995.
Medium: 0.227 to 0.745.
Hard: 0.249 to 0.737.

I made a noise. I'm not going to describe what kind.

Nine out of ten samples on the easy task hit a perfect 1.0. The model wasn't just lucky on those runs, it had learned the structure of the task. You could see it in the variance. Baseline scores ranged from 0.0 to 0.65 on the same prompt depending on how the dice rolled. Trained scores were tight: 0.95 to 1.0 on easy, 0.68 to 0.82 on medium, 0.65 to 0.80 on hard.

The training curve told the same story. First quartile mean reward: 0.390. Last quartile: 0.648. A 66% lift during training itself, on top of the much bigger gap between trained and untrained at evaluation time.

Training results showing reward and loss curves with baseline vs trained comparison

The interesting part: watching the model try to cheat

Because the reward had multiple independent components instead of one big scalar, I could see exactly how the model tried to game each one during training. Going through the early-step rollouts in order:

Steps 8 to 15: the model started outputting just the JSON action and skipping the email body entirely. The short-email penalty (-0.30 if under 20 words) caught this. Within about 30 steps, every output had a real email attached.
Around step 25: it started scheduling meetings at 8am or 6pm. Outside working hours. The scheduling-correctness grader returned 0 for the working-hours check, and within 50 steps the model had figured that out.
Around step 50: generic templated phrasing showed up. "Thank you for your email. I'll check the calendar and get back to you." Polite, vague, useless. The generic-phrasing detector docked these. Specific responses came back.
Throughout: the model would occasionally just forget the meeting_details block. The -0.40 penalty made that a non-strategy fast.

Most submissions claim they have anti-hacking penalties. Showing the penalties firing during real training, on a real curve, is the rare part. It's the difference between saying a multi-grader rubric works and demonstrating that it does.

The bigger claim

One sanity check from earlier in the day stuck with me. Same three tasks, same scoring, but evaluated against an untuned Nemotron 120B running through OpenRouter via the standard inference.py baseline. It averaged 0.337 across the three tasks.

After 90 minutes of GRPO, a model 240 times smaller is averaging 0.83 on the same environment. Free Colab T4. Zero-dollar cloud bill.

That's the point of training-environment design. A reward signal that's hard to game, a model that's allowed to actually train against it, and you get a result that beats a frontier model on its native task.

I think there's a research-shaped argument hiding in here. Frontier LLMs are notoriously bad at structured calendar reasoning. Try asking any production agent to find a 30-minute slot that doesn't conflict with your standups. ExecAssist isolates that specific failure mode into a tractable RL target. The result suggests that for a class of structured personal-task workflows, task-specific RL on a small model is a real alternative to scaling up. That feels like a workshop paper, maybe.

Try it yourself

Live environment: devanshudon-exec-assist.hf.space/docs. Hit POST /reset?task=easy, then POST /step with your action. Swagger does most of the work.
Baseline: python inference.py reproduces the untrained scores around 0.32 average.
Training: the Colab notebook is in the repo. Set runtime to T4, Run All, about 50 minutes including evaluation.
Repo: all the code, the working hyperparameters, the broken hyperparameters that caused the collapse, the results JSON, and this writeup.

The environment is hard in interesting ways. Try to break it. I'd be curious what the model learns to game next.