Interventional Feature Steering on Deterministic Code Tasks

Towards Practical Benchmarks for Mechanistic Interpretability

Poster presented at the New England Mechanistic Interpretability (NEMI) workshop, August 2025

This research represents Noblis’ groundbreaking work on making Large Language Models (LLMs) more efficient through feature steering. We’re achieving remarkable results by:

Adding tiny learned vectors inside an LLM that nudge it to write shorter code without altering original weights
Testing with unit-tested coding tasks for clear pass/fail evaluation
Creating a “dial” effect to control which layers to modify and by how much

Why this matters:

Lower compute costs and faster results compared to full fine-tuning
Clear, auditable control of model behavior
Builds trust through explainable AI interventions
Provides valuable insights for governance and safety guardrails
Our benchmark provides an objective yardstick for measuring both correctness and token savings quantitatively.

View the Poster