Towards Practical Benchmarks for Mechanistic Interpretability
Poster presented at the New England Mechanistic Interpretability (NEMI) workshop, August 2025
This research represents Noblis’ groundbreaking work on making Large Language Models (LLMs) more efficient through feature steering. We’re achieving remarkable results by:
- Adding tiny learned vectors inside an LLM that nudge it to write shorter code without altering original weights
- Testing with unit-tested coding tasks for clear pass/fail evaluation
- Creating a “dial” effect to control which layers to modify and by how much
Why this matters:
- Lower compute costs and faster results compared to full fine-tuning
- Clear, auditable control of model behavior
- Builds trust through explainable AI interventions
- Provides valuable insights for governance and safety guardrails
Our benchmark provides an objective yardstick for measuring both correctness and token savings quantitatively.