Technical Poster
August 27, 2025

Interventional Feature Steering on Deterministic Code Tasks

by Nathan Clark

Towards Practical Benchmarks for Mechanistic Interpretability

Poster presented at the New England Mechanistic Interpretability (NEMI) workshop, August 2025

This research represents Noblis’ groundbreaking work on making Large Language Models (LLMs) more efficient through feature steering. We’re achieving remarkable results by:

  • Adding tiny learned vectors inside an LLM that nudge it to write shorter code without altering original weights
  • Testing with unit-tested coding tasks for clear pass/fail evaluation
  • Creating a “dial” effect to control which layers to modify and by how much

Why this matters:

  • Lower compute costs and faster results compared to full fine-tuning
  • Clear, auditable control of model behavior
  • Builds trust through explainable AI interventions
  • Provides valuable insights for governance and safety guardrails

    Our benchmark provides an objective yardstick for measuring both correctness and token savings quantitatively.