BroomVA

Nous

Metacognitive evaluation module — real-time quality judgment of agent behavior with homeostatic feedback.

Nous

Nous is the metacognitive evaluation module of the Agent OS -- the "Pepe Grillo" that judges agent behavior as it runs. It provides real-time quality scoring across five evaluation layers, feeding signals back into Autonomic's homeostatic control loops.

The name comes from Greek (nous, mind/intellect) -- the faculty of rational thought and self-reflection.

Design principles

Evaluation is layered. Agent quality is not a single number. Nous evaluates across five distinct layers -- reasoning, action, execution, safety, and cost -- so that specific failures can be diagnosed and addressed. A high-cost but correct result is a different problem than a cheap but incoherent one.

Inline evaluation must be fast. Heuristic evaluators run as middleware hooks in the Arcan agent loop with a strict budget of less than 2ms per evaluation. No I/O, no allocations in the hot path. If you need an LLM to judge quality, that runs asynchronously via the nous-judge crate.

Quality scores feed homeostasis. Nous does not merely report scores -- it feeds them into Autonomic's regulation loops. Low quality scores can trigger operating mode changes, budget adjustments, or strategy shifts. This closes the feedback loop between evaluation and behavior.

The five evaluator layers

Every evaluator belongs to exactly one layer. Layers enable aggregation, filtering, and targeted intervention.

LayerWhat it measuresExample evaluators
ReasoningCoherence, completeness, logical soundnessPlan quality, plan adherence (LLM-as-judge)
ActionTool usage correctness, argument validityTool argument validation, correct tool selection
ExecutionEfficiency, iteration count, token usageToken efficiency, step count optimization
SafetyPolicy compliance, blocklist checks, capability enforcementSafety compliance checks, policy violation detection
CostBudget adherence, spend velocity, resource efficiencyBudget adherence scoring, cost-per-quality ratio

Hybrid architecture

Nous uses a hybrid deployment model: fast inline heuristics embedded in the agent loop, plus optional async LLM-as-judge evaluators running in a separate daemon.

Inline evaluators (less than 2ms)

Inline evaluators run as Arcan middleware hooks. They inspect events in real-time and produce scores without any I/O:

  • Token efficiency -- are we using tokens wisely relative to task complexity?
  • Budget adherence -- are we tracking within budget constraints?
  • Tool correctness -- did the tool call use valid arguments?
  • Argument validity -- are tool arguments well-formed and reasonable?
  • Safety compliance -- does this action comply with the agent's policy manifest?
  • Step efficiency -- are we making progress or spinning?

Inline scores flow directly into Autonomic's regulation fold.

Async evaluators (LLM-as-judge)

Async evaluators use an LLM to judge higher-level quality. They run in the nousd daemon process and evaluate after the fact:

  • Plan quality -- is the agent's plan well-structured and likely to succeed?
  • Plan adherence -- is the agent actually following its stated plan?
  • Task completion -- did the agent achieve the user's objective?

Async scores are persisted to Lago as eval.* events and flow into Autonomic via journal subscription.

Architecture

aios-protocol (canonical contract)
    |
nous-core (types + traits, zero I/O)
    |           \              \
nous-heuristics  nous-judge    nous-lago (+ lago-core, lago-journal)
    |           /
nous-middleware (+ arcan-core)
    |
nous-api (axum)
    |
nousd (binary)
CrateRole
nous-corePure types: NousEvaluator trait, EvalScore, EvalResult, EvalLayer taxonomy
nous-heuristicsInline evaluators: token efficiency, budget adherence, tool correctness, safety compliance
nous-middlewareArcan middleware integration -- wires evaluators into the agent loop
nous-judgeAsync LLM-as-judge: plan quality, plan adherence, task completion
nous-lagoLago persistence bridge for evaluation events
nous-apiHTTP API (axum): /eval/{session}, /eval/run
nousdDaemon binary

Event namespace

All Nous events use EventKind::Custom with the prefix "eval.":

  • eval.score -- individual evaluation result with layer, score, and metadata
  • eval.batch -- batch of scores from a single evaluation pass
  • eval.judge_result -- async LLM-as-judge evaluation result

EvalScore is OTel-aligned: it emits gen_ai.evaluation.result span events via Vigil.

Score flow

Nous produces EvalScore values that flow through two paths:

  1. Inline path: Heuristic evaluator produces score in middleware hook. Score is sent directly to Autonomic's regulation fold. Latency: less than 2ms.

  2. Async path: LLM-as-judge evaluator produces score in nousd. Score is persisted as a Lago event. Autonomic picks it up via journal subscription. Latency: seconds (LLM call).

Both paths ultimately feed Autonomic, which adjusts the agent's operating mode and gating profile based on quality signals. Low reasoning scores might trigger a shift to Verify mode. High cost scores with low quality might trigger Conserving mode.

Quality scores surface in Arcan. The Arcan console and TUI display real-time quality scores from Nous evaluators, giving operators visibility into agent performance across all five layers.

Integration points

SubsystemHow Nous integrates
ArcanNousMiddleware runs evaluators on every agent loop tick
AutonomicInline scores feed directly into the homeostatic fold; async scores arrive via Lago
LagoEvaluation events persisted for historical analysis and replay
VigilEvalScore emits OTel span events (gen_ai.evaluation.result)
AnimaEvaluates against the agent's PolicyManifest for safety layer checks

On this page