Skip to main content

Scoring Algorithm

The KnowledgePulse scoring engine evaluates reasoning traces using a composite formula that combines four independent quality dimensions. In Phase 2, the engine introduces domain-specific weight profiles that tailor the scoring emphasis to different task domains, and enforces a 100ms performance budget per evaluation.

Composite Formula

The overall quality score is computed as a weighted sum of four normalized dimensions:

score = C × wC + N × wN + D × wD + O × wO

Where:

SymbolDimensionRange
CComplexity0.0 -- 1.0
NNovelty0.0 -- 1.0
DTool Diversity0.0 -- 1.0
OOutcome Confidence0.0 -- 1.0

The weights (wC, wN, wD, wO) vary by domain. They always sum to 1.0.

Domain-Specific Weight Profiles

Different task domains prioritize different quality signals. A finance trace benefits most from high outcome confidence, while a coding trace benefits from diverse tool usage. The scoring engine selects the weight profile automatically based on metadata.task_domain.

Available Profiles

DomainwC (Complexity)wN (Novelty)wD (Tool Diversity)wO (Outcome)
default0.250.350.150.25
finance0.200.250.100.45
code0.200.300.300.20
medical0.150.200.100.55
customer_service0.200.300.200.30

Design Rationale

  • Finance heavily weights outcome confidence because financial analysis demands accurate, verifiable conclusions.
  • Code heavily weights tool diversity because effective coding agents leverage multiple tools (linters, type checkers, test runners).
  • Medical has the highest outcome confidence weight (0.55) because correctness is critical in medical reasoning.
  • Customer service balances novelty and outcome confidence, rewarding creative yet effective problem resolution.

Using Domain Profiles

Domain selection happens automatically through the trace metadata:

import { evaluateValue } from "@knowledgepulse/sdk";
import type { ReasoningTrace } from "@knowledgepulse/sdk";

const trace: ReasoningTrace = {
"@context": "https://knowledgepulse.dev/schema/v1",
"@type": "ReasoningTrace",
id: "kp:trace:finance-demo-001",
metadata: {
created_at: new Date().toISOString(),
task_domain: "finance", // ← selects finance weight profile
success: true,
quality_score: 0,
visibility: "network",
privacy_level: "aggregated",
},
task: { objective: "Analyze TSMC Q4 earnings report" },
steps: [
{ step_id: 0, type: "thought", content: "Extracting revenue and margin data" },
{ step_id: 1, type: "tool_call", tool: { name: "financial_data_api" }, input: { ticker: "TSM" } },
{ step_id: 2, type: "observation", content: "Revenue: $26.3B, up 14.3% YoY" },
{ step_id: 3, type: "tool_call", tool: { name: "comparison_tool" }, input: { metric: "gross_margin" } },
{ step_id: 4, type: "observation", content: "Gross margin 57.9%, above industry average" },
],
outcome: {
result_summary: "Strong quarterly performance driven by AI chip demand",
confidence: 0.92,
},
};

const score = await evaluateValue(trace);
// With finance weights, high outcome confidence (0.92) contributes more
console.log(score); // e.g. 0.78

If the domain does not match any registered profile, the default weights are used. Unknown domains are silently handled -- no error is thrown.

Rule-Based Overrides

After computing the weighted composite score, three deterministic overrides are applied in order:

1. Single-Step Penalty

if (steps.length === 1 && steps[0].type === "thought") score = 0.1;

A trace with a single thought-only step has minimal knowledge value. The score is forced to 0.1 regardless of other factors.

2. Error-Recovery Bonus

if (errorRecovery > 2 && metadata.success) score = Math.min(1.0, score + 0.1);

Traces that recover from more than 2 errors and still succeed demonstrate valuable resilience. A +0.1 bonus is added, capped at 1.0.

3. Zero-Diversity Penalty

if (uniqueTools <= 1 && steps.some(s => s.tool)) score = Math.max(0.0, score - 0.1);

If a trace uses tools but only one unique tool, a -0.1 penalty is applied, floored at 0.0. This encourages diverse tool usage.

note

The single-step penalty takes precedence. If a trace has exactly one thought step, the score is set to 0.1 first. The error-recovery bonus and zero-diversity penalty then apply on top of that value if their conditions are also met.

Temporal Decay for Novelty

The novelty dimension uses embedding-based similarity against a local vector cache. As the cache accumulates traces over time, the novelty score for semantically similar traces naturally decreases. This creates an implicit temporal decay effect:

  1. Fresh trace in an empty cache: novelty defaults to 0.5.
  2. New unique trace: novelty approaches 1.0 (low similarity to existing vectors).
  3. Repeated trace pattern: novelty approaches 0.0 (high similarity to cached vectors).

The vector cache supports TTL-based eviction (introduced in Phase 2), so cached entries expire after a configurable time window. This ensures that a topic revisited after the TTL period regains a higher novelty score.

import { VectorCache } from "@knowledgepulse/sdk";

const cache = new VectorCache({
maxElements: 1000,
dimensions: 384,
ttlMs: 3600000, // 1 hour — entries expire after this
});

Performance Budget

The scoring function is designed to complete within 100ms for typical traces. Key implementation choices that support this constraint:

ComponentStrategyLatency
Vector cacheBrute-force linear scan over 1,000 vectors< 1ms
EmbedderLazy-loaded, cached after first invocation~50ms first call, ~5ms subsequent
Composite calculationPure arithmetic, no I/O< 0.1ms
Rule overridesThree conditional checks< 0.01ms

If the optional embedder (@huggingface/transformers) is not installed, novelty defaults to 0.5 and the entire evaluation runs in under 1ms.

Scoring Interface

interface ScoringWeights {
complexity: number;
novelty: number;
toolDiversity: number;
outcomeConfidence: number;
}

function evaluateValue(trace: ReasoningTrace): Promise<number>;

The function returns a Promise<number> between 0.0 and 1.0. It is stateful across invocations within the same process because the local vector cache persists for novelty calculations.

Example: Comparing Domain Profiles

The same trace evaluated under different domains produces different scores due to weight differences:

// Same trace structure, different task_domain values
const domains = ["default", "finance", "code", "medical", "customer_service"];

for (const domain of domains) {
const trace = createTrace({ task_domain: domain });
const score = await evaluateValue(trace);
console.log(`${domain}: ${score.toFixed(3)}`);
}

// Example output (varies by trace content):
// default: 0.623
// finance: 0.714 (high confidence rewarded)
// code: 0.598 (tool diversity emphasized)
// medical: 0.751 (confidence dominates)
// customer_service: 0.645 (balanced)