Skip to main content

Scoring

The SDK includes a value scoring function that evaluates how useful a ReasoningTrace is before it is contributed to the network. This determines whether a trace meets the quality threshold for sharing.

evaluateValue(trace)

function evaluateValue(trace: ReasoningTrace): Promise<number>

Parameters:

ParameterTypeDescription
traceReasoningTraceA complete reasoning trace to evaluate

Returns: Promise<number> -- a quality score between 0.0 and 1.0.

Example:

import { evaluateValue } from "@knowledgepulse/sdk";
import type { ReasoningTrace } from "@knowledgepulse/sdk";

const trace: ReasoningTrace = {
"@context": "https://knowledgepulse.dev/schema/v1",
"@type": "ReasoningTrace",
id: "kp:trace:550e8400-e29b-41d4-a716-446655440000",
metadata: {
created_at: new Date().toISOString(),
task_domain: "code-review",
success: true,
quality_score: 0,
visibility: "network",
privacy_level: "aggregated",
},
task: { objective: "Review PR #42 for security issues" },
steps: [
{ step_id: 0, type: "thought", content: "Analyzing diff for injection vectors" },
{ step_id: 1, type: "tool_call", tool: { name: "github_pr_read" }, input: { pr: 42 } },
{ step_id: 2, type: "observation", content: "Found unsanitized SQL in handler.ts" },
{ step_id: 3, type: "tool_call", tool: { name: "static_analysis" }, input: { file: "handler.ts" } },
{ step_id: 4, type: "observation", content: "Confirmed SQL injection vulnerability" },
],
outcome: {
result_summary: "Identified 1 critical SQL injection vulnerability",
confidence: 0.95,
},
};

const score = await evaluateValue(trace);
console.log(score); // e.g. 0.72

Scoring Dimensions

The composite score is a weighted average of four independent dimensions:

DimensionWeightRangeDescription
Complexity (C)25%0.0 - 1.0How structurally rich the trace is
Novelty (N)35%0.0 - 1.0How different the trace is from previously seen traces
Tool Diversity (D)15%0.0 - 1.0Variety of tools used relative to step count
Outcome Confidence (O)25%0.0 - 1.0Confidence in the result, adjusted for success
score = C * 0.25 + N * 0.35 + D * 0.15 + O * 0.25

Complexity (C)

Measures the structural richness of the reasoning trace based on step type variety, error recovery, and trace length.

C = min(1.0, (uniqueTypes / 4) * 0.5 + (errorRecovery > 0 ? 0.3 : 0) + (steps.length / 20) * 0.2)
FactorContributionDescription
Unique step typesup to 0.50Number of distinct step types (thought, tool_call, observation, error_recovery) divided by 4
Error recovery0.00 or 0.30Bonus if the trace contains at least one error_recovery step
Step countup to 0.20Number of steps divided by 20 (longer traces score higher, capped at 20)

Novelty (N)

Measures how different a trace is from previously scored traces using embedding-based similarity.

  • Embedding model: Xenova/all-MiniLM-L6-v2 (384 dimensions)
  • Input text: task objective concatenated with all step contents
  • Comparison: cosine similarity against all vectors in the local cache
  • Formula: N = 1.0 - maxCosineSimilarity(embedding, cache)

If the @huggingface/transformers package is not installed, the novelty dimension falls back to 0.5 (the midpoint). This ensures scoring still works without the optional dependency, albeit with reduced discrimination on novelty.

When the local cache is empty (first trace scored in a session), novelty also defaults to 0.5.

Tool Diversity (D)

Measures the variety of distinct tools used in the trace.

D = min(1.0, (uniqueTools / max(1, steps.length)) * 3)

The multiplier of 3 means that a trace where one-third of steps use different tools will achieve the maximum score. This rewards traces that leverage multiple tools without penalizing long sequences of tool calls.

Outcome Confidence (O)

Reflects the agent's self-reported confidence, adjusted by whether the task actually succeeded.

O = outcome.confidence * (metadata.success ? 1.0 : 0.3)

Failed tasks have their confidence multiplied by 0.3, significantly reducing the outcome dimension score.

Rule-Based Overrides

After computing the weighted composite score, three rule-based adjustments are applied in order:

ConditionEffectRationale
Single thought-only stepScore set to 0.1A trace with one thought step has minimal value
More than 2 error recoveries and success: trueScore increased by +0.1 (capped at 1.0)Successful recovery from multiple errors is highly valuable
1 or fewer unique tools (when tools are used)Score decreased by -0.1 (floored at 0.0)Low tool diversity in tool-using traces is penalized
// Single thought-only step
if (steps.length === 1 && steps[0].type === "thought") score = 0.1;

// Successful multi-error recovery
if (errorRecovery > 2 && metadata.success) score = min(1.0, score + 0.1);

// Low tool diversity
if (uniqueTools <= 1 && steps.some(s => s.tool)) score = max(0.0, score - 0.1);
note

The single-thought override takes precedence: if a trace has exactly one thought step, the score is set to 0.1 regardless of other factors. The subsequent overrides then apply on top of that value if their conditions are also met.

Internal Vector Cache

The scoring module maintains an internal VectorCache instance for computing novelty across invocations within the same process.

PropertyValue
Max elements1,000
Dimensions384
AlgorithmBrute-force linear scan
EvictionOldest-first when over capacity

The cache is designed for the common case of scoring traces in a single agent session. At 1,000 vectors of 384 dimensions each, the memory footprint is approximately 1.5 MB and a full scan completes in under 1 ms.

The VectorCache class is also exported from the SDK for advanced use cases:

import { VectorCache } from "@knowledgepulse/sdk";

const cache = new VectorCache({ maxElements: 500, dimensions: 384 });

cache.add(new Float32Array(384)); // Add a vector
const sim = cache.maxCosineSimilarity(q); // Query max similarity
console.log(cache.size); // Number of stored vectors
cache.clear(); // Reset the cache

Scoring Without the Embedder

If you do not install @huggingface/transformers, the scoring function still works. The novelty dimension defaults to 0.5, and the final score is computed from the remaining three dimensions plus the fixed novelty midpoint:

score = C * 0.25 + 0.5 * 0.35 + D * 0.15 + O * 0.25
= C * 0.25 + 0.175 + D * 0.15 + O * 0.25

This is suitable for development and testing but provides less discriminating scores in production. For best results, install the optional dependency:

bun add @huggingface/transformers