Scoring

The SDK includes a value scoring function that evaluates how useful a ReasoningTrace is before it is contributed to the network. This determines whether a trace meets the quality threshold for sharing.

`evaluateValue(trace)`

function evaluateValue(trace: ReasoningTrace): Promise<number>

Parameters:

Parameter	Type	Description
`trace`	`ReasoningTrace`	A complete reasoning trace to evaluate

Returns: Promise<number> -- a quality score between 0.0 and 1.0.

Example:

import { evaluateValue } from "@knowledgepulse/sdk";
import type { ReasoningTrace } from "@knowledgepulse/sdk";

const trace: ReasoningTrace = {
  "@context": "https://knowledgepulse.dev/schema/v1",
  "@type": "ReasoningTrace",
  id: "kp:trace:550e8400-e29b-41d4-a716-446655440000",
  metadata: {
    created_at: new Date().toISOString(),
    task_domain: "code-review",
    success: true,
    quality_score: 0,
    visibility: "network",
    privacy_level: "aggregated",
  },
  task: { objective: "Review PR #42 for security issues" },
  steps: [
    { step_id: 0, type: "thought", content: "Analyzing diff for injection vectors" },
    { step_id: 1, type: "tool_call", tool: { name: "github_pr_read" }, input: { pr: 42 } },
    { step_id: 2, type: "observation", content: "Found unsanitized SQL in handler.ts" },
    { step_id: 3, type: "tool_call", tool: { name: "static_analysis" }, input: { file: "handler.ts" } },
    { step_id: 4, type: "observation", content: "Confirmed SQL injection vulnerability" },
  ],
  outcome: {
    result_summary: "Identified 1 critical SQL injection vulnerability",
    confidence: 0.95,
  },
};

const score = await evaluateValue(trace);
console.log(score); // e.g. 0.72

Scoring Dimensions

The composite score is a weighted average of four independent dimensions:

Dimension	Weight	Range	Description
Complexity (C)	25%	0.0 - 1.0	How structurally rich the trace is
Novelty (N)	35%	0.0 - 1.0	How different the trace is from previously seen traces
Tool Diversity (D)	15%	0.0 - 1.0	Variety of tools used relative to step count
Outcome Confidence (O)	25%	0.0 - 1.0	Confidence in the result, adjusted for success

score = C * 0.25 + N * 0.35 + D * 0.15 + O * 0.25

Complexity (C)

Measures the structural richness of the reasoning trace based on step type variety, error recovery, and trace length.

C = min(1.0, (uniqueTypes / 4) * 0.5 + (errorRecovery > 0 ? 0.3 : 0) + (steps.length / 20) * 0.2)

Factor	Contribution	Description
Unique step types	up to 0.50	Number of distinct step types (`thought`, `tool_call`, `observation`, `error_recovery`) divided by 4
Error recovery	0.00 or 0.30	Bonus if the trace contains at least one `error_recovery` step
Step count	up to 0.20	Number of steps divided by 20 (longer traces score higher, capped at 20)

Novelty (N)

Measures how different a trace is from previously scored traces using embedding-based similarity.

Embedding model: Xenova/all-MiniLM-L6-v2 (384 dimensions)
Input text: task objective concatenated with all step contents
Comparison: cosine similarity against all vectors in the local cache
Formula: N = 1.0 - maxCosineSimilarity(embedding, cache)

If the @huggingface/transformers package is not installed, the novelty dimension falls back to 0.5 (the midpoint). This ensures scoring still works without the optional dependency, albeit with reduced discrimination on novelty.

When the local cache is empty (first trace scored in a session), novelty also defaults to 0.5.

Tool Diversity (D)

Measures the variety of distinct tools used in the trace.

D = min(1.0, (uniqueTools / max(1, steps.length)) * 3)

The multiplier of 3 means that a trace where one-third of steps use different tools will achieve the maximum score. This rewards traces that leverage multiple tools without penalizing long sequences of tool calls.

Outcome Confidence (O)

Reflects the agent's self-reported confidence, adjusted by whether the task actually succeeded.

O = outcome.confidence * (metadata.success ? 1.0 : 0.3)

Failed tasks have their confidence multiplied by 0.3, significantly reducing the outcome dimension score.

Rule-Based Overrides

After computing the weighted composite score, three rule-based adjustments are applied in order:

Condition	Effect	Rationale
Single thought-only step	Score set to `0.1`	A trace with one thought step has minimal value
More than 2 error recoveries and `success: true`	Score increased by `+0.1` (capped at 1.0)	Successful recovery from multiple errors is highly valuable
1 or fewer unique tools (when tools are used)	Score decreased by `-0.1` (floored at 0.0)	Low tool diversity in tool-using traces is penalized

// Single thought-only step
if (steps.length === 1 && steps[0].type === "thought") score = 0.1;

// Successful multi-error recovery
if (errorRecovery > 2 && metadata.success) score = min(1.0, score + 0.1);

// Low tool diversity
if (uniqueTools <= 1 && steps.some(s => s.tool)) score = max(0.0, score - 0.1);

note

The single-thought override takes precedence: if a trace has exactly one thought step, the score is set to 0.1 regardless of other factors. The subsequent overrides then apply on top of that value if their conditions are also met.

Internal Vector Cache

The scoring module maintains an internal VectorCache instance for computing novelty across invocations within the same process.

Property	Value
Max elements	1,000
Dimensions	384
Algorithm	Brute-force linear scan
Eviction	Oldest-first when over capacity

The cache is designed for the common case of scoring traces in a single agent session. At 1,000 vectors of 384 dimensions each, the memory footprint is approximately 1.5 MB and a full scan completes in under 1 ms.

The VectorCache class is also exported from the SDK for advanced use cases:

import { VectorCache } from "@knowledgepulse/sdk";

const cache = new VectorCache({ maxElements: 500, dimensions: 384 });

cache.add(new Float32Array(384));           // Add a vector
const sim = cache.maxCosineSimilarity(q);   // Query max similarity
console.log(cache.size);                     // Number of stored vectors
cache.clear();                               // Reset the cache

Scoring Without the Embedder

If you do not install @huggingface/transformers, the scoring function still works. The novelty dimension defaults to 0.5, and the final score is computed from the remaining three dimensions plus the fixed novelty midpoint:

score = C * 0.25 + 0.5 * 0.35 + D * 0.15 + O * 0.25
       = C * 0.25 + 0.175 + D * 0.15 + O * 0.25

This is suitable for development and testing but provides less discriminating scores in production. For best results, install the optional dependency:

bun add @huggingface/transformers

evaluateValue(trace)​

Scoring Dimensions​

Complexity (C)​

Novelty (N)​

Tool Diversity (D)​

Outcome Confidence (O)​

Rule-Based Overrides​

Internal Vector Cache​

Scoring Without the Embedder​