The problem, specifically
I shipped a personal knowledge base with nightly AI synthesis — 16-step deep analysis per note via Claude Opus, embedding-based connection discovery, cross-domain bridge scanning, living thesis tracking. The v1 architecture worked. Then I ingested 192 notes in a week (including batch imports of 40 and 71 articles from two authors), and three distinct failure modes emerged:
- AI-generated synthesis drifted in format across the corpus, silently breaking downstream pipelines
- Connection discovery produced 432 results, most of them noise
- Cross-domain connections — the whole point — barely worked because embeddings can’t bridge vocabulary gaps
None of these were bugs. The system was functioning correctly at the individual-note level. Every failure was a system-level property invisible from any single execution.
Architecture and design decisions
Each note runs through a 16-step synthesis: logic reconstruction, bias detection, falsifiability analysis, second-order effects, and so on. The output is structured markdown with YAML frontmatter, required sections, specific header formatting.
Over 196 notes, the synthesis output drifted. “Core Logic Reconstruction” became “Core Logic.” “Stronger Argument” became “Steel-Man” or “Steel Man” or “Steelman.” Sub-section headers that should be bold labels got promoted to full headers. Body text merged onto header lines. Eight formatting variants across the corpus.
Each note looked fine in isolation. The AI never produced garbage. But the embedding pipeline expects consistent section names. The connection scorer compares sections across notes. When “Stronger Argument” in one note is “Steel-Man” in another, section-level matching breaks.
The fix: pkb lint. Validates every note against the schema — YAML frontmatter fields, all 12 synthesis sections in correct order, formatting rules (no merged headers, proper spacing, consistent bold labels). pkb lint --fix auto-corrects what it can. 47 tests covering the lint rules. Runs as a gate before embedding — malformed notes get rejected before they enter the graph.
Design decision: I built this as a separate process, not a post-processing step inside the synthesis agent. The agent doing the work can’t reliably assess its own quality over time. This is the same principle behind code review — the incentives and perspective need to be different. The synthesis agent optimizes for depth of analysis. The linter optimizes for corpus consistency. Different objectives, different processes.
What I rejected: Prompt engineering the drift away. You can add “ALWAYS use exactly these section headers” to the system prompt, and it helps for a while, but it’s not enforceable. LLMs are stochastic. Over hundreds of runs, they’ll drift. You need a deterministic gate.
The 80% connection drop
After fixing all 196 notes and re-running the full pipeline (embed, discover connections, scan bridges), connections dropped from 432 to 88. An 80% reduction.
First instinct: something broke. The numbers said otherwise.
- Max cross-domain similarity score increased from 0.713 to 0.7515
- Bridge scanner hit 100% (30/30 bridges found)
- The best connections got stronger while hundreds of weak ones disappeared
What happened: before synthesis, many notes had vague, underspecified embeddings. An article about position sizing in investing and an article about error budgets in engineering looked “kinda similar” because both were underspecified blobs of text. After synthesis gave each note a sharp, detailed analysis, the embeddings became more distinctive. Superficially related notes stopped matching. Genuinely connected notes matched harder.
I didn’t tune any parameters to get fewer connections. I improved the quality of understanding, and the noise fell away on its own. The score distribution shifted enough that the default connection threshold needed recalibration (from 0.7 down to 0.55), but the underlying signal-to-noise ratio improved dramatically.
The counterintuitive lesson: better understanding produces fewer connections, not more. Analysis disambiguates, which eliminates false positives. If you improve your pipeline and see a drop in results, check whether you lost noise or signal before panicking.
Cross-domain discovery: the hard problem
This was the whole point of the system — connecting ideas from investing to engineering to parenting to AI. It was also the hardest thing to make work.
Embeddings handle within-domain similarity well. Two articles about position sizing will score high. But “budget your maximum loss” (investing) and “define your blast radius” (agent safety) are structurally identical insights expressed in completely different vocabulary. Cosine similarity gives them a low score because the token distributions don’t overlap, even though the underlying principle is the same.
Single-signal systems fail at boundaries. I needed signal diversity.
What I built: A multi-signal scoring pipeline that stacks four signals, each catching what the others miss.
The scoring formula went from:
0.6 * embedding + 0.2 * tags + 0.2 * wikilinks
to:
0.4 * embedding + 0.3 * semantic_tags + 0.2 * domain_bridge + 0.1 * wikilinks
The key changes:
Semantic tag overlap. Instead of comparing tags as strings (where “decision-making” and “decision-hygiene” score zero), I embed each tag as a vector and compare meaning. Conceptually similar tags boost the connection score even when the strings differ.
Domain bridge bonus. When two notes are from different domains AND share semantic tags, they get an explicit score boost. This compensates for the embedding weakness on cross-domain pairs. It’s a deliberate thumb on the scale — I’m telling the system “cross-domain connections with shared concepts are more interesting than the raw similarity score suggests.”
LLM bridge scoring. For the top cross-domain candidates, I run a lightweight model that reads both notes and judges whether they share a structural pattern or principle. This catches connections that pure math misses. Cost is negligible — roughly a penny per 50 pairs.
What I rejected: Fine-tuning embeddings for cross-domain similarity. Too expensive, too fragile, and the training data doesn’t exist for a personal knowledge base. The multi-signal approach is more robust because each signal fails independently — they don’t share failure modes.
Tradeoff: The weights are hand-tuned. I picked 0.4/0.3/0.2/0.1 based on eyeballing results across a few dozen connections. There’s no principled optimization here. It works well enough, but the weights are almost certainly suboptimal. A proper evaluation would require labeled connection pairs, which I don’t have. This is a known limitation I’m living with.
Tradeoffs explicitly called out
Linter strictness vs. synthesis flexibility. The linter enforces exact section names and formatting. This means the synthesis agent can’t adapt its output structure to notes that might benefit from different analysis frameworks. A note about a personal experience gets the same 12-section treatment as a dense technical paper. I chose consistency over adaptability because the downstream pipeline requires it. The cost is that some synthesis sections feel forced on certain note types.
Fewer connections vs. discovery surface area. Going from 432 to 88 connections means I might miss some genuinely interesting but weak connections. I’m optimizing for precision over recall. For a personal knowledge system where you actually read and act on connections, this is the right call — 88 high-quality connections are more useful than 432 where most are noise. But I’ve accepted that some serendipitous discoveries won’t surface.
LLM-as-judge for bridge scoring. Using an AI model to judge whether two notes share a structural pattern introduces a dependency on the model’s judgment. It’s not deterministic — run it twice, you might get different results. I mitigate this by only using it as one signal among four, and only for top candidates. But it means the connection graph isn’t perfectly reproducible.
Maintenance cost. Building v1 took about two intense days. The quality infrastructure — linter, threshold recalibration, multi-signal scoring, semantic tag embeddings, audit gates — took the rest of the week and produced more code than v1 itself. The maintenance infrastructure is the actual product. The capture-and-store part is easy. The drift-detection, quality-enforcement, recalibration machinery is what separates a knowledge system from a digital junk drawer. But it’s a real ongoing cost.
Implementation details that matter
The linter architecture. 47 tests might sound like overkill for a personal project. It’s not. Each test covers a specific drift pattern I observed in the wild — merged headers, inconsistent bold labels, missing sections, wrong section order, malformed frontmatter. The tests are the documentation of every way the AI has drifted. When I see a new drift pattern, I add a test and a fix rule. The linter grows with the failure modes.
Semantic tag embedding. Tags are embedded once and cached. When a new note comes in, its tags get embedded and compared against the cache. This is cheap — tags are short strings, embedding them is fast. The insight is that tag similarity should be semantic, not string-based. “decision-making” and “decision-hygiene” are obviously related to a human but invisible to string comparison. Embedding them as vectors makes the obvious relationship computable.
The out-of-band audit pattern. The synthesis agent and the linter never share context. The agent doesn’t know about the linter’s rules. The linter doesn’t know about the agent’s reasoning. This is intentional — if the agent knew the lint rules, it might optimize for passing the linter rather than producing good analysis. Keeping them separate preserves the independence of the quality check. Same reason you don’t let developers approve their own PRs.
Results with numbers
| Metric | Before (v1, pre-linter) | After (v1 + quality infra) |
|---|
| Notes in corpus | 192 | 192 |
| Connections discovered | 432 | 88 |
| Max cross-domain similarity | 0.713 | 0.7515 |
| Bridge detection rate | Not measured | 100% (30/30) |
| Format variants in corpus | 8 | 1 |
| Lint rules | 0 | 47 |
| Living theses | 16 | 16 |
The headline number — 80% fewer connections — looks like a regression. It’s the opposite. The system went from producing a haystack with some needles to producing mostly needles.
What’s next
Weight optimization. The multi-signal scoring weights are hand-tuned. I need a labeled evaluation set — pairs of notes with human-judged connection quality — to optimize properly. This is the kind of thing that’s easy to defer and hard to do well.
Thesis evolution tracking. The 16 living theses update when new evidence comes in, but I don’t track how they’ve changed over time. A thesis that’s been revised five times is more interesting than one that’s been stable. I need versioning and a diff view.
Linter coverage gaps. The linter catches structural drift but not semantic drift. If the synthesis agent starts producing shallower analysis while maintaining correct formatting, the linter won’t catch it. I need a quality signal beyond schema compliance — possibly an LLM-as-judge scoring synthesis depth, though that brings its own problems.
Scale questions. 192 notes is small. At 1,000 notes, the all-pairs connection discovery becomes expensive. At 10,000, it’s intractable. I’ll need approximate nearest neighbor search or a pre-filtering step. Not a problem yet, but it’s on the horizon.
The system is arguing with me better than it was a week ago. Not because it’s smarter, but because I built the infrastructure to keep it honest. The interesting part isn’t the AI — it’s the quality engineering around the AI.