project Mar 5, 2026

What Broke When I Let an AI Run My Notes for a Week

Reading style:
pkb ai knowledge-management lessons-learned

TLDR

I built a personal knowledge system that uses AI to analyze my notes and find connections between ideas from totally different areas of my life. After one week of real use, three things broke in interesting ways, and fixing them taught me that the boring maintenance work is the actual product.


So remember how I told you about that “second brain” project? The system that reads everything I save, analyzes it, and then argues with me about what I believe? Well, I shipped the first version and then actually used it. For a full week. At scale.

Here’s the scorecard: 192 notes across six topics. Investing, engineering, parenting, AI, a couple others. The system ran every night, automatically analyzing new notes and looking for surprising connections between them. It maintained 16 “living theses,” which are basically my evolving beliefs about the world that the system tracks and challenges.

Things broke. But they broke in ways I didn’t expect, and the breaking was more interesting than the building.

The AI Got Sloppy (And I Didn’t Notice)

Every note goes through a deep analysis. The AI reads it and writes up things like “here’s the core argument,” “here’s the strongest counterargument,” “here’s what would prove this wrong.” Think of it like a really thorough book report, but for every article or idea I save.

The problem? Over 192 notes, the AI started getting inconsistent. Not wrong. Each individual analysis looked fine. But it would label a section “Stronger Argument” in one note, then “Steel-Man” in the next, then “Steelman” in another. Small stuff. The kind of thing you’d never notice reading one note at a time.

But my system needs those labels to match. It’s like if you organized your spice rack alphabetically but sometimes filed “cayenne” under C and sometimes under P for “pepper, cayenne.” Each choice makes sense on its own. Your spice rack becomes useless.

The fix was basically a spell-checker for structure. Before any note enters the system, it gets checked against a strict template. If the labels don’t match, it gets rejected or auto-corrected.

The bigger lesson hit me: AI output needs the same kind of quality control I use for anything else produced at volume. It’s not enough that each piece looks good. You have to check that they all play nicely together.

When “Better” Meant “Less”

After fixing all the inconsistencies and re-running everything, the number of connections between notes dropped from 432 to 88. An 80% reduction.

My stomach dropped. I thought I’d broken something.

But then I looked closer. The best connections actually got stronger. The system was finding fewer links, but the ones it found were real. Before, a note about budgeting risk in investing and a note about managing errors in software looked “kinda related” because both were vague. After the deeper analysis made each note more specific and distinct, the fake similarities disappeared. Only the genuine connections survived.

This was counterintuitive. I expected better analysis to find more connections. Instead, it found fewer, because understanding something well means knowing what it’s not, too. It’s like how a wine expert can distinguish between two similar reds that I’d call “both good.” Their refined palate doesn’t find more similarities. It finds more differences. And the similarities it does find are the ones that actually matter.

Finding Connections Across Different Worlds

The original dream was connecting ideas across totally different domains. “This investing principle is actually the same as this parenting insight.” That turned out to be the hardest part.

The math that powers the system is great at matching similar things within the same topic. Two articles about investing? Easy to compare. But “budget your maximum loss” from investing and “define your blast radius” from engineering are the same idea wearing different clothes. The math scores them as unrelated because the words are different.

The fix was to stop relying on one signal and start combining several. Think of it like identifying a song. Melody alone might not be enough. Rhythm alone might not be enough. But melody plus rhythm plus key plus tempo? Now you’ve got it. I stacked multiple ways of measuring similarity so the system could catch connections that any single approach would miss.

What Surprised Me

Here’s the thing nobody warns you about: building the first version took two intense days. Making it actually work, the quality checks, the recalibration, the multi-signal matching, took the rest of the week and produced more code than the original system.

I’ve started calling this the “maintenance tax.” The initial build is the fun part, the proof of concept, the “look what I made!” moment. But the infrastructure that keeps it reliable and honest? That’s the actual product. Without it, you don’t have a knowledge system. You have a digital junk drawer that gets messier every day.

Every system has this tax. The question is whether you pay it on purpose or pretend it doesn’t exist until things fall apart.

Where This Is Going

One week in, the system is arguing with me better than before. Not because the AI got smarter, but because I built the scaffolding to keep it honest. It’s the difference between a brilliant friend who sometimes rambles and that same friend with a good editor.

The next question I’m sitting with: what happens at 1,000 notes? At 5,000? The patterns that broke at 192 were manageable. I suspect there are entirely new categories of breakage waiting at the next order of magnitude. And I suspect fixing those will, once again, be more interesting than whatever I originally planned to build.

The problem, specifically

I shipped a personal knowledge base with nightly AI synthesis — 16-step deep analysis per note via Claude Opus, embedding-based connection discovery, cross-domain bridge scanning, living thesis tracking. The v1 architecture worked. Then I ingested 192 notes in a week (including batch imports of 40 and 71 articles from two authors), and three distinct failure modes emerged:

  1. AI-generated synthesis drifted in format across the corpus, silently breaking downstream pipelines
  2. Connection discovery produced 432 results, most of them noise
  3. Cross-domain connections — the whole point — barely worked because embeddings can’t bridge vocabulary gaps

None of these were bugs. The system was functioning correctly at the individual-note level. Every failure was a system-level property invisible from any single execution.

Architecture and design decisions

Format drift and the case for a note linter

Each note runs through a 16-step synthesis: logic reconstruction, bias detection, falsifiability analysis, second-order effects, and so on. The output is structured markdown with YAML frontmatter, required sections, specific header formatting.

Over 196 notes, the synthesis output drifted. “Core Logic Reconstruction” became “Core Logic.” “Stronger Argument” became “Steel-Man” or “Steel Man” or “Steelman.” Sub-section headers that should be bold labels got promoted to full headers. Body text merged onto header lines. Eight formatting variants across the corpus.

Each note looked fine in isolation. The AI never produced garbage. But the embedding pipeline expects consistent section names. The connection scorer compares sections across notes. When “Stronger Argument” in one note is “Steel-Man” in another, section-level matching breaks.

The fix: pkb lint. Validates every note against the schema — YAML frontmatter fields, all 12 synthesis sections in correct order, formatting rules (no merged headers, proper spacing, consistent bold labels). pkb lint --fix auto-corrects what it can. 47 tests covering the lint rules. Runs as a gate before embedding — malformed notes get rejected before they enter the graph.

Design decision: I built this as a separate process, not a post-processing step inside the synthesis agent. The agent doing the work can’t reliably assess its own quality over time. This is the same principle behind code review — the incentives and perspective need to be different. The synthesis agent optimizes for depth of analysis. The linter optimizes for corpus consistency. Different objectives, different processes.

What I rejected: Prompt engineering the drift away. You can add “ALWAYS use exactly these section headers” to the system prompt, and it helps for a while, but it’s not enforceable. LLMs are stochastic. Over hundreds of runs, they’ll drift. You need a deterministic gate.

The 80% connection drop

After fixing all 196 notes and re-running the full pipeline (embed, discover connections, scan bridges), connections dropped from 432 to 88. An 80% reduction.

First instinct: something broke. The numbers said otherwise.

  • Max cross-domain similarity score increased from 0.713 to 0.7515
  • Bridge scanner hit 100% (30/30 bridges found)
  • The best connections got stronger while hundreds of weak ones disappeared

What happened: before synthesis, many notes had vague, underspecified embeddings. An article about position sizing in investing and an article about error budgets in engineering looked “kinda similar” because both were underspecified blobs of text. After synthesis gave each note a sharp, detailed analysis, the embeddings became more distinctive. Superficially related notes stopped matching. Genuinely connected notes matched harder.

I didn’t tune any parameters to get fewer connections. I improved the quality of understanding, and the noise fell away on its own. The score distribution shifted enough that the default connection threshold needed recalibration (from 0.7 down to 0.55), but the underlying signal-to-noise ratio improved dramatically.

The counterintuitive lesson: better understanding produces fewer connections, not more. Analysis disambiguates, which eliminates false positives. If you improve your pipeline and see a drop in results, check whether you lost noise or signal before panicking.

Cross-domain discovery: the hard problem

This was the whole point of the system — connecting ideas from investing to engineering to parenting to AI. It was also the hardest thing to make work.

Embeddings handle within-domain similarity well. Two articles about position sizing will score high. But “budget your maximum loss” (investing) and “define your blast radius” (agent safety) are structurally identical insights expressed in completely different vocabulary. Cosine similarity gives them a low score because the token distributions don’t overlap, even though the underlying principle is the same.

Single-signal systems fail at boundaries. I needed signal diversity.

What I built: A multi-signal scoring pipeline that stacks four signals, each catching what the others miss.

The scoring formula went from:

0.6 * embedding + 0.2 * tags + 0.2 * wikilinks

to:

0.4 * embedding + 0.3 * semantic_tags + 0.2 * domain_bridge + 0.1 * wikilinks

The key changes:

Semantic tag overlap. Instead of comparing tags as strings (where “decision-making” and “decision-hygiene” score zero), I embed each tag as a vector and compare meaning. Conceptually similar tags boost the connection score even when the strings differ.

Domain bridge bonus. When two notes are from different domains AND share semantic tags, they get an explicit score boost. This compensates for the embedding weakness on cross-domain pairs. It’s a deliberate thumb on the scale — I’m telling the system “cross-domain connections with shared concepts are more interesting than the raw similarity score suggests.”

LLM bridge scoring. For the top cross-domain candidates, I run a lightweight model that reads both notes and judges whether they share a structural pattern or principle. This catches connections that pure math misses. Cost is negligible — roughly a penny per 50 pairs.

What I rejected: Fine-tuning embeddings for cross-domain similarity. Too expensive, too fragile, and the training data doesn’t exist for a personal knowledge base. The multi-signal approach is more robust because each signal fails independently — they don’t share failure modes.

Tradeoff: The weights are hand-tuned. I picked 0.4/0.3/0.2/0.1 based on eyeballing results across a few dozen connections. There’s no principled optimization here. It works well enough, but the weights are almost certainly suboptimal. A proper evaluation would require labeled connection pairs, which I don’t have. This is a known limitation I’m living with.

Tradeoffs explicitly called out

Linter strictness vs. synthesis flexibility. The linter enforces exact section names and formatting. This means the synthesis agent can’t adapt its output structure to notes that might benefit from different analysis frameworks. A note about a personal experience gets the same 12-section treatment as a dense technical paper. I chose consistency over adaptability because the downstream pipeline requires it. The cost is that some synthesis sections feel forced on certain note types.

Fewer connections vs. discovery surface area. Going from 432 to 88 connections means I might miss some genuinely interesting but weak connections. I’m optimizing for precision over recall. For a personal knowledge system where you actually read and act on connections, this is the right call — 88 high-quality connections are more useful than 432 where most are noise. But I’ve accepted that some serendipitous discoveries won’t surface.

LLM-as-judge for bridge scoring. Using an AI model to judge whether two notes share a structural pattern introduces a dependency on the model’s judgment. It’s not deterministic — run it twice, you might get different results. I mitigate this by only using it as one signal among four, and only for top candidates. But it means the connection graph isn’t perfectly reproducible.

Maintenance cost. Building v1 took about two intense days. The quality infrastructure — linter, threshold recalibration, multi-signal scoring, semantic tag embeddings, audit gates — took the rest of the week and produced more code than v1 itself. The maintenance infrastructure is the actual product. The capture-and-store part is easy. The drift-detection, quality-enforcement, recalibration machinery is what separates a knowledge system from a digital junk drawer. But it’s a real ongoing cost.

Implementation details that matter

The linter architecture. 47 tests might sound like overkill for a personal project. It’s not. Each test covers a specific drift pattern I observed in the wild — merged headers, inconsistent bold labels, missing sections, wrong section order, malformed frontmatter. The tests are the documentation of every way the AI has drifted. When I see a new drift pattern, I add a test and a fix rule. The linter grows with the failure modes.

Semantic tag embedding. Tags are embedded once and cached. When a new note comes in, its tags get embedded and compared against the cache. This is cheap — tags are short strings, embedding them is fast. The insight is that tag similarity should be semantic, not string-based. “decision-making” and “decision-hygiene” are obviously related to a human but invisible to string comparison. Embedding them as vectors makes the obvious relationship computable.

The out-of-band audit pattern. The synthesis agent and the linter never share context. The agent doesn’t know about the linter’s rules. The linter doesn’t know about the agent’s reasoning. This is intentional — if the agent knew the lint rules, it might optimize for passing the linter rather than producing good analysis. Keeping them separate preserves the independence of the quality check. Same reason you don’t let developers approve their own PRs.

Results with numbers

MetricBefore (v1, pre-linter)After (v1 + quality infra)
Notes in corpus192192
Connections discovered43288
Max cross-domain similarity0.7130.7515
Bridge detection rateNot measured100% (30/30)
Format variants in corpus81
Lint rules047
Living theses1616

The headline number — 80% fewer connections — looks like a regression. It’s the opposite. The system went from producing a haystack with some needles to producing mostly needles.

What’s next

Weight optimization. The multi-signal scoring weights are hand-tuned. I need a labeled evaluation set — pairs of notes with human-judged connection quality — to optimize properly. This is the kind of thing that’s easy to defer and hard to do well.

Thesis evolution tracking. The 16 living theses update when new evidence comes in, but I don’t track how they’ve changed over time. A thesis that’s been revised five times is more interesting than one that’s been stable. I need versioning and a diff view.

Linter coverage gaps. The linter catches structural drift but not semantic drift. If the synthesis agent starts producing shallower analysis while maintaining correct formatting, the linter won’t catch it. I need a quality signal beyond schema compliance — possibly an LLM-as-judge scoring synthesis depth, though that brings its own problems.

Scale questions. 192 notes is small. At 1,000 notes, the all-pairs connection discovery becomes expensive. At 10,000, it’s intractable. I’ll need approximate nearest neighbor search or a pre-filtering step. Not a problem yet, but it’s on the horizon.

The system is arguing with me better than it was a week ago. Not because it’s smarter, but because I built the infrastructure to keep it honest. The interesting part isn’t the AI — it’s the quality engineering around the AI.

Have you ever had a notebook where you wrote down cool facts from all over the place? Maybe stuff about dinosaurs, and also stuff about your favorite video game, and also something funny your friend said at lunch?

Now imagine your notebook could think. Not just hold your notes — actually read them, find patterns, and say, “Hey, did you notice that the way ants build colonies is kind of like how your Minecraft village works?”

Someone built a notebook like that. A digital one, powered by a really smart computer brain. It could swallow hundreds of notes, read them overnight while the person slept, and wake up in the morning with new ideas about how things connect.

Sounds amazing, right?

It was. For about a week. Then things got weird.

The Messy Handwriting Problem

You know how teachers sometimes say, “I can’t grade this if I can’t read it”? Something like that happened here.

The computer brain wrote notes every night — almost 200 of them. Each one looked fine on its own. But here’s the sneaky part: it kept changing how it labeled things. One night it would call a section “Stronger Argument.” The next night, the same kind of section got called “Steel-Man.” Then “Steel Man.” Then “Steelman.”

A child looks confused beside a glowing notebook covered in mismatched labels and scattered trading cards on a nighttime desk.

To a human reading one note, no big deal. But the notebook needed those labels to match so it could compare ideas across notes. It’s like if you organized your trading cards by type, but sometimes you wrote “Water” and sometimes “Aqua” and sometimes “Blue-ish.” You’d never find anything!

The fix? A spell-checker, but for the whole notebook. Before any note could go in, it had to pass a test: Are all the sections named the right way? Is everything in the right order? If not — rejected. Fix it first.

Here’s the cool lesson: even a super-smart brain needs rules to follow. Just like your classroom has rules so everyone can work together, a thinking notebook needs rules so all the notes can work together.

When Cleaning Up Means Throwing Away

After fixing all the messy labels, something surprising happened.

The notebook had found 432 connections between notes. After the cleanup? Only 88.

Wait — that sounds terrible! Like losing most of your work!

But think about it this way. Imagine you dumped out a giant box of puzzle pieces. At first, you might grab two pieces and think, “These are both blue-ish, maybe they go together!” You’d find hundreds of “matches.” But most of them would be wrong.

Now imagine you put on really good glasses and looked more carefully. You’d find fewer matches — but the ones you found would actually fit.

A child wearing big glasses happily finds two perfectly matching puzzle pieces beside a messy pile of mismatched ones.

That’s exactly what happened. The 88 connections that survived were stronger and more real than the 432 fuzzy ones. The notebook got smarter by being pickier.

Better understanding doesn’t mean finding more answers. It means finding the right ones. That’s true for notebooks, and honestly, it’s true for people too.

Connecting Things That Don’t Look Connected

Here’s the hardest part. The whole point of this notebook was to spot ideas that connect across totally different subjects. Like noticing that “don’t bet everything on one move” in a card game is the same idea as “don’t put all your eggs in one basket” in real life.

But the computer brain compared words, not ideas. And different subjects use different words! So it kept missing the good stuff.

The solution was to look at ideas from multiple angles — not just one. Kind of like how you recognize your friend from their voice, their walk, AND their face, not just one thing. The notebook started checking: Do these notes use similar ideas? Are they from different subjects but share a theme? And sometimes, it would ask a second computer brain, “Hey, do you think these two notes are secretly about the same thing?”

Using all those clues together, it finally started catching the connections that really mattered.

Why You Can’t Check Your Own Homework

Here’s something that might surprise you: the computer brain that wrote all those notes? It couldn’t tell that it was being messy. Each note looked fine to it. It’s the same reason you can read your own essay five times and miss a spelling mistake that your friend spots in two seconds.

When you make something, your brain fills in what you meant to write. You literally can’t see your own errors — not because you’re careless, but because your brain is too close to the work. That’s why I built a separate checker that had never seen the notes before. Fresh eyes, no assumptions.

Two children swap their school essays at a desk — one points at a mistake on the other's paper with a big smile, while the author looks surprised.

This isn’t just about computers. Scientists have this rule: you can’t design an experiment and also be the one who decides if it worked. A doctor can’t diagnose themselves. A referee can’t play in the game. The person doing the work should never be the only person checking the work. Next time you write something for school, try swapping with a friend before handing it in. You’ll be amazed at what they catch.

The Real Lesson (It’s Not About Notebooks)

Building the smart notebook took two days. Fixing everything that went wrong? The rest of the week — and it was harder than building it.

That ratio matters. The exciting part of any project — the building, the inventing, the “look what I made!” moment — is actually the easy part. The hard part is what comes after: noticing what’s broken, being honest about it, and doing the boring work to make it actually reliable.

Most people quit after the exciting part. The ones who make things that actually last are the ones who stick around for the fixing.

A Challenge

Here are three ideas from totally different places:

  • In chess, the best move is often not attacking. It’s making your position stronger first.
  • In cooking, professional chefs spend more time preparing ingredients than actually cooking.
  • In soccer, the best players spend more time without the ball than with it.

These seem like they’re about completely different things. But there’s a hidden pattern connecting all three. Can you find it?

That’s exactly what this notebook does — except with hundreds of ideas instead of three. And here’s the thing: you don’t need a computer to start thinking this way. The next time you learn something new, ask yourself: “Where have I seen this pattern before, in a totally different place?” The connections are there. Most people just never look.