Beetroot

I Mass-Replaced Fuse.js in My Clipboard Manager. It Took 8 Iterations.

Fuse.js returned 98 results for 'local v'. Two days and 8 iterations later — from same-field matching to Unicode word boundaries to scored merge — I finally got search right. Here's every wrong turn.

Users kept complaining: search in Beetroot returned garbage. "timeout" matched delivery_time_options. "local v" returned 98 results out of 1,106 entries.

Fuse.js wasn't the problem. Clipboard data was. Code snippets, stack traces, long URLs — on 200+ character strings, the letters of your query are statistically guaranteed to appear somewhere. Fuse.js calls that a match.

Two days. Eight iterations. Here's every wrong turn.

TL;DR:

  • Fuse.js produces garbage on long strings — the letters scatter-match across 200+ characters
  • Same-field matching didn't help. Word boundaries helped partially. Field priority was the real breakthrough
  • Early-exit broke monotonic narrowing (more characters typed → results jump instead of narrowing)
  • Final: scored merge — all phases run unconditionally, dedup by best score, fuzzy always present

The starting point

typescript
const fuse = new Fuse(items, {
  keys: ['content', 'note', 'source_app', 'source_title'],
  threshold: 0.4,
  minMatchCharLength: 3,
})
return fuse.search(query) // 🔥 that's it. one phase.
QueryGotExpectedWhy
timeout~402–5Letters t-i-m-e-o-u-t scattered in delivery_time_options
local v982–5"local" in window titles, "v" in everything
lm st711–3"st" matches "install", "const", "first"...
port 80131–3"port" found inside "import"

How Ditto and CopyQ do it

Before building anything, I checked the two most popular open-source alternatives.

Ditto — SQLite WHERE mText LIKE '%term%'. Full table scan on every keystroke. No fuzzy, no scoring, no FTS index.

CopyQ — in-memory QString::contains() per item, AND logic across whitespace-split tokens. Batched in 20ms slices. No fuzzy, no scoring.

Both share the same problems: no relevance ranking (everything sorted by recency), no awareness of code structure (LIKE '%port%' matches "import"), and linear scans that slow down at scale.

Fuse.js was already better for typo tolerance. The problem wasn't the baseline — it was that fuzzy matching is fundamentally wrong for long strings.

The 8 iterations

1. Same-field matching → didn't help

Idea: require all tokens in the same field.

typescript
tokens.every(t => field.lower.includes(t))

"local v" → still 98. Why? source_title had "So the local path needed loopback-only validati..." — both "local" and "v" (in "validation") in ONE field. Long metadata strings are the enemy.

2. Word-boundary matching → partially helped

Idea: "v" shouldn't match the middle of "validation."

JavaScript \b doesn't work for Unicode — Cyrillic, CJK all broken:

javascript
/\bстд/.test('лм стд нельзя')  // false! \b doesn't know Cyrillic

Built a Unicode-aware isWordStart():

typescript
function isWordStart(text: string, pos: number): boolean {
  if (pos === 0) return true
  const prev = text[pos - 1]
  if (!/[\p{L}\p{N}]/u.test(prev)) return true          // after space, _, -
  if (/\p{Ll}/u.test(prev) && /\p{Lu}/u.test(text[pos])) return true // camelCase
  return false
}
PositionExampleBoundary?
After spaceimport ▌React
After _delivery_▌time
camelCaselocal▌Variable
Mid-wordim▌port
Cyrillicлм ▌стд

"local v" → 78. Better, but source_title still contained both words at legitimate boundaries.

3. Clean refactor → same results, better code

Code was patch-on-patch — 7 commits, each patching the previous. Rewrote into three clean phases: contiguous substring → word-start tokens → Fuse.js fuzzy. But still treating all 4 fields equally.

4. Field priority → the real breakthrough

The problem was never the algorithm. It was treating window metadata as equal to clipboard content.

typescript
const PRIMARY   = ['content', 'note']            // what you copied
const SECONDARY = ['source_app', 'source_title'] // window metadata

Search primary first. Only fall back to metadata if nothing found — strict early-exit:

typescript
const p1 = contiguousSearch(items, query, PRIMARY)
if (p1.length > 0) return p1
 
const p2 = wordStartTokenSearch(items, query, PRIMARY)
if (p2.length > 0) return p2
 
// Only if content has nothing — check metadata
const s1 = contiguousSearch(items, query, SECONDARY)
if (s1.length > 0) return s1

"local v" → 9 results. Down from 98. The source_title noise was eliminated because content matches were found first.

Seven iterations improving matching logic. The real fix? One if statement. (This early-exit logic would later cause its own problems — see iteration 6.)

5. Fragment highlight → invisible matches made visible

For long clips, highlights were invisible. truncate() showed the first 100 chars, but the match could be at position 250:

typescript
// Before: always first 100 chars → match not visible
// After: window shifts to the match
if (indices[0][0] >= maxLen) {
  offset = matchStart - contextBefore
}
// "...needed a local variable for the..." ← highlighted

6–7. Early-exit broke → merged scoring

Early-exit seemed elegant. But it broke monotonic narrowing — the expectation that typing more characters narrows results:

"thai"     → Phase 1 → 6 results
"thai mon" → Phase 1 → 0 → Phase 2 → 1 completely different result 😱

And a paradox: better spelling → fewer results:

"Antropc"   → fuzzy → finds "Anthropic" → 3 results
"antrophic" → Phase 1: 1 exact match → fuzzy skipped → 1 result 🤔

8. Scored merge → the final architecture

The fix wasn't choosing between early-exit and always-run. It was running everything, but scoring each phase differently:

typescript
// All phases run unconditionally, each at a different score
const hits: ScoredHit[] = [
  ...collectScored(contiguousSearch(items, q, PRIMARY), 1.0),
  ...collectScored(wordStartTokenSearch(items, q, PRIMARY), 0.75),
  ...collectScored(contiguousSearch(items, q, SECONDARY), 0.5),
  ...collectScored(wordStartTokenSearch(items, q, SECONDARY), 0.25),
]
 
// Dedup: keep best score per item
const best = new Map<number, ScoredHit>()
for (const hit of hits) {
  const existing = best.get(hit.item.id)
  if (!existing || hit.score > existing.score)
    best.set(hit.item.id, hit)
}
 
// Fuzzy always runs — but only adds NEW items not found above
for (const r of ensureIndex(items).search(query)) {
  if (!best.has(r.item.id))
    best.set(r.item.id, { item: r.item,
      score: Math.max(0.05, 0.15 * (1 - (r.score ?? 1))), ... })
}

Why this works:

  • No early-exit problems. Every phase always runs → monotonic narrowing guaranteed
  • No noise. Scores control ranking, not filtering. "local" from source_title scores 0.5, the same match in content scores 1.0 — content always wins in sort order
  • Fuzzy always present. "Antropc" finds "Anthropic" whether or not exact matches exist — just ranked lower

How scored merge works

"local v"  → Phase 1: 9 exact hits (1.0), Phase 2-4: some at 0.75/0.5/0.25
           → dedup keeps score 1.0 for content matches → 9 top results ✓

"lm st"    → Phase 1: 0 → Phase 2: 1 word-start (0.75)
           → Phase 5 (fuzzy): adds typo hits at 0.05–0.15

"Antropc"  → Phases 1-4: 0 → Phase 5: Fuse.js → "Anthropic" at 0.05–0.15 ✓

"timeout"  → Phase 1: 20+ substring hits (1.0) → top of results
           → lower phases still run but can't outrank content matches

The score table

ScoreWhat matchesFields
1.0Exact phrase substringcontent, note
0.75Word-start tokenscontent, note
0.5Exact phrase substringwindow title, app
0.25Word-start tokenswindow title, app
0.05–0.15Fuzzy (typos)all fields

All phases run on every query. Scores control ranking, not filtering.

Before → after

On 1,120 real clipboard entries:

QueryBeforeAfterWhat changed
local v989Metadata can't pollute
lm st711–2"st" won't match "install"
port 80131–2"port" won't match "import"
timeout~40~8Exact substring only
thaithai monJumpNarrowMonotonic ✓
Antropc (typo)33Typo tolerance kept

Benchmark (1,120 items, 5,000 query runs): 0.76ms per query. Running all phases adds negligible overhead — the deterministic phases are simple string operations, and the Fuse.js index is cached.

In action

Exact match — "microsoft ope" finds 5 relevant entries with highlighted tokens:

Beetroot search: exact match for "microsoft ope" showing 5 results with highlighted tokens

Fuzzy typo — "necrosoft" (misspelled) still finds Microsoft entries via fuzzy phase:

Beetroot search: fuzzy match for "necrosoft" typo showing 2 results

The highlight tax

Finding matches is 3 lines. Showing why something matched — 40% of the code.

The worst bug: query port import on import React from 'react' — "import" matches [0,5], "port" matches [2,5] (it's inside "import"). Two overlapping <mark> elements → visual glitch. Classic interval merge fixes it:

typescript
indices.sort((a, b) => a[0] - b[0])
const merged: [number, number][] = []
for (const pair of indices) {
  const last = merged[merged.length - 1]
  if (last && pair[0] <= last[1] + 1)
    last[1] = Math.max(last[1], pair[1])  // merge overlapping
  else
    merged.push([pair[0], pair[1]])
}
// [0,5] + [2,5] → [0,5]. +1 merges adjacent ranges too.

Also: fieldIndices.push(...findAllIndices()) hit RangeError: Maximum call stack size exceeded on 10K-char strings. Spread operator expands into function arguments — V8 has a limit. Replaced with explicit loop.

What I learned

The problem was the data, not the algorithm. Seven iterations improved matching logic. The real fix was field priority — stop treating window titles as equal to content.

Monotonic narrowing is not optional. "Typed more → results jump" is deeply disorienting. Test with query sequences, not individual queries.

Fuzzy is not a fallback. Typo-tolerant results should always be present, just ranked low. "Fuzzy OR deterministic" creates paradoxes where better spelling → fewer results.

Patch-on-patch is debt in 48 hours. Seven commits each patching the previous. A clean refactor at iteration 3 would have saved four commits. But I probably needed those wrong turns to understand the problem space.

Discussion

No comment section here — all discussions happen on X.

Max Nardit

Max Nardit

@mnardit

More articles

How I Added 5 AI Providers to a Tauri Desktop App

Building multi-provider AI in Tauri: OpenAI, Gemini, Claude, DeepSeek, and Ollama. Three integration patterns, a Rust proxy for localhost, and why browser security made local models the hardest part.