fuzzly — Learn Fuzzy Search Algorithms with Interactive Visualizations

Deep Dive

The Levenshtein Distance Algorithm

Every fuzzy search is built on this single idea: how many single-character edits separate two strings? Here are the three allowed operations.

Insert Add a missing character

Turn cat → cart by inserting r. Cost: 1.

cat → crat → cart

Delete Remove an extra character

Turn cart → cat by deleting r. Cost: 1.

cart → cat

Substitute Replace a wrong character

Turn cat → bat by replacing c → b. Cost: 1.

cat → bat

Match Characters are the same

When two characters at the same position match, no edit is needed. Cost: 0.

cat ↔ bat

🧮 Interactive DP Matrix Visualizer Each cell shows the edit distance for the prefixes up to that point

Source

Target

Animation speed: fast

Click Visualize to build the matrix, or Animate to watch it fill in step by step.

Base case Match (cost 0) Substitute (cost 1) Insert Delete Optimal path

The Code

Four Ways to Implement It

Each variant produces identical results. They trade memory and performance for readability. Start with compute, understand it fully, then read the optimised versions.

compute — The Full Matrix

The classic Wagner-Fischer algorithm. Builds a complete (m+1)×(n+1) grid where every cell represents the edit distance for a pair of prefixes. Start here — every cell, every loop, every step is annotated in the source.

Time: O(m·n) Space: O(m·n)

Easiest to understand and trace through by hand
Full matrix preserved — supports backtracking to get the edit path
Great for learning: you can print the matrix and follow along
Used by LevenshteinWithOperations to show which edits were applied

pub fn compute(source: &str, target: &str) -> usize {
    let src: Vec<char> = source.chars().collect();
    let tgt: Vec<char> = target.chars().collect();
    let (m, n) = (src.len(), tgt.len());

    // Edge cases
    if m == 0 { return n; }
    if n == 0 { return m; }

    // Full (m+1)×(n+1) matrix
    let mut matrix = vec![vec![0usize; n+1]; m+1];

    // Base cases: empty prefix conversions
    for i in 0..=m { matrix[i][0] = i; }
    for j in 0..=n { matrix[0][j] = j; }

    for i in 1..=m {
        for j in 1..=n {
            // 0 if chars match, 1 if they differ
            let cost = if src[i-1] == tgt[j-1] { 0 } else { 1 };

            matrix[i][j] = min(
                matrix[i-1][j]   + 1,     // deletion
                min(
                matrix[i][j-1]   + 1,     // insertion
                matrix[i-1][j-1] + cost,  // substitution
                )
            );
        }
    }
    matrix[m][n]
}

compute_optimized — Two-Row Buffer

Key insight: each cell only reads from the row directly above. Once row i is complete we never need rows 0..i−1 again. So we keep just two rows and swap them each iteration.

Time: O(m·n) Space: O(min(m,n))

Uses O(min(m,n)) memory instead of O(m·n) — great for long strings
Puts the shorter string in the column dimension for smallest rows
std::mem::swap is free — just pointer swap, no copy
Same usize cell size as compute

pub fn compute_optimized(source: &str, target: &str) -> usize {
    let mut src: Vec<char> = source.chars().collect();
    let mut tgt: Vec<char> = target.chars().collect();
    let (m, n) = (src.len(), tgt.len());
    if m == 0 { return n; }
    if n == 0 { return m; }

    // Put shorter string in column dim → smaller row allocation
    let (src, tgt, m, n) = if m < n {
        (tgt, src, n, m)
    } else {
        (src, tgt, m, n)
    };

    let mut prev: Vec<usize> = (0..=n).collect();
    let mut curr: Vec<usize> = vec![0; n+1];

    for i in 1..=m {
        curr[0] = i;
        for j in 1..=n {
            let cost = usize::from(src[i-1] != tgt[j-1]);
            curr[j] = min(curr[j-1]+1,
                       min(prev[j]+1, prev[j-1]+cost));
        }
        // Swap — O(1), just pointer exchange
        std::mem::swap(&mut prev, &mut curr);
    }
    prev[n]
}

compute_fast — Production Path

Two optimisations stack on top of compute_optimized: ASCII strings skip the Vec<char> allocation entirely, and cells shrink to the narrowest integer that can't overflow — u8 for strings under 256 chars, u16 under 65 536, otherwise u32. Used internally by similarity().

Time: O(m·n) Space: O(min(m,n))

ASCII fast path: as_bytes() — no heap alloc for Vec<char>
Adaptive cell width: gap cost 1 ⇒ no cell exceeds max(m,n), so we pick u8/u16/u32 accordingly (up to 8× cache density)
LLVM auto-vectorises the u8 inner loop on short inputs — near-streaming memory throughput
Automatically falls back to char slices for Unicode
Read this last — the optimisations only make sense once you know what's being optimised

pub fn compute_fast(source: &str, target: &str) -> usize {
    // ASCII: every char = 1 byte, skip Vec<char> allocation
    if source.is_ascii() && target.is_ascii() {
        return fast_bytes(source.as_bytes(), target.as_bytes());
    }
    let src: Vec<char> = source.chars().collect();
    let tgt: Vec<char> = target.chars().collect();
    fast_chars(&src, &tgt)
}

fn fast_bytes(src: &[u8], tgt: &[u8]) -> usize {
    let (m, n) = (src.len(), tgt.len());
    let (src, tgt) = if m < n { (tgt, src) } else { (src, tgt) };

    // Gap cost = 1 ⇒ no cell exceeds max(m, n). Pick the narrowest
    // integer type that can hold it — u8 fits 99% of fuzzy-search inputs.
    let max_dist = src.len();
    if max_dist <= u8::MAX as usize {
        fast_bytes_dp::<u8>(src, tgt)   // 1-byte cells: entire DP fits in L1
    } else if max_dist <= u16::MAX as usize {
        fast_bytes_dp::<u16>(src, tgt)  // 2-byte cells
    } else {
        fast_bytes_dp::<u32>(src, tgt)  // 4-byte cells (4 B strings…)
    }
}

// Monomorphised per cell type — three tight specialised loops.
fn fast_bytes_dp<T: DpCell>(src: &[u8], tgt: &[u8]) -> usize {
    let (m, n) = (src.len(), tgt.len());
    let mut prev: Vec<T> = (0..=n).map(T::from_usize).collect();
    let mut curr: Vec<T> = vec![T::from_usize(0); n+1];

    for i in 1..=m {
        curr[0] = T::from_usize(i);
        let s = src[i-1];
        for j in 1..=n {
            let cost = if s == tgt[j-1] { T::from_usize(0) } else { T::ONE };
            curr[j] = curr[j-1].add(T::ONE)
                .min(prev[j].add(T::ONE))
                .min(prev[j-1].add(cost));
        }
        std::mem::swap(&mut prev, &mut curr);
    }
    prev[n].to_usize()
}

compute_myers — Bit-Parallel Levenshtein

Gene Myers (1999) observed that the deltas between consecutive cells in a DP column are each just −1, 0, or +1. That means an entire column can be encoded in two bit-vectors Pv (positive deltas) and Mv (negative deltas), packed into one 64-bit word per vector. Advancing a column is ~7 bitwise ops — no inner loop over cells. The running score of the bottom cell is kept as a plain integer, updated by peeking at the high bit of the horizontal deltas.

Time: O(n) words Space: O(1) words

Entire DP column packed into a single u64 — no per-cell loop
~7 bitwise ops per text character (a fixed schedule, not dependent on m)
Stack-allocated 128-entry Peq table for ASCII; no heap
Ties the auto-vectorised u8 DP on ASCII short inputs (both are near the memory ceiling), but with fundamentally different asymptotic behaviour
Foundation for future block-Myers (patterns > 64) and SIMD expansion
Falls back to compute_fast for patterns > 64 chars or Unicode

pub fn compute_myers(source: &str, target: &str) -> usize {
    if source.is_ascii() && target.is_ascii() {
        let (a, b) = (source.as_bytes(), target.as_bytes());
        if a.len().min(b.len()) <= 64 {
            return myers_bytes(a, b);
        }
    }
    Self::compute_fast(source, target)
}

fn myers_bytes(a: &[u8], b: &[u8]) -> usize {
    // Shorter string becomes the pattern `p` (fits in one u64)
    let (p, t) = if a.len() <= b.len() { (a, b) } else { (b, a) };
    let m = p.len();
    if m == 0 { return t.len(); }

    // Peq[c] = bitmask of positions where byte c occurs in p
    let mut peq = [0u64; 128];  // ASCII only — 128 slots are enough
    for (i, &c) in p.iter().enumerate() {
        peq[(c & 0x7F) as usize] |= 1u64 << i;
    }

    let mask   = if m == 64 { u64::MAX } else { (1u64 << m) - 1 };
    let top    = 1u64 << (m - 1);
    let mut pv = mask;   // all deltas +1 initially (column = 0,1,…,m)
    let mut mv = 0u64;
    let mut score = m;

    for &c in t {
        let eq = peq[(c & 0x7F) as usize];
        let xv = eq | mv;
        // The carry trick: propagates match runs in one addition
        let xh = (((eq & pv).wrapping_add(pv)) ^ pv) | eq;

        let mut ph = mv | !(xh | pv);
        let mut mh = pv & xh;

        if ph & top != 0 { score += 1; }
        if mh & top != 0 { score -= 1; }

        ph = (ph << 1) | 1;   mh <<= 1;
        pv = (mh & mask) | !((xv | ph) & mask);
        pv &= mask;
        mv = ph & xv & mask;
    }
    score
}

Performance

Benchmark Results

All four implementations run head-to-head on five stress tiers. Measured with Criterion.rs (40 samples, release build, x86-64 Linux). Reproduce: cargo bench

Tier	String lengths	compute	compute_optimized	compute_fast	compute_myers	Speedup (fast vs compute)
tiny	6 vs 7 chars	332 ns	286 ns	39 ns	40 ns	8.5× faster
short	22 vs 24 chars	2,478 ns	1,342 ns	105 ns	106 ns	23× faster
medium	66 vs 68 chars	14,003 ns	9,497 ns	10,085 ns	10,011 ns	1.4× faster
long	182 vs 191 chars	99,178 ns	104,900 ns	79,650 ns	79,973 ns	1.2× faster
stress	270 vs 268 chars	119,310 ns	79,560 ns	96,125 ns	94,962 ns	1.2× faster

💡 tiny & short tiers: compute_fast's adaptive u8 DP cells fit the entire working set in a few cache lines and LLVM auto-vectorises the inner loop — delivering 8–23× speedups over the baseline.

💡 stress tier: compute_optimized edges ahead because the full matrix in compute exceeds L2 cache, while the rolling two rows stay warm.

⚡ compute_myers ties compute_fast on ASCII inputs: the auto-vectorised u8 DP is already memory-bandwidth-bound at these sizes. Its value is algorithmic — O(n) bitwise words per candidate instead of O(m·n) cells — and it's the foundation for future block-Myers and SIMD work. Falls back to compute_fast for patterns longer than 64 chars.

📈 HTML reports with per-sample charts: target/criterion/levenshtein/report/index.html after running cargo bench.

Fuzzy Search,
Built to Learn From

What is Fuzzy Search?

Edit Distance

Similarity Score

Ranking Results

The Levenshtein Distance Algorithm

Insert Add a missing character

Delete Remove an extra character

Substitute Replace a wrong character

Match Characters are the same

🧮 Interactive DP Matrix Visualizer Each cell shows the edit distance for the prefixes up to that point

Four Ways to Implement It

compute — The Full Matrix

compute_optimized — Two-Row Buffer

compute_fast — Production Path

compute_myers — Bit-Parallel Levenshtein

Benchmark Results

Use fuzzly in Your Project

Enjoying fuzzly? Give it a star!

Fuzzy Search,Built to Learn From

What is Fuzzy Search?

Edit Distance

Similarity Score

Ranking Results

The Levenshtein Distance Algorithm

Insert Add a missing character

Delete Remove an extra character

Substitute Replace a wrong character

Match Characters are the same

🧮 Interactive DP Matrix Visualizer Each cell shows the edit distance for the prefixes up to that point

Four Ways to Implement It

compute — The Full Matrix

compute_optimized — Two-Row Buffer

compute_fast — Production Path

compute_myers — Bit-Parallel Levenshtein

Benchmark Results

Use fuzzly in Your Project

Enjoying fuzzly? Give it a star!

Fuzzy Search,
Built to Learn From