LLMs Work Best When the User Defines Acceptance Criteria First: A Practical Case Study

Developer and quantitative practitioner behind KatanaQuant published a detailed analysis revealing that an LLM-generated Rust reimplementation of SQLite was 20,171 times slower than the original on primary key lookups — not due to incorrect code but because the query planner missed a four-line `is_rowid_ref()` check that maps named INTEGER PRIMARY KEY columns to B-tree seeks. The post, which drew 95 HN points and 74 comments, concludes that LLMs optimize for plausibility over correctness: the code compiles, passes tests, and looks architecturally sound, but contains semantic bugs that only emerge under benchmarking or production load. The author recommends that developers define specific, measurable acceptance criteria — including performance benchmarks — before generating any code with an LLM.

Key Takeaways

  • SQLite Rust reimplementation (576,000 lines of LLM-generated Rust) achieves only ~0.5% of SQLite's throughput on `SELECT WHERE id = N` because `is_rowid_ref()` checks only 'rowid', '_rowid_', 'oid' — not named `INTEGER PRIMARY KEY` columns — forcing O(n²) full scans instead of O(n log n) B-tree seeks
  • METR's 2025–2026 RCT with 16 experienced open-source developers found AI-assisted developers were 19% slower on average, while still believing subjectively they had been 20% faster; GitClear's 211M-line analysis found copy-pasted code exceeded refactored code for the first time ever in 2024
  • Practical fix: define acceptance criteria as specific benchmarks before prompting — e.g. 'primary key lookup must be O(log n) and within 2x of SQLite on 100k rows' — so the LLM's plausible-but-wrong output fails the criteria and triggers iteration

Original source: KatanaQuant Blog