Benchmarked

Benchmarks

70 tasks. 6 categories. CodeSift CLI vs Bash (grep/Read/find). Tested on a real 4,127-file TypeScript codebase.

Same tasks given to separate Claude agents — one with CodeSift, one with standard Bash tools. Identical answer format required.

A

Text Search

Find text patterns across the codebase — regex, imports, usage patterns

search_text CodeSift wins
Tokens
48,930 vs 72,993
-33% (24,063 saved)
Tool Calls
11 vs 29
10 tasks
Time
1m 10s vs 1m 36s
Wall clock
Savings
-33%
fewer tokens
View all 10 tasks
ID Task Calls CodeSift Bash Win
A1 Find all Prisma transactions in service files 1 / 2 20 matches 20 matches
A2 Find all files that import from @/lib/errors 1 / 2 97 files 88 files CS
A3 Find all TODO and FIXME comments in src/ 2 / 2 8 items 8 items
A4 Find all files that use the withAuth wrapper 1 / 2 103 files 97 files CS
A5 Find all process.env usage across the entire project 1 / 1 40 env vars 38 env vars CS
A6 Find all async functions matching *Risk using regex 1 / 2 35 functions 35 functions
A7 Find all places that throw AppError in the codebase 1 / 2 262 throw sites 185 throw sites CS
A8 Find all Redis usage (case-insensitive search) 1 / 2 3 files 3 files
A9 Find all exported HTTP route handlers (GET/POST/PATCH/DELETE) 1 / 2 147 handlers 147 handlers
A10 Find console.log statements in production (non-test) files 1 / 2 5 statements 1 statement CS
B

Symbol Search

Find function definitions, interfaces, types, and components by name

search_symbols CodeSift wins
Tokens
49,609 vs 60,282
-18% (10,673 saved)
Tool Calls
10 vs 20
10 tasks
Time
1m 3s vs 2m
Wall clock
Savings
-18%
fewer tokens
View all 10 tasks
ID Task Calls CodeSift Bash Win
B1 Find the definition of the createRisk function 1 / 2 file:line + signature file:line (no signature) CS
B2 Find the DocumentDetail interface definition 1 / 2 2 definitions found 2 definitions found
B3 Find all React hooks (use*) in the components directory 1 / 2 10 hooks 10 hooks
B4 List all functions exported by risk.service.ts 1 / 2 4 functions + signatures 4 functions CS
B5 Find the AuditAction type/enum definition 1 / 2 definition + related types definition only CS
B6 Find all functions whose name starts with "create" 1 / 2 100 results ~80 results CS
B7 Find the RiskSummary interface definition 1 / 2 2 definitions 2 definitions
B8 Find all Zod validation schemas in the validators directory 1 / 2 100 schemas ~60 schemas CS
B9 Find the RiskPanel React component definition 1 / 2 found (.tsx parsed) found
B10 Find the withWorkspace higher-order function and its body 1 / 2 definition + body definition + body
C

File Structure

Navigate directory trees, file outlines, and repository structure

get_file_tree / get_file_outline CodeSift wins
Tokens
36,580 vs 45,489
-20% (8,909 saved)
Tool Calls
10 vs 10
10 tasks
Time
21s vs 46s
Wall clock
Savings
-20%
fewer tokens
View all 10 tasks
ID Task Calls CodeSift Bash Win
C1 List the contents of the src/lib directory 1 / 1 files + symbol counts files only CS
C2 Show the outline of risk.service.ts (exports, functions, types) 1 / 1 structured AST outline grep-based approximation CS
C3 Find all test files (*.test.ts) in the project 1 / 1 files + symbol metadata file paths only CS
C4 Show the directory tree at depth 2 1 / 1 compact flat list tree output
C5 Show the structure of the components directory 1 / 1 symbol-enriched file list CS
C6 Find files with more than 20 symbols (complex files) 1 / 1 22x less output (compact + min_symbols) full listing CS
C7 Find all route.ts files across the project 1 / 1 8x less output (name_pattern) find + list CS
C8 Overview of all service files in src/lib 1 / 1 grouped by file, symbol counts file list CS
C9 Generate a compact overview of the entire repository 1 / 1 structured compact output tree + wc CS
C10 List the API routes directory with handler functions 1 / 1 routes + handlers listed file paths only CS
D

Code Retrieval

Read specific function bodies, type definitions, and code blocks

get_symbol / get_symbols CodeSift wins
Tokens
57,703 vs 60,482
-5% (2,779 saved)
Tool Calls
32 vs 29
10 tasks
Time
5m vs 1m 55s
Wall clock
Savings
-5%
fewer tokens
View all 10 tasks
ID Task Calls CodeSift Bash Win
D1 Read the createRisk function body 3 / 2 exact symbol boundaries grep + line range CS
D2 Read the RiskSummary interface definition 2 / 2 full interface full interface
D3 Read 3 related functions in risk.service.ts 3 / 3 batch get_symbols 3 separate reads
D4 Read the AppError class definition 2 / 2 class + methods class + methods
D5 Read a Prisma enum definition 4 / 3 found after fallback direct grep Bash
D6 Read the withAuth HOF and its return type 3 / 3 function + types function + types
D7 Read multiple related type definitions 4 / 4 batch retrieval sequential reads
D8 Read a test case helper function 4 / 3 found (IDs were undefined, fixed) direct read Bash
D9 Read the risk analysis pipeline entry point 3 / 3 exact function file + line range CS
D10 Read a React component with its prop types 4 / 4 component + Props type full file read CS
E

Relationships

Find references, trace call chains, understand code connections

find_references / trace_call_chain CodeSift wins
Tokens
52,312 vs 60,810
-14% (8,498 saved)
Tool Calls
10 vs 10
10 tasks
Time
1m 19s vs 1m 28s
Wall clock
Savings
-14%
fewer tokens
View all 10 tasks
ID Task Calls CodeSift Bash Win
E1 Find all callers of the createRisk function 1 / 1 all callers found all callers found
E2 What functions does analyzeDocument call? 1 / 1 structured call tree flat grep list CS
E3 Trace createRisk call chain 2 levels deep 1 / 1 transitive tree in 1 call manual 2-step trace CS
E4 Find all references to the RiskSummary type 1 / 1 imports + types + usages grep matches CS
E5 Find every file that references withAuth 1 / 1 all usage sites all usage sites
E6 Trace acceptRisk call chain 2 levels deep 1 / 1 structured tree manual trace CS
E7 Find all call sites of getRiskById 1 / 1 all call sites all call sites
E8 Find all files that import or render RiskPanel 1 / 1 import + render sites import + render sites
E9 Full call chain of createRisk (max depth) 1 / 1 deepest trace OK partial (manual) CS
E10 Find all usages of the RiskItem component 1 / 1 all usages all usages
G

Semantic Search

Answer conceptual questions about the codebase using embeddings

codebase_retrieval (semantic) CodeSift wins
Quality
7.8/10 vs 6.5/10
+20% better answers
Tasks
10
conceptual questions
Metric
Human-rated 1-10
quality of answer
View all 10 tasks
ID Task Calls CodeSift Bash Win
G1 How does the permission and auth system work? 1 / 3 9/10 — auth middleware + decorators + guards 7/10 — partial — missed decorators CS
G2 What caching strategies are used in this project? 1 / 2 8/10 — Redis + Next.js cache + AI prompt cache 7/10 — Redis + Next.js cache CS
G3 How are errors handled across the application? 1 / 2 9/10 — AppError hierarchy + handlers + middleware 4/10 — grep for "catch" only CS
G4 How is multi-tenancy implemented? 1 / 2 7/10 — org-based isolation patterns 6/10 — partial org references CS
G5 How does the analysis pipeline work end-to-end? 1 / 3 10/10 — full pipeline: ingestion → analysis → scoring 6/10 — partial — missed scoring step CS
G6 What API security measures are in place? 1 / 2 7/10 — auth guards + rate limiting + CORS 5/10 — auth guards only CS
G7 How is state managed in the React frontend? 1 / 2 7/10 — context + hooks patterns 8/10 — useState/useContext grep Bash
G8 What testing patterns and frameworks are used? 1 / 2 6/10 — Vitest + testing-library (noise from test files) 7/10 — Vitest + testing-library (clean) Bash
G9 How does the Qdrant vector database integration work? 1 / 2 9/10 — init + indexing + query flow 7/10 — init + query (missed indexing) CS
G10 How are database transactions handled in services? 1 / 2 6/10 — $transaction patterns (some noise) 8/10 — $transaction patterns (clean) Bash

Benchmarks Planned

19 tools awaiting benchmarks. 10 admin/utility tools have no comparison target.

LSP Bridge

go_to_definition planned New tool — LSP bridge added after benchmark round
get_type_info planned New tool — LSP bridge added after benchmark round
rename_symbol planned New tool — LSP bridge added after benchmark round

Context

get_context_bundle planned Combines multiple tools — needs composite benchmark
assemble_context planned L0-L3 compression needs token efficiency benchmark
detect_communities planned No grep equivalent — needs quality-based evaluation
get_knowledge_map planned No grep equivalent — needs quality-based evaluation

Analysis

find_dead_code planned No grep equivalent — needs precision/recall evaluation
analyze_complexity planned No grep equivalent — needs accuracy evaluation
find_clones planned No grep equivalent — needs precision evaluation
analyze_hotspots planned Requires git history — needs accuracy evaluation
search_patterns planned Anti-pattern detection needs false-positive benchmark
impact_analysis planned Needs blast radius accuracy evaluation

Cross-Repo

cross_repo_search planned Requires multi-repo setup for benchmark
cross_repo_refs planned Requires multi-repo setup for benchmark

Search

find_and_show planned Compound of search_symbols + get_symbol — needs benchmark

Graph

trace_route planned HTTP route tracing — needs accuracy evaluation

Diff

changed_symbols planned Git-based — needs accuracy evaluation
diff_outline planned Git-based — needs accuracy evaluation

Admin

suggest_queries n/a Discovery tool — no performance comparison possible
generate_report n/a Output tool — no comparison target
generate_claude_md n/a Output tool — no comparison target
usage_stats n/a Reporting tool — no comparison target
index_folder n/a Setup tool — no comparison target
index_repo n/a Setup tool — no comparison target
index_file n/a Setup tool — no comparison target
invalidate_cache n/a Admin tool — no comparison target
list_repos n/a Admin tool — no comparison target
list_patterns n/a Admin tool — no comparison target

Methodology

Test Setup

Codebase
promptvault — 4,127 files, 19,707 symbols
Date
2026-03-14
Agents
Claude with identical prompts, separate conversations
Tasks
70 tasks across 6 categories

Metrics

Token efficiency
Total tokens consumed to complete each task (lower = better for cost and speed)
Tool calls
Number of tool invocations required (fewer = faster iteration)
Quality (semantic)
Human-rated 1-10 scale on answer completeness and accuracy
Reproducibility
All benchmark scripts available in the CodeSift repository