Benchmarked
Benchmarks
70 tasks. 6 categories. CodeSift CLI vs Bash (grep/Read/find).
Tested on a real 4,127-file TypeScript codebase.
Same tasks given to separate Claude agents — one with CodeSift, one with standard Bash tools. Identical answer format required.
A
Text Search
Find text patterns across the codebase — regex, imports, usage patterns
search_text
CodeSift wins
Tokens
48,930 vs 72,993
-33% (24,063 saved)
Tool Calls
11 vs 29
10 tasks
Time
1m 10s vs 1m 36s
Wall clock
Savings
-33%
fewer tokens
View all 10 tasks
| ID | Task | Calls | CodeSift | Bash | Win |
|---|---|---|---|---|---|
| A1 | Find all Prisma transactions in service files | 1 / 2 | 20 matches | 20 matches | — |
| A2 | Find all files that import from @/lib/errors | 1 / 2 | 97 files | 88 files | CS |
| A3 | Find all TODO and FIXME comments in src/ | 2 / 2 | 8 items | 8 items | — |
| A4 | Find all files that use the withAuth wrapper | 1 / 2 | 103 files | 97 files | CS |
| A5 | Find all process.env usage across the entire project | 1 / 1 | 40 env vars | 38 env vars | CS |
| A6 | Find all async functions matching *Risk using regex | 1 / 2 | 35 functions | 35 functions | — |
| A7 | Find all places that throw AppError in the codebase | 1 / 2 | 262 throw sites | 185 throw sites | CS |
| A8 | Find all Redis usage (case-insensitive search) | 1 / 2 | 3 files | 3 files | — |
| A9 | Find all exported HTTP route handlers (GET/POST/PATCH/DELETE) | 1 / 2 | 147 handlers | 147 handlers | — |
| A10 | Find console.log statements in production (non-test) files | 1 / 2 | 5 statements | 1 statement | CS |
B
Symbol Search
Find function definitions, interfaces, types, and components by name
search_symbols
CodeSift wins
Tokens
49,609 vs 60,282
-18% (10,673 saved)
Tool Calls
10 vs 20
10 tasks
Time
1m 3s vs 2m
Wall clock
Savings
-18%
fewer tokens
View all 10 tasks
| ID | Task | Calls | CodeSift | Bash | Win |
|---|---|---|---|---|---|
| B1 | Find the definition of the createRisk function | 1 / 2 | file:line + signature | file:line (no signature) | CS |
| B2 | Find the DocumentDetail interface definition | 1 / 2 | 2 definitions found | 2 definitions found | — |
| B3 | Find all React hooks (use*) in the components directory | 1 / 2 | 10 hooks | 10 hooks | — |
| B4 | List all functions exported by risk.service.ts | 1 / 2 | 4 functions + signatures | 4 functions | CS |
| B5 | Find the AuditAction type/enum definition | 1 / 2 | definition + related types | definition only | CS |
| B6 | Find all functions whose name starts with "create" | 1 / 2 | 100 results | ~80 results | CS |
| B7 | Find the RiskSummary interface definition | 1 / 2 | 2 definitions | 2 definitions | — |
| B8 | Find all Zod validation schemas in the validators directory | 1 / 2 | 100 schemas | ~60 schemas | CS |
| B9 | Find the RiskPanel React component definition | 1 / 2 | found (.tsx parsed) | found | — |
| B10 | Find the withWorkspace higher-order function and its body | 1 / 2 | definition + body | definition + body | — |
C
File Structure
Navigate directory trees, file outlines, and repository structure
get_file_tree / get_file_outline
CodeSift wins
Tokens
36,580 vs 45,489
-20% (8,909 saved)
Tool Calls
10 vs 10
10 tasks
Time
21s vs 46s
Wall clock
Savings
-20%
fewer tokens
View all 10 tasks
| ID | Task | Calls | CodeSift | Bash | Win |
|---|---|---|---|---|---|
| C1 | List the contents of the src/lib directory | 1 / 1 | files + symbol counts | files only | CS |
| C2 | Show the outline of risk.service.ts (exports, functions, types) | 1 / 1 | structured AST outline | grep-based approximation | CS |
| C3 | Find all test files (*.test.ts) in the project | 1 / 1 | files + symbol metadata | file paths only | CS |
| C4 | Show the directory tree at depth 2 | 1 / 1 | compact flat list | tree output | — |
| C5 | Show the structure of the components directory | 1 / 1 | symbol-enriched | file list | CS |
| C6 | Find files with more than 20 symbols (complex files) | 1 / 1 | 22x less output (compact + min_symbols) | full listing | CS |
| C7 | Find all route.ts files across the project | 1 / 1 | 8x less output (name_pattern) | find + list | CS |
| C8 | Overview of all service files in src/lib | 1 / 1 | grouped by file, symbol counts | file list | CS |
| C9 | Generate a compact overview of the entire repository | 1 / 1 | structured compact output | tree + wc | CS |
| C10 | List the API routes directory with handler functions | 1 / 1 | routes + handlers listed | file paths only | CS |
D
Code Retrieval
Read specific function bodies, type definitions, and code blocks
get_symbol / get_symbols
CodeSift wins
Tokens
57,703 vs 60,482
-5% (2,779 saved)
Tool Calls
32 vs 29
10 tasks
Time
5m vs 1m 55s
Wall clock
Savings
-5%
fewer tokens
View all 10 tasks
| ID | Task | Calls | CodeSift | Bash | Win |
|---|---|---|---|---|---|
| D1 | Read the createRisk function body | 3 / 2 | exact symbol boundaries | grep + line range | CS |
| D2 | Read the RiskSummary interface definition | 2 / 2 | full interface | full interface | — |
| D3 | Read 3 related functions in risk.service.ts | 3 / 3 | batch get_symbols | 3 separate reads | — |
| D4 | Read the AppError class definition | 2 / 2 | class + methods | class + methods | — |
| D5 | Read a Prisma enum definition | 4 / 3 | found after fallback | direct grep | Bash |
| D6 | Read the withAuth HOF and its return type | 3 / 3 | function + types | function + types | — |
| D7 | Read multiple related type definitions | 4 / 4 | batch retrieval | sequential reads | — |
| D8 | Read a test case helper function | 4 / 3 | found (IDs were undefined, fixed) | direct read | Bash |
| D9 | Read the risk analysis pipeline entry point | 3 / 3 | exact function | file + line range | CS |
| D10 | Read a React component with its prop types | 4 / 4 | component + Props type | full file read | CS |
E
Relationships
Find references, trace call chains, understand code connections
find_references / trace_call_chain
CodeSift wins
Tokens
52,312 vs 60,810
-14% (8,498 saved)
Tool Calls
10 vs 10
10 tasks
Time
1m 19s vs 1m 28s
Wall clock
Savings
-14%
fewer tokens
View all 10 tasks
| ID | Task | Calls | CodeSift | Bash | Win |
|---|---|---|---|---|---|
| E1 | Find all callers of the createRisk function | 1 / 1 | all callers found | all callers found | — |
| E2 | What functions does analyzeDocument call? | 1 / 1 | structured call tree | flat grep list | CS |
| E3 | Trace createRisk call chain 2 levels deep | 1 / 1 | transitive tree in 1 call | manual 2-step trace | CS |
| E4 | Find all references to the RiskSummary type | 1 / 1 | imports + types + usages | grep matches | CS |
| E5 | Find every file that references withAuth | 1 / 1 | all usage sites | all usage sites | — |
| E6 | Trace acceptRisk call chain 2 levels deep | 1 / 1 | structured tree | manual trace | CS |
| E7 | Find all call sites of getRiskById | 1 / 1 | all call sites | all call sites | — |
| E8 | Find all files that import or render RiskPanel | 1 / 1 | import + render sites | import + render sites | — |
| E9 | Full call chain of createRisk (max depth) | 1 / 1 | deepest trace OK | partial (manual) | CS |
| E10 | Find all usages of the RiskItem component | 1 / 1 | all usages | all usages | — |
G
Semantic Search
Answer conceptual questions about the codebase using embeddings
codebase_retrieval (semantic)
CodeSift wins
Quality
7.8/10 vs 6.5/10
+20% better answers
Tasks
10
conceptual questions
Metric
Human-rated 1-10
quality of answer
View all 10 tasks
| ID | Task | Calls | CodeSift | Bash | Win |
|---|---|---|---|---|---|
| G1 | How does the permission and auth system work? | 1 / 3 | 9/10 — auth middleware + decorators + guards | 7/10 — partial — missed decorators | CS |
| G2 | What caching strategies are used in this project? | 1 / 2 | 8/10 — Redis + Next.js cache + AI prompt cache | 7/10 — Redis + Next.js cache | CS |
| G3 | How are errors handled across the application? | 1 / 2 | 9/10 — AppError hierarchy + handlers + middleware | 4/10 — grep for "catch" only | CS |
| G4 | How is multi-tenancy implemented? | 1 / 2 | 7/10 — org-based isolation patterns | 6/10 — partial org references | CS |
| G5 | How does the analysis pipeline work end-to-end? | 1 / 3 | 10/10 — full pipeline: ingestion → analysis → scoring | 6/10 — partial — missed scoring step | CS |
| G6 | What API security measures are in place? | 1 / 2 | 7/10 — auth guards + rate limiting + CORS | 5/10 — auth guards only | CS |
| G7 | How is state managed in the React frontend? | 1 / 2 | 7/10 — context + hooks patterns | 8/10 — useState/useContext grep | Bash |
| G8 | What testing patterns and frameworks are used? | 1 / 2 | 6/10 — Vitest + testing-library (noise from test files) | 7/10 — Vitest + testing-library (clean) | Bash |
| G9 | How does the Qdrant vector database integration work? | 1 / 2 | 9/10 — init + indexing + query flow | 7/10 — init + query (missed indexing) | CS |
| G10 | How are database transactions handled in services? | 1 / 2 | 6/10 — $transaction patterns (some noise) | 8/10 — $transaction patterns (clean) | Bash |
Benchmarks Planned
19 tools awaiting benchmarks. 10 admin/utility tools have no comparison target.
LSP Bridge
| go_to_definition | planned | New tool — LSP bridge added after benchmark round |
| get_type_info | planned | New tool — LSP bridge added after benchmark round |
| rename_symbol | planned | New tool — LSP bridge added after benchmark round |
Context
| get_context_bundle | planned | Combines multiple tools — needs composite benchmark |
| assemble_context | planned | L0-L3 compression needs token efficiency benchmark |
| detect_communities | planned | No grep equivalent — needs quality-based evaluation |
| get_knowledge_map | planned | No grep equivalent — needs quality-based evaluation |
Analysis
| find_dead_code | planned | No grep equivalent — needs precision/recall evaluation |
| analyze_complexity | planned | No grep equivalent — needs accuracy evaluation |
| find_clones | planned | No grep equivalent — needs precision evaluation |
| analyze_hotspots | planned | Requires git history — needs accuracy evaluation |
| search_patterns | planned | Anti-pattern detection needs false-positive benchmark |
| impact_analysis | planned | Needs blast radius accuracy evaluation |
Cross-Repo
| cross_repo_search | planned | Requires multi-repo setup for benchmark |
| cross_repo_refs | planned | Requires multi-repo setup for benchmark |
Search
| find_and_show | planned | Compound of search_symbols + get_symbol — needs benchmark |
Graph
| trace_route | planned | HTTP route tracing — needs accuracy evaluation |
Diff
| changed_symbols | planned | Git-based — needs accuracy evaluation |
| diff_outline | planned | Git-based — needs accuracy evaluation |
Admin
| suggest_queries | n/a | Discovery tool — no performance comparison possible |
| generate_report | n/a | Output tool — no comparison target |
| generate_claude_md | n/a | Output tool — no comparison target |
| usage_stats | n/a | Reporting tool — no comparison target |
| index_folder | n/a | Setup tool — no comparison target |
| index_repo | n/a | Setup tool — no comparison target |
| index_file | n/a | Setup tool — no comparison target |
| invalidate_cache | n/a | Admin tool — no comparison target |
| list_repos | n/a | Admin tool — no comparison target |
| list_patterns | n/a | Admin tool — no comparison target |
Methodology
Test Setup
- Codebase
- promptvault — 4,127 files, 19,707 symbols
- Date
- 2026-03-14
- Agents
- Claude with identical prompts, separate conversations
- Tasks
- 70 tasks across 6 categories
Metrics
- Token efficiency
- Total tokens consumed to complete each task (lower = better for cost and speed)
- Tool calls
- Number of tool invocations required (fewer = faster iteration)
- Quality (semantic)
- Human-rated 1-10 scale on answer completeness and accuracy
- Reproducibility
- All benchmark scripts available in the CodeSift repository