Chunking Strategies for AI Code Review on Large Repos

Aziz Q. — Thu, 21 May 2026 20:00:06 +0000

i spent the last few days building an open-source AI code reviewer called Basira. one of the hardest design problems was figuring out how to feed entire github repos to an LLM without blowing past the context window or burning the budget. here's what i landed on.

The Problem

a medium repo is 50-200 files, 5-50k lines. claude sonnet has a 200k token context window, but stuffing the whole repo in is wasteful: most files don't need review at the same time, and the model loses focus on a wall of unrelated code.

Naive Approaches That Don't Work

One file per call: explodes API costs and loses cross-file context. an issue in auth.py might depend on a model defined in users.py.
Whole repo in one call: hits context limits on anything past a few thousand files, and quality drops as the model can't focus on what matters.
Random chunks: breaks logical units. you get half a class or half a function reviewed.

Three-Pass Chunking

Pass 1: Inventory

walk the repo, build a file tree with sizes and language. skip binaries, lockfiles, generated code, vendored deps. apply user-configured ignore patterns. no LLM calls in this pass, it's cheap.

def inventory_repo(repo_path: Path) -> list[FileEntry]:
    entries = []
    for path in repo_path.rglob("*"):
        if should_skip(path):
            continue
        entries.append(FileEntry(
            path=path,
            size=path.stat().st_size,
            language=detect_language(path),
            tokens=estimate_tokens(path),
        ))
    return entries

Pass 2: Grouping

bin files into chunks of ~8k tokens each, but keep related files together. files in the same directory tend to depend on each other, so they go in the same chunk. tests follow their source file when possible.

Pass 3: Review

send each chunk to claude with a structured prompt asking for findings in JSON, with severity, line numbers, and reasoning. parallelize chunks but rate-limit so we don't hit anthropic limits.

Tradeoffs

chunk boundary loss: if a function in chunk A is misused in chunk B, you won't catch it. mitigated partly by including a project summary in each chunk's prompt.
token budget per chunk: 8k is a sweet spot for sonnet. smaller = more API calls = more cost. bigger = quality drops.
ordering: putting more important files first means if budget runs out, you've reviewed the critical stuff. determining "important" is the hard part, currently using a heuristic (entry points + recently changed files).

Real Numbers

a scan of my own LogHunter repo (96 files, ~15k lines of python+go+react):

8 chunks
93k tokens in, 7k tokens out
$0.39 total
3 min wall clock
65 findings (7 critical, 32 major, 26 minor)

What I Don't Know Yet

how this scales to monorepos (100k+ files). probably needs a different strategy entirely, maybe diff-based review.
whether semantic clustering (group files by what they do, not where they sit) beats directory-based grouping. would need embeddings.
if there's a way to get cross-chunk context without re-sending shared files.

Code

implementation is open source under MIT. chunking logic lives in backend/app/services/scan_engine.py. happy to discuss design decisions or take feedback.

repo: github.com/2lba/basira

if you've solved this differently i'd genuinely like to hear how.

I built an open-source SIEM that detects attacks in real time

Aziz Q. — Mon, 18 May 2026 23:49:57 +0000

I'm a Mechanical Engineering student but I spend most of my free time on cybersecurity. After a while of just doing CTFs and reading write-ups I wanted to actually build something real.

Most open-source SIEM tools are either too basic (a script that greps auth.log) or too heavy to set up without a dedicated team. I wanted something in the middle — something that looks like a real product and deploys with one command.

So I built LogHunter.

what it does

The platform has three parts:

Go collector — sits on your servers, tails SSH and Nginx log files, parses them, and ships events in batches to the engine. The binary is about 15MB.

Python detection engine (FastAPI) — runs every event through three detectors:

Brute force — tracks failed logins per IP using Redis sliding windows. 5 failures in 5 minutes = alert. (MITRE T1110)
Web attacks — regex matching for SQL injection, XSS, and path traversal. (MITRE T1190)
Impossible travel — flags when the same user logs in from two countries within an hour. (MITRE T1078)

React dashboard — dark theme, live WebSocket feed, SVG world map with animated threat dots, host monitoring, and notification management. You add Slack/Discord/Telegram channels from the UI.

screenshots

architecture

Collector (Go) → Engine (FastAPI) → Dashboard (React) → Postgres + Redis → Slack / Discord / Telegram

security

Since it's a security tool I tried to actually do this part right:

API key auth on event ingestion
JWT + bcrypt for dashboard access
rate limited login (5/min per IP)
WebSocket requires valid token
CORS restricted to dashboard origin
engine refuses to start if secret key is still default
databases not exposed outside docker network
webhook secrets masked in API responses
non-root containers
all queries through SQLAlchemy ORM

try it

git clone https://github.com/2lba/loghunter.git
cd loghunter
cp .env.example .env
# generate secrets with: openssl rand -hex 32
docker-compose up --build -d

There's a demo script that fills the dashboard with realistic attack data:

chmod +x demo-data.sh

./demo-data.sh

bugs that wasted my time

passlib doesn't work with bcrypt 5.x. had to switch to raw bcrypt.
react-simple-maps doesn't support React 19. rewrote the map with d3-geo.
FastAPI CORS middleware doesn't cover error responses. wrote custom middleware.
Postgres INET columns return IPv4Address objects that Pydantic can't serialize.
special characters in .env passwords break shell scripts.

what's next

ML anomaly detection
eBPF collector
Kubernetes operator
mobile alerts app

repo: github.com/2lba/loghunter

feedback welcome.

Forem: Aziz Q.