<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aziz Q.</title>
    <description>The latest articles on Forem by Aziz Q. (@2lba).</description>
    <link>https://forem.com/2lba</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939069%2F969b8f49-96fb-44e7-b291-d50b7cd3d3ca.jpeg</url>
      <title>Forem: Aziz Q.</title>
      <link>https://forem.com/2lba</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/2lba"/>
    <language>en</language>
    <item>
      <title>Chunking Strategies for AI Code Review on Large Repos</title>
      <dc:creator>Aziz Q.</dc:creator>
      <pubDate>Thu, 21 May 2026 20:00:06 +0000</pubDate>
      <link>https://forem.com/2lba/chunking-strategies-for-ai-code-review-on-large-repos-3b14</link>
      <guid>https://forem.com/2lba/chunking-strategies-for-ai-code-review-on-large-repos-3b14</guid>
      <description>&lt;p&gt;i spent the last few days building an open-source AI code reviewer called Basira. one of the hardest design problems was figuring out how to feed entire github repos to an LLM without blowing past the context window or burning the budget. here's what i landed on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;a medium repo is 50-200 files, 5-50k lines. claude sonnet has a 200k token context window, but stuffing the whole repo in is wasteful: most files don't need review at the same time, and the model loses focus on a wall of unrelated code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive Approaches That Don't Work
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One file per call&lt;/strong&gt;: explodes API costs and loses cross-file context. an issue in &lt;code&gt;auth.py&lt;/code&gt; might depend on a model defined in &lt;code&gt;users.py&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Whole repo in one call&lt;/strong&gt;: hits context limits on anything past a few thousand files, and quality drops as the model can't focus on what matters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Random chunks&lt;/strong&gt;: breaks logical units. you get half a class or half a function reviewed.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Three-Pass Chunking
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pass 1: Inventory
&lt;/h3&gt;

&lt;p&gt;walk the repo, build a file tree with sizes and language. skip binaries, lockfiles, generated code, vendored deps. apply user-configured ignore patterns. no LLM calls in this pass, it's cheap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inventory_repo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FileEntry&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rglob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;should_skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;FileEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;st_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;detect_language&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;estimate_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pass 2: Grouping
&lt;/h3&gt;

&lt;p&gt;bin files into chunks of ~8k tokens each, but keep related files together. files in the same directory tend to depend on each other, so they go in the same chunk. tests follow their source file when possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pass 3: Review
&lt;/h3&gt;

&lt;p&gt;send each chunk to claude with a structured prompt asking for findings in JSON, with severity, line numbers, and reasoning. parallelize chunks but rate-limit so we don't hit anthropic limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;chunk boundary loss&lt;/strong&gt;: if a function in chunk A is misused in chunk B, you won't catch it. mitigated partly by including a project summary in each chunk's prompt.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;token budget per chunk&lt;/strong&gt;: 8k is a sweet spot for sonnet. smaller = more API calls = more cost. bigger = quality drops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ordering&lt;/strong&gt;: putting more important files first means if budget runs out, you've reviewed the critical stuff. determining "important" is the hard part, currently using a heuristic (entry points + recently changed files).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real Numbers
&lt;/h2&gt;

&lt;p&gt;a scan of my own LogHunter repo (96 files, ~15k lines of python+go+react):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8 chunks&lt;/li&gt;
&lt;li&gt;93k tokens in, 7k tokens out&lt;/li&gt;
&lt;li&gt;$0.39 total&lt;/li&gt;
&lt;li&gt;3 min wall clock&lt;/li&gt;
&lt;li&gt;65 findings (7 critical, 32 major, 26 minor)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Don't Know Yet
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;how this scales to monorepos (100k+ files). probably needs a different strategy entirely, maybe diff-based review.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;whether semantic clustering (group files by what they do, not where they sit) beats directory-based grouping. would need embeddings.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;if there's a way to get cross-chunk context without re-sending shared files.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;implementation is open source under MIT. chunking logic lives in &lt;code&gt;backend/app/services/scan_engine.py&lt;/code&gt;. happy to discuss design decisions or take feedback.&lt;/p&gt;

&lt;p&gt;repo: &lt;a href="https://github.com/2lba/basira" rel="noopener noreferrer"&gt;github.com/2lba/basira&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;if you've solved this differently i'd genuinely like to hear how.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>programming</category>
    </item>
    <item>
      <title>I built an open-source SIEM that detects attacks in real time</title>
      <dc:creator>Aziz Q.</dc:creator>
      <pubDate>Mon, 18 May 2026 23:49:57 +0000</pubDate>
      <link>https://forem.com/2lba/i-built-an-open-source-siem-that-detects-attacks-in-real-time-5dp2</link>
      <guid>https://forem.com/2lba/i-built-an-open-source-siem-that-detects-attacks-in-real-time-5dp2</guid>
      <description>&lt;p&gt;I'm a Mechanical Engineering student but I spend most of my free time on cybersecurity. After a while of just doing CTFs and reading write-ups I wanted to actually build something real.&lt;/p&gt;

&lt;p&gt;Most open-source SIEM tools are either too basic (a script that greps auth.log) or too heavy to set up without a dedicated team. I wanted something in the middle — something that looks like a real product and deploys with one command.&lt;/p&gt;

&lt;p&gt;So I built LogHunter.&lt;/p&gt;

&lt;h2&gt;
  
  
  what it does
&lt;/h2&gt;

&lt;p&gt;The platform has three parts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go collector&lt;/strong&gt; — sits on your servers, tails SSH and Nginx log files, parses them, and ships events in batches to the engine. The binary is about 15MB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python detection engine&lt;/strong&gt; (FastAPI) — runs every event through three detectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Brute force&lt;/strong&gt; — tracks failed logins per IP using Redis sliding windows. 5 failures in 5 minutes = alert. (MITRE T1110)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web attacks&lt;/strong&gt; — regex matching for SQL injection, XSS, and path traversal. (MITRE T1190)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impossible travel&lt;/strong&gt; — flags when the same user logs in from two countries within an hour. (MITRE T1078)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;React dashboard&lt;/strong&gt; — dark theme, live WebSocket feed, SVG world map with animated threat dots, host monitoring, and notification management. You add Slack/Discord/Telegram channels from the UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  screenshots
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxvpl2w0kj6grsogsaki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxvpl2w0kj6grsogsaki.png" alt="overview" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9xyjaj4qdnj512zgdqkr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9xyjaj4qdnj512zgdqkr.png" alt="threat map" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fropws2br5o9esfdznuv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fropws2br5o9esfdznuv7.png" alt="notifications" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  architecture
&lt;/h2&gt;

&lt;p&gt;Collector (Go) → Engine (FastAPI) → Dashboard (React) → Postgres + Redis → Slack / Discord / Telegram&lt;/p&gt;

&lt;h2&gt;
  
  
  security
&lt;/h2&gt;

&lt;p&gt;Since it's a security tool I tried to actually do this part right:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API key auth on event ingestion&lt;/li&gt;
&lt;li&gt;JWT + bcrypt for dashboard access&lt;/li&gt;
&lt;li&gt;rate limited login (5/min per IP)&lt;/li&gt;
&lt;li&gt;WebSocket requires valid token&lt;/li&gt;
&lt;li&gt;CORS restricted to dashboard origin&lt;/li&gt;
&lt;li&gt;engine refuses to start if secret key is still default&lt;/li&gt;
&lt;li&gt;databases not exposed outside docker network&lt;/li&gt;
&lt;li&gt;webhook secrets masked in API responses&lt;/li&gt;
&lt;li&gt;non-root containers&lt;/li&gt;
&lt;li&gt;all queries through SQLAlchemy ORM&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  try it
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/2lba/loghunter.git
cd loghunter
cp .env.example .env
# generate secrets with: openssl rand -hex 32
docker-compose up --build -d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There's a demo script that fills the dashboard with realistic attack data:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chmod +x demo-data.sh&lt;br&gt;
./demo-data.sh&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  bugs that wasted my time&lt;br&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;passlib doesn't work with bcrypt 5.x. had to switch to raw bcrypt.&lt;/li&gt;
&lt;li&gt;react-simple-maps doesn't support React 19. rewrote the map with d3-geo.&lt;/li&gt;
&lt;li&gt;FastAPI CORS middleware doesn't cover error responses. wrote custom middleware.&lt;/li&gt;
&lt;li&gt;Postgres INET columns return IPv4Address objects that Pydantic can't serialize.&lt;/li&gt;
&lt;li&gt;special characters in .env passwords break shell scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  what's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ML anomaly detection&lt;/li&gt;
&lt;li&gt;eBPF collector&lt;/li&gt;
&lt;li&gt;Kubernetes operator&lt;/li&gt;
&lt;li&gt;mobile alerts app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;repo: &lt;a href="https://github.com/2lba/loghunter" rel="noopener noreferrer"&gt;github.com/2lba/loghunter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;feedback welcome.&lt;/p&gt;

</description>
      <category>security</category>
      <category>opensource</category>
      <category>python</category>
      <category>go</category>
    </item>
  </channel>
</rss>
