<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: us</title>
    <description>The latest articles on Forem by us (@us).</description>
    <link>https://forem.com/us</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F463049%2F9ebb1ef6-458c-44cb-9bbc-60c515f91b24.jpeg</url>
      <title>Forem: us</title>
      <link>https://forem.com/us</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/us"/>
    <language>en</language>
    <item>
      <title>I couldn't stand waiting 4.6s per page on Firecrawl, so I wrote my own web scraper in Rust</title>
      <dc:creator>us</dc:creator>
      <pubDate>Mon, 09 Mar 2026 19:41:25 +0000</pubDate>
      <link>https://forem.com/us/i-couldnt-stand-waiting-46s-per-page-on-firecrawl-so-i-wrote-my-own-web-scraper-in-rust-2h46</link>
      <guid>https://forem.com/us/i-couldnt-stand-waiting-46s-per-page-on-firecrawl-so-i-wrote-my-own-web-scraper-in-rust-2h46</guid>
      <description>&lt;p&gt;Last year I was building an AI agent. Simple job: scrape web pages, convert to markdown, feed to an LLM. Classic RAG pipeline stuff.&lt;/p&gt;

&lt;p&gt;I tried Firecrawl. Nice API, solid docs, everything looked great — until I put it in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.6 seconds.&lt;/strong&gt; For a single page. My agent was spending more time waiting than thinking.&lt;/p&gt;

&lt;p&gt;Switched to Crawl4AI. Speed was okay-ish but I couldn't deploy the damn thing. Python, Playwright, Chromium, a mountain of dependencies. Docker image was 2GB. Running it on a simple VPS was an adventure in itself.&lt;/p&gt;

&lt;p&gt;Looked at Spider.cloud. Fast, but closed-source and expensive. You don't own your infrastructure.&lt;/p&gt;

&lt;p&gt;One night I thought "how hard can this be in Rust?"&lt;/p&gt;

&lt;p&gt;8 months later, here we are.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is CRW?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/us/crw" rel="noopener noreferrer"&gt;CRW&lt;/a&gt; is an open-source web scraping API written in Rust. It does everything Firecrawl does — scrape, crawl, map, LLM extraction — but as a single 8MB binary.&lt;/p&gt;

&lt;p&gt;Here's where it stands right now:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;CRW&lt;/th&gt;
&lt;th&gt;Firecrawl&lt;/th&gt;
&lt;th&gt;Crawl4AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Avg latency&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;833ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,600ms&lt;/td&gt;
&lt;td&gt;3,200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crawl coverage&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;77.2%&lt;/td&gt;
&lt;td&gt;~80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.6MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500MB+&lt;/td&gt;
&lt;td&gt;300MB+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker image&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500MB+&lt;/td&gt;
&lt;td&gt;~2GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(&lt;a href="https://scrapeway.com" rel="noopener noreferrer"&gt;Scrapeway benchmark&lt;/a&gt; data, same 500-URL corpus)&lt;/p&gt;

&lt;p&gt;I was honestly surprised by the first benchmark results myself. I knew Rust would be faster but didn't expect a 5.5x gap. The real surprise was coverage — the lol-html parser does seriously good work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Okay but why "yet another scraper"?
&lt;/h2&gt;

&lt;p&gt;Fair question. There are plenty of scrapers out there. But here's the thing:&lt;/p&gt;

&lt;p&gt;Firecrawl's API is actually well-designed. &lt;code&gt;/v1/scrape&lt;/code&gt;, &lt;code&gt;/v1/crawl&lt;/code&gt;, &lt;code&gt;/v1/map&lt;/code&gt; — clean, intuitive, useful. The problem isn't the API design, it's the engine underneath.&lt;/p&gt;

&lt;p&gt;So I thought: let's take the same API and rewrite the engine from scratch in Rust. That way anyone using Firecrawl can switch by &lt;strong&gt;changing one line&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- const BASE_URL = "https://api.firecrawl.dev";
&lt;/span&gt;&lt;span class="gi"&gt;+ const BASE_URL = "https://api.fastcrw.com";
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I mean it. That's literally it. Same endpoints, same request/response format.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Simplest possible usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 3000:3000 ghcr.io/us/crw:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. You now have a web scraping API running on localhost:3000. Unlimited requests, zero cost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:3000/v1/scrape &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"url": "https://example.com", "formats": ["markdown"]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get back clean markdown. No HTML tags, no ads, no navigation menus. Clean text you can feed directly to an LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using it with AI agents — my original motivation
&lt;/h2&gt;

&lt;p&gt;The whole reason I built CRW was for AI agents. So MCP server comes built-in.&lt;/p&gt;

&lt;p&gt;If you're using Claude Desktop or Cursor, just add this to your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"crw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"crw"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Claude can run "scrape this page" or "crawl this site" commands directly. Your agent can freely browse the web.&lt;/p&gt;

&lt;p&gt;You can do this with Firecrawl too, but their MCP server is a separate package, separate setup, separate config. With CRW it's all inside the same binary.&lt;/p&gt;




&lt;h2&gt;
  
  
  LLM extraction — my favorite feature
&lt;/h2&gt;

&lt;p&gt;You give it a JSON schema, CRW extracts structured data from the page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:3000/v1/scrape &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://news.ycombinator.com",
    "formats": ["extract"],
    "extract": {
      "schema": {
        "type": "object",
        "properties": {
          "stories": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "title": {"type": "string"},
                "points": {"type": "number"}
              }
            }
          }
        }
      }
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get titles and points from the Hacker News front page as clean JSON. No regex, no CSS selectors even. The LLM understands the page and extracts what you need.&lt;/p&gt;

&lt;p&gt;I use this constantly in RAG pipelines. Pulling product data from e-commerce sites, grabbing article metadata from blogs — always this endpoint.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-host vs Cloud
&lt;/h2&gt;

&lt;p&gt;Two options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-host (free, forever):&lt;/strong&gt; &lt;code&gt;docker run&lt;/code&gt; and you're done. Runs on your server, your data stays with you, no request limits. AGPL-3.0 license.&lt;/p&gt;

&lt;p&gt;It runs comfortably on a $5 DigitalOcean droplet because it uses 6.6MB RAM at idle. Try self-hosting Firecrawl — you'll need at least 1GB RAM minimum.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud (&lt;a href="https://fastcrw.com" rel="noopener noreferrer"&gt;fastcrw.com&lt;/a&gt;):&lt;/strong&gt; For when you don't want to manage servers. 50 free credits, no credit card. Proxy network, auto-scaling, Chromium fleet all included. I actually use the cloud version for my own agents because I don't want to deal with proxy management.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Rust?
&lt;/h2&gt;

&lt;p&gt;I get this question a lot. "You could've written it in Go, you could've written it in Node" — yeah, I could've.&lt;/p&gt;

&lt;p&gt;But web scraping is CPU-bound work. Parsing HTML, traversing the DOM, converting to markdown — these are all byte-level operations. Rust's zero-cost abstractions make a real difference here.&lt;/p&gt;

&lt;p&gt;Then there's memory. Firecrawl's Node.js + Playwright stack eats 500MB+ at idle. At 10 concurrent requests it blows past 2GB. CRW handles the same load under 50MB. That means 10x more throughput on the same server.&lt;/p&gt;

&lt;p&gt;And finally: single binary. You run &lt;code&gt;cargo build --release&lt;/code&gt;, you get an 8MB file. Alpine-based Docker image, minimal total size. Even CI/CD build times are fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  Known gaps (let me be honest)
&lt;/h2&gt;

&lt;p&gt;It's not perfect. Here's what I know is missing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No WebSocket streaming yet&lt;/strong&gt; — you track crawl progress via polling. Coming soon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No screenshot/PDF capture&lt;/strong&gt; — it's on the roadmap but I haven't implemented it yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs are still growing&lt;/strong&gt; — I add to them every week but they're not as comprehensive as Firecrawl's yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm telling you this because I don't trust projects that claim to do everything perfectly with zero issues. CRW is a fast and reliable scraper, but it's still evolving.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it out
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Self-host in 30 seconds:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 3000:3000 ghcr.io/us/crw:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Try the cloud(50 free credits):&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://fastcrw.com" rel="noopener noreferrer"&gt;fastcrw.com&lt;/a&gt; — 50 free credits&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source code:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/us/crw" rel="noopener noreferrer"&gt;github.com/us/crw&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full docs:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://us.github.io/crw/" rel="noopener noreferrer"&gt;us.github.io/crw&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If you find bugs or want features, &lt;a href="https://github.com/us/crw/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt;. I look at every single one.&lt;/p&gt;

&lt;p&gt;And honestly, if you star the repo I'd appreciate it. That's how open-source motivation works — you see a star come in and suddenly you're writing code at 2am again.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>rust</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How AI Learns To Read Lips?</title>
      <dc:creator>us</dc:creator>
      <pubDate>Fri, 04 Sep 2020 16:45:37 +0000</pubDate>
      <link>https://forem.com/us/how-ai-learns-to-read-lips-3ohc</link>
      <guid>https://forem.com/us/how-ai-learns-to-read-lips-3ohc</guid>
      <description>&lt;p&gt;In this study, the effect of other expressions other than the lip on the face on understanding and synthesizing what is said was investigated. For example, the words ‘park’, ‘bark’ and ‘mark’ in English can be easily confused. Indeed, only 25% to 30% of the English language can be distinguished by lip reading alone. A professional lip reader takes into account not only lip movements but also spoken subject, facial expression, head movements and grammar. And the most important features that distinguish this research from other lip-to-sound models are that they examine the unique face and speech style of each person and the contextual integrity of the sentence. And it is really being trained with much more data versus other models. &lt;/p&gt;

&lt;p&gt;For more please check my blog.&lt;br&gt;
&lt;a href="https://us.github.io/how-ai-learns-to-read-lips"&gt;https://us.github.io/how-ai-learns-to-read-lips&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>developer</category>
      <category>development</category>
      <category>python</category>
    </item>
    <item>
      <title>How TF-Coder Works? (Explained)</title>
      <dc:creator>us</dc:creator>
      <pubDate>Fri, 04 Sep 2020 16:43:06 +0000</pubDate>
      <link>https://forem.com/us/how-tf-coder-works-explained-52c6</link>
      <guid>https://forem.com/us/how-tf-coder-works-explained-52c6</guid>
      <description>&lt;p&gt;How does TF-Coder synthesize the answers to TensorFlow questions in  StackOverflow at the ‘superhuman’ level? What are the technologies  behind it?&lt;br&gt;
Check out my new blog post. &lt;br&gt;
&lt;a href="https://us.github.io/how-tf-coder-works"&gt;https://us.github.io/how-tf-coder-works&lt;/a&gt;&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>development</category>
      <category>machinelearning</category>
      <category>developer</category>
    </item>
  </channel>
</rss>
