<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: paramaw</title>
    <description>The latest articles on Forem by paramaw (@paramaw).</description>
    <link>https://forem.com/paramaw</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F709945%2F87b51f28-fd93-41d0-82c1-8e3d9a85be52.jpeg</url>
      <title>Forem: paramaw</title>
      <link>https://forem.com/paramaw</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/paramaw"/>
    <language>en</language>
    <item>
      <title>I built a web scraping companion tool to instantly make any scrapers scalable and unblockable</title>
      <dc:creator>paramaw</dc:creator>
      <pubDate>Mon, 20 Sep 2021 18:05:29 +0000</pubDate>
      <link>https://forem.com/paramaw/i-built-a-web-scraping-companion-tool-to-instantly-make-any-scrapers-scalable-and-unblockable-16jl</link>
      <guid>https://forem.com/paramaw/i-built-a-web-scraping-companion-tool-to-instantly-make-any-scrapers-scalable-and-unblockable-16jl</guid>
      <description>&lt;p&gt;Over the years of web scraping for many clients, and over billions of pages scraped at &lt;a href="https://www.datahen.com" rel="noopener noreferrer"&gt;DataHen&lt;/a&gt;, I realized that we kept on doing the same things over and over again with regards to scalability, unblockability and &lt;a href="https://github.com/DataHenHQ/till#problems-with-web-scraping" rel="noopener noreferrer"&gt;general problems that web scraping typically face&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So, I built &lt;a href="https://github.com/DataHenHQ/till" rel="noopener noreferrer"&gt;Till&lt;/a&gt;, a companion tool that integrates with any scraper in 5 minutes, without much code changes.&lt;/p&gt;

&lt;p&gt;It works as a man-in-the-middle proxy, that your scraper can connect to. &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5yqrmhh96o62gg63op0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5yqrmhh96o62gg63op0.png" alt="How Till Works"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All you need to do is connect to Till via the proxy protocol, and Till handles things such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User agent generation and randomization&lt;/li&gt;
&lt;li&gt;Proxy IP randomization&lt;/li&gt;
&lt;li&gt;Cookie management&lt;/li&gt;
&lt;li&gt;HTTP Caching&lt;/li&gt;
&lt;li&gt;HTTP Request interceptions&lt;/li&gt;
&lt;li&gt;Sticky Sessions&lt;/li&gt;
&lt;li&gt;Request Logging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you use Till, you don't need to build many of the repetitive logics required to scale and unblock scrapers, you can simply focus on the main scraping steps/tasks itself.&lt;/p&gt;

&lt;p&gt;Let me know of any feedback, or comments etc.&lt;br&gt;
Here is the &lt;a href="https://github.com/DataHenHQ/till" rel="noopener noreferrer"&gt;Github link&lt;/a&gt;. Please give it a star, if you find it useful.&lt;br&gt;
And here is the &lt;a href="https://till.datahen.com" rel="noopener noreferrer"&gt;product link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scrapingtools</category>
    </item>
  </channel>
</rss>
