<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Petr Pátek</title>
    <description>The latest articles on Forem by Petr Pátek (@petrpatek).</description>
    <link>https://forem.com/petrpatek</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F451973%2F1bc0e84c-4af2-47bf-9acd-d4414e8728c8.jpeg</url>
      <title>Forem: Petr Pátek</title>
      <link>https://forem.com/petrpatek</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/petrpatek"/>
    <language>en</language>
    <item>
      <title>Bypassing web scraping protection: get the most out of your proxies with shared IP address emulation</title>
      <dc:creator>Petr Pátek</dc:creator>
      <pubDate>Thu, 13 Aug 2020 14:35:39 +0000</pubDate>
      <link>https://forem.com/apify/bypassing-web-scraping-protection-get-the-most-out-of-your-proxies-with-shared-ip-address-emulation-291c</link>
      <guid>https://forem.com/apify/bypassing-web-scraping-protection-get-the-most-out-of-your-proxies-with-shared-ip-address-emulation-291c</guid>
      <description>&lt;p&gt;&lt;em&gt;Learn about modern web scraping protection techniques and how to bypass them. Scrape up to three times more pages by combining IP address rotation with shared IP address emulation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9rpegjwryjc13o54f9hl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9rpegjwryjc13o54f9hl.png" alt="Emulating multiple users routed through the same IP address." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Web scraping is used everywhere. From e-commerce to automotive, industries are collecting valuable data from the web to get ahead of competition. But as web scraping grows in popularity and accessibility, websites employ ever more sophisticated techniques to block the robots.&lt;/p&gt;

&lt;p&gt;We compare the effectiveness of plain IP address rotation and shared IP address emulation (aka &lt;a href="https://en.wikipedia.org/wiki/Session_multiplexing" rel="noopener noreferrer"&gt;session multiplexing&lt;/a&gt;) at bypassing the protections of &lt;a href="https://www.alibaba.com/" rel="noopener noreferrer"&gt;Alibaba&lt;/a&gt;, &lt;a href="https://www.google.com/" rel="noopener noreferrer"&gt;Google&lt;/a&gt; and &lt;a href="https://www.amazon.com/" rel="noopener noreferrer"&gt;Amazon&lt;/a&gt;–sites notoriously protective of their data.&lt;/p&gt;

&lt;p&gt;Our results show that shared IP address emulation can help you bypass blocking and significantly extend the efficiency of your proxies.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is shared IP address emulation?
&lt;/h3&gt;

&lt;p&gt;Emulating shared IP address sessions relies on websites knowing that many different users can be behind a single IP address. Requests from mobile phones, for example, are usually routed through only a few IP addresses. Meanwhile, users protected by a single corporate firewall may all be using the same IP address.&lt;/p&gt;

&lt;p&gt;You can trick websites into limiting their blocking by emulating these user sessions. Shared IP address emulation relies on managing the requests you send to websites by using cookies, authentication tokens and browser HTTP signatures that make the requests look like they’re coming from multiple users routed through the same IP address.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluation of shared IP address emulation
&lt;/h3&gt;

&lt;p&gt;In this test, we ran a simple scraper that extracts a web page’s title and search result titles on randomly generated Alibaba, Google and Amazon search pages. Each run was performed using a new, &lt;a href="https://apify.com/pricing" rel="noopener noreferrer"&gt;free Apify account&lt;/a&gt;, which is allocated 30 random &lt;a href="https://oxylabs.io/blog/data-center-proxies" rel="noopener noreferrer"&gt;datacenter proxies&lt;/a&gt; from a shared pool.&lt;/p&gt;

&lt;p&gt;We scraped each site first using only IP rotation and then with a fresh account using shared IP address emulation. Scraping with shared IP address emulation allowed us to scrape between two and three times more pages before being blocked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2400%2F0%2AAEinODeyMVBTdw-j" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F2400%2F0%2AAEinODeyMVBTdw-j" alt="Comparison of the number of pages scraped with IP address rotation vs shared session emulation" width="1200" height="742"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared IP address emulation made simple with Apify SDK’s SessionPool
&lt;/h2&gt;

&lt;p&gt;The open-source &lt;a href="https://sdk.apify.com/" rel="noopener noreferrer"&gt;Apify SDK&lt;/a&gt; library for Node.js provides a toolbox for web scraping, crawling and web automation tasks. Its built-in &lt;a href="https://sdk.apify.com/docs/api/session-pool" rel="noopener noreferrer"&gt;SessionPool&lt;/a&gt; class enables shared IP address emulation with a few simple configuration parameters and method calls. It is easily pluggable into parts of the Apify ecosystem such as the &lt;a href="https://apify.com/proxy" rel="noopener noreferrer"&gt;Apify Proxy&lt;/a&gt; and &lt;a href="https://apify.com/actors" rel="noopener noreferrer"&gt;actors&lt;/a&gt; but can also be used separately.&lt;/p&gt;

&lt;p&gt;The code example below shows how you can create a simple crawler that uses the Apify Proxy and shared IP address emulation with the &lt;a href="https://sdk.apify.com/" rel="noopener noreferrer"&gt;Apify SDK&lt;/a&gt;. The crawler recursively crawls the Apify domain, saving the title of each page it visits.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The example uses &lt;a href="https://sdk.apify.com/docs/api/cheerio-crawler#__docusaurus" rel="noopener noreferrer"&gt;CheerioCrawler&lt;/a&gt;, Apify’s framework for the parallel crawling of web pages using plain HTTP requests and the &lt;a href="https://www.npmjs.com/package/cheerio" rel="noopener noreferrer"&gt;cheerio&lt;/a&gt; HTML parser. Cheerio is a fast, flexible and lean implementation of core &lt;a href="https://jquery.com/" rel="noopener noreferrer"&gt;jQuery&lt;/a&gt; designed specifically for the server. It parses markup and provides an API for traversing and manipulating the resulting data structure.&lt;/p&gt;

&lt;p&gt;The resulting crawler is extremely efficient.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Implementing shared IP address emulation with Apify SDK’s &lt;a href="https://sdk.apify.com/docs/api/session-pool" rel="noopener noreferrer"&gt;SessionPool&lt;/a&gt; is an easy task that can significantly reduce blocking when web scraping. It can reduce your proxy costs or simply allow you to scrape more pages.&lt;/p&gt;

&lt;p&gt;Would you like to learn more about the Apify SDK? Check out this guide on &lt;a href="https://sdk.apify.com/docs/guides/getting-started" rel="noopener noreferrer"&gt;getting started with Apify&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feel free to let us know in the comments how this approach works for you!&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>dataextraction</category>
      <category>apify</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
