<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Pritesh Ranjan</title>
    <description>The latest articles on Forem by Pritesh Ranjan (@pritesh_ranjan).</description>
    <link>https://forem.com/pritesh_ranjan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F978518%2Fe4429cd7-23bf-4aaa-a7ad-6ebef314bab2.jpeg</url>
      <title>Forem: Pritesh Ranjan</title>
      <link>https://forem.com/pritesh_ranjan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pritesh_ranjan"/>
    <language>en</language>
    <item>
      <title>Java Cucumber Maven Test Automation Framework: A Comprehensive Guide to RESTful API Testing</title>
      <dc:creator>Pritesh Ranjan</dc:creator>
      <pubDate>Mon, 30 Jan 2023 14:46:56 +0000</pubDate>
      <link>https://forem.com/pritesh_ranjan/java-cucumber-maven-test-automation-framework-a-comprehensive-guide-to-restful-api-testing-35lp</link>
      <guid>https://forem.com/pritesh_ranjan/java-cucumber-maven-test-automation-framework-a-comprehensive-guide-to-restful-api-testing-35lp</guid>
      <description>&lt;h2&gt;
  
  
  What is the tech stack used for this?
&lt;/h2&gt;

&lt;p&gt;The Java Cucumber Maven Test Automation Framework is a comprehensive solution for testing RESTful APIs. It makes use of the power of Java and Maven as a build tool, along with the simplicity of Cucumber, to provide a flexible and reliable testing framework. The framework has been designed to make test automation as straightforward and as hassle-free as possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;p&gt;One of the key features of this framework is that it maintains the test context between steps, allowing for easy sharing of test data between steps. The framework uses REST-Assured for API testing, which provides a powerful and flexible API testing library. The framework also generates Cucumber reports for test results, providing a clear and concise view of the test results. The implementation of the BDD methodology provides a clear and understandable way to write tests, making it easy for stakeholders to understand the tests. In this framework, Lombok is utilized to reduce boilerplate code, such as getters and setters, and improve the overall maintainability of the code. With Lombok, you can annotate your class with just a few annotations, and it will generate the necessary code for you at compile time.&lt;br&gt;
Finally, the framework allows for easy environment switching using dynamic config keys, making it easy to switch between different environments, such as &lt;em&gt;DEV&lt;/em&gt;, &lt;em&gt;QA&lt;/em&gt;, or &lt;em&gt;UAT&lt;/em&gt;. This is achieved with the help of &lt;em&gt;org.aeonbits.owner&lt;/em&gt; package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;org.framework.bdd.utils&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.aeonbits.owner.Config&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="nd"&gt;@Config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LoadPolicy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LoadType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;MERGE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@Config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Sources&lt;/span&gt;&lt;span class="o"&gt;({&lt;/span&gt;&lt;span class="s"&gt;"file:${user.dir}/src/main/resources/config.properties"&lt;/span&gt;&lt;span class="o"&gt;})&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;FrameworkConfiguration&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Config&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Key&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"${environment}.base-uri"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;baseUri&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="nd"&gt;@Key&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"reports"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;reportPath&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;qa&lt;/span&gt;
&lt;span class="py"&gt;reports&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;target/cucumber-reports/report.html&lt;/span&gt;
&lt;span class="c"&gt;############## DEV ###################
&lt;/span&gt;&lt;span class="py"&gt;dev.base-uri&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;https://api.restful-api.dev&lt;/span&gt;


&lt;span class="c"&gt;############## QA #####################
&lt;/span&gt;&lt;span class="py"&gt;qa.base-uri&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;https://api.restful-api.dev&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Getting started with this framework is simple. You can clone the repository from &lt;a href="https://github.com/pritesh-ranjan/java-cucumber-framework"&gt;https://github.com/pritesh-ranjan/java-cucumber-framework&lt;/a&gt; and open the project in IntelliJ IDEA or Eclipse. Then, run the command "&lt;em&gt;mvn clean test&lt;/em&gt;" in the root directory of the project. The tests will automatically run, and the results will be generated in the &lt;em&gt;target/cucumber-reports&lt;/em&gt; directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework Structure
&lt;/h2&gt;

&lt;p&gt;The framework is structured into several key components:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;src/main/java&lt;/em&gt;: contains abstract classes, test context, utilities, models, constants, and config reader&lt;/p&gt;

&lt;p&gt;&lt;em&gt;src/main/resources&lt;/em&gt;: contains the feature files and test data&lt;/p&gt;

&lt;p&gt;&lt;em&gt;src/test/java&lt;/em&gt;: contains the step definitions and test runners&lt;/p&gt;

&lt;p&gt;&lt;em&gt;pom.xml&lt;/em&gt;: contains the dependencies and build configuration&lt;/p&gt;

&lt;h2&gt;
  
  
  Test Context
&lt;/h2&gt;

&lt;p&gt;Test context is an important aspect of any cucumber framework.&lt;/p&gt;

&lt;p&gt;It is implemented using the Enum-based singleton pattern. The TestContext enum allows us to share test data in-between cucumber bdd steps in a thread-safe manner. In this case, we are using it to share payload, requests, and response objects. Credits to the test context : &lt;a href="https://medium.com/@bcarunmail/sharing-state-between-cucumber-step-definitions-using-java-and-spring-972bc31117af"&gt;https://medium.com/@bcarunmail/sharing-state-between-cucumber-step-definitions-using-java-and-spring-972bc31117af&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The test context class is located in the &lt;em&gt;src/main/java&lt;/em&gt; directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Testing
&lt;/h2&gt;

&lt;p&gt;API testing is done using the REST-Assured library. The REST-Assured library provides a powerful and flexible API testing library that is used to make HTTP requests and verify the responses. We use the &lt;em&gt;RequestSpecification&lt;/em&gt; class of rest assured for a one-time setup of the content headers, and other common items used while making a new request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Behavior Driven Development
&lt;/h2&gt;

&lt;p&gt;The framework uses Cucumber to implement BDD. The feature files are located in the &lt;em&gt;src/test/resources&lt;/em&gt; directory and the step definitions are located in the &lt;em&gt;src/test/java&lt;/em&gt; directory. BDD provides a clear and understandable way to write tests, making it easy for stakeholders to understand the tests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kd"&gt;Feature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Gadgets API tests

  &lt;span class="kn"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; reset the test context
    &lt;span class="nf"&gt;Given &lt;/span&gt;test context is reset

  &lt;span class="kn"&gt;Scenario Outline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Get object by id
    &lt;span class="nf"&gt;Given &lt;/span&gt;a get request is made for fetching details for object with &lt;span class="s"&gt;"&amp;lt;id&amp;gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;When &lt;/span&gt;response has status 200
    &lt;span class="nf"&gt;Then &lt;/span&gt;response has valid schema
    &lt;span class="nf"&gt;And &lt;/span&gt;response contains the following &lt;span class="s"&gt;"&amp;lt;name&amp;gt;"&lt;/span&gt;

    &lt;span class="nn"&gt;Examples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nv"&gt;id&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;
    &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;2&lt;/span&gt;  &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Apple&lt;/span&gt; &lt;span class="n"&gt;iPhone&lt;/span&gt; &lt;span class="n"&gt;12&lt;/span&gt; &lt;span class="n"&gt;Mini,&lt;/span&gt; &lt;span class="n"&gt;256GB,&lt;/span&gt; &lt;span class="n"&gt;Blue&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;
    &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;7&lt;/span&gt;  &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;HP&lt;/span&gt; &lt;span class="n"&gt;Pavilion&lt;/span&gt; &lt;span class="n"&gt;Plus&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This framework also demonstrates the use of Hooks like @&lt;em&gt;Before&lt;/em&gt; and @&lt;em&gt;After&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;org.framework.bdd.steps&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.cucumber.java.After&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io.cucumber.java.Before&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Hooks&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Before&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;beforeScenario&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"starting"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@After&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;afterScenario&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ending"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apart from hooks, the background steps are implemented with the &lt;em&gt;Background&lt;/em&gt; keyword in the feature file.&lt;/p&gt;

&lt;p&gt;While Background is a way to specify steps common to all scenarios in a feature file, hooks provide a way to run specific steps before or after each scenario or the entire feature. The Background steps are executed before each scenario in a feature file, whereas hooks are executed before or after each scenario or the entire feature, depending on where they are placed in the code. This allows you to perform specific actions, such as setting up test data or cleaning up after each scenario, in an efficient and organized manner. The use of hooks and Background can significantly improve the readability and maintainability of your test code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reporting
&lt;/h2&gt;

&lt;p&gt;Cucumber reporting is integrated with the framework. After running the tests, the report can be found in the &lt;em&gt;target/cucumber-reports&lt;/em&gt; directory. The report provides a clear and concise view of the test results. Currently, we are using cucumber HTML reports, but extent reporting can be added as per requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extensibility
&lt;/h2&gt;

&lt;p&gt;The framework is built to be easily extensible and can be integrated with other testing libraries and frameworks. This makes it easy to add new functionality to the framework as needed, allowing for growth and expansion as your testing needs change. The &lt;em&gt;AbstractSteps&lt;/em&gt; class store all common methods and variables. The &lt;em&gt;FrameworkException&lt;/em&gt; class is a base exception class for all custom-defined exceptions.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;The framework provides a powerful and flexible platform for test automation. With its combination of Java, Cucumber, and Maven, this framework provides a robust and maintainable solution for testing RESTful APIs using BDD methodology. Whether you are just getting started with test automation or looking for a more robust solution, this type of framework design is a great choice for your testing needs.&lt;/p&gt;

</description>
      <category>testing</category>
      <category>automation</category>
      <category>restassured</category>
      <category>java</category>
    </item>
    <item>
      <title>Advanced Web Scraping using Python-Scrapy and Splash</title>
      <dc:creator>Pritesh Ranjan</dc:creator>
      <pubDate>Thu, 24 Nov 2022 14:31:07 +0000</pubDate>
      <link>https://forem.com/epam_india_python/advanced-web-scraping-using-python-scrapy-and-splash-972</link>
      <guid>https://forem.com/epam_india_python/advanced-web-scraping-using-python-scrapy-and-splash-972</guid>
      <description>&lt;h2&gt;
  
  
  Introduction:
&lt;/h2&gt;

&lt;p&gt;Scrapy is a free and open-source web-crawling framework written in Python programming language. Designed for web scraping, it can also be used to extract data using APIs or as general-purpose web automation.&lt;/p&gt;

&lt;p&gt;The best part about Scrapy is its speed. Since it is asynchronous, Scrapy can make multiple requests parallelly. This increases efficiency, which makes Scrapy memory and CPU efficient compared to conventional tools like Selenium, python-requests, JAVA JSoup, or rest-assured.&lt;/p&gt;

&lt;p&gt;One of the limitations of Scrapy is that it cannot process JavaScript. To overcome this limitation, we can use JS rendering engines like Playwright, Splash, and Selenium. Splash is a JavaScript rendering engine with an HTTP API.&lt;/p&gt;

&lt;p&gt;Now you may ask why use Splash with Scrapy. As we already have so many JS rendering engines. Because both Scrapy and Splash both are designed based on an event-driven twisted network protocol. Also Splash is very light weight and is capable of processing multiple pages in parallel. In my experience, Splash complements Scrapy's ability to crawl the web without hampering its performance.&lt;/p&gt;

&lt;p&gt;In this blog, we will learn how we can exploit Splash for web crawling/automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup:
&lt;/h2&gt;

&lt;p&gt;As this is an advanced tutorial, it is assumed that you have already worked on Python3 and Scrapy framework and have the setup ready on your machine. Now, the easiest way to set up Splash is through Docker.&lt;/p&gt;

&lt;p&gt;Let us download and install Docker from the &lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; official website. Once Docker is set up, we can pull the Splash image using the following command in the terminal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5pqyzf5gs8msi4hlfbxo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5pqyzf5gs8msi4hlfbxo.png" alt="Image description" width="691" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Docker should be running in your system now. But before we can use it in our Scrapy framework, we need to install the python Scrapy-Splash dependency using Pip.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fyo6fl9ofyiefwvl1i5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fyo6fl9ofyiefwvl1i5.png" alt="Image description" width="294" height="95"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then we need to add the Splash middleware settings into the &lt;em&gt;settings.py&lt;/em&gt; file of our Scrapy project. The most important setting to modify is the DOWNLOADER_MIDDLEWARES. It should look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6eex7g8k9zpq756jwcrj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6eex7g8k9zpq756jwcrj.png" alt="Image description" width="734" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  These settings will allow the Scrapy engine to communicate seamlessly with Splash. Please note that the SPIDER_MIDDLEWARES and DUPEFILTER_CLASS are optional as they help us avoid duplicate requests.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Usage:
&lt;/h2&gt;

&lt;p&gt;We need to be able to tell the Scrapy engine when to use Splash for a particular request. The Scrapy-Splash package we just installed comes with a handy method called &lt;em&gt;SplashRequest&lt;/em&gt; that does just that. Whenever we wish to invoke Splash and use its JS rendering capabilities, we can call &lt;em&gt;SplashRequest&lt;/em&gt; instead of the usual Request.&lt;/p&gt;

&lt;p&gt;Scrapy-Splash shares its features with other headless browsers, like, performing certain actions and modifying its working before returning the HTML response. One can configure Splash to do these actions by passing arguments or using Lua Scripts. This can be done with the &lt;em&gt;args&lt;/em&gt; argument of the &lt;em&gt;SplashRequest&lt;/em&gt; class.&lt;/p&gt;

&lt;p&gt;We can also run custom JS code by passing it within the &lt;em&gt;args&lt;/em&gt; dictionary. Here is an example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi96wp4s9r9qf7qlwswf6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi96wp4s9r9qf7qlwswf6.png" alt="Image description" width="525" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;SplashRequest&lt;/em&gt; call returns an object of the Scrapy &lt;em&gt;Response&lt;/em&gt; class and can be used to grab data and perform required actions. This means that we can easily swap any &lt;em&gt;Request&lt;/em&gt; call with &lt;em&gt;SplashRequest&lt;/em&gt; and the rest of the code will not be affected.&lt;/p&gt;

&lt;p&gt;Here is a list of some of the actions that can be performed using Splash (apart from custom JS execution):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Wait for page elements to load&lt;/li&gt;
&lt;li&gt;Scroll the page&lt;/li&gt;
&lt;li&gt;Click on page elements&lt;/li&gt;
&lt;li&gt;Turn of images or use Adblock rules to make rendering faster&lt;/li&gt;
&lt;li&gt;Take screenshots&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While the first four actions can only be performed using Lua script, and are out of scope for this tutorial, let us see how we can take screenshots and save them into a file. Our &lt;em&gt;SplashRequest&lt;/em&gt; call should look something like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrzakkeeiyc4optrr3uf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrzakkeeiyc4optrr3uf.png" alt="Image description" width="440" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here is the parse method that is being referenced (in the callback argument):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frnlr72h0fabvtj1dzj2r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frnlr72h0fabvtj1dzj2r.png" alt="Image description" width="551" height="183"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are grabbing the 'png' attribute of the response and converting it into a base64 format, finally, this can be written to a file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data extraction and storage with Scrapy:
&lt;/h2&gt;

&lt;p&gt;Either it's a simple scrapy &lt;em&gt;Request&lt;/em&gt; or a &lt;em&gt;SplashRequest&lt;/em&gt;, the process to extract data from the &lt;em&gt;Response&lt;/em&gt; remains the same. As it uses &lt;em&gt;lxml&lt;/em&gt; to build the HTML DOM tree, we can use traditional tools like &lt;em&gt;XPath, CSS selector&lt;/em&gt; to grab HTML text, attribute values, etc. For the simplicity of this tutorial, we will build a spider that can crawl all the pages on this website and grab the data onto a file. This data will include information like quotes text, quote author, and tags. Let us see how we can extract data using XP_ath._&lt;/p&gt;

&lt;p&gt;Just like we can perform all major actions on Selenium, via the driver, here the response variable can be used to perform various tasks like drabbing an element and extracting data from it. To get all the quotes on a page, we use the following XPath:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;//div[&lt;a class="mentioned-user" href="https://dev.to/class"&gt;@class&lt;/a&gt;="quote"]&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's the rest of the code:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv79db8uagmd6hoo877zz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv79db8uagmd6hoo877zz.png" alt="Image description" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are not only extracting data from the webpage but also intuitively following all the pagination links.&lt;/p&gt;

&lt;p&gt;Executing this code took less than 5 seconds. Performing a similar task on something like selenium would take at least 50 seconds. Since there is no dependency on a web browser, along with an asynchronous approach, this speeds up the process quite a bit.&lt;/p&gt;

&lt;p&gt;Let's see how data is extracted when we run this code. The first image show a sample terminal output of the scraping process,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftt8jq2ueq734j9vjrq8b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftt8jq2ueq734j9vjrq8b.png" alt="Image description" width="800" height="107"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But this is of no use unless we can store this scraped data into a file or database. Let's see how that's done.&lt;/p&gt;

&lt;p&gt;Scrapy allows us to store scraped data in &lt;em&gt;.csv&lt;/em&gt; or &lt;em&gt;.json&lt;/em&gt; files without writing any additional code and we can enable this feature with the "&lt;em&gt;-o"&lt;/em&gt; parameter. For example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4827k8esxn090puu3f0k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4827k8esxn090puu3f0k.png" alt="Image description" width="394" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will store all the scraped data into a file called "&lt;em&gt;output.json&lt;/em&gt;". Here's a snapshot of the output file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2q8yflfly6s1yj2up7e4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2q8yflfly6s1yj2up7e4.png" alt="Image description" width="607" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scrapy Architecture:
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhbeg85g6063jjfrjr216.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhbeg85g6063jjfrjr216.png" alt="Image description" width="739" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above diagram shows the official architecture of the scrapy framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  User agent rotation:
&lt;/h2&gt;

&lt;p&gt;User agents are used to identifying themselves on the website. It tells the server some necessary details like browser name, version, etc. Let us look at how to set custom user agents using Scrapy Splash. The SplashRequest call has the optional headers parameter to provide custom headers. This is used to provide custom user agents in a dictionary structure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqmwsnat5bvx59il9ajj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqmwsnat5bvx59il9ajj.png" alt="Image description" width="577" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above snippet demonstrates this in a quite simple and easy to understand manner. Let us say we have a list of user-agents in a file. We can use the python random choice to set the new user-agent into the headers if we want to set a new user-agent with every request. It is important to note that one can use the &lt;em&gt;Scrapy-user-agents&lt;/em&gt; package to get an updated list of user-agents instead of maintaining one locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion:
&lt;/h2&gt;

&lt;p&gt;One thing to note is that Scrapy is not a test automation tool like selenium, etc. Although third-party frameworks can be used alongside Scrapy to perform test automation, it was never designed for that purpose. It should be used for efficient and fast API or Web automation, web crawling, and other similar tasks.&lt;/p&gt;

&lt;p&gt;Also, Splash can be used alongside Scrapy to process requests where rendering JS is necessary. In my experience we can use tools like Selenium along with Scrapy too, but that drastically slows it down. Splash is the best choice for JS rendering in Scrapy because both are developed by the same company. As Scrapy is written in Python, it is quite easy to learn and extremely popular among the Data Mining, Data Science community.&lt;/p&gt;

&lt;h2&gt;
  
  
  References:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://python.plainenglish.io/rotating-user-agent-with-scrapy-78ca141969fe" rel="noopener noreferrer"&gt;https://python.plainenglish.io/rotating-user-agent-with-scrapy-78ca141969fe&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/scrapinghub/splash" rel="noopener noreferrer"&gt;https://github.com/scrapinghub/splash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://scrapeops.io/python-scrapy-playbook/scrapy-splash/" rel="noopener noreferrer"&gt;https://scrapeops.io/python-scrapy-playbook/scrapy-splash/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://splash.readthedocs.io/en/stable/index.html" rel="noopener noreferrer"&gt;https://splash.readthedocs.io/en/stable/index.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.zyte.com/blog/handling-javascript-in-scrapy-with-splash/" rel="noopener noreferrer"&gt;https://www.zyte.com/blog/handling-javascript-in-scrapy-with-splash/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://scrapy.org/" rel="noopener noreferrer"&gt;https://scrapy.org/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.scrapy.org/en/latest/" rel="noopener noreferrer"&gt;https://docs.scrapy.org/en/latest/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/scrapy/scrapy" rel="noopener noreferrer"&gt;https://github.com/scrapy/scrapy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy" rel="noopener noreferrer"&gt;https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Disclaimers:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Although web crawling is important in a variety of fields, please note that web crawling must be done ethically, and we must ensure that the script runs in accordance with the Terms of Use, etc of the website(s) being crawled.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a personal blog. The views and opinions expressed here are only those of the author and do not represent those of any organization or any individual with whom the author may be associated, professionally or personally.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
    </item>
  </channel>
</rss>
