<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: vianel</title>
    <description>The latest articles on Forem by vianel (@vianeltxt).</description>
    <link>https://forem.com/vianeltxt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F23242%2F653cfd6e-86bd-4f13-8ade-90569a108d01.jpg</url>
      <title>Forem: vianel</title>
      <link>https://forem.com/vianeltxt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vianeltxt"/>
    <language>en</language>
    <item>
      <title>Flow to Source implementation with Akka streams</title>
      <dc:creator>vianel</dc:creator>
      <pubDate>Mon, 03 Aug 2020 23:58:30 +0000</pubDate>
      <link>https://forem.com/vianeltxt/flow-to-source-implementation-with-akka-streams-1gbh</link>
      <guid>https://forem.com/vianeltxt/flow-to-source-implementation-with-akka-streams-1gbh</guid>
      <description>&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; This post requires previous knowledge of scala using the &lt;a href="https://akka.io/"&gt;Akka&lt;/a&gt; framework specifically &lt;a href="https://doc.akka.io/docs/akka/current/stream/index.html"&gt;Akka streams&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Akka streams structure
&lt;/h1&gt;

&lt;p&gt;The Akka streams module of Akka provides a beautiful DSL that will help you to move data in an asynchronous way. The base of an Akka stream is composed of 3 main components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Source ~&amp;gt; Flow ~&amp;gt; Sink
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Source:  Which is in charge of emitting the data through the stream, only have outputs ports&lt;/li&gt;
&lt;li&gt;Flow: receive the data, transform it and send it to the next component&lt;/li&gt;
&lt;li&gt;Sink: The end of the stream it will be the last step of any stream, only have input ports.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  The problem
&lt;/h1&gt;

&lt;p&gt;So far so good but what happens when we need to get the data from a source and this receives the key on the fly? i.e we need to download a file but we don't know the name of the file which arrives frow a previous step and we are using a library to download the file which is a source.&lt;/p&gt;

&lt;h3&gt;
  
  
  FlatMapConcat FTW
&lt;/h3&gt;

&lt;p&gt;The FlatMapConcat flow allows us to transform any input element into a source this means we can connect a flow with a source, something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Source ~&amp;gt; Flow ~&amp;gt; Source ~&amp;gt; Sink
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;A simple sample would look like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1    //Simple Source
2    val src = Source(List(1, 2, 3))
3
4    val fromSrc = b.add(Flow[Int].log("from-Src"))
5
6    //Emit new list per element received
7    val flow = b.add(Flow[Int].flatMapConcat{ element =&amp;gt;
8      Source(List(4, 5, 6))
9    })
10
11    //¯\_(ツ)_/¯ nothing to do, just for sample purposes
12    val map = b.add(Flow[Int].map{ element =&amp;gt;
13      element
14    })
15
16    src ~&amp;gt; fromSrc ~&amp;gt;flow ~&amp;gt; map ~&amp;gt; sink
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;At line 16 in the main flow, we can see the 3rd step of our process is a regular flow but digging into it at line 8 we notice this flow is actually a Source that receives data from the flow FromSrc and emits new data to the map flow. 🙌🏻&lt;/p&gt;

&lt;p&gt;That's it we just connected a source in the middle of a stream that emits new data based on the received data. &lt;/p&gt;

&lt;p&gt;...&lt;br&gt;
In case you want to run the sample check out my gist &lt;a href="https://gist.github.com/vianel/e9a0ac0ae8c542fe0a1709a1501cafc9"&gt;here&lt;/a&gt; &lt;/p&gt;

</description>
      <category>akka</category>
      <category>scala</category>
      <category>functionalprograming</category>
      <category>streams</category>
    </item>
    <item>
      <title>How to build a Web Scraper using golang with colly</title>
      <dc:creator>vianel</dc:creator>
      <pubDate>Sat, 18 Jul 2020 02:49:57 +0000</pubDate>
      <link>https://forem.com/vianeltxt/how-to-build-a-web-scraper-using-golang-with-colly-18lh</link>
      <guid>https://forem.com/vianeltxt/how-to-build-a-web-scraper-using-golang-with-colly-18lh</guid>
      <description>&lt;p&gt;I have seen a lot of examples of how to build a web scraper in lots of programming languages mostly in python specifically using &lt;a href="https://scrapy.org/"&gt;scrapy&lt;/a&gt; tool but only a few in &lt;a href="https://golang.org/"&gt;golang&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As many golangs fans know, golang has tons of benefits when we talk about concurrency and parallelism, all of these features combined with a modern framework allow us to scratch the web in an easy and fastest way, but first of all, let's start with what a web scraper does?&lt;/p&gt;

&lt;h2&gt;
  
  
  Explaining Web scraper like I'm five
&lt;/h2&gt;

&lt;p&gt;Quoting &lt;a href="https://en.wikipedia.org/wiki/Web_scraping"&gt;Wikipedia&lt;/a&gt; definition:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol or through a web browser&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So a web scraping is a technique used to extract data from websites using HTTP, think of this a web scraper is basically a robot that can read the data from a website like the human brain can read this post, a web scraper can get the text from this post, extract the data from the HTML and it can use them for many purposes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Web scraper vs Web crawler
&lt;/h2&gt;

&lt;p&gt;In order to keep this short, a web crawler is a bot that can browse the web so a search engine like google can index new websites and a web scraper is responsible of extract the data from that website.&lt;/p&gt;

&lt;h2&gt;
  
  
  Back to business
&lt;/h2&gt;

&lt;p&gt;Now we know what we are building let's start to get our hands dirty, our first step will be to create a simple server in golang with a ping endpoint, using the standard lib it will look like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;package main

import (
    "log"
    "net/http"
)

func ping(w http.ResponseWriter, r *http.Request) {
    log.Println("Ping")
    w.Write([]byte("ping"))
}

func main() {
    addr := ":7171"

    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;That's it we just created a simple server with golang 👏🏻 to test it just run it like any other golang program using &lt;code&gt;go run main.go&lt;/code&gt; you will see a log that our server is listening to the port &lt;code&gt;7171&lt;/code&gt;. To test it with curl just run &lt;code&gt;curl -s 'http://127.0.0.1:7171/ping'&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Thanks to the http lib we can easily create a server just calling the &lt;code&gt;ListenAndServe&lt;/code&gt; function specifying a port and using the &lt;code&gt;HandleFunc&lt;/code&gt; we can manage different endpoints for our API. &lt;/p&gt;

&lt;p&gt;Now we have a server up and running. Let's create another endpoint, this one will extract some data from the colly website.&lt;/p&gt;

&lt;p&gt;First of all, we need to install the colly dependency to do this I highly recommend to use &lt;a href="https://blog.golang.org/using-go-modules"&gt;go module&lt;/a&gt; just run &lt;code&gt;go mod init &amp;lt;project-name&amp;gt;&lt;/code&gt; this will generate the &lt;code&gt;go.mod&lt;/code&gt; file where all dependencies used in the project will be. Open the &lt;code&gt;go.mod&lt;/code&gt; and add the colly dependency in the require section&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;require (
github.com/gocolly/colly v1.2.0
)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;and that's it go module will take care of download the dependency to your local machine.&lt;/p&gt;

&lt;p&gt;We are all set to extract all the data from the websites so let's create a function to get all links from any website&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func getData(w http.ResponseWriter, r *http.Request) {
//Verify the param "URL" exists
    URL := r.URL.Query().Get("url")
    if URL == "" {
        log.Println("missing URL argument")
        return
    }
    log.Println("visiting", URL)

//Create a new collector which will be in charge of collect the data from HTML
    c := colly.NewCollector()

//Slices to store the data
    var response []string

//onHTML function allows the collector to use a callback function when the specific HTML tag is reached 
//in this case whenever our collector finds an
//anchor tag with href it will call the anonymous function
// specified below which will get the info from the href and append it to our slice
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        if link != "" {
            response = append(response, link)
        }
    })

//Command to visit the website
    c.Visit(URL)

// parse our response slice into JSON format
    b, err := json.Marshal(response)
    if err != nil {
        log.Println("failed to serialize response:", err)
        return
    }
// Add some header and write the body for our endpoint
    w.Header().Add("Content-Type", "application/json")
    w.Write(b)
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This function will extract all links from any website specified in the &lt;code&gt;url&lt;/code&gt; param.&lt;/p&gt;

&lt;p&gt;And finally, let's add our new endpoint to our server&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func main() {
    addr := ":7171"

    http.HandleFunc("/search", getData)
    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Done we have our new search GET endpoint which receives the &lt;code&gt;url&lt;/code&gt; param and it will extract all links from the website specified. For test it using curl you can use &lt;code&gt;curl -s 'http://127.0.0.1:7171/search?url=http://go-colly.org/'&lt;/code&gt; it will response something like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;["http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/","http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/docs/","https://github.com/gocolly/colly/blob/master/LICENSE.txt","https://github.com/gocolly/colly","http://go-colly.org/contact/","http://go-colly.org/docs/","http://go-colly.org/services/","https://github.com/gocolly/colly","https://github.com/gocolly/site/","http://go-colly.org/sitemap.xml"]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h1&gt;
  
  
  Conquer all the web
&lt;/h1&gt;

&lt;p&gt;Congrats! you just created a web scraper API in golang 🙌🏻.&lt;br&gt;
before conquering all the web remember to check the robots.txt file of the website to ensure you can extract that data. For those who are not aware of what &lt;a href="http://www.robotstxt.org/"&gt;robots.txt&lt;/a&gt; is.&lt;br&gt;
Is a file on all the websites where it specifies the instructions to the robots like web crawler or web scraper how to handle their data or what endpoint the robots can use. For example, in google will be &lt;code&gt;https://www.google.com/robots.txt&lt;/code&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Next step
&lt;/h2&gt;

&lt;p&gt;Keep going further with &lt;a href="http://go-colly.org/"&gt;colly&lt;/a&gt; documentation, it has a lot of examples of web scraper and web crawlers. &lt;/p&gt;

</description>
      <category>go</category>
      <category>scraper</category>
      <category>web</category>
      <category>likeimfive</category>
    </item>
  </channel>
</rss>
