Flow to Source implementation with Akka streams

vianel — Mon, 03 Aug 2020 23:58:30 +0000

NOTE: This post requires previous knowledge of scala using the Akka framework specifically Akka streams.

Akka streams structure

The Akka streams module of Akka provides a beautiful DSL that will help you to move data in an asynchronous way. The base of an Akka stream is composed of 3 main components:

Source ~> Flow ~> Sink

Source: Which is in charge of emitting the data through the stream, only have outputs ports
Flow: receive the data, transform it and send it to the next component
Sink: The end of the stream it will be the last step of any stream, only have input ports.

The problem

So far so good but what happens when we need to get the data from a source and this receives the key on the fly? i.e we need to download a file but we don't know the name of the file which arrives frow a previous step and we are using a library to download the file which is a source.

FlatMapConcat FTW

The FlatMapConcat flow allows us to transform any input element into a source this means we can connect a flow with a source, something like this:

Source ~> Flow ~> Source ~> Sink

A simple sample would look like this

1    //Simple Source
2    val src = Source(List(1, 2, 3))
3
4    val fromSrc = b.add(Flow[Int].log("from-Src"))
5
6    //Emit new list per element received
7    val flow = b.add(Flow[Int].flatMapConcat{ element =>
8      Source(List(4, 5, 6))
9    })
10
11    //¯\_(ツ)_/¯ nothing to do, just for sample purposes
12    val map = b.add(Flow[Int].map{ element =>
13      element
14    })
15
16    src ~> fromSrc ~>flow ~> map ~> sink

At line 16 in the main flow, we can see the 3rd step of our process is a regular flow but digging into it at line 8 we notice this flow is actually a Source that receives data from the flow FromSrc and emits new data to the map flow. 🙌🏻

That's it we just connected a source in the middle of a stream that emits new data based on the received data.

...
In case you want to run the sample check out my gist here

How to build a Web Scraper using golang with colly

vianel — Sat, 18 Jul 2020 02:49:57 +0000

I have seen a lot of examples of how to build a web scraper in lots of programming languages mostly in python specifically using scrapy tool but only a few in golang.

As many golangs fans know, golang has tons of benefits when we talk about concurrency and parallelism, all of these features combined with a modern framework allow us to scratch the web in an easy and fastest way, but first of all, let's start with what a web scraper does?

Explaining Web scraper like I'm five

Quoting Wikipedia definition:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol or through a web browser

So a web scraping is a technique used to extract data from websites using HTTP, think of this a web scraper is basically a robot that can read the data from a website like the human brain can read this post, a web scraper can get the text from this post, extract the data from the HTML and it can use them for many purposes.

Web scraper vs Web crawler

In order to keep this short, a web crawler is a bot that can browse the web so a search engine like google can index new websites and a web scraper is responsible of extract the data from that website.

Back to business

Now we know what we are building let's start to get our hands dirty, our first step will be to create a simple server in golang with a ping endpoint, using the standard lib it will look like this

package main

import (
    "log"
    "net/http"
)

func ping(w http.ResponseWriter, r *http.Request) {
    log.Println("Ping")
    w.Write([]byte("ping"))
}

func main() {
    addr := ":7171"

    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}

That's it we just created a simple server with golang 👏🏻 to test it just run it like any other golang program using go run main.go you will see a log that our server is listening to the port 7171. To test it with curl just run curl -s 'http://127.0.0.1:7171/ping'

Thanks to the http lib we can easily create a server just calling the ListenAndServe function specifying a port and using the HandleFunc we can manage different endpoints for our API.

Now we have a server up and running. Let's create another endpoint, this one will extract some data from the colly website.

First of all, we need to install the colly dependency to do this I highly recommend to use go module just run go mod init <project-name> this will generate the go.mod file where all dependencies used in the project will be. Open the go.mod and add the colly dependency in the require section

require (
github.com/gocolly/colly v1.2.0
)

and that's it go module will take care of download the dependency to your local machine.

We are all set to extract all the data from the websites so let's create a function to get all links from any website

func getData(w http.ResponseWriter, r *http.Request) {
//Verify the param "URL" exists
    URL := r.URL.Query().Get("url")
    if URL == "" {
        log.Println("missing URL argument")
        return
    }
    log.Println("visiting", URL)

//Create a new collector which will be in charge of collect the data from HTML
    c := colly.NewCollector()

//Slices to store the data
    var response []string

//onHTML function allows the collector to use a callback function when the specific HTML tag is reached 
//in this case whenever our collector finds an
//anchor tag with href it will call the anonymous function
// specified below which will get the info from the href and append it to our slice
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        if link != "" {
            response = append(response, link)
        }
    })

//Command to visit the website
    c.Visit(URL)

// parse our response slice into JSON format
    b, err := json.Marshal(response)
    if err != nil {
        log.Println("failed to serialize response:", err)
        return
    }
// Add some header and write the body for our endpoint
    w.Header().Add("Content-Type", "application/json")
    w.Write(b)
}

This function will extract all links from any website specified in the url param.

And finally, let's add our new endpoint to our server

func main() {
    addr := ":7171"

    http.HandleFunc("/search", getData)
    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}

Done we have our new search GET endpoint which receives the url param and it will extract all links from the website specified. For test it using curl you can use curl -s 'http://127.0.0.1:7171/search?url=http://go-colly.org/' it will response something like

["http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/","http://go-colly.org/","http://go-colly.org/docs/","http://go-colly.org/articles/","http://go-colly.org/services/","http://go-colly.org/datasets/","https://godoc.org/github.com/gocolly/colly","https://github.com/gocolly/colly","https://github.com/gocolly/colly","http://go-colly.org/docs/","https://github.com/gocolly/colly/blob/master/LICENSE.txt","https://github.com/gocolly/colly","http://go-colly.org/contact/","http://go-colly.org/docs/","http://go-colly.org/services/","https://github.com/gocolly/colly","https://github.com/gocolly/site/","http://go-colly.org/sitemap.xml"]

Conquer all the web

Congrats! you just created a web scraper API in golang 🙌🏻.
before conquering all the web remember to check the robots.txt file of the website to ensure you can extract that data. For those who are not aware of what robots.txt is.
Is a file on all the websites where it specifies the instructions to the robots like web crawler or web scraper how to handle their data or what endpoint the robots can use. For example, in google will be https://www.google.com/robots.txt

Next step

Keep going further with colly documentation, it has a lot of examples of web scraper and web crawlers.

Forem: vianel