Forem: Ivan Mushketyk

Should you use DynamoDB?

Ivan Mushketyk — Mon, 06 Nov 2017 18:47:08 +0000

Selecting a proper technology for a new project is always a stressful event. We need to find something that will fit all existing requirements, does not restrict further growth, allows to achieve necessary performance, does not put a heavy operational burden, etc. It’s only natural that selecting a database can be tricky.

In this article, I would like to describe DynamoDB database created by AWS. My goal is to give you enough knowledge so you would be able to answer a simple question: “Should I use DynamoDB in my next project?”. I will describe in what cases DynamoDB can be used efficiently and what pitfalls to avoid. I hope this will help you to make your life easier.

This article will start with a general overview of DynamoDB; then I will show how to structure data for DynamoDB and what options do you have to work with DynamoDB. The article will finish with a rundown of some advanced features of DynamoDB.

What is DynamoDB

Let’s start with what is AWS DynamoDB. DynamoDB is a NoSQL, key-value/document-oriented database. As a key-value database, it allows storing an item with an id and then get an item back. As a document-oriented database, it allows storing complex nested documents.

DynamoDB is a serverless database, meaning that when you work with it you do not need to worry about individual machines. In fact, there is even no way to find out how many machines Amazon is using to serve your data. Instead of working with individual servers you need to specify how many read and write requests your database should process.

On the one hand, this allows Amazon to provide a predictable low latency which is according to many sources is less than 10 ms if a request is coming from an EC2 host from the same AWS region. The latency can be even lower (<1 ms) if you enable cache for DynamoDB (more on this in the later section).

On the other hand specifying a number of requests instead of a number of servers allows you to concentrate on the business value of the database and not on the implementation details. If you have an estimate of how many requests you need to process, all you need to do is to specify a number in AWS console or perform an API request. This can’t be simpler.

The price that you pay for using DynamoDB is primarily determined by the provisioned capacity for your DynamoDB database. The higher it is, the higher the monthly bill is. DynamoDB also provides a small amount so-called “burst capacity which can be used for a short period of time to read or write more data that your provisioned capacity allows. If you’ve consumed it and still read or write more data, DynamoDB will return a ProvisionedThroughputExceededException, and you need either to retry an operation or provision more capacity.

In addition to these major features there are few other reasons why you might consider DynamoDB:

Massive scale – just as other AWS services DynamoDB can work on a massive scale. Many other companies such as Airbnb, Lyft, and Duolingo are using DynamoDB in production.
Low operational overhead – you still need to do some operational tasks, such as, ensure that you have enough provisioned capacity, but most of the operational load is taken by the AWS team.
Reliable – despite few outages DynamoDB has a proven track of being rock-solid database solution. Also, all data that is written to DynamoDB is replicated to three different location.
Schemaless – just as many other NoSQL databases DynamoDB does not impose strict schema allowing more flexibility.
Simple API – DynamoDB API is very straightforward. Overall it has less than twenty methods an only a handful of them is related to writing and reading data.
Autoscaling – it is pretty straightforward to scale DynamoDB database up or down. All you need to do is to enable autoscaling on a particular table, and AWS will automatically increase or decrease provisioned capacity depending on current load. Alternatively, you can perform the UpdateTable API call and change the provisioned capacity.
Integration with other AWS services – DynamoDB is one of the core AWS services and has good integration with other services. You can use it together with CloudSearch to enable full-text search, perform data analytics with AWS EMR, back up data with AWS Data Pipeline, etc.

Data model in DynamoDB

Now let’s take a look at how to store data in DynamoDB. Data in DynamoDB is separated into tables. When you create a table, you need to decide on the key type that your table will have. DynamoDB has two types of keys and when you select a key type and you can’t change once it is selected:

Simple key – in this case, you need to identify what attribute in the table contains a key. This key is called a partition key. With this key type, DynamoDB does not give you a lot of flexibility and the only operation that you can do efficiently is to store an element with a key and get an element by a key back.
Composite key – in this case, you need to specify two key values which are called partition key and a sort key. As in the previous case, you can get an item by key, but you can also query this data in a more elaborate way. For example, you can get all items with the same partition key, sort result data by the value of the sort key, filter items using the value of the sort key, etc. The pair of partition/sort should be unique for each item.

Let’s take a look at some examples of using simple and composite keys. For example, if we want to store a table with users in our database we could use a table with a simple key and store it like this:

With this table, the only operation that we can perform efficiently is to get a user by id.

Composite keys allow more flexibility. For example, we could define a table with a composite that stores forum messages and select user id as a partition key and timestamp as a sort key:

This structure would allow performing more complex queries like:

Get all forum posts written by user with a specified id
Get all forum posts written by a specified user sorted by time (we can do this because we have the sort key)
Get all forum posts that were written in a specified time interval (we can do this because we can specify filtering expression on a sort key)

Notice that in this case, you can only query data for a specified partition key. If you want to search for items across partition keys you need to use the scan operation. It allows finding all items in a table that match a specified filter expression. This operation is less restrictive than the first two, but it does not exploit any knowledge about where data is stored in DynamoDB and ends up scanning the whole table.

You should try to use the scan operation as little as possible. While it allows much more flexibility, it is significantly slower and consumes more provisioned capacity. If you are extensively relying on scanning big tables, you won’t achieve high performance and will have to provision more capacity and hence pay more money.

Consistency in DynamoDB

As with many other NoSQL databases, you can select consistency level when you perform operations with DynamoDB. DynamoDB stores three copies of each item and when you write data to DynamoDB it only acknowledges a write after two copies out of three were updated. The third copy is updated later.

When you read data from DynamoDB, you have two options. You can either use strong consistency and in this DynamoDB will read data from two copies and return the latest data, or you can select eventual consistency and in this case, DynamoDB will only read data from one copy at random, and may return stale data.

Indexes

Simple key and composite key model is quite restrictive and is not enough to support complex use cases. To help with that DynamoDB supports two index types:

Local secondary index – is very similar to composite key and is used to define additional sort order or to filter items by a different criteria. The main difference from a composite key is that a pair of a partition/sort key should be unique, but a pair of partition/secondary index should not be unique.
Global secondary index – this allows using a different partition key for your data. You can use a global secondary index if you want to get an item from a table by one of two ids, for example, a book in an online shop can have two ids: ISBN-10 and ISBN-13. As with regular tables, global secondary indexes can have simple and composite keys.

Internally, global secondary index is simply a copy of your original data in a separate DynamoDB table with a different key. When an item is written into a table with a global secondary index, DynamoDB copies data in the background. Because of this writing data into a global secondary is always eventually consistent.

With DynamoDB you can create up to five local secondary indexes and up to five global secondary indexes per table.

Complex data

All examples so far presented data in table format but DynamoDB also supports complex data types. In addition to simple data types like numbers and strings DynamoDB supports these types:

Nested object – a value of an attribute in DynamoDB can be a complex nested object
Set – a set of numbers, strings, or binary values
List – a untyped list that can contain any values

Programmatic access

If you want to access DynamoDB,f you have two main options: low-level API and DynamoDB mapper. All communication with DynamoDB is performed via HTTP. To read data, there are just four methods:

GetItem – get a single item by id from a database
BatchGetItem – get several items by id in one call
Query – query a composite key or an index
Scan – scan through a table

And there are just four methods to change data in DynamoDB:

PutItem – write a new item to a table
BatchWriteItem – write multiple items to a table
UpdateItem – update some fields in a specified item
DeleteItem – remove an item by id

All methods work on a table level. There are no methods that work across different tables.

The low-level API is a thin wrapper over these HTTP methods. It is verbose and cumbersome to use. For example, to get a single item from a DynamoDB table you need write that much code:

// Create DynamoDB client
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();

// Create a composite key
HashMap<String, AttributeValue> key = new HashMap<>();
key.put(”UserId”, new AttributeValue()
    .withN(”1”));
key.put(”Timestamp”,new AttributeValue().withN(“1498928631”));

// Create a request object
GetItemRequest request = new GetItemRequest()         
    .withTableName(”ForumMessages”)
    .withKey(key);

// Perform API request
GetItemResult result = client.getItem(request);
// Get attribute from the result item
AttributeValue year = result.getItem().get(”Message");
// Get string value 
String message = attributeValue.getS();

The code is pretty straightforward. First, we create a key to get an item by a key, then as with other AWS API methods we create a request object and perform a request. The example is in Java, but API clients for AWS exist for other languages such as Python, .NET platform, and JavaScript.

Now you may be wondering if there is a library that will help to avoid all this massive amount boilerplate code. And in fact, AWS implemented a high-level library for this called DynamoDB mapper. To use it you first need to define a structure of your data similarly to how you define it with ORM frameworks:

// Specify table name
@DynamoDBTable(tableName=“ForumUsers”)
public class User {
    // "UserId" attribute is a key
    @DynamoDBHashKey(attributeName=“UserId”)
    public int getUserId() {
        return userId;
    }

    // Map this value to the "Name" attribute
    @DynamoDBAttribute(attributeName = ”Name”)
    public int getName() {
        return name;
    }
}

Now accessing data in DynamoDB is much simpler. All we need to do to is to create an instance of the DynamoDBMapper and call the load method:

// Create DynamoDB client (just as in the previous example)
AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().build();
// Create DynamoDB mappper
DynamoDBMapper mapper = new DynamoDBMapper(client);

// Create a key instance
User key = new User();
key.setUserId(1);

// Get item by keys
User result = mapper.load(key);

Notice that now to specify a key of an item we can use a Java POJO class and not a HashMap instance as in the previous case.

Unless you need to access some nitty-gritty details of DynamoDB or need to implement a custom way of doing things I recommend to use DynamoDB mapper and only to use low-level API if necessary. Examples here were provided in Java, but there are also DynamoDB mapper implementations for other languages like .NET platform and Python.

You can also use unofficial libraries to access DynamoDB. For example, if you are using Spring you can consider using Amazon DynamoDB module for Spring Data.

Advanced features

All features described so far are core DynamoDB features, but it also has some additional more advanced features that you can use to build complex applications. In this section, I’ll provide a short list of what these features are and what you can use them for.

Optimistic locking

A common problem in a distributed systems is when different actors are stepping on each other toes while performing operations in parallel. A common example is when two different processes are trying to update the same item in a database. In this case, a second update can override data written by the first update.

To solve this problem, DynamoDB allows specifying a condition for performing an update. If a condition is satisfied, new data is written, otherwise, DynamoDB will return an error.

A common way to use this feature that is implemented in DynamoDBMapper is to maintain a version field with an item and increment it on every update. If a version was not changed a process can write new data. Otherwise, it will have to re-read data and try to perform the operation again.

This technique is also called optimistic locking and is similar to the compare-and-swap operation that is used to implement lock-free data structures.

Transactions

While DynamoDB does not have built-in support for transactions, AWS has implemented a Java library that implements transactions on top of existing DynamoDB features. A detailed design is described here, but in the nutshell when you perform an operation with the DynamoDB transaction library it stores a list of performed operations into a separate table and on commit it applies stored operations.

Since the library is open sourced there is nothing that prevents developers from implementing a similar solution for other languages, but it seems currently DynamoDB transaction support is only implemented for Java.

Time to live

Not all data that you store in DynamoDB should be stored forever. Older data can be moved to a cheaper data storage, like S3, or simply removed. To automatically delete old data DynamoDB implements the time-to-live feature which allows specifying an attribute that stores a timestamp when an item should be removed. DynamoDB tracks expired items and removes them for no extra cost.

DynamoDB Streams

Another powerful DynamoDB feature is DynamoDB streams. If it is enabled it allows reading an immutable, ordered stream of updates to a DynamoDB table. An item is written to a stream after an update is performed and allow to react to changes to DynamoDB. This is a crucial feature if you need to implement one of the following use-cases:

Cross-region replication – you may want to store a copy of your data in a separate region to keep it closer to your users or to have a back-up of your data. To implement it you can read a DynamoDB stream and replay update operations in a second database.
Aggregated table – DynamoDB model might not suit some of the queries you need to perform. For example, if you need to group by data, DynamoDB does not support it out of the box. To implement this feature, you may read DynamoDB update stream and maintain an aggregated table that fits DynamoDB model and allows efficient queries.
Keep data in sync – in many cases, you need to maintain a copy of your data in a different datastore such as cache or CloudSearch. DynamoDB streams is an immense help for that. Since a record in DynamoDB stream appears only if data was stored in DynamoDB.

DynamoDB streams implementation is very similar to another AWS services called Kinesis. They both have similar API and to read data from any of the systems you can use Kinesis Client Library (KCL) that provides a high-level interface for reading data from a Kinesis or DynamoDB stream. Keep in mind that DynamoDB streams API and Kinesis API are slightly different, so you need to use an adapter to use KCL with DynamoDB.

DynamoDB accelerator (DAX)

As with every other database, it can be beneficial to use cache if you use DynamoDB. Unfortunately, it may be tricky to maintain cache consistency. If you use a caching solution like ElastiCache, it may be a good idea to utilize DynamoDB stream to maintain a copy of your data in a cache, but recently DynamoDB has introduced a new feature called DynamoDB accelerator.

DAX is write-through caching layer for DynamoDB. It has exactly the same API as DynamoDB, and if you’ve enabled it, you are supposed to read and write to DynamoDB through it. DAX keeps track of what data was written to DynamoDB and only stores it if a write was acknowledged by DynamoDB.

One of the key benefits of using DAX is that it has a sub-millisecond latency which may be especially important if you have strict SLAs.

Conclusions

DynamoDB is a great database. It has a rich feature set, predictable low latency, and almost no operational load. The key to using it efficiently is to understand its data model and to check if you can fit your data in DynamoDB. Remember, to achieve the stellar performance you need to use queries as much as possible and try to avoid scans operations.

This was an introductory article to DynamoDB, and I will write more, so stay tuned. In the meantime, you can take a look at my deep dive DynamoDB course. You can watch the preview for the course here.

The post Should you use DynamoDB? appeared first on Brewing Codes.

Sending additional data to and from Flink cluster

Ivan Mushketyk — Tue, 24 Oct 2017 09:00:02 +0000

If you know anything about Apache Flink, you are probably familiar with how to send data to it and how to get results back. But in some cases, we need to send configuration data to the Flink cluster and receive some additional data from it.

In the first part of the article, I’ll describe how to send configuration data to our Flink cluster. There are many things that we want to configure: function parameters, configuration files, machine learning models. Flink provides several different ways to do this, and we will cover how to use them and when to use each one. In the second part of the article, I will describe a non-trivial way of sending data back from a Flink cluster.

This article requires some basic knowledge of Apache Flink. If you are not familiar with it, you can read some of my other articles on the topic: here, here, and here.

Sending data to task managers

Before we dig into the details of how to send data between different components in Apache Flink, let’s first talk about what components there are in a Flink cluster and what are we trying to achieve. The following diagram presents what main moving parts Flink has and how they interact:

When we need to execute a Flink application, we interact with a job manager that stores details about the job it is running, such as an execution graph. It controls task managers and each task manager contains a portion of the data and execute data processing functions that we’ve defined.

In many cases, we would like to configure the behavior of our functions that run in the Flink cluster. Depending on a use-case we may need to set a single variable or submit a file with a static configuration, and we will discuss how Flink supports these and other cases.

In addition to sending configuration data to task managers, sometimes we may want to return data from our functions in addition to regular outputs.

Configuring user-defined functions

Let’s say we have an application that reads a list of movies from a CSV file and needs to filter all movies of a particular genre:

// Read a dataset of movies
DataSet<Tuple3<Long, String, String>> lines = env.readCsvFile("movies.csv")
        .ignoreFirstLine()
        .parseQuotedStrings('"')
        .ignoreInvalidLines()
        .types(Long.class, String.class, String.class);

lines.filter((FilterFunction<Tuple3<Long, String, String>>) movie -> {
    // Genres for a movie separated by the "|" symbol
    String[] genres = movie.f2.split("\\|");

    // Find all movies that has the "Action" genre
    return Stream.of(genres).anyMatch(g -> g.equals("Action"));
}).print();

It is very likely that we would like to extract movies of a different genre and to this we need to be able to configure our filter function. When you implement a function like this, the most straightforward way to configure it is to implement a constructor:

// Pass a genre name
lines.filter(new FilterGenre("Action"))
    .print();

...

class FilterGenre implements FilterFunction<Tuple3<Long, String, String>> {

    String genre;
    // Initialize filter function
    public FilterGenre(String genre) {
        this.genre = genre;
    }

    @Override
    public boolean filter(Tuple3<Long, String, String> movie) throws Exception {
        String[] genres = movie.f2.split("\\|");

        return Stream.of(genres).anyMatch(g -> g.equals(genre));
    }
}

Alternatively, if you are using lambda functions you can simply use a variable from its closure:

final String genre = "Action";

lines.filter((FilterFunction<Tuple3<Long, String, String>>) movie -> {
    String[] genres = movie.f2.split("\\|");

    // Using variable
    return Stream.of(genres).anyMatch(g -> g.equals(genre));
}).print();

Flink will serialize this variable and send it with the function to the cluster.

All these methods can get annoying if you need to pass a lot of variables to your function. To help with that Apache Flink provides the withParameters method. To use it you need to implement a Rich version of a function you are interested in, so instead of implementing the MapFunction interface, you will have to implement the RichMapFunction.

Rich functions allow you to pass a number of parameters using the withParameters method:

// Class in Flink to store parameters
Configuration configuration = new Configuration();
configuration.setString("genre", "Action");

lines.filter(new FilterGenreWithParameters())
        // Pass parameters to a function
        .withParameters(configuration)
        .print();

To read these parameters we need to implement the open and read parameters in it:

class FilterGenreWithParameters extends RichFilterFunction<Tuple3<Long, String, String>> {

    String genre;

    @Override
    public void open(Configuration parameters) throws Exception {
        // Read the parameter
        genre = parameters.getString("genre", "");
    }

    @Override
    public boolean filter(Tuple3<Long, String, String> movie) throws Exception {
        String[] genres = movie.f2.split("\\|");

        return Stream.of(genres).anyMatch(g -> g.equals(genre));
    }
}

All these options will work, but it can be tedious if you need to set the same parameter for multiple functions. To handle this Flink allows setting global environments variable that will be accessible by all task managers.

To do this, you first need to read arguments from a command line using the ParameterTool.fromArgs:

public static void main(String... args) {
    // Read command line arguments
    ParameterTool parameterTool = ParameterTool.fromArgs(args);
    ...
}

and then set global job parameters using the setGlobalJobParameters:

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameterTool);
...

// This function will be able to read these global parameters
lines.filter(new FilterGenreWithGlobalEnv())
                .print();

Now we can implement a function that will read these parameters. As before it should be a rich function:

class FilterGenreWithGlobalEnv extends RichFilterFunction<Tuple3<Long, String, String>> {

    @Override
    public boolean filter(Tuple3<Long, String, String> movie) throws Exception {
        String[] genres = movie.f2.split("\\|");
        // Get global parameters
        ParameterTool parameterTool = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
        // Read parameter
        String genre = parameterTool.get("genre");

        return Stream.of(genres).anyMatch(g -> g.equals(genre));
    }
}

To read a parameter we need to call the getGlobalJobParameter to get all global parameters and then use the get method to get the parameter we are interested in.

Broadcast variables

All these methods that we’ve discussed before will suit you if you want to send data from a client to task managers, but what if data exists in task managers in the form of a dataset? In this case, it’s better to use another Flink feature called broadcast variables. It simply allows sending a dataset to task managers that will execute your functions.

Let’s say we have a dataset that contains words that we should ignore when we do text processing, and we want to set it our function. To set a broadcast variable for a single function, we need to use the withBroadcastSet method and a dataset to it.

DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3);
// Get a dataset with words to ignore
DataSet<String> wordsToIgnore = ...

data.map(new RichFlatMapFunction<String, String>() {

    // A collection to store words. This will be stored in memory
    // of a task manager
    Collection<String> wordsToIgnore;

    @Override
    public void open(Configuration parameters) throws Exception {
        // Read a collection of words to ignore
        wordsToIgnore = getRuntimeContext().getBroadcastVariable("wordsToIgnore");
    }

    @Override
    public String map(String line, Collector<String> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words)
            // Use the collection of words to ignore
            if (wordsToIgnore.contains(word))
                out.collect(new Tuple2<>(word, 1));
    }
    // Pass a dataset via a broadcast variable
}).withBroadcastSet(wordsToIgnore, "wordsToIgnore");

You should keep in mind that if you use broadcast variables, a dataset will be stored in a task manager’s memory, so you should only use it for small datasets.

If you want to send more data to each task manager and do not want to store this data in memory, you can send a static file to task managers using Flink’s distributed cache. To use it you, first, need to store a file in one of the distributed file systems like HDFS and then register this file in the cache:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// Register a file from HDFS
env.registerCachedFile("hdfs:///path/to/file", "machineLearningModel")

...

env.execute()

To access the distributed cache, we again need to implement a rich function:

class MyClassifier extends RichMapFunction<String, Integer> {

    @Override
    public void open(Configuration config) {
      File machineLearningModel = getRuntimeContext().getDistributedCache().getFile("machineLearningModel");
      ...
    }

    @Override
    public Integer map(String value) throws Exception {
      ...
    }
}

Notice that to access a file in the distributed cache we need to use the same key that we used to register it.

Accumulators

We’ve covered how we can send data to task managers but now let’s talk about how we can send data from task managers back. You may wonder why do we need to do anything special. After all, Apache Flink is all about building data processing pipelines that read input data, process it, and return a result back.

To clarify what else can we possibly want let’s take a look at an example. Let’s say we need to count how many times each word occurs in a text and at the same time we want to calculate how many lines do we have in the text:

// Text to process
DataSet<String> lines = ...

// Word count algorithm
lines.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    @Override
    public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words) {
            out.collect(new Tuple2<>(word, 1));
        }
    }
})
.groupBy(0)
.sum(1)
.print();

// Count a number of lines in the text to process
int linesCount = lines.count()
System.out.println(linesCount);

The problem is that if we run this application as it is will run two Flink jobs! First to get the word count and second to count a number of lines.

This is definitely inefficient, but how can we avoid this? One way is to use accumulators. They allow you to send data from task managers and this data to be aggregated using a predefined function. Flink has following built-in accumulators:

IntCounter , LongCounter , DoubleCounter – allows summing together int, long, double values sent from task managers
AverageAccumulator – calculates an average of double values
LongMaximum , LongMinimum , IntMaximum , IntMinimum , DoubleMaximum , DoubleMinimum – accumulators to determine maximum and minimum values for different types
Histogram – used to computed distribution of values from task managers

To use an accumulator, we need to create and register it an user-defined function and then read the result on the client. Here is how we can do this:

lines.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {

    // Create an accumulator
    private IntCounter linesNum = new IntCounter();

    @Override
    public void open(Configuration parameters) throws Exception {
        // Register accumulator
        getRuntimeContext().addAccumulator("linesNum", linesNum);
    }

    @Override
    public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
        String[] words = line.split("\\W+");
        for (String word : words) {
            out.collect(new Tuple2<>(word, 1));
        }

        // Increment after each line is processed
        linesNum.add(1);
    }
})
.groupBy(0)
.sum(1)
.print();

// Get accumulator result
int linesNum = env.getLastJobExecutionResult().getAccumulatorResult("linesNum");
System.out.println(linesNum);

This allows us to count how many times each word occurs in the input text and how many lines does it have.

If you need a custom accumulator, you can also implement your own accumulators using Accumulator or SimpleAccumulator interfaces.

More information

I hope you liked this article and found it useful. You can find the source code for this article in my git repository with other Apache Flink examples.

I will write more articles about Flink in the near future, so stay tuned! You can read my other articles here, or you can you can take a look at my Pluralsight course where I cover Apache Flink in more details: Understanding Apache Flink. Here is a short preview of this course.

The post Sending additional data to and from Flink cluster appeared first on Brewing Codes.

Four ways to optimize your Flink applications

Ivan Mushketyk — Tue, 17 Oct 2017 10:00:43 +0000

Flink is a complicated framework and provides many ways to tweak its execution. In this article, I’ll show four different ways to improve the performance of your Flink applications.

If you are not familiar with Flink, you can read other introductory articles like this, this, and this one. But if you are already familiar with Apache Flink this article will help you to make your applications a little bit faster.

Use Flink tuples

When you use operations like groupBy, join or keyBy Flink provides you a number of options to select a key in your dataset. You can use a key selector function:

// Join movies and ratings datasets
movies.join(ratings)
        // Use movie id as a key in both cases
        .where(new KeySelector<Movie, String>() {
            @Override
            public String getKey(Movie m) throws Exception {
                return m.getId();
            }
        })
        .equalTo(new KeySelector<Rating, String>() {
            @Override
            public String getKey(Rating r) throws Exception {
                return r.getMovieId();
            }
        })

Or you can specify a field names in POJO types:

movies.join(ratings)
    // Use same fields as in the previous example
    .where("id")
    .equalTo("movieId")

But if you are working with Flink tuple types you can simply specify a position of a field tuple that will be used as key:

DataSet<Tuple2<String, String>> movies = ...
DataSet<Tuple3<String, String, Double>> ratings = ...

movies.join(ratings)
    // Specify fields positions in tuples
    .where(0)
    .equalTo(1)

The last option will give you the best performance, but what about readability? Does it mean that your code will look like this now:

DataSet<Tuple3<Integer, String, Double>> result = movies.join(ratings)
    .where(0)
    .equalTo(0)
    .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,Double>, Tuple3<Integer, String, Double>>() {
        // What is happening here?
        @Override
        public Tuple3<Integer, String, Double> join(Tuple2<Integer, String> first, Tuple2<Integer, Double> second) throws Exception {
            // Some tuples are joined with some other tuples and some fields are returned???
            return new Tuple3<>(first.f0, first.f1, second.f1);
        }
    });

A common idiom to improve readability, in this case, is to create a class that inherits from one of the TupleX classes and implements getters and setters for these fields. Here how an Edge class from the Flink Gelly library that has three classes and extends the Tuple3 class:

public class Edge<K, V> extends Tuple3<K, K, V> {

    public Edge(K source, K target, V value) {
        this.f0 = source;
        this.f1 = target;
        this.f2 = value;
    }

    // Getters and setters for readability
    public void setSource(K source) {
        this.f0 = source;
    }

    public K getSource() {
        return this.f0;
    }

    // Also has getters and setters for other fields
    ...
}

Reuse Flink objects

Another option that you can use to improve the performance of your Flink application is to use mutable objects when you return data from a user-defined function. Take a look at this example:

stream
    .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
        @Override
        public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, Long>> collector) throws Exception {
            long changesCount = ...
            // A new Tuple instance is created on every execution
            collector.collect(new Tuple2<>(userName, changesCount));
        }
    }

As you can see on every execution of the apply function, we create a new instance of the Tuple2 class, which increases pressure on a garbage collector. One way to fix this problem would be to reuse the same instance again and again:

stream
    .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
        // Create an instance that we will reuse on every call
        private Tuple2<String, Long> result = new Tuple<>();

        @Override
        public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, Long>> collector) throws Exception {
            long changesCount = ...

            // Set fields on an existing object instead of creating a new one
            result.f0 = userName;
            // Auto-boxing!! A new Long value may be created
            result.f1 = changesCount;

            // Reuse the same Tuple2 object
            collector.collect(result);
        }
    }

It’s a bit better. We create a new Tuple2 instance on every call, but we still, indirectly, create an instance of the Long class. To solve this problem, Flink has a number of so-called value classes: IntValue, LongValue, StringValue, FloatValue, etc. The point of this classes is to provide mutable versions of built-in types, so we could reuse them in our user-defined functions. Here is how we can use them:

stream
    .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
        // Create a mutable count instance
        private LongValue count = new IntValue();
        // Assign mutable count to the tuple
        private Tuple2<String, LongValue> result = new Tuple<>("", count);

        @Override
        // Notice that now we have a different return type
        public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, LongValue>> collector) throws Exception {
            long changesCount = ...

            // Set fields on an existing object instead of creating a new one
            result.f0 = userName;
            // Update mutable count value
            count.setValue(changesCount);

            // Reuse the same tuple and the same LongValue instance
            collector.collect(result);
        }
    }

This idiom is commonly used in Flink libraries like Flink Gelly.

Use function annotations

One more way to optimize your Flink application is to provide some information about what your user-defined functions are doing with input data. Since Flink can’t parse and understand code, you can provide crucial information that will help to build a more efficient execution plan. There are three annotations that we can use:

@ForwardedFields – specifies what fields in an input value were left unchanged and are used in an output value.
@NotForwardedFields – specifies fields which were not preserved in the same positions in the output.
@ReadFields – specifies what fields were used to compute a result value. You should only specify fields that were used in computations and not merely copied to the output.

Let’s take a look at how we can use ForwardedFields annotation:

// Specify that the first element is copied without any changes
@ForwardedFields("0")
class MyFunction implements MapFunction<Tuple2<Long, Double>, Tuple2<Long, Double>> {
    @Override
    public Tuple2<Long, Double> map(Tuple2<Long, Double> value) {
       // Copy first field without change
        return new Tuple2<>(value.f0, value.f1 + 123);
    }
}

This means that the first element in an input tuple is not being changed and it is returned in the same position.

If you don’t change a field, but simply move it into a different position, you can specify this with the ForwardedFields annotation as well. In the next example we swap fields in an input tuple and warn Flink about this:

// 1st element goes into the 2nd position, and 2nd element goes into the 1st position
@ForwardedFields("0->1; 1->0")
class SwapArguments implements MapFunction<Tuple2<Long, Double>, Tuple2<Double, Long>> {
    @Override
    public Tuple2<Double, Long> map(Tuple2<Long, Double> value) {
       // Swap elements in a tuple
        return new Tuple2<>(value.f1, value.f0);
    }
}

The annotations mentioned above can only be applied to functions that have one input parameter, such as map or flatMap. If you have two input parameters, you can use the ForwardedFieldsFirst and the ForwardedFieldsSecond annotations that provide information about the first and the second parameters respectively.

Here how we can use these annotations in an implementation of the JoinFunction interface:

// Two fields from the input tuple are copied to the first and second positions of the output tuple
@ForwardedFieldsFirst("0; 1")
// The third field from the input tuple is copied to the third position of the output tuple
@ForwardedFieldsSecond("2")
class MyJoin implements JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,Double>, Tuple3<Integer, String, Double>>() {
    @Override
    public Tuple3<Integer, String, Double> join(Tuple2<Integer, String> first, Tuple2<Integer, Double> second) throws Exception {
        return new Tuple3<>(first.f0, first.f1, second.f1);
    }
})

Flink also provides NotForwardedFieldsFirst, NotForwardedFieldsSecond, ReadFieldsFirst, and ReadFirldsSecond annotations for similar purposes.

Select join type

You can make your joins faster if you give Flink another hint, but before we discuss why it works, let’s talk about how Flink executes joins.

When Flink is processing batch data, each machine in a cluster stores part of data. To perform a join Apache Flink needs to find all pairs of two datasets where a join condition is satisfied. To do this Flink first has to put items from both datasets that have the same key on the same machine in the cluster. There are two strategies for this:

Repartition-Repartition strategy – in this case, both datasets are partitioned by their keys and send across the network. It means that if datasets are big, it may take a significant amount of time to copy them across the network.
Broadcast-Forward strategy – in this case, one dataset is left untouched, but the second dataset is copied to every machine in the cluster that has part of the first dataset.

If you are joining a small dataset with a much bigger dataset, you can use the Broadcast-Forward strategy and avoid costly partition of the first dataset. This is really easy to do:

ds1.join(ds2, JoinHint.BROADCAST_HASH_FIRST)

This hints that the first dataset is a much smaller than the second one.

You can also use other join hints:

BROADCAST_HASH_SECOND – the second dataset is much smaller
REPARTITION_HASH_FIRST – the first dataset it a bit smaller
REPARTITION_HASH_SECOND – the second dataset is a bit smaller
REPARTITION_SORT_MERGE – repartition both datasets and use sorting and merging strategy
OPTIMIZER_CHOOSES – Flink optimizer will decide how to join datasets

You can read more about how Flink performs joins in this article.

More information

I hope you liked this article and found it useful.

The post Four ways to optimize your Flink applications appeared first on Brewing Codes.

Getting started with stream processing using Apache Flink

Ivan Mushketyk — Mon, 09 Oct 2017 09:00:43 +0000

If in your mind “Apache Flink and “streaming programming does not have a strong link you probably was not following news recently. Apache Flink took the world of Big Data by storm. Now is a perfect opportunity for a tool like this to thrive: stream processing becomes more and more prevalent in data processing, and Apache Flink presents a number of important innovations.

In this article, I will show how to start writing stream processing algorithms using Apache Flink. We will read a stream of Wikipedia edits and will see how can get some meaningful data out of it. In the process, you will see how to read and write stream data, how to perform simple operations, and how to implement more complex algorithms.

Getting started

I believe that if you are new to Apache Flink, it’s better to start with learning about batch processing since it is simpler and will give you a solid foundation to learning stream processing. I’ve written an introductory blog post about how to start with batch processing using Apache Flink, so I encourage you to read it first.

If you already know how to use batch processing in Apache Flink, stream processing does not have a lot of surprises for you. As before we will take a look at three distinct phases in your application: reading data from a source, processing data, and writing data to an external system.

There are few notable differences comparing to the batch processing. First of all, in batch processing, all data is available in advance. We do not process new data even if it arrives while we the process is running.

It is different in stream processing though. We read data as it is being generated and the stream of data that we need to process is potentially infinite. With this approach, we can process incoming data in almost real-time.

In the stream mode, Flink will read data from and write data to different systems including Apache Kafka, Rabbit MQ, basically systems that produce and consume a constant stream of data. Notice that we can read data from HDFS or S3 as well. In this case, Apache Flink will constantly monitor a folder and will process files as they arrive.

Here is how we can read data from a file in the stream mode:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> stream = env.readTextFile("file/path");

Notice that to use stream processing we need to use the StreamExecutionEnvironment class instead of the ExecutionEnvironment. Also methods that read data return an instance of DataStream class that we will use later for data processing.

We can also create finite streams from collections or arrays as in the batch processing case:

DataStream<Integer> numbers = env.fromCollection(Arrays.asList(1, 2, 3, 4, 5 6);
DataStream<Integer> numbers = env.fromElements(1, 2, 3, 4, 5);

Simple data processing

To process a stream of items in an stream Flink provides operators similar to batch processing operators like: map, filter, and mapReduce.

Let’s implement our first stream processing example. We will read a stream of edits to Wikipedia and display items that we are interested in.

First, to read edits stream, we need to use the WikipediaEditsSource:

DataStream<WikipediaEditEvent> edits = env.addSource(new WikipediaEditsSource());

To use it we need to call the addSource method that is used to read data from various sources such as Kafka, Kinesis, RabbitMQ, etc. This method returns a stream of edits that we can now process.

Let’s filter all edits that were not made by a bot and that have changed more than a thousand bytes:

edits.filter((FilterFunction<WikipediaEditEvent>) edit -> {
    return !edit.isBotEdit() && edit.getByteDiff() > 1000;
})
.print();

This is very similar to how you can use the filter method in the batch case, with the only exception that it process an infinite stream.

Now the last step is to run our application. As before we need to call the execute method:

env.execute()

The application will start to print filtered wikipedia edits until we stop it:

2> WikipediaEditEvent{timestamp=1506499898043, channel='#en.wikipedia', title='17 FIBA Womens Melanesia Basketball Cup', diffUrl='https://en.wikipedia.org/w/index.php?diff=802608251&oldid=802520770', user='Malto15', byteDiff=1853, summary='/* Preliminary round */', flags=0}
7> WikipediaEditEvent{timestamp=1506499911216, channel='#en.wikipedia', title='User:MusikBot/StaleDrafts/Report', diffUrl='https://en.wikipedia.org/w/index.php?diff=802608262&oldid=802459885', user='MusikBot', byteDiff=11674, summary='Reporting 142 stale non-AfC drafts', flags=0}
...

Stream windows

Notice that methods that we’ve discussed so far before all work on individual elements in a stream. It’s unlikely that we can come up with many interesting stream algorithms that can be implemented using these simple operators. Using just them it will be impossible to implement following use-cases:

Count a number of edits that are performed every minute
Count how many edits were performed by each user every ten minutes

It’s obvious that to answer these questions we need to process groups of elements. This is what stream windows are for.

In the nutshell stream windows allow us to group elements in a stream and execute a user-defined function on each group. This user-defined function can return zero, one, or more elements and in this way, it creates a new stream that we can process or store in a separate system.

How can we group elements in a stream? Flink provides several options to do this:

Tumbling window – creates non-overlapping adjacent windows in a stream. We can either group elements by time (say, all elements from 10:00 to 10:05 go into one group) or by count (first 50 elements go into a separate group). For example, we can use this to answer a question like: count a number of elements in a stream for non-overlapping five-minute intervals.
Sliding window – similar to the tumbling window but here windows can overlap. We can use it if we need to calculate a metric for the last five minutes, but we want to display an output every minute.
Session window – in this case, Flink groups events that occurred close in time to each other.
Global window – in this case, Flink puts all elements to a single window. This is only useful if we define a custom trigger that defines when a window is finished.

In addition to selecting how to assign elements to different windows, we need to select a stream type. Flink has two window types:

Keyed stream – with this stream type Flink will partition a single stream into multiple independent streams by a key (e.g., name of a user who made an edit). When we process a window in a keyed stream a function that we define only has access to items with the same key, but working with multiple independent streams allows Flink to parallelize work.
Non-keyed stream – in this case, all elements in the stream will be processed together and our user-defined function will have access to all elements in a stream. The downside of this stream type is that it gives no parallelism and only one machine in the cluster will be able to execute our code.

Now let’s implement some demos using stream windows. First of all let’s find how many edits are performed on Wikipedia every minute. First we need to read a stream of edits:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = env.addSource(new WikipediaEditsSource());

Then we need to specify that we want to separate the stream into one-minute non-overlapping windows:

edits
    // Non-overlapping one-minute windows
    .timeWindowAll(Time.minutes(1))

And now we can define a custom function that will process all elements in each one-minute window. To do this, we will use the apply method and pass an implementation of the AllWindowFunction:

edits
    .timeWindowAll(Time.minutes(1))
    .apply(new AllWindowFunction<WikipediaEditEvent, Tuple3<Date, Long, Long>, TimeWindow>() {
        @Override
        public void apply(TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple3<Date, Long, Long>> collector) throws Exception {
            long count = 0;
            long bytesChanged = 0;

            // Count number of edits
            for (WikipediaEditEvent event : iterable) {
                count++;
                bytesChanged += event.getByteDiff();
            }

            // Output a number of edits and window's end time
            collector.collect(new Tuple3<>(new Date(timeWindow.getEnd()), count, bytesChanged));
        }
    })
    .print();

Despite being a bit verbose the method is pretty straightforward, the apply method receives three parameters:

timeWindow – contains information about a window we are processing
iterable – iterator for elements in a single window
collector – an object that we can use to output elements into the result stream

All we do here is counting a number of changes and then using the collector instance to output the result of our calculation together with the end timestamp of a window.

If we run this application we will see items produced by the apply method printed into the output stream:

1> (Wed Sep 27 12:58:00 IST 2017,62,62016)
2> (Wed Sep 27 12:59:00 IST 2017,82,12812)
3> (Wed Sep 27 13:00:00 IST 2017,89,45532)
4> (Wed Sep 27 13:01:00 IST 2017,79,11128)
5> (Wed Sep 27 13:02:00 IST 2017,82,26582)

Keyed stream example

Now let’s take a look at a bit more complicated example. Let’s count how many edits a user does per each ten-minutes interval. This can help to identify most active users or to find some unusual activity in the system.

Of course, we could just use a non-keyed stream, iterate over all elements in a window and maintain a dictionary to track counts, but this approach won’t scale since non-keyed streams are not parallelizable. To use resources of a Flink cluster efficiently, we need to key our stream by user name, which will create multiple logical streams: one per user.

DataStream<WikipediaEditEvent> edits = env.addSource(new WikipediaEditsSource());

edits
    // Key by user name
    .keyBy((KeySelector<WikipediaEditEvent, String>) WikipediaEditEvent::getUser)
    // Ten-minute non-overlapping windows
    .timeWindow(Time.minutes(10))

The only difference here is that we use the keyBy method to specify a key for our streams. Here we simply use a username as a partition key.

Now when we have a keyed stream, we can apply a function that will be executed to process each window. As before we will use the apply method:

edits
    .keyBy((KeySelector<WikipediaEditEvent, String>) WikipediaEditEvent::getUser)
    .timeWindow(Time.minutes(10))
    .apply(new WindowFunction<WikipediaEditEvent, Tuple2<String, Long>, String, TimeWindow>() {
        @Override
        public void apply(String userName, TimeWindow timeWindow, Iterable<WikipediaEditEvent> iterable, Collector<Tuple2<String, Long>> collector) throws Exception {
            long changesCount = 0;

            // Count number of changes
            for (WikipediaEditEvent ignored : iterable) {
                changesCount++;
            }
            // Output user name and number of changes
            collector.collect(new Tuple2<>(userName, changesCount));
        }
    })
    .print();

The only significant difference here is that this version of the apply method has four parameters. The additional first parameter specifies a key for the logical stream that our function is processing.

If we execute this application we will get a stream where each element contains a user name and a number of edits this user performed per ten-minute interval:

...
5> (InternetArchiveBot,6)
1> (Francis Schonken,1)
6> (.30.124.210,1)
1> (MShabazz,1)
5> (Materialscientist,18)
1> (Aquaelfin,1)
6> (Cote d'Azur,2)
1> (Daniel Cavallari,3)
5> (00:1:F159:6D32:2578:A6F7:AB88:C8D,2)
...

As you can see some users have a Wikipedia edit spree today!

More information

This was an introductory article, and there is much more to Apache Flink. I will write more articles about Flink in the near future, so stay tuned! You can read my other articles here, or you can you can take a look at my Pluralsight course where I cover Apache Flink in more details: Understanding Apache Flink. Here is a short preview of this course.

The post Getting started with stream processing using Apache Flink appeared first on Brewing Codes.

Getting started with batch processing using Apache Flink

Ivan Mushketyk — Sun, 01 Oct 2017 11:44:04 +0000

If you’ve been following software development news recently you probably heard about the new project called Apache Flink. I’ve already written about it a bit here and here, but if you are not familiar with it, Apache Flink is a new generation Big Data processing tool that can process either finite sets of data (this is also called batch processing) or potentially infinite streams of data (stream processing). In terms of new features, many believe Apache Flink is a game changer and can even replace Apache Spark in the future.

In this article, I’ll introduce you to how you can use Apache Flink to implement simple batch processing algorithms. We will start with setting up our development environment, and then we will see how we can load data, process a dataset, and write data back to an external system.

Why batch processing?

You might have heard that stream processing is “the new hot thing right now” and that Apache Flink is a tool for stream processing. This can pose a question, why do we need to learn how to implement batch processing applications.

While it is true, that stream processing becomes more and more widespread; many tasks still require batch processing. Also if you are just getting started with Apache Flink, in my opinion, it is better to start with batch processing since it is simpler and in a way resembles working with a database. Once you’ve covered batch processing, you can learn about stream processing where Apache Flink really shines!

How to follow examples

If you want to implement some Apache Flink applications yourself, first you need to create a Flink project. In this article, we are going to write applications in Java, but you can also write Flink application in Scala, Python or R.

To create a Flink Java project execute the following command:

mvn archetype:generate \
      -DarchetypeGroupId=org.apache.flink \
      -DarchetypeArtifactId=flink-quickstart-java \
      -DarchetypeVersion=1.3.2

After you enter group id, artifact id, and a project version this command will create the following project structure:

.
â”œâ”€â”€ pom.xml
â””â”€â”€ src
    â””â”€â”€ main
        â”œâ”€â”€ java
        â”‚Â Â  â””â”€â”€ flinkProject
        â”‚Â Â  â”œâ”€â”€ BatchJob.java
        â”‚Â Â  â”œâ”€â”€ SocketTextStreamWordCount.java
        â”‚Â Â  â”œâ”€â”€ StreamingJob.java
        â”‚Â Â  â””â”€â”€ WordCount.java
        â””â”€â”€ resources
            â””â”€â”€ log4j.properties

The most important here is the massive pom.xml that specifies all the necessary dependencies. Automatically created Java classes are examples of some simple Flink applications that you can take a look at, but we don’t need them for our purposes.

To start developing your first Flink application create a class with the main method like this:

public class FlinkProgram {

    public static void main(String[] args) throws Exception {
        // Create Flink execution environment
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // We will write our code here

        // Start Flink application
        env.execute();
    }
}

There is nothing special about this main method. All we have to do is to add some boilerplate code.

First, we need to create a Flink execution environment that will behave differently if you run it on a local machine or in a Flink cluster:

On a local machine, it will create a full-fledged Flink cluster with multiple local nodes. This is a good way to test how your application will work in a realistic environment
On a Flink cluster, it won’t create anything but will use existing cluster resources instead

Alternatively, you could create a collection environment like this:

ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();

This will create a Flink execution environment that instead of running Flink application on a local cluster will emulate all operations using in-memory collections in a single Java process. Your application will run faster, but this environment some subtle differences from a local cluster with multiple nodes.

Where do we start?

Before we can do anything, we need to read data into Apache Flink. We can read data from numerous systems including local filesystem, S3, HDFS, HBase, Cassandra, etc. No matter where we read a dataset from, Apache Flink allows us to work with data in a uniform way using the DataSet class:

DataSet<Integer> numbers = ...

All items in a dataset should have the same type. The single generics parameter specifies a type of the data that is stored in a dataset.

To read data from a file, we can use the readTextFile method that will read lines in a file line by line and return a dataset of type String:

DataSet<String> lines = env.readTextFile("path/to/file.txt");

If you specify a file path like this, Flink will attempt to read a local file. If you want to read a file from HDFS you need to specify the hdfs:// protocol:

env.readCsvFile("hdfs:///path/to/file.txt")

Flink also has support for CSV files, but in this case, it won’t return a dataset of strings. It will try to parse every line and return a dataset of Tuple instances:

DataSet<Tuple2<Long, String>> lines = env.readCsvFile("data.csv")
                .types(Long.class, String.class);

Tuple2 is a class that stores an immutable pair of two fields, but there are other classes like Tuple0, Tuple1, Tuple3, up to Tuple25 that store from zero to twenty-five fields. Later we will see how to work with these classes.

The types method specifies types and number of columns in a CSV file, so Flink could read a parse them.

We can also create small datasets that are very good for small experiments and unit tests:

// Create from a list
DataSet<String> letters = env.fromCollection(Arrays.asList("a", "b", "c"));
// Create from an array
DataSet<Integer> numbers = env.fromElements(1, 2, 3, 4, 5);

A question that you may ask is what data we can store in a DataSet? Not every Java type can be used in a dataset, and there are four different categories of types that you can use:

Built-in Java types and POJO classes
Flink Tuples and Scala case classes
Values – these are special mutable wrappers for Java primitive types that you can use to increase performance (I’ll write about this in one of the upcoming articles)
Implementations of Hadoop Writable interface

Processing data with Apache Flink

Now to the data processing part! How do you implement an algorithm for processing your data? To do this, you can use a number of operations that resemble Java 8 streams operations, such as:

map – converts items in a dataset using a user-defined function. Every input element is converted into exactly one output element
filter – filters items in a dataset according to a user-defined function
flatMap – similar to the map operator, but allows returning zero, one or many elements
groupBy – groups elements by a key. Similar to the GROUP BY operator in SQL
project – select specified fields in a dataset of tuples, similar to the SELECT operator from SQL
reduce – combines elements in a dataset into a single value using a user-defined function

Keep in mind that the biggest difference between Java streams and these operations is that Java 8, works with data in memory and can access local data, while Flink works with data on a cluster in a distributed environment.

Let’s take a look at a simple example that uses these operations. The following example is very straightforward. It creates a dataset of numbers, which squares every number and filters out all odd numbers.

// Create a dataset of numbers
DataSet<Integer> numbers = env.fromElements(1, 2, 3, 4, 5, 6, 7);

// Square every number
DataSet<Integer> result = numbers.map(new MapFunction<Integer, Integer>() {
    @Override
    public Integer map(Integer integer) throws Exception {
        return integer * integer;
    }
})
// Leave only even numbers
.filter(new FilterFunction<Integer>() {
    @Override
    public boolean filter(Integer integer) throws Exception {
        return integer % 2 == 0;
    }
});

If you have any experience with Java 8 you are probably wondering why I don’t use lambdas here. We can use lambdas here but it can cause some complications, as I’ve written here.

Saving data back

After we’ve finished processing our data it would make sense to save the result of our hard work. Flink can store data into a number of third-party systems such as HDFS, S3, Cassandra, etc.

For example, to write data to a file, we need to use the writeAsText method from the DataSet class:

DataSet<Integer> ds = ...

ds.writeAsText("path/to/file");

For debugging/testing purposes Flink can write data to standard output or to standard output:

DataSet<Integer> ds = ...

// Output dataset to the standard output
ds.print();

// Output dataset to the standard err
ds.printToErr()

More complicated example

To implement some meaningful algorithms we need to first download a Grouplens movies dataset. It contains several CSV files with information about movies and movie ratings. We are going to work with the movies.csv file from this dataset which contains a list of all movies and looks like this:

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller

It has three columns:

movieId – a unique movie id for a movie in this dataset
title – a title of the movie
genres – a “|” separated list of genres for each movie

We can now load this CSV file in Apache Flink and perform some meaningful processing. Here we will load a file from a local filesystem, while in a realistic environment you would read a much bigger dataset and it would probably reside in a distributed system, such as S3 or HDFS.

In this demo let’s find all movies of the “Action” genre. Here is a code snippet that that does this:

// Load dataset of movies
DataSet<Tuple3<Long, String, String>> lines = env.readCsvFile("movies.csv")
            .ignoreFirstLine()
            .parseQuotedStrings('"')
            .ignoreInvalidLines()
            .types(Long.class, String.class, String.class);

DataSet<Movie> movies = lines.map(new MapFunction<Tuple3<Long,String,String>, Movie>() {
    @Override
    public Movie map(Tuple3<Long, String, String> csvLine) throws Exception {
        String movieName = csvLine.f1;
        String[] genres = csvLine.f2.split("\\|");
        return new Movie(movieName, new HashSet<>(Arrays.asList(genres)));
    }
});
DataSet<Movie> filteredMovies = movies.filter(new FilterFunction<Movie>() {
    @Override
    public boolean filter(Movie movie) throws Exception {
        return movie.getGenres().contains("Action");
    }
});

filteredMovies.writeAsText("output.txt");

Let’s break it down. First, we read a CSV file using the readCsvFile method:

DataSet<Tuple3<Long, String, String>> lines = env.readCsvFile("movies.csv")
    // ignore CSV header
    .ignoreFirstLine()
    // Set strings quotes character
    .parseQuotedStrings('"')
    // Ignore invalid lines in the CSV file
    .ignoreInvalidLines()
    // Specify types of columns in the CSV file
    .types(Long.class, String.class, String.class);

Using helper methods, we specify how to parse strings in the CSV file and that we need to skip the first line. In the last line, we specify a type of each column in the CSV file and Flink will parse data for us.

Now when we have a dataset loaded in a Flink cluster we can do some data processing. First, we parse a list of genres for every movie using the map method:

DataSet<Movie> movies = lines.map(new MapFunction<Tuple3<Long,String,String>, Movie>() {
    @Override
    public Movie map(Tuple3<Long, String, String> csvLine) throws Exception {
        String movieName = csvLine.f1;
         String[] genres = csvLine.f2.split("\\|");
         return new Movie(movieName, new HashSet<>(Arrays.asList(genres)));
    }
});

To transform every movie we need to implement the MapFunction that will receive every CSV record as a Tuple3 instance and will convert it into the Movie POJO class:

class Movie {
    private String name;
    private Set<String> genres;

    public Movie(String name, Set<String> genres) {
        this.name = name;
        this.genres = genres;
    }

    public String getName() {
        return name;
    }

    public Set<String> getGenres() {
        return genres;
    }
}

If you recall the structure of the CSV file, the second column contains a name of a movie and the third column contains a list of genres. Hence we access these columns using fields f1 and f2 respectively.

Now when we have a dataset of movies we can implement the core part of our algorithm and filter all action movies:

DataSet<Movie> filteredMovies = movies.filter(new FilterFunction<Movie>() {
    @Override
    public boolean filter(Movie movie) throws Exception {
        return movie.getGenres().contains("Action");
    }
});

This will only return movies that contain “Action” in the set of genres.

Now the last step is very straightforward; we store result data into a file:

filteredMovies.writeAsText("output.txt");

This simply stores the result data into a local text file, but as with the readTextFile method, we could write this file into HDFS or S3 by specifying a protocol like hdfs://.

More information

The post Getting started with batch processing using Apache Flink appeared first on Brewing Codes.

Python Data Structures Idioms

Ivan Mushketyk — Fri, 29 Sep 2017 16:38:32 +0000

Significant portion of our time we as a developers spend writing code that manipulates basic data structures: traverse a list, create a map, filter elements in a collection. Therefore it is important to know how effectively do it in Python and make your code more readable and efficient.

Using lists

Iterate over a list

There are many ways to iterate over a list in Python. And the simplest way would be just to maintain current position in list and increment it on each iteration:

## SO WRONG
l = [1, 2, 3, 4, 5]
i = 0
while i < len(l):
    print l[i]
    i += 1

This works, but Python provides a more convenient way to do using range function. range function can be used to generate numbers from 0 to N and this can be used as an analog of a for loop in C:

## STILL WRONG
for i in range(len(l)):
    print l[i]

While this is more concise, there is a better way to do it since Python let us iterate over a list directly, similarly to foreach loops in other languages:

# RIGHT
for v in l:
    print v

Iterate over a list in reverse order

How can we iterate a list in the reverse order? One way to do it would be to use an unreadable 3 arguments version of the range function and provide position of the last element in a list (first argument), position of an element before the first element in the list (second argument) and negative step to go in reverse order (third argument):

# WRONG
for i in range(len(l) - 1, -1, -1):
    print l[i]

But as you've may already guessed Python should offer a much better way to do it. We can just use reversed function in a for loop:

# RIGHT
for i in reversed(l):
    print i

Access the last element

A commonly used idiom to access the last element in a list would be: get length of a list, subtract 1 from it, use result number as a position of the last element:

# WRONG
l = [1, 2, 3, 4, 5]
>>> l[len(l) - 1]
5

This is cumbersome in Python since it supports negative indexes to access elements from the end of the list. So -1 is the last element:

# RIGHT
>>> l[-1]
5

Negative indexes can also be used to access a next to last element and so on:

# RIGHT
>>> l[-2]
4
>>> l[-3]
3

Use sequence unpacking

A common way to extract values from a list to multiple variables in other programming languages would be to use indexes:

# WRONG
l1 = l[0]
l2 = l[1]
l3 = l[2]

But Python supports sequence unpacking that lets us to extract values from a list to multiple variables:

# RIGHT
l1, l2, l3 = [1, 2, 3]

>>> l1
1
>>> l2
2
>>> l3
3

Use lists comprehensions

Let's say we want to filter all grades for a movie posted by users of age 18 or bellow.

How many times did you write code like this:

# WRONG
under_18_grades = []
for grade in grades:
    if grade.age <= 18:
        under_18_grades.append(grade)

Do it no more in Python and use list comprehensions with if statement instead.

# RIGHT
under_18_grades = [grade for grade in grades if grade.age <= 18]

Use enumerate function

Sometimes you need to iterate over a list and keep track of a position of each element. Say, if you need to display a menu items in a shell you can simply use the range function:

# WRONG
for i in range(len(menu_items)):
    menu_items = menu_items[i]
    print "{}. {}".format(i, menu_items)

A better way to do it would be to use enumerate function. It is a iterator that returns pairs each of which contains position of an element and the element itself:

# RIGHT
for i, menu_items in enumerate(menu_items):
    print "{}. {}".format(i, menu_items)

Use keys to sort

A typical way to sort elements in other programming languages is to provide a function that compares two objects along with a collection to sort. In Python it would look like:

people = [Person('John', 30), Person('Peter', 28), Person('Joe', 42)]

# WRONG
def compare_people(p1, p2):
    if p1.age < p2.age:
        return -1
    if p1.age > p2.age:
        return 1
    return 0

sorted(people, cmp=compare_people)

[Person(name='Peter', age=28), Person(name='John', age=30), Person(name='Joe', age=42)]

But this is not the best way to do it. Since all we need to do to compare two instances of Person class is to compare values of their age field. Why should we write a complex compare function for this?

Specifically for this case sorted function accepts key function that is used to extract a key that will be used to compare two instances of an object:

# RIGHT
sorted(people, key=lambda p: p.age)
[Person(name='Peter', age=28), Person(name='John', age=30), Person(name='Joe', age=42)]

Use all/any functions

If you want to check if all or any value in a collection is True one way would be iterate over a list:

# WRONG
def all_true(lst):
    for v in lst:
        if not v:
            return False
    return True

But Python already has all, any functions for that. all returns True if all values in an iterable passed to it are True, while any returns True if at least one of values passed to it is True:

# RIGHT
all([True, False])
>> False

any([True, False])
>> True

To check if all items comply with a certain condition, you can convert a list of arbitrary objects to a list of booleans using list comprehension:

all([person.age > 18 for person in people])

Or you can pass a generator (just omit square braces around the list comprehension):

all(person.age > 18 for person in people)

Not only this will save you two keystrokes it will also omit creation of an intermediate list (more about this later).

Use slicing

You can take part of a list using a technique called slicing. Instead of providing a single index in a square brackets when accessing a list you can provide the following three values

lst[start:end:step]

All of these parameters are optional and you can get different parts of a list if you omit some of them. If only start position is provided it will return all elements in a list starting from the specified index:

# RIGHT
>>> lst = range(10)
>>> lst[3:]
[3, 4, 5, 6, 7, 8, 9]

If only end position is provided slicing will return all elements up to the provided position:

>>> lst[:-3]
[0, 1, 2, 3, 4, 5, 6]

You can also get part of a list between two indexes:

>>> lst[3:6]
[3, 4, 5]

By default step in slicing is equal to one which mean that all elements between start and end positions are returned. If you want to get only every second element or every third element you need to provide a step value:

>>> lst[2:8:2]
[2, 4, 6]

Do not create unnecessary objects

Use xrange

range is a useful function if you need to generate consistent integer values in a range, but it has one drawback: it returns a list with all generated values:

# WRONG
# Returns a too big list
for i in range(1000000000):
    ...

Solution here is to use xrange function. It immediately return an iterator instead of creating a list:

# RIGHT
# Returns an iterator
for i in xrange(1000000000):
    ...

The drawback of xrange comparing to the range function is that it's output can be iterated only once.

New in Python 3

In Python 3 xrange was removed and range function behaves like xrange in Python 2.x. If you need to iterate over an output of range in Python 3 multiple times you can convert its output in to a list:

>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Use izip

If you need to generate pairs from elements in two collections, one way to do it would be to use the zip function:

# WRONG
names = ['Joe', 'Kate', 'Peter']
ages = [30, 28, 41]
# Creates a list
zip(names, ages)

[('Joe', 30), ('Kate', 28), ('Peter', 41)]

Instead we can use the izip function that would return a return an iterator instead of creating a new list:

# RIGHT
from itertools import izip
# Creates an iterator
it = izip(names, ages)

New in Python 3

In Python 3 izip function is removed and zip behaves like izip function in Python 2.x.

Use generators

Lists comprehensions is a powerful tool in Python, but since it can use extensive amount of memory since each list comprehension will create a new list:

# WRONG

# Original list
lst = range(10)
# This will create a new list
lst_1 = [i + 1 for i in lst]
# This will create another list
lst_2 = [i ** 2 for i in lst_1]

A way to avoid this is to use generators instead of list comprehensions. The difference in syntax is minimal: you should use parenthesis instead of square brackets, but the difference is crucial. The following example does not create any intermediate lists:

# RIGHT

# Original list
lst = range(10)
# Won't create a new list
lst_1 = (i + 1 for i in lst)
# Won't create another list
lst_2 = (i ** 2 for i in lst_1)

This is especially handy if you may need to process only part of the result collection to get a result, say to find a first element that match a certain condition.

Use dictionaries idiomatically

Avoid using keys() function

If you need to iterate over keys in a dictionary you may be inclined to use keys function on a hash map:

# WRONG
for k in d.keys():
    print k

But there is a better way, you use iterate over a dictionary it performs iteration over its keys, so you can do simply:

# RIGHT
for k in d:
    print k

Not only it will save you some typing it will prevent from creating a copy of all keys in a dict as keys method does.

Iterate over keys and values

If you use keys method it's really easy to iterate keys and values in a dictionary like this:


#WRONG
for k in d:
    v = d[k]
    print k, v

But there is a better way. You can use items function that returns key-value pairs from a dictionary:

# RIGHT
for k, v in d.items():
    print k, v

Not only this method is more concise, it's a more efficient too.

Use dictionaries comprehension

One way to create a dictionary is to assign values to it one-by-one:

# WRONG

d = {}
for person in people:
    d[person.name] = person

Instead you can use a dictionary comprehension to turn this into a one liner:

# RIGHT
d = {person.name: person for person in people}

Use collections module

Use namedtuple

If you need a struct like type you may just define a class with an init method and a bunch of fields:

# WRONG
class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

However collections module from Python library provides a namedtuple type that turns this into a one-liner:

# RIGHT
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])

In addition namedtuple implements __str__, __repr__, and __eq__ methods:

>>> Point(1, 2)
Point(x=1, y=2)
>>> Point(1, 2) == Point(1, 2)
True

Use defaultdict

If we need to count a number of times an element is encountered in a collection, we can use a common approach:

# WRONG
d = {}
for v in lst:
    if v not in d:
        d[v] = 1
    else:
        d[v] += 1

collections module provides a very handy class for this case which is called defaultdict. It's constructor accepts a function that will be used to calculate a value for a non-existing key:

>>> d = defaultdict(lambda: 42)
>>> d['key']
42

To rewrite counting example we can pass the int function to defaultdict which returns zero if called with no arguments:

# RIGHT
from collections import defaultdict
d = defaultdict(int)
for v in lst:
    d[v] += 1

defaultdict is useful when you need to create any kind of grouping of items in a collection, but you just need to get count of elements you may use Counter class instead:

# RIGHT
from collections import Counter

>>> counter = Counter(lst)
>>> counter
Counter({4: 3, 1: 2, 2: 1, 3: 1, 5: 1})

This post was originally posted at Brewing Codes blog.

Apache Spark vs. Apache Flink

Ivan Mushketyk — Wed, 27 Sep 2017 14:40:21 +0000

If you look at this image with a list of Big Data tools it may seem that all possible niches in this field are already occupied. With so much competition it should be very tough to come up with a groundbreaking technology.

Apache Flink creators have a different thought about this. It started as a research project called Stratosphere. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014 it was accepted as an Apache Incubator project, and just a few months later it became a top-level Apache project. At the time of this writing, the project has almost twelve thousand commits and more than 300 contributors.

Why is there so much attention? This is because Apache Flink was called a new generation Big Data processing framework and has enough innovations under its belt to replace Apache Spark and become the new de-facto tool for batch and stream processing.

Should you switch to Apache Flink? Should you stick with Apache Spark for a while? Or is Apache Flink just a new gimmick? This article will attempt to give you answers to these and other questions.

Apache Spark is an old news

Unless you have been living under a rock for the last couple of years, you have heard about Apache Spark. It looks like every modern system that does any kind data processing is using Apache Spark in one way or another.

For a long time, Spark was the latest and greatest tool in this area. It delivered some impressive features comparing to its predecessors such as:

Impressive speed - it is ten times faster than Hadoop if data is processed on a disk and up to 100 times faster if data is processed in memory.
Simpler Directed acyclic graph model - instead of defining your data processing jobs using rigid MapReduce framework Spark allows to define a graph of tasks that can implement complex data processing algorithms
Stream processing - with the advent of new technologies such as the Internet of Things it is not enough to simply to process a huge amount of data. Now we need processing a huge amount of data as it arrives in real time. This is why Apache Spark has introduced stream processing that allows to process a potentially infinite stream of data.
Rich set of libraries - In addition to its core features Apache Spark provides powerful libraries for machine learning, graph processing, and performing SQL queries.

To get a better idea of how you write applications with Apache Spark, let's take a look at how you can implement a simple Word Count application that would count how many times each word was used in a text document:

// Read file
val file = sc.textFile("file/path")
val wordCount = file
  // Extract words from every line
  .flatMap(line => line.split(" "))
  // Convert words to pairs
  .map(word => (word, 1))
  // Count how many times each word was used
  .reduceByKey(_ + _)

If you know Scala, this code should seem straightforward and is similar to working with regular collections. First we read a list of lines from a file located in "file/path". This file can be either a local file or a file in HDFS or S3.

Then every line is split into a list of words using the flatMap method that simply splits a string by the space symbol. Then to implement the word counting we use the map method to convert every word into a pair where the first element of the pair is a word from the input text and the second element is simply a number one.

Then the last step simply counts how many times each word was used by summing up numbers for all pairs for the same word.

Apache Spark seems like a great and versatile tool. But what does Apache Flink brings to the table?

New kid on the block

At first glance, there does not seem to be many differences. The architecture diagram looks very similar:

If you take a look at the code example for the Word Count application for Apache Flink you would see that there is almost no difference:

val file = env.readTextFile("file/path")
val counts = file
  .flatMap(line => line.split(" "))
  .map(word => (word, 1))
  .groupBy(0)
  .sum(1)

Few notable differences, is that in this case we need to use the readTextFile method instead of the textFile method and that we need to use a pair of methods: groupBy and sum instead of reduceByKey.

So what is all the fuss about? Apache Flink may not have any visible differences on the outside, but it definitely has enough innovations, to become the next generation data processing tool. Here are just some of them:

Implements actual streaming processing - when you process a stream in Apache Spark, it treats it as many small batch problems, hence making stream processing a special case. Apache Flink, in contrast, treats batch processing as a special and does not use micro batching.
Better support for cyclical and iterative processing - Flink provides some additional operations that allow implementing cycles in your streaming application and algorithms that need to perform several iterations on batch data.
Custom memory management - Apache Flink is a Java application, but it does not rely entirely on JVM garbage collector. It implements custom memory manager that stores data to process in byte arrays. This allows to reduce the load on a garbage collector and increase performance. You can read about it in this blog post.
Lower latency and higher throughput - multiple tests done by third parties suggest that Apache Flink has lower latency and higher throughput than its competitors.
Powerful windows operators - when you need to process a stream of data in most cases you need to apply a function to a finite group of elements in a stream. For example, you may need to count how many clicks your application has received in each five-minute interval, or you may want to know what was the most popular tweet on Twitter in each ten-minute interval. While Spark supports some of these use-cases, Apache Flink provides a vastly more powerful set of operators for stream processing.
Implements lightweight distributed snapshots - this allows Apache Flink to provide low overhead and only-once processing guarantees in stream processing, without using micro batching as Spark does.

What to choose

So, you are working on a new project, and you need to pick a software for it? What should use? Spark? Flink?

Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements.

If you don't need bleeding edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.

If on the other hand, you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.

Does all this mean that Apache Spark is obsolete and in a couple of years we all are going to use Apache Flink? The answer may surprise you. While Flink has some impressive features, Spark is not staying the same. For example, Apache Spark introduced custom memory management in 2015 with the release of project Tungsten, and since then it has been adding features that were first introduced by Apache Flink. The winner is not decided yet.

More information

In the upcoming blog posts I will write more about how you can use Apache Flink for batch and stream processing, so stay tuned!

If you want to know more about Apache Flink, you can take a look at my Pluralsight course where I cover Apache Flink in more details: Understanding Apache Flink. Here is a short preview of this course.

This post was originally posted at Brewing Codes blog.