<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ivan Mushketyk</title>
    <description>The latest articles on Forem by Ivan Mushketyk (@mushketyk).</description>
    <link>https://forem.com/mushketyk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F33698%2F0d79a5e8-f0a5-4a0d-9327-6ef9e6455808.jpg</url>
      <title>Forem: Ivan Mushketyk</title>
      <link>https://forem.com/mushketyk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mushketyk"/>
    <language>en</language>
    <item>
      <title>Should you use DynamoDB?</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Mon, 06 Nov 2017 18:47:08 +0000</pubDate>
      <link>https://forem.com/mushketyk/should-you-use-dynamodb-5m5</link>
      <guid>https://forem.com/mushketyk/should-you-use-dynamodb-5m5</guid>
      <description>&lt;p&gt;Selecting a proper technology for a new project is always a stressful event. We need to find something that will fit all existing requirements, does not restrict further growth, allows to achieve necessary performance, does not put a heavy operational burden, etc. It’s only natural that selecting a database can be tricky.&lt;/p&gt;

&lt;p&gt;In this article, I would like to describe DynamoDB database created by AWS. My goal is to give you enough knowledge so you would be able to answer a simple question: “Should I use DynamoDB in my next project?”. I will describe in what cases DynamoDB can be used efficiently and what pitfalls to avoid. I hope this will help you to make your life easier.&lt;/p&gt;

&lt;p&gt;This article will start with a general overview of DynamoDB; then I will show how to structure data for DynamoDB and what options do you have to work with DynamoDB. The article will finish with a rundown of some advanced features of DynamoDB.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is DynamoDB
&lt;/h1&gt;

&lt;p&gt;Let’s start with what is AWS DynamoDB. DynamoDB is a &lt;a href="https://brewing.codes/2017/10/30/nosql-guide/"&gt;NoSQL&lt;/a&gt;, key-value/document-oriented database. As a key-value database, it allows storing an item with an id and then get an item back. As a document-oriented database, it allows storing complex nested documents.&lt;/p&gt;

&lt;p&gt;DynamoDB is a serverless database, meaning that when you work with it you do not need to worry about individual machines. In fact, there is even no way to find out how many machines Amazon is using to serve your data. Instead of working with individual servers you need to specify how many read and write requests your database should process.&lt;/p&gt;

&lt;p&gt;On the one hand, this allows Amazon to provide a predictable low latency which is according to many sources is less than 10 ms if a request is coming from an EC2 host from the same AWS region. The latency can be even lower (&amp;lt;1 ms) if you enable cache for DynamoDB (more on this in the later section).&lt;/p&gt;

&lt;p&gt;On the other hand specifying a number of requests instead of a number of servers allows you to concentrate on the business value of the database and not on the implementation details. If you have an estimate of how many requests you need to process, all you need to do is to specify a number in AWS console or perform an API request. This can’t be simpler.&lt;/p&gt;

&lt;p&gt;The price that you pay for using DynamoDB is primarily determined by the provisioned capacity for your DynamoDB database. The higher it is, the higher the monthly bill is. DynamoDB also provides a small amount so-called “burst capacity which can be used for a short period of time to read or write more data that your provisioned capacity allows. If you’ve consumed it and still read or write more data, DynamoDB will return a &lt;code&gt;ProvisionedThroughputExceededException&lt;/code&gt;, and you need either to retry an operation or provision more capacity.&lt;/p&gt;

&lt;p&gt;In addition to these major features there are few other reasons why you might consider DynamoDB:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Massive scale&lt;/strong&gt; – just as other AWS services DynamoDB can work on a massive scale. Many other companies such as Airbnb, Lyft, and Duolingo are using DynamoDB in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low operational overhead&lt;/strong&gt; – you still need to do some operational tasks, such as, ensure that you have enough provisioned capacity, but most of the operational load is taken by the AWS team.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable&lt;/strong&gt; – despite few outages DynamoDB has a proven track of being rock-solid database solution. Also, all data that is written to DynamoDB is replicated to three different location.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schemaless&lt;/strong&gt; – just as many other NoSQL databases DynamoDB does not impose strict schema allowing more flexibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple API&lt;/strong&gt; – DynamoDB API is very straightforward. Overall it has less than &lt;a href="http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Operations_Amazon_DynamoDB.html"&gt;twenty methods&lt;/a&gt; an only a handful of them is related to writing and reading data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoscaling&lt;/strong&gt; – it is pretty straightforward to scale DynamoDB database up or down. All you need to do is to enable autoscaling on a particular table, and AWS will automatically increase or decrease provisioned capacity depending on current load. Alternatively, you can perform the &lt;a href="http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_UpdateTable.html"&gt;UpdateTable&lt;/a&gt; API call and change the provisioned capacity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with other AWS services&lt;/strong&gt; – DynamoDB is one of the core AWS services and has good integration with other services. You can use it together with CloudSearch to enable full-text search, perform data analytics with AWS EMR, back up data with AWS Data Pipeline, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Data model in DynamoDB
&lt;/h1&gt;

&lt;p&gt;Now let’s take a look at how to store data in DynamoDB. Data in DynamoDB is separated into tables. When you create a table, you need to decide on the key type that your table will have. DynamoDB has two types of keys and when you select a key type and you can’t change once it is selected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple key&lt;/strong&gt; – in this case, you need to identify what attribute in the table contains a key. This key is called a partition key. With this key type, DynamoDB does not give you a lot of flexibility and the only operation that you can do efficiently is to store an element with a key and get an element by a key back.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composite key&lt;/strong&gt; – in this case, you need to specify two key values which are called partition key and a sort key. As in the previous case, you can get an item by key, but you can also query this data in a more elaborate way. For example, you can get all items with the same partition key, sort result data by the value of the sort key, filter items using the value of the sort key, etc. The pair of partition/sort should be unique for each item.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s take a look at some examples of using simple and composite keys. For example, if we want to store a table with users in our database we could use a table with a simple key and store it like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7E0oi5A9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/partition-key-3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7E0oi5A9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/partition-key-3.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this table, the only operation that we can perform efficiently is to get a user by id.&lt;/p&gt;

&lt;p&gt;Composite keys allow more flexibility. For example, we could define a table with a composite that stores forum messages and select user id as a partition key and timestamp as a sort key:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LnYyCJHT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/composite-key-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LnYyCJHT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/composite-key-2.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This structure would allow performing more complex queries like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Get all forum posts written by user with a specified id&lt;/li&gt;
&lt;li&gt;Get all forum posts written by a specified user sorted by time (we can do this because we have the sort key)&lt;/li&gt;
&lt;li&gt;Get all forum posts that were written in a specified time interval (we can do this because we can specify filtering expression on a sort key)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice that in this case, you can only query data for a specified partition key. If you want to search for items across partition keys you need to use the scan operation. It allows finding all items in a table that match a specified filter expression. This operation is less restrictive than the first two, but it does not exploit any knowledge about where data is stored in DynamoDB and ends up scanning the whole table.&lt;/p&gt;

&lt;p&gt;You should try to use the scan operation as little as possible. While it allows much more flexibility, it is significantly slower and consumes more provisioned capacity. If you are extensively relying on scanning big tables, you won’t achieve high performance and will have to provision more capacity and hence pay more money.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consistency in DynamoDB
&lt;/h2&gt;

&lt;p&gt;As with many other NoSQL databases, you can select consistency level when you perform operations with DynamoDB. DynamoDB stores three copies of each item and when you write data to DynamoDB it only acknowledges a write after two copies out of three were updated. The third copy is updated later.&lt;/p&gt;

&lt;p&gt;When you read data from DynamoDB, you have two options. You can either use strong consistency and in this DynamoDB will read data from two copies and return the latest data, or you can select eventual consistency and in this case, DynamoDB will only read data from one copy at random, and may return stale data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Indexes
&lt;/h2&gt;

&lt;p&gt;Simple key and composite key model is quite restrictive and is not enough to support complex use cases. To help with that DynamoDB supports two index types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local secondary index&lt;/strong&gt; – is very similar to composite key and is used to define additional sort order or to filter items by a different criteria. The main difference from a composite key is that a pair of a partition/sort key should be unique, but a pair of partition/secondary index should not be unique.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global secondary index&lt;/strong&gt; – this allows using a different partition key for your data. You can use a global secondary index if you want to get an item from a table by one of two ids, for example, a book in an online shop can have two ids: ISBN-10 and ISBN-13. As with regular tables, global secondary indexes can have simple and composite keys.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internally, global secondary index is simply a copy of your original data in a separate DynamoDB table with a different key. When an item is written into a table with a global secondary index, DynamoDB copies data in the background. Because of this writing data into a global secondary is always eventually consistent.&lt;/p&gt;

&lt;p&gt;With DynamoDB you can create up to five local secondary indexes and up to five global secondary indexes per table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Complex data
&lt;/h2&gt;

&lt;p&gt;All examples so far presented data in table format but DynamoDB also supports complex data types. In addition to simple data types like numbers and strings DynamoDB supports these types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nested object&lt;/strong&gt; – a value of an attribute in DynamoDB can be a complex nested object&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set&lt;/strong&gt; – a set of numbers, strings, or binary values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;List&lt;/strong&gt; – a untyped list that can contain any values&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Programmatic access
&lt;/h1&gt;

&lt;p&gt;If you want to access DynamoDB,f you have two main options: &lt;a href="http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Operations_Amazon_DynamoDB.html"&gt;low-level API&lt;/a&gt; and DynamoDB mapper. All communication with DynamoDB is performed via HTTP. To read data, there are just four methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GetItem&lt;/strong&gt; – get a single item by id from a database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BatchGetItem&lt;/strong&gt; – get several items by id in one call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query&lt;/strong&gt; – query a composite key or an index&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scan&lt;/strong&gt; – scan through a table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And there are just four methods to change data in DynamoDB:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PutItem&lt;/strong&gt; – write a new item to a table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BatchWriteItem&lt;/strong&gt; – write multiple items to a table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UpdateItem&lt;/strong&gt; – update some fields in a specified item&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeleteItem&lt;/strong&gt; – remove an item by id&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All methods work on a table level. There are no methods that work across different tables.&lt;/p&gt;

&lt;p&gt;The low-level API is a thin wrapper over these HTTP methods. It is verbose and cumbersome to use. For example, to get a single item from a DynamoDB table you need write that much code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Create DynamoDB client&lt;/span&gt;
&lt;span class="nc"&gt;AmazonDynamoDB&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AmazonDynamoDBClientBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;standard&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Create a composite key&lt;/span&gt;
&lt;span class="nc"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;AttributeValue&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="nc"&gt;UserId&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AttributeValue&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withN&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="nc"&gt;Timestamp&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AttributeValue&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;withN&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="mi"&gt;1498928631&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;// Create a request object&lt;/span&gt;
&lt;span class="nc"&gt;GetItemRequest&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GetItemRequest&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;         
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withTableName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="nc"&gt;ForumMessages&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Perform API request&lt;/span&gt;
&lt;span class="nc"&gt;GetItemResult&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getItem&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Get attribute from the result item&lt;/span&gt;
&lt;span class="nc"&gt;AttributeValue&lt;/span&gt; &lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getItem&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="nc"&gt;Message&lt;/span&gt;&lt;span class="err"&gt;"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Get string value &lt;/span&gt;
&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributeValue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getS&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is pretty straightforward. First, we create a key to get an item by a key, then as with other AWS API methods we create a request object and perform a request. The example is in Java, but API clients for AWS exist for other languages such as Python, .NET platform, and JavaScript.&lt;/p&gt;

&lt;p&gt;Now you may be wondering if there is a library that will help to avoid all this massive amount boilerplate code. And in fact, AWS implemented a high-level library for this called DynamoDB mapper. To use it you first need to define a structure of your data similarly to how you define it with ORM frameworks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Specify table name&lt;/span&gt;
&lt;span class="nd"&gt;@DynamoDBTable&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tableName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nc"&gt;ForumUsers&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// "UserId" attribute is a key&lt;/span&gt;
    &lt;span class="nd"&gt;@DynamoDBHashKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributeName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nc"&gt;UserId&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;getUserId&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Map this value to the "Name" attribute&lt;/span&gt;
    &lt;span class="nd"&gt;@DynamoDBAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributeName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="nc"&gt;Name&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now accessing data in DynamoDB is much simpler. All we need to do to is to create an instance of the &lt;code&gt;DynamoDBMapper&lt;/code&gt; and call the &lt;code&gt;load&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Create DynamoDB client (just as in the previous example)&lt;/span&gt;
&lt;span class="nc"&gt;AmazonDynamoDB&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AmazonDynamoDBClientBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;standard&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;// Create DynamoDB mappper&lt;/span&gt;
&lt;span class="nc"&gt;DynamoDBMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DynamoDBMapper&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Create a key instance&lt;/span&gt;
&lt;span class="nc"&gt;User&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setUserId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Get item by keys&lt;/span&gt;
&lt;span class="nc"&gt;User&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;load&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that now to specify a key of an item we can use a Java POJO class and not a &lt;code&gt;HashMap&lt;/code&gt; instance as in the previous case.&lt;/p&gt;

&lt;p&gt;Unless you need to access some nitty-gritty details of DynamoDB or need to implement a custom way of doing things I recommend to use DynamoDB mapper and only to use low-level API if necessary. Examples here were provided in Java, but there are also DynamoDB mapper implementations for other languages like &lt;a href="http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DotNetSDKHighLevel.html"&gt;.NET platform&lt;/a&gt; and &lt;a href="http://dynamodb-mapper.readthedocs.io/en/latest/"&gt;Python&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can also use unofficial libraries to access DynamoDB. For example, if you are using Spring you can consider using Amazon DynamoDB module for &lt;a href="https://github.com/derjust/spring-data-dynamodb"&gt;Spring Data&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Advanced features
&lt;/h1&gt;

&lt;p&gt;All features described so far are core DynamoDB features, but it also has some additional more advanced features that you can use to build complex applications. In this section, I’ll provide a short list of what these features are and what you can use them for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimistic locking
&lt;/h2&gt;

&lt;p&gt;A common problem in a distributed systems is when different actors are stepping on each other toes while performing operations in parallel. A common example is when two different processes are trying to update the same item in a database. In this case, a second update can override data written by the first update.&lt;/p&gt;

&lt;p&gt;To solve this problem, DynamoDB allows specifying a condition for performing an update. If a condition is satisfied, new data is written, otherwise, DynamoDB will return an error.&lt;/p&gt;

&lt;p&gt;A common way to use this feature that is implemented in DynamoDBMapper is to maintain a version field with an item and increment it on every update. If a version was not changed a process can write new data. Otherwise, it will have to re-read data and try to perform the operation again.&lt;/p&gt;

&lt;p&gt;This technique is also called optimistic locking and is similar to the &lt;a href="https://en.wikipedia.org/wiki/Compare-and-swap"&gt;compare-and-swap&lt;/a&gt; operation that is used to implement lock-free data structures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transactions
&lt;/h2&gt;

&lt;p&gt;While DynamoDB does not have built-in support for transactions, AWS has implemented a &lt;a href="https://github.com/awslabs/dynamodb-transactions"&gt;Java library&lt;/a&gt; that implements transactions on top of existing DynamoDB features. A detailed design is described &lt;a href="https://github.com/awslabs/dynamodb-transactions/blob/master/DESIGN.md"&gt;here&lt;/a&gt;, but in the nutshell when you perform an operation with the DynamoDB transaction library it stores a list of performed operations into a separate table and on commit it applies stored operations.&lt;/p&gt;

&lt;p&gt;Since the library is open sourced there is nothing that prevents developers from implementing a similar solution for other languages, but it seems currently DynamoDB transaction support is only implemented for Java.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time to live
&lt;/h2&gt;

&lt;p&gt;Not all data that you store in DynamoDB should be stored forever. Older data can be moved to a cheaper data storage, like S3, or simply removed. To automatically delete old data DynamoDB implements the time-to-live feature which allows specifying an attribute that stores a timestamp when an item should be removed. DynamoDB tracks expired items and removes them for no extra cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  DynamoDB Streams
&lt;/h2&gt;

&lt;p&gt;Another powerful DynamoDB feature is DynamoDB streams. If it is enabled it allows reading an immutable, ordered stream of updates to a DynamoDB table. An item is written to a stream after an update is performed and allow to react to changes to DynamoDB. This is a crucial feature if you need to implement one of the following use-cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-region replication&lt;/strong&gt; – you may want to store a copy of your data in a separate region to keep it closer to your users or to have a back-up of your data. To implement it you can read a DynamoDB stream and replay update operations in a second database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregated table&lt;/strong&gt; – DynamoDB model might not suit some of the queries you need to perform. For example, if you need to group by data, DynamoDB does not support it out of the box. To implement this feature, you may read DynamoDB update stream and maintain an aggregated table that fits DynamoDB model and allows efficient queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep data in sync&lt;/strong&gt; – in many cases, you need to maintain a copy of your data in a different datastore such as cache or CloudSearch. DynamoDB streams is an immense help for that. Since a record in DynamoDB stream appears only if data was stored in DynamoDB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_LgkQ0gI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/kcl-image-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_LgkQ0gI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/kcl-image-2.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DynamoDB streams implementation is very similar to another AWS services called Kinesis. They both have similar API and to read data from any of the systems you can use &lt;a href="https://github.com/awslabs/amazon-kinesis-client"&gt;Kinesis Client Library&lt;/a&gt; (KCL) that provides a high-level interface for reading data from a Kinesis or DynamoDB stream. Keep in mind that DynamoDB streams API and Kinesis API are slightly different, so you need to use an &lt;a href="https://github.com/awslabs/dynamodb-streams-kinesis-adapter"&gt;adapter&lt;/a&gt; to use KCL with DynamoDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  DynamoDB accelerator (DAX)
&lt;/h2&gt;

&lt;p&gt;As with every other database, it can be beneficial to use cache if you use DynamoDB. Unfortunately, it may be tricky to &lt;a href="https://martinfowler.com/bliki/TwoHardThings.html"&gt;maintain cache consistency&lt;/a&gt;. If you use a caching solution like &lt;a href="https://aws.amazon.com/elasticache/"&gt;ElastiCache&lt;/a&gt;, it may be a good idea to utilize DynamoDB stream to maintain a copy of your data in a cache, but recently DynamoDB has introduced a new feature called DynamoDB accelerator.&lt;/p&gt;

&lt;p&gt;DAX is write-through caching layer for DynamoDB. It has exactly the same API as DynamoDB, and if you’ve enabled it, you are supposed to read and write to DynamoDB through it. DAX keeps track of what data was written to DynamoDB and only stores it if a write was acknowledged by DynamoDB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LO_M-lxO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/dax-image-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LO_M-lxO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://brewing.codes/wp-content/uploads/2017/11/dax-image-2.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the key benefits of using DAX is that it has a sub-millisecond latency which may be especially important if you have strict SLAs.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusions
&lt;/h1&gt;

&lt;p&gt;DynamoDB is a great database. It has a rich feature set, predictable low latency, and almost no operational load. The key to using it efficiently is to understand its data model and to check if you can fit your data in DynamoDB. Remember, to achieve the stellar performance you need to use queries as much as possible and try to avoid scans operations.&lt;/p&gt;

&lt;p&gt;This was an introductory article to DynamoDB, and I will write more, so stay tuned. In the meantime, you can take a look at my &lt;a href="http://bit.ly/dynamo-deep-dive"&gt;deep dive DynamoDB course&lt;/a&gt;. You can watch the preview for the course &lt;a href="http://bit.ly/dynamodb-preview"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://brewing.codes/2017/11/06/dynamodb-overview/"&gt;Should you use DynamoDB?&lt;/a&gt; appeared first on &lt;a href="https://brewing.codes"&gt;Brewing Codes&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>database</category>
      <category>dynamodb</category>
    </item>
    <item>
      <title>Sending additional data to and from Flink cluster</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Tue, 24 Oct 2017 09:00:02 +0000</pubDate>
      <link>https://forem.com/mushketyk/sending-additional-data-to-and-from-flink-cluster-b8g</link>
      <guid>https://forem.com/mushketyk/sending-additional-data-to-and-from-flink-cluster-b8g</guid>
      <description>

&lt;p&gt;If you know anything about Apache Flink, you are probably familiar with how to send data to it and how to get results back. But in some cases, we need to send configuration data to the Flink cluster and receive some additional data from it.&lt;/p&gt;

&lt;p&gt;In the first part of the article, I’ll describe how to send configuration data to our Flink cluster. There are many things that we want to configure: function parameters, configuration files, machine learning models. Flink provides several different ways to do this, and we will cover how to use them and when to use each one. In the second part of the article, I will describe a non-trivial way of sending data back from a Flink cluster.&lt;/p&gt;

&lt;p&gt;This article requires some basic knowledge of Apache Flink. If you are not familiar with it, you can read some of my other articles on the topic: &lt;a href="https://brewing.codes/2017/09/25/flink-vs-spark/"&gt;here&lt;/a&gt;, &lt;a href="https://dev.to/mushketyk/getting-started-with-batch-processing-using-apache-flink-bnh"&gt;here&lt;/a&gt;, and &lt;a href="https://dev.to/mushketyk/getting-started-with-stream-processing-using-apache-flink-530"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Sending data to task managers
&lt;/h1&gt;

&lt;p&gt;Before we dig into the details of how to send data between different components in Apache Flink, let’s first talk about what components there are in a Flink cluster and what are we trying to achieve. The following diagram presents what main moving parts Flink has and how they interact:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bi_g0R0s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/lh3.googleusercontent.com/-VC1iepeKON8/WdM9D0APYZI/AAAAAAAAGdk/iwpXPl5dVNc4mRLQ9zPjZpohT8W4MzGjACHMYCw/I/15070159503265.jpg%3Fw%3D690%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bi_g0R0s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i2.wp.com/lh3.googleusercontent.com/-VC1iepeKON8/WdM9D0APYZI/AAAAAAAAGdk/iwpXPl5dVNc4mRLQ9zPjZpohT8W4MzGjACHMYCw/I/15070159503265.jpg%3Fw%3D690%26ssl%3D1" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we need to execute a Flink application, we interact with a job manager that stores details about the job it is running, such as an execution graph. It controls task managers and each task manager contains a portion of the data and execute data processing functions that we’ve defined.&lt;/p&gt;

&lt;p&gt;In many cases, we would like to configure the behavior of our functions that run in the Flink cluster. Depending on a use-case we may need to set a single variable or submit a file with a static configuration, and we will discuss how Flink supports these and other cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KyybHwe_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/lh3.googleusercontent.com/-BrwTPnr-cFk/WdM9sexcS-I/AAAAAAAAGds/J4da5TaNL3sfCZ0UqPgF5j1DlZnFpxitQCHMYCw/I/15070161117794.jpg%3Fw%3D690%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KyybHwe_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://i1.wp.com/lh3.googleusercontent.com/-BrwTPnr-cFk/WdM9sexcS-I/AAAAAAAAGds/J4da5TaNL3sfCZ0UqPgF5j1DlZnFpxitQCHMYCw/I/15070161117794.jpg%3Fw%3D690%26ssl%3D1" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to sending configuration data to task managers, sometimes we may want to return data from our functions in addition to regular outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring user-defined functions
&lt;/h2&gt;

&lt;p&gt;Let’s say we have an application that reads a list of movies from a CSV file and needs to filter all movies of a particular genre:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Read a dataset of movies&lt;/span&gt;
&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;readCsvFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"movies.csv"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ignoreFirstLine&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parseQuotedStrings&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="sc"&gt;'"'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ignoreInvalidLines&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;FilterFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Genres for a movie separated by the "|" symbol&lt;/span&gt;
    &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\|"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Find all movies that has the "Action" genre&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genres&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;anyMatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Action"&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}).&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;It is very likely that we would like to extract movies of a different genre and to this we need to be able to configure our filter function. When you implement a function like this, the most straightforward way to configure it is to implement a constructor:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Pass a genre name&lt;/span&gt;
&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;FilterGenre&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Action"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FilterGenre&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="n"&gt;FilterFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Initialize filter function&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;FilterGenre&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;genre&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\|"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genres&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;anyMatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Alternatively, if you are using lambda functions you can simply use a variable from its &lt;a href="https://en.wikipedia.org/wiki/Closure_(computer_programming)"&gt;closure&lt;/a&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Action"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;FilterFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\|"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Using variable&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genres&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;anyMatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="o"&gt;}).&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Flink will serialize this variable and send it with the function to the cluster.&lt;/p&gt;

&lt;p&gt;All these methods can get annoying if you need to pass a lot of variables to your function. To help with that Apache Flink provides the &lt;code&gt;withParameters&lt;/code&gt; method. To use it you need to implement a &lt;code&gt;Rich&lt;/code&gt; version of a function you are interested in, so instead of implementing the &lt;code&gt;MapFunction&lt;/code&gt; interface, you will have to implement the &lt;code&gt;RichMapFunction&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Rich functions allow you to pass a number of parameters using the &lt;code&gt;withParameters&lt;/code&gt; method:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Class in Flink to store parameters&lt;/span&gt;
&lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;configuration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"genre"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Action"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;FilterGenreWithParameters&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="c1"&gt;// Pass parameters to a function&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withParameters&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To read these parameters we need to implement the &lt;code&gt;open&lt;/code&gt; and read parameters in it:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FilterGenreWithParameters&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="n"&gt;RichFilterFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Read the parameter&lt;/span&gt;
        &lt;span class="n"&gt;genre&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"genre"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\|"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genres&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;anyMatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;All these options will work, but it can be tedious if you need to set the same parameter for multiple functions. To handle this Flink allows setting global environments variable that will be accessible by all task managers. &lt;/p&gt;

&lt;p&gt;To do this, you first need to read arguments from a command line using the &lt;code&gt;ParameterTool.fromArgs&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Read command line arguments&lt;/span&gt;
    &lt;span class="n"&gt;ParameterTool&lt;/span&gt; &lt;span class="n"&gt;parameterTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ParameterTool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromArgs&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;and then set global job parameters using the &lt;code&gt;setGlobalJobParameters&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;ExecutionEnvironment&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getConfig&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;setGlobalJobParameters&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameterTool&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;// This function will be able to read these global parameters&lt;/span&gt;
&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;FilterGenreWithGlobalEnv&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now we can implement a function that will read these parameters. As before it should be a rich function:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FilterGenreWithGlobalEnv&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="n"&gt;RichFilterFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;genres&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;movie&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\|"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;// Get global parameters&lt;/span&gt;
        &lt;span class="n"&gt;ParameterTool&lt;/span&gt; &lt;span class="n"&gt;parameterTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ParameterTool&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getExecutionConfig&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getGlobalJobParameters&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="c1"&gt;// Read parameter&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;genre&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parameterTool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"genre"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genres&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;anyMatch&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equals&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;genre&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To read a parameter we need to call the &lt;code&gt;getGlobalJobParameter&lt;/code&gt; to get all global parameters and then use the &lt;code&gt;get&lt;/code&gt; method to get the parameter we are interested in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Broadcast variables
&lt;/h2&gt;

&lt;p&gt;All these methods that we’ve discussed before will suit you if you want to send data from a client to task managers, but what if data exists in task managers in the form of a dataset? In this case, it’s better to use another Flink feature called broadcast variables. It simply allows sending a dataset to task managers that will execute your functions.&lt;/p&gt;

&lt;p&gt;Let’s say we have a dataset that contains words that we &lt;a href="https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html"&gt;should ignore&lt;/a&gt; when we do text processing, and we want to set it our function. To set a broadcast variable for a single function, we need to use the &lt;code&gt;withBroadcastSet&lt;/code&gt; method and a dataset to it.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;toBroadcast&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromElements&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Get a dataset with words to ignore&lt;/span&gt;
&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;wordsToIgnore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;RichFlatMapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// A collection to store words. This will be stored in memory&lt;/span&gt;
    &lt;span class="c1"&gt;// of a task manager&lt;/span&gt;
    &lt;span class="n"&gt;Collection&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;wordsToIgnore&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Read a collection of words to ignore&lt;/span&gt;
        &lt;span class="n"&gt;wordsToIgnore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getBroadcastVariable&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"wordsToIgnore"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\W+"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;// Use the collection of words to ignore&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wordsToIgnore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Pass a dataset via a broadcast variable&lt;/span&gt;
&lt;span class="o"&gt;}).&lt;/span&gt;&lt;span class="na"&gt;withBroadcastSet&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wordsToIgnore&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"wordsToIgnore"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You should keep in mind that if you use broadcast variables, a dataset will be stored in a task manager’s memory, so you should only use it for small datasets.&lt;/p&gt;

&lt;p&gt;If you want to send more data to each task manager and do not want to store this data in memory, you can send a static file to task managers using Flink’s distributed cache. To use it you, first, need to store a file in one of the distributed file systems like HDFS and then register this file in the cache:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;ExecutionEnvironment&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Register a file from HDFS&lt;/span&gt;
&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;registerCachedFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hdfs:///path/to/file"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"machineLearningModel"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To access the distributed cache, we again need to implement a rich function:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyClassifier&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="n"&gt;RichMapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="n"&gt;machineLearningModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getDistributedCache&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"machineLearningModel"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Integer&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Notice that to access a file in the distributed cache we need to use the same key that we used to register it.&lt;/p&gt;

&lt;h1&gt;
  
  
  Accumulators
&lt;/h1&gt;

&lt;p&gt;We’ve covered how we can send data to task managers but now let’s talk about how we can send data from task managers back. You may wonder why do we need to do anything special. After all, Apache Flink is all about building data processing pipelines that read input data, process it, and return a result back.&lt;/p&gt;

&lt;p&gt;To clarify what else can we possibly want let’s take a look at an example. Let’s say we need to count how many times each word occurs in a text and at the same time we want to calculate how many lines do we have in the text:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Text to process&lt;/span&gt;
&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;// Word count algorithm&lt;/span&gt;
&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;flatMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;FlatMapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\W+"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;})&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;groupBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sum&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Count a number of lines in the text to process&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;linesCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linesCount&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The problem is that if we run this application as it is will run two Flink jobs! First to get the word count and second to count a number of lines.&lt;/p&gt;

&lt;p&gt;This is definitely inefficient, but how can we avoid this? One way is to use accumulators. They allow you to send data from task managers and this data to be aggregated using a predefined function. Flink has following built-in accumulators:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IntCounter&lt;/strong&gt; , &lt;strong&gt;LongCounter&lt;/strong&gt; , &lt;strong&gt;DoubleCounter&lt;/strong&gt; – allows summing together int, long, double values sent from task managers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AverageAccumulator&lt;/strong&gt; – calculates an average of double values&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LongMaximum&lt;/strong&gt; , &lt;strong&gt;LongMinimum&lt;/strong&gt; , &lt;strong&gt;IntMaximum&lt;/strong&gt; , &lt;strong&gt;IntMinimum&lt;/strong&gt; , &lt;strong&gt;DoubleMaximum&lt;/strong&gt; , &lt;strong&gt;DoubleMinimum&lt;/strong&gt; – accumulators to determine maximum and minimum values for different types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Histogram&lt;/strong&gt; – used to computed distribution of values from task managers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To use an accumulator, we need to create and register it an user-defined function and then read the result on the client. Here is how we can do this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;flatMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;RichFlatMapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// Create an accumulator&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="n"&gt;IntCounter&lt;/span&gt; &lt;span class="n"&gt;linesNum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;IntCounter&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Register accumulator&lt;/span&gt;
        &lt;span class="n"&gt;getRuntimeContext&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;addAccumulator&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"linesNum"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linesNum&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;flatMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\W+"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Increment after each line is processed&lt;/span&gt;
        &lt;span class="n"&gt;linesNum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;})&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;groupBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sum&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Get accumulator result&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;linesNum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getLastJobExecutionResult&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getAccumulatorResult&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"linesNum"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linesNum&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This allows us to count how many times each word occurs in the input text and how many lines does it have.&lt;/p&gt;

&lt;p&gt;If you need a custom accumulator, you can also implement your own accumulators using Accumulator or SimpleAccumulator interfaces.&lt;/p&gt;

&lt;h1&gt;
  
  
  More information
&lt;/h1&gt;

&lt;p&gt;I hope you liked this article and found it useful. You can find the source code for this article in my git repository with other &lt;a href="https://github.com/mushketyk/flink-examples"&gt;Apache Flink examples&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I will write more articles about Flink in the near future, so stay tuned! You can read my other articles &lt;a href="https://brewing.codes/"&gt;here&lt;/a&gt;, or you can you can take a look at my Pluralsight course where I cover Apache Flink in more details: &lt;a href="http://bit.ly/understanding-flink"&gt;Understanding Apache Flink&lt;/a&gt;. Here is a &lt;a href="http://bit.ly/understanding-flink-preview"&gt;short preview&lt;/a&gt; of this course.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://brewing.codes/2017/10/24/flink-additional-data/"&gt;Sending additional data to and from Flink cluster&lt;/a&gt; appeared first on &lt;a href="https://brewing.codes"&gt;Brewing Codes&lt;/a&gt;.&lt;/p&gt;


</description>
      <category>flink</category>
      <category>accumulators</category>
      <category>broadcastvariables</category>
    </item>
    <item>
      <title>Four ways to optimize your Flink applications</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Tue, 17 Oct 2017 10:00:43 +0000</pubDate>
      <link>https://forem.com/mushketyk/four-ways-to-optimize-your-flink-applications-363</link>
      <guid>https://forem.com/mushketyk/four-ways-to-optimize-your-flink-applications-363</guid>
      <description>

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BmBXnkU1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/cv65nwycvzita8ggz2qf.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BmBXnkU1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://thepracticaldev.s3.amazonaws.com/i/cv65nwycvzita8ggz2qf.jpeg" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flink is a complicated framework and provides many ways to tweak its execution. In this article, I’ll show four different ways to improve the performance of your Flink applications.&lt;/p&gt;

&lt;p&gt;If you are not familiar with Flink, you can read other introductory articles like &lt;a href="https://brewing.codes/2017/09/25/flink-vs-spark/"&gt;this&lt;/a&gt;, &lt;a href="https://dev.to/mushketyk/getting-started-with-batch-processing-using-apache-flink-bnh"&gt;this&lt;/a&gt;, and &lt;a href="https://dev.to/mushketyk/getting-started-with-stream-processing-using-apache-flink-530"&gt;this one&lt;/a&gt;. But if you are already familiar with Apache Flink this article will help you to make your applications a little bit faster.&lt;/p&gt;

&lt;h1&gt;
  
  
  Use Flink tuples
&lt;/h1&gt;

&lt;p&gt;When you use operations like &lt;code&gt;groupBy&lt;/code&gt;, &lt;code&gt;join&lt;/code&gt; or &lt;code&gt;keyBy&lt;/code&gt; Flink provides you a number of options to select a key in your dataset. You can use a key selector function:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Join movies and ratings datasets&lt;/span&gt;
&lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;// Use movie id as a key in both cases&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;KeySelector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Movie&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nd"&gt;@Override&lt;/span&gt;
            &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;getKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Movie&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getId&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;})&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equalTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;KeySelector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Rating&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="nd"&gt;@Override&lt;/span&gt;
            &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;getKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Rating&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMovieId&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Or you can specify a field names in POJO types:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// Use same fields as in the previous example&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equalTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"movieId"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;But if you are working with Flink tuple types you can simply specify a position of a field tuple that will be used as key:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ratings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// Specify fields positions in tuples&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equalTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The last option will give you the best performance, but what about readability? Does it mean that your code will look like this now:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;DataSet&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;movies&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ratings&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;equalTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;JoinFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// What is happening here?&lt;/span&gt;
        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Some tuples are joined with some other tuples and some fields are returned???&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;A common idiom to improve readability, in this case, is to create a class that inherits from one of the &lt;code&gt;TupleX&lt;/code&gt; classes and implements getters and setters for these fields. Here how an &lt;code&gt;Edge&lt;/code&gt; class from the Flink Gelly library that has three classes and extends the &lt;code&gt;Tuple3&lt;/code&gt; class:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Edge&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;Edge&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Getters and setters for readability&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;setSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="nf"&gt;getSource&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Also has getters and setters for other fields&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h1&gt;
  
  
  Reuse Flink objects
&lt;/h1&gt;

&lt;p&gt;Another option that you can use to improve the performance of your Flink application is to use mutable objects when you return data from a user-defined function. Take a look at this example:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;WindowFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeWindow&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeWindow&lt;/span&gt; &lt;span class="n"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
            &lt;span class="c1"&gt;// A new Tuple instance is created on every execution&lt;/span&gt;
            &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As you can see on every execution of the &lt;code&gt;apply&lt;/code&gt; function, we create a new instance of the &lt;code&gt;Tuple2&lt;/code&gt; class, which increases pressure on a garbage collector. One way to fix this problem would be to reuse the same instance again and again:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;WindowFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeWindow&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Create an instance that we will reuse on every call&lt;/span&gt;
        &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;

        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeWindow&lt;/span&gt; &lt;span class="n"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

            &lt;span class="c1"&gt;// Set fields on an existing object instead of creating a new one&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
            &lt;span class="c1"&gt;// Auto-boxing!! A new Long value may be created&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

            &lt;span class="c1"&gt;// Reuse the same Tuple2 object&lt;/span&gt;
            &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;It’s a bit better. We create a new &lt;code&gt;Tuple2&lt;/code&gt; instance on every call, but we still, indirectly, create an instance of the &lt;code&gt;Long&lt;/code&gt; class. To solve this problem, Flink has a number of so-called value classes: &lt;code&gt;IntValue&lt;/code&gt;, &lt;code&gt;LongValue&lt;/code&gt;, &lt;code&gt;StringValue&lt;/code&gt;, &lt;code&gt;FloatValue&lt;/code&gt;, etc. The point of this classes is to provide mutable versions of built-in types, so we could reuse them in our user-defined functions. Here is how we can use them:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;stream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;WindowFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeWindow&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Create a mutable count instance&lt;/span&gt;
        &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="n"&gt;LongValue&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;IntValue&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="c1"&gt;// Assign mutable count to the tuple&lt;/span&gt;
        &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LongValue&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="c1"&gt;// Notice that now we have a different return type&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimeWindow&lt;/span&gt; &lt;span class="n"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Iterable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LongValue&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;

            &lt;span class="c1"&gt;// Set fields on an existing object instead of creating a new one&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
            &lt;span class="c1"&gt;// Update mutable count value&lt;/span&gt;
            &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setValue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;changesCount&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

            &lt;span class="c1"&gt;// Reuse the same tuple and the same LongValue instance&lt;/span&gt;
            &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This idiom is commonly used in Flink libraries like Flink Gelly.&lt;/p&gt;

&lt;h1&gt;
  
  
  Use function annotations
&lt;/h1&gt;

&lt;p&gt;One more way to optimize your Flink application is to provide some information about what your user-defined functions are doing with input data. Since Flink can’t parse and understand code, you can provide crucial information that will help to build a more efficient execution plan. There are three annotations that we can use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;@ForwardedFields – specifies what fields in an input value were left unchanged and are used in an output value.&lt;/li&gt;
&lt;li&gt;@NotForwardedFields – specifies fields which were not preserved in the same positions in the output.&lt;/li&gt;
&lt;li&gt;@ReadFields – specifies what fields were used to compute a result value. You should only specify fields that were used in computations and not merely copied to the output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s take a look at how we can use &lt;code&gt;ForwardedFields&lt;/code&gt; annotation:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Specify that the first element is copied without any changes&lt;/span&gt;
&lt;span class="nd"&gt;@ForwardedFields&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyFunction&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="n"&gt;MapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
       &lt;span class="c1"&gt;// Copy first field without change&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This means that the first element in an input tuple is not being changed and it is returned in the same position.&lt;/p&gt;

&lt;p&gt;If you don’t change a field, but simply move it into a different position, you can specify this with the &lt;code&gt;ForwardedFields&lt;/code&gt; annotation as well. In the next example we swap fields in an input tuple and warn Flink about this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1st element goes into the 2nd position, and 2nd element goes into the 1st position&lt;/span&gt;
&lt;span class="nd"&gt;@ForwardedFields&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0-&amp;gt;1; 1-&amp;gt;0"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SwapArguments&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="n"&gt;MapFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
       &lt;span class="c1"&gt;// Swap elements in a tuple&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The annotations mentioned above can only be applied to functions that have one input parameter, such as &lt;code&gt;map&lt;/code&gt; or &lt;code&gt;flatMap&lt;/code&gt;. If you have two input parameters, you can use the &lt;code&gt;ForwardedFieldsFirst&lt;/code&gt; and the &lt;code&gt;ForwardedFieldsSecond&lt;/code&gt; annotations that provide information about the first and the second parameters respectively.&lt;/p&gt;

&lt;p&gt;Here how we can use these annotations in an implementation of the &lt;code&gt;JoinFunction&lt;/code&gt; interface:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Two fields from the input tuple are copied to the first and second positions of the output tuple&lt;/span&gt;
&lt;span class="nd"&gt;@ForwardedFieldsFirst&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0; 1"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// The third field from the input tuple is copied to the third position of the output tuple&lt;/span&gt;
&lt;span class="nd"&gt;@ForwardedFieldsSecond&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"2"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyJoin&lt;/span&gt; &lt;span class="kd"&gt;implements&lt;/span&gt; &lt;span class="n"&gt;JoinFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Double&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;f1&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Flink also provides &lt;code&gt;NotForwardedFieldsFirst&lt;/code&gt;, &lt;code&gt;NotForwardedFieldsSecond&lt;/code&gt;, &lt;code&gt;ReadFieldsFirst&lt;/code&gt;, and &lt;code&gt;ReadFirldsSecond&lt;/code&gt; annotations for similar purposes.&lt;/p&gt;

&lt;h1&gt;
  
  
  Select join type
&lt;/h1&gt;

&lt;p&gt;You can make your joins faster if you give Flink another hint, but before we discuss why it works, let’s talk about how Flink executes joins.&lt;/p&gt;

&lt;p&gt;When Flink is processing batch data, each machine in a cluster stores part of data. To perform a join Apache Flink needs to find all pairs of two datasets where a join condition is satisfied. To do this Flink first has to put items from both datasets that have the same key on the same machine in the cluster. There are two strategies for this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repartition-Repartition strategy&lt;/strong&gt; – in this case, both datasets are partitioned by their keys and send across the network. It means that if datasets are big, it may take a significant amount of time to copy them across the network.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broadcast-Forward strategy&lt;/strong&gt; – in this case, one dataset is left untouched, but the second dataset is copied to every machine in the cluster that has part of the first dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are joining a small dataset with a much bigger dataset, you can use the Broadcast-Forward strategy and avoid costly partition of the first dataset. This is really easy to do:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;ds1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;join&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ds2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;JoinHint&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BROADCAST_HASH_FIRST&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This hints that the first dataset is a much smaller than the second one.&lt;/p&gt;

&lt;p&gt;You can also use other join hints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BROADCAST_HASH_SECOND&lt;/strong&gt; – the second dataset is much smaller&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;REPARTITION_HASH_FIRST&lt;/strong&gt; – the first dataset it a bit smaller&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;REPARTITION_HASH_SECOND&lt;/strong&gt; – the second dataset is a bit smaller&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;REPARTITION_SORT_MERGE&lt;/strong&gt; – repartition both datasets and use sorting and merging strategy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPTIMIZER_CHOOSES&lt;/strong&gt; – Flink optimizer will decide how to join datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can read more about how Flink performs joins in &lt;a href="https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html"&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  More information
&lt;/h1&gt;

&lt;p&gt;I hope you liked this article and found it useful.&lt;/p&gt;

&lt;p&gt;I will write more articles about Flink in the near future, so stay tuned! You can read my other articles &lt;a href="https://brewing.codes/"&gt;here&lt;/a&gt;, or you can you can take a look at my Pluralsight course where I cover Apache Flink in more details: &lt;a href="http://bit.ly/understanding-flink"&gt;Understanding Apache Flink&lt;/a&gt;. Here is a &lt;a href="http://bit.ly/understanding-flink-preview"&gt;short preview&lt;/a&gt; of this course.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://brewing.codes/2017/10/17/flink-optimize/"&gt;Four ways to optimize your Flink applications&lt;/a&gt; appeared first on &lt;a href="https://brewing.codes"&gt;Brewing Codes&lt;/a&gt;.&lt;/p&gt;


</description>
      <category>apacheflink</category>
      <category>forsyndication</category>
      <category>batching</category>
      <category>flink</category>
    </item>
    <item>
      <title>Getting started with stream processing using Apache Flink</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Mon, 09 Oct 2017 09:00:43 +0000</pubDate>
      <link>https://forem.com/mushketyk/getting-started-with-stream-processing-using-apache-flink-530</link>
      <guid>https://forem.com/mushketyk/getting-started-with-stream-processing-using-apache-flink-530</guid>
      <description>&lt;p&gt;If in your mind “Apache Flink and “streaming programming does not have a strong link you probably was not following news recently. Apache Flink took the world of Big Data by storm. Now is a perfect opportunity for a tool like this to thrive: stream processing becomes more and more prevalent in data processing, and Apache Flink presents a number of important innovations.&lt;/p&gt;

&lt;p&gt;In this article, I will show how to start writing stream processing algorithms using Apache Flink. We will read a stream of Wikipedia edits and will see how can get some meaningful data out of it. In the process, you will see how to read and write stream data, how to perform simple operations, and how to implement more complex algorithms.&lt;/p&gt;

&lt;h1&gt;
  
  
  Getting started
&lt;/h1&gt;

&lt;p&gt;I believe that if you are new to Apache Flink, it’s better to start with learning about batch processing since it is simpler and will give you a solid foundation to learning stream processing. I’ve written an introductory blog post about how to &lt;a href="https://dev.to/mushketyk/getting-started-with-batch-processing-using-apache-flink-bnh"&gt;start with batch processing&lt;/a&gt; using Apache Flink, so I encourage you to read it first.&lt;/p&gt;

&lt;p&gt;If you already know how to use batch processing in Apache Flink, stream processing does not have a lot of surprises for you. As before we will take a look at three distinct phases in your application: reading data from a source, processing data, and writing data to an external system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Flh3.googleusercontent.com%2F-ckQmoSoDqq0%2FWcwhOUEXAYI%2FAAAAAAAAGcc%2F7H4rKTdC-ZAEWn54dMPMzoLJ2MrnARabgCHMYCw%2FI%2F15065500601313.jpg%3Fresize%3D690%252C338%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Flh3.googleusercontent.com%2F-ckQmoSoDqq0%2FWcwhOUEXAYI%2FAAAAAAAAGcc%2F7H4rKTdC-ZAEWn54dMPMzoLJ2MrnARabgCHMYCw%2FI%2F15065500601313.jpg%3Fresize%3D690%252C338%26ssl%3D1" alt="Flink streaming overview"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are few notable differences comparing to the batch processing. First of all, in batch processing, all data is available in advance. We do not process new data even if it arrives while we the process is running.&lt;/p&gt;

&lt;p&gt;It is different in stream processing though. We read data as it is being generated and the stream of data that we need to process is potentially infinite. With this approach, we can process incoming data in almost real-time.&lt;/p&gt;

&lt;p&gt;In the stream mode, Flink will read data from and write data to different systems including Apache Kafka, Rabbit MQ, basically systems that produce and consume a constant stream of data. Notice that we can read data from HDFS or S3 as well. In this case, Apache Flink will constantly monitor a folder and will process files as they arrive.&lt;/p&gt;

&lt;p&gt;Here is how we can read data from a file in the stream mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;StreamExecutionEnvironment&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;readTextFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file/path"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that to use stream processing we need to use the &lt;code&gt;StreamExecutionEnvironment&lt;/code&gt; class instead of the &lt;code&gt;ExecutionEnvironment&lt;/code&gt;. Also methods that read data return an instance of &lt;code&gt;DataStream&lt;/code&gt; class that we will use later for data processing.&lt;/p&gt;

&lt;p&gt;We can also create finite streams from collections or arrays as in the batch processing case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;numbers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromCollection&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Arrays&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;numbers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromElements&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Simple data processing
&lt;/h1&gt;

&lt;p&gt;To process a stream of items in an stream Flink provides operators similar to batch processing operators like: &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt;, and &lt;code&gt;mapReduce&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let’s implement our first stream processing example. We will read a stream of edits to Wikipedia and display items that we are interested in.&lt;/p&gt;

&lt;p&gt;First, to read edits stream, we need to use the &lt;code&gt;WikipediaEditsSource&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;edits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WikipediaEditsSource&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To use it we need to call the &lt;code&gt;addSource&lt;/code&gt; method that is used to read data from various sources such as Kafka, Kinesis, RabbitMQ, etc. This method returns a stream of edits that we can now process.&lt;/p&gt;

&lt;p&gt;Let’s filter all edits that were not made by a bot and that have changed more than a thousand bytes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;edits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;FilterFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;edit&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isBotEdit&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;edit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByteDiff&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;})&lt;/span&gt;
&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is very similar to how you can use the &lt;code&gt;filter&lt;/code&gt; method in the batch case, with the only exception that it process an infinite stream.&lt;/p&gt;

&lt;p&gt;Now the last step is to run our application. As before we need to call the &lt;code&gt;execute&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The application will start to print filtered wikipedia edits until we stop it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2&amp;gt; WikipediaEditEvent{timestamp=1506499898043, channel='#en.wikipedia', title='17 FIBA Womens Melanesia Basketball Cup', diffUrl='https://en.wikipedia.org/w/index.php?diff=802608251&amp;amp;oldid=802520770', user='Malto15', byteDiff=1853, summary='/* Preliminary round */', flags=0}
7&amp;gt; WikipediaEditEvent{timestamp=1506499911216, channel='#en.wikipedia', title='User:MusikBot/StaleDrafts/Report', diffUrl='https://en.wikipedia.org/w/index.php?diff=802608262&amp;amp;oldid=802459885', user='MusikBot', byteDiff=11674, summary='Reporting 142 stale non-AfC drafts', flags=0}
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Stream windows
&lt;/h1&gt;

&lt;p&gt;Notice that methods that we’ve discussed so far before all work on individual elements in a stream. It’s unlikely that we can come up with many interesting stream algorithms that can be implemented using these simple operators. Using just them it will be impossible to implement following use-cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Count a number of edits that are performed every minute&lt;/li&gt;
&lt;li&gt;Count how many edits were performed by each user every ten minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s obvious that to answer these questions we need to process groups of elements. This is what stream windows are for.&lt;/p&gt;

&lt;p&gt;In the nutshell stream windows allow us to group elements in a stream and execute a user-defined function on each group. This user-defined function can return zero, one, or more elements and in this way, it creates a new stream that we can process or store in a separate system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Flh3.googleusercontent.com%2F-DOB6f7mUQDo%2FWctftzIMJTI%2FAAAAAAAAGa0%2FDSAzJvXrGB4BJfF0HCVUVDrMkF5dirwUACHMYCw%2FI%2F15065005339869.jpg%3Fresize%3D690%252C290%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi0.wp.com%2Flh3.googleusercontent.com%2F-DOB6f7mUQDo%2FWctftzIMJTI%2FAAAAAAAAGa0%2FDSAzJvXrGB4BJfF0HCVUVDrMkF5dirwUACHMYCw%2FI%2F15065005339869.jpg%3Fresize%3D690%252C290%26ssl%3D1" alt="Input stream"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How can we group elements in a stream? Flink provides several options to do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tumbling window&lt;/strong&gt; – creates non-overlapping adjacent windows in a stream. We can either group elements by time (say, all elements from 10:00 to 10:05 go into one group) or by count (first 50 elements go into a separate group). For example, we can use this to answer a question like: count a number of elements in a stream for non-overlapping five-minute intervals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sliding window&lt;/strong&gt; – similar to the tumbling window but here windows can overlap. We can use it if we need to calculate a metric for the last five minutes, but we want to display an output every minute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session window&lt;/strong&gt; – in this case, Flink groups events that occurred close in time to each other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global window&lt;/strong&gt; – in this case, Flink puts all elements to a single window. This is only useful if we define a custom trigger that defines when a window is finished.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Flh3.googleusercontent.com%2F-GloeTPTCeYE%2FWctfx_4GEuI%2FAAAAAAAAGa4%2Fs4Mf-8H3HAE7w7RzOmavKOMhFgxMR5bCACHMYCw%2FI%2F15065005489096.jpg%3Fresize%3D690%252C277%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi1.wp.com%2Flh3.googleusercontent.com%2F-GloeTPTCeYE%2FWctfx_4GEuI%2FAAAAAAAAGa4%2Fs4Mf-8H3HAE7w7RzOmavKOMhFgxMR5bCACHMYCw%2FI%2F15065005489096.jpg%3Fresize%3D690%252C277%26ssl%3D1" alt="Windows assigners"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to selecting how to assign elements to different windows, we need to select a stream type. Flink has two window types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keyed stream&lt;/strong&gt; – with this stream type Flink will partition a single stream into multiple independent streams by a key (e.g., name of a user who made an edit). When we process a window in a keyed stream a function that we define only has access to items with the same key, but working with multiple independent streams allows Flink to parallelize work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-keyed stream&lt;/strong&gt; – in this case, all elements in the stream will be processed together and our user-defined function will have access to all elements in a stream. The downside of this stream type is that it gives no parallelism and only one machine in the cluster will be able to execute our code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Flh3.googleusercontent.com%2F-T5fJVX7KzGY%2FWctfn-9EPOI%2FAAAAAAAAGaw%2F5NENoLqM7GUPFrAcpSyITV0RksNnyqmdACHMYCw%2FI%2F15065005103874.jpg%3Fresize%3D690%252C319%26ssl%3D1" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi2.wp.com%2Flh3.googleusercontent.com%2F-T5fJVX7KzGY%2FWctfn-9EPOI%2FAAAAAAAAGaw%2F5NENoLqM7GUPFrAcpSyITV0RksNnyqmdACHMYCw%2FI%2F15065005103874.jpg%3Fresize%3D690%252C319%26ssl%3D1" alt="Keyed stream"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let’s implement some demos using stream windows. First of all let’s find how many edits are performed on Wikipedia every minute. First we need to read a stream of edits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;StreamExecutionEnvironment&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;edits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WikipediaEditsSource&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we need to specify that we want to separate the stream into one-minute non-overlapping windows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;edits&lt;/span&gt;
    &lt;span class="c1"&gt;// Non-overlapping one-minute windows&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;timeWindowAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now we can define a custom function that will process all elements in each one-minute window. To do this, we will use the &lt;code&gt;apply&lt;/code&gt; method and pass an implementation of the &lt;code&gt;AllWindowFunction&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;edits&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;timeWindowAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AllWindowFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="nc"&gt;TimeWindow&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TimeWindow&lt;/span&gt; &lt;span class="n"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Iterable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
            &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;bytesChanged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

            &lt;span class="c1"&gt;// Count number of edits&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;++;&lt;/span&gt;
                &lt;span class="n"&gt;bytesChanged&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByteDiff&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;

            &lt;span class="c1"&gt;// Output a number of edits and window's end time&lt;/span&gt;
            &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Tuple3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getEnd&lt;/span&gt;&lt;span class="o"&gt;()),&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytesChanged&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;})&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Despite being a bit verbose the method is pretty straightforward, the &lt;code&gt;apply&lt;/code&gt; method receives three parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;timeWindow&lt;/strong&gt; – contains information about a window we are processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iterable&lt;/strong&gt; – iterator for elements in a single window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;collector&lt;/strong&gt; – an object that we can use to output elements into the result stream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All we do here is counting a number of changes and then using the &lt;code&gt;collector&lt;/code&gt; instance to output the result of our calculation together with the end timestamp of a window.&lt;/p&gt;

&lt;p&gt;If we run this application we will see items produced by the &lt;code&gt;apply&lt;/code&gt; method printed into the output stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1&amp;gt; (Wed Sep 27 12:58:00 IST 2017,62,62016)
2&amp;gt; (Wed Sep 27 12:59:00 IST 2017,82,12812)
3&amp;gt; (Wed Sep 27 13:00:00 IST 2017,89,45532)
4&amp;gt; (Wed Sep 27 13:01:00 IST 2017,79,11128)
5&amp;gt; (Wed Sep 27 13:02:00 IST 2017,82,26582)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Keyed stream example
&lt;/h1&gt;

&lt;p&gt;Now let’s take a look at a bit more complicated example. Let’s count how many edits a user does per each ten-minutes interval. This can help to identify most active users or to find some unusual activity in the system.&lt;/p&gt;

&lt;p&gt;Of course, we could just use a non-keyed stream, iterate over all elements in a window and maintain a dictionary to track counts, but this approach won’t scale since non-keyed streams are not parallelizable. To use resources of a Flink cluster efficiently, we need to key our stream by user name, which will create multiple logical streams: one per user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;edits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WikipediaEditsSource&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;span class="n"&gt;edits&lt;/span&gt;
    &lt;span class="c1"&gt;// Key by user name&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;KeySelector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="nl"&gt;WikipediaEditEvent:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;getUser&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// Ten-minute non-overlapping windows&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only difference here is that we use the &lt;code&gt;keyBy&lt;/code&gt; method to specify a key for our streams. Here we simply use a username as a partition key.&lt;/p&gt;

&lt;p&gt;Now when we have a keyed stream, we can apply a function that will be executed to process each window. As before we will use the &lt;code&gt;apply&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;edits&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;KeySelector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="nl"&gt;WikipediaEditEvent:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;getUser&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WindowFunction&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TimeWindow&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TimeWindow&lt;/span&gt; &lt;span class="n"&gt;timeWindow&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Iterable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Collector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

            &lt;span class="c1"&gt;// Count number of changes&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WikipediaEditEvent&lt;/span&gt; &lt;span class="n"&gt;ignored&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;changesCount&lt;/span&gt;&lt;span class="o"&gt;++;&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
            &lt;span class="c1"&gt;// Output user name and number of changes&lt;/span&gt;
            &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;changesCount&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;})&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;print&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only significant difference here is that this version of the &lt;code&gt;apply&lt;/code&gt; method has four parameters. The additional first parameter specifies a key for the logical stream that our function is processing.&lt;/p&gt;

&lt;p&gt;If we execute this application we will get a stream where each element contains a user name and a number of edits this user performed per ten-minute interval:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...
5&amp;gt; (InternetArchiveBot,6)
1&amp;gt; (Francis Schonken,1)
6&amp;gt; (.30.124.210,1)
1&amp;gt; (MShabazz,1)
5&amp;gt; (Materialscientist,18)
1&amp;gt; (Aquaelfin,1)
6&amp;gt; (Cote d'Azur,2)
1&amp;gt; (Daniel Cavallari,3)
5&amp;gt; (00:1:F159:6D32:2578:A6F7:AB88:C8D,2)
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see some users have a Wikipedia edit spree today!&lt;/p&gt;

&lt;h1&gt;
  
  
  More information
&lt;/h1&gt;

&lt;p&gt;This was an introductory article, and there is much more to Apache Flink. I will write more articles about Flink in the near future, so stay tuned! You can read my other articles here, or you can you can take a look at my Pluralsight course where I cover Apache Flink in more details: &lt;a href="http://bit.ly/understanding-flink" rel="noopener noreferrer"&gt;Understanding Apache Flink&lt;/a&gt;. Here is a &lt;a href="http://bit.ly/understanding-flink-preview" rel="noopener noreferrer"&gt;short preview&lt;/a&gt; of this course.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://brewing.codes/2017/10/09/start-flink-streaming/" rel="noopener noreferrer"&gt;Getting started with stream processing using Apache Flink&lt;/a&gt; appeared first on &lt;a href="https://brewing.codes" rel="noopener noreferrer"&gt;Brewing Codes&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>apacheflink</category>
      <category>forsyndication</category>
      <category>bigdata</category>
      <category>flink</category>
    </item>
    <item>
      <title>Getting started with batch processing using Apache Flink</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Sun, 01 Oct 2017 11:44:04 +0000</pubDate>
      <link>https://forem.com/mushketyk/getting-started-with-batch-processing-using-apache-flink-bnh</link>
      <guid>https://forem.com/mushketyk/getting-started-with-batch-processing-using-apache-flink-bnh</guid>
      <description>

&lt;p&gt;If you’ve been following software development news recently you probably heard about the new project called Apache Flink. I’ve already written about it a bit &lt;a href="https://brewing.codes/2017/09/25/flink-vs-spark/"&gt;here&lt;/a&gt; and &lt;a href="https://brewing.codes/2017/01/16/apache-flink-a-new-landmark/"&gt;here&lt;/a&gt;, but if you are not familiar with it, Apache Flink is a new generation Big Data processing tool that can process either finite sets of data (this is also called batch processing) or potentially infinite streams of data (stream processing). In terms of new features, many believe Apache Flink is a game changer and can even replace Apache Spark in the future.&lt;/p&gt;

&lt;p&gt;In this article, I’ll introduce you to how you can use Apache Flink to implement simple batch processing algorithms. We will start with setting up our development environment, and then we will see how we can load data, process a dataset, and write data back to an external system.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why batch processing?
&lt;/h1&gt;

&lt;p&gt;You might have heard that stream processing is “the new hot thing right now” and that Apache Flink is a tool for stream processing. This can pose a question, why do we need to learn how to implement batch processing applications.&lt;/p&gt;

&lt;p&gt;While it is true, that stream processing becomes more and more widespread; many tasks still require batch processing. Also if you are just getting started with Apache Flink, in my opinion, it is better to start with batch processing since it is simpler and in a way resembles working with a database. Once you’ve covered batch processing, you can learn about stream processing where Apache Flink really shines!&lt;/p&gt;

&lt;h1&gt;
  
  
  How to follow examples
&lt;/h1&gt;

&lt;p&gt;If you want to implement some Apache Flink applications yourself, first you need to create a Flink project. In this article, we are going to write applications in Java, but you can also write Flink application in Scala, Python or R.&lt;/p&gt;

&lt;p&gt;To create a Flink Java project execute the following command:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mvn archetype:generate \
      -DarchetypeGroupId=org.apache.flink \
      -DarchetypeArtifactId=flink-quickstart-java \
      -DarchetypeVersion=1.3.2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;After you enter group id, artifact id, and a project version this command will create the following project structure:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
â”œâ”€â”€ pom.xml
â””â”€â”€ src
    â””â”€â”€ main
        â”œâ”€â”€ java
        â”‚Â Â  â””â”€â”€ flinkProject
        â”‚Â Â  â”œâ”€â”€ BatchJob.java
        â”‚Â Â  â”œâ”€â”€ SocketTextStreamWordCount.java
        â”‚Â Â  â”œâ”€â”€ StreamingJob.java
        â”‚Â Â  â””â”€â”€ WordCount.java
        â””â”€â”€ resources
            â””â”€â”€ log4j.properties
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The most important here is the massive &lt;code&gt;pom.xml&lt;/code&gt; that specifies all the necessary dependencies. Automatically created Java classes are examples of some simple Flink applications that you can take a look at, but we don’t need them for our purposes.&lt;/p&gt;

&lt;p&gt;To start developing your first Flink application create a class with the &lt;code&gt;main&lt;/code&gt; method like this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public class FlinkProgram {

    public static void main(String[] args) throws Exception {
        // Create Flink execution environment
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // We will write our code here

        // Start Flink application
        env.execute();
    }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There is nothing special about this &lt;code&gt;main&lt;/code&gt; method. All we have to do is to add some boilerplate code.&lt;/p&gt;

&lt;p&gt;First, we need to create a Flink execution environment that will behave differently if you run it on a local machine or in a Flink cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On a local machine, it will create a full-fledged Flink cluster with multiple local nodes. This is a good way to test how your application will work in a realistic environment&lt;/li&gt;
&lt;li&gt;On a Flink cluster, it won’t create anything but will use existing cluster resources instead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alternatively, you could create a collection environment like this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This will create a Flink execution environment that instead of running Flink application on a local cluster will emulate all operations using in-memory collections in a single Java process. Your application will run faster, but this environment some subtle differences from a local cluster with multiple nodes.&lt;/p&gt;

&lt;h1&gt;
  
  
  Where do we start?
&lt;/h1&gt;

&lt;p&gt;Before we can do anything, we need to read data into Apache Flink. We can read data from numerous systems including local filesystem, S3, HDFS, HBase, Cassandra, etc. No matter where we read a dataset from, Apache Flink allows us to work with data in a uniform way using the &lt;code&gt;DataSet&lt;/code&gt; class:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Integer&amp;gt; numbers = ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;All items in a dataset should have the same type. The single generics parameter specifies a type of the data that is stored in a dataset.&lt;/p&gt;

&lt;p&gt;To read data from a file, we can use the &lt;code&gt;readTextFile&lt;/code&gt; method that will read lines in a file line by line and return a dataset of type &lt;code&gt;String&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;String&amp;gt; lines = env.readTextFile("path/to/file.txt");
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you specify a file path like this, Flink will attempt to read a local file. If you want to read a file from HDFS you need to specify the &lt;code&gt;hdfs://&lt;/code&gt; protocol:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;env.readCsvFile("hdfs:///path/to/file.txt")
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Flink also has support for CSV files, but in this case, it won’t return a dataset of strings. It will try to parse every line and return a dataset of &lt;code&gt;Tuple&lt;/code&gt; instances:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Tuple2&amp;lt;Long, String&amp;gt;&amp;gt; lines = env.readCsvFile("data.csv")
                .types(Long.class, String.class);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Tuple2&lt;/code&gt; is a class that stores an immutable pair of two fields, but there are other classes like &lt;code&gt;Tuple0&lt;/code&gt;, &lt;code&gt;Tuple1&lt;/code&gt;, &lt;code&gt;Tuple3&lt;/code&gt;, up to &lt;code&gt;Tuple25&lt;/code&gt; that store from zero to twenty-five fields. Later we will see how to work with these classes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;types&lt;/code&gt; method specifies types and number of columns in a CSV file, so Flink could read a parse them.&lt;/p&gt;

&lt;p&gt;We can also create small datasets that are very good for small experiments and unit tests:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Create from a list
DataSet&amp;lt;String&amp;gt; letters = env.fromCollection(Arrays.asList("a", "b", "c"));
// Create from an array
DataSet&amp;lt;Integer&amp;gt; numbers = env.fromElements(1, 2, 3, 4, 5);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;A question that you may ask is what data we can store in a DataSet? Not every Java type can be used in a dataset, and there are four different categories of types that you can use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built-in Java types and POJO classes&lt;/li&gt;
&lt;li&gt;Flink Tuples and Scala case classes&lt;/li&gt;
&lt;li&gt;Values – these are special mutable wrappers for Java primitive types that you can use to increase performance (I’ll write about this in one of the upcoming articles)&lt;/li&gt;
&lt;li&gt;Implementations of Hadoop Writable interface&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Processing data with Apache Flink
&lt;/h1&gt;

&lt;p&gt;Now to the data processing part! How do you implement an algorithm for processing your data? To do this, you can use a number of operations that resemble Java 8 streams operations, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;map&lt;/strong&gt; – converts items in a dataset using a user-defined function. Every input element is converted into exactly one output element&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;filter&lt;/strong&gt; – filters items in a dataset according to a user-defined function&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;flatMap&lt;/strong&gt; – similar to the &lt;strong&gt;map&lt;/strong&gt; operator, but allows returning zero, one or many elements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;groupBy&lt;/strong&gt; – groups elements by a key. Similar to the &lt;code&gt;GROUP BY&lt;/code&gt; operator in SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;project&lt;/strong&gt; – select specified fields in a dataset of tuples, similar to the &lt;code&gt;SELECT&lt;/code&gt; operator from SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reduce&lt;/strong&gt; – combines elements in a dataset into a single value using a user-defined function&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keep in mind that the biggest difference between Java streams and these operations is that Java 8, works with data in memory and can access local data, while Flink works with data on a cluster in a distributed environment.&lt;/p&gt;

&lt;p&gt;Let’s take a look at a simple example that uses these operations. The following example is very straightforward. It creates a dataset of numbers, which squares every number and filters out all odd numbers.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Create a dataset of numbers
DataSet&amp;lt;Integer&amp;gt; numbers = env.fromElements(1, 2, 3, 4, 5, 6, 7);

// Square every number
DataSet&amp;lt;Integer&amp;gt; result = numbers.map(new MapFunction&amp;lt;Integer, Integer&amp;gt;() {
    @Override
    public Integer map(Integer integer) throws Exception {
        return integer * integer;
    }
})
// Leave only even numbers
.filter(new FilterFunction&amp;lt;Integer&amp;gt;() {
    @Override
    public boolean filter(Integer integer) throws Exception {
        return integer % 2 == 0;
    }
});
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you have any experience with Java 8 you are probably wondering why I don’t use lambdas here. We can use lambdas here but it can cause some complications, as I’ve written &lt;a href="https://brewing.codes/2017/01/31/using-apache-flink-with-java-8/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Saving data back
&lt;/h1&gt;

&lt;p&gt;After we’ve finished processing our data it would make sense to save the result of our hard work. Flink can store data into a number of third-party systems such as HDFS, S3, Cassandra, etc.&lt;/p&gt;

&lt;p&gt;For example, to write data to a file, we need to use the &lt;code&gt;writeAsText&lt;/code&gt; method from the &lt;code&gt;DataSet&lt;/code&gt; class:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Integer&amp;gt; ds = ...

ds.writeAsText("path/to/file");
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For debugging/testing purposes Flink can write data to standard output or to standard output:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Integer&amp;gt; ds = ...

// Output dataset to the standard output
ds.print();

// Output dataset to the standard err
ds.printToErr()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h1&gt;
  
  
  More complicated example
&lt;/h1&gt;

&lt;p&gt;To implement some meaningful algorithms we need to first download a &lt;a href="https://grouplens.org/datasets/movielens/"&gt;Grouplens movies dataset&lt;/a&gt;. It contains several CSV files with information about movies and movie ratings. We are going to work with the &lt;code&gt;movies.csv&lt;/code&gt; file from this dataset which contains a list of all movies and looks like this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;It has three columns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;movieId&lt;/strong&gt; – a unique movie id for a movie in this dataset&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;title&lt;/strong&gt; – a title of the movie&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;genres&lt;/strong&gt; – a “|” separated list of genres for each movie&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can now load this CSV file in Apache Flink and perform some meaningful processing. Here we will load a file from a local filesystem, while in a realistic environment you would read a much bigger dataset and it would probably reside in a distributed system, such as S3 or HDFS.&lt;/p&gt;

&lt;p&gt;In this demo let’s find all movies of the “Action” genre. Here is a code snippet that that does this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Load dataset of movies
DataSet&amp;lt;Tuple3&amp;lt;Long, String, String&amp;gt;&amp;gt; lines = env.readCsvFile("movies.csv")
            .ignoreFirstLine()
            .parseQuotedStrings('"')
            .ignoreInvalidLines()
            .types(Long.class, String.class, String.class);

DataSet&amp;lt;Movie&amp;gt; movies = lines.map(new MapFunction&amp;lt;Tuple3&amp;lt;Long,String,String&amp;gt;, Movie&amp;gt;() {
    @Override
    public Movie map(Tuple3&amp;lt;Long, String, String&amp;gt; csvLine) throws Exception {
        String movieName = csvLine.f1;
        String[] genres = csvLine.f2.split("\\|");
        return new Movie(movieName, new HashSet&amp;lt;&amp;gt;(Arrays.asList(genres)));
    }
});
DataSet&amp;lt;Movie&amp;gt; filteredMovies = movies.filter(new FilterFunction&amp;lt;Movie&amp;gt;() {
    @Override
    public boolean filter(Movie movie) throws Exception {
        return movie.getGenres().contains("Action");
    }
});

filteredMovies.writeAsText("output.txt");
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Let’s break it down. First, we read a CSV file using the &lt;code&gt;readCsvFile&lt;/code&gt; method:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Tuple3&amp;lt;Long, String, String&amp;gt;&amp;gt; lines = env.readCsvFile("movies.csv")
    // ignore CSV header
    .ignoreFirstLine()
    // Set strings quotes character
    .parseQuotedStrings('"')
    // Ignore invalid lines in the CSV file
    .ignoreInvalidLines()
    // Specify types of columns in the CSV file
    .types(Long.class, String.class, String.class);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Using helper methods, we specify how to parse strings in the CSV file and that we need to skip the first line. In the last line, we specify a type of each column in the CSV file and Flink will parse data for us.&lt;/p&gt;

&lt;p&gt;Now when we have a dataset loaded in a Flink cluster we can do some data processing. First, we parse a list of genres for every movie using the &lt;code&gt;map&lt;/code&gt; method:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Movie&amp;gt; movies = lines.map(new MapFunction&amp;lt;Tuple3&amp;lt;Long,String,String&amp;gt;, Movie&amp;gt;() {
    @Override
    public Movie map(Tuple3&amp;lt;Long, String, String&amp;gt; csvLine) throws Exception {
        String movieName = csvLine.f1;
         String[] genres = csvLine.f2.split("\\|");
         return new Movie(movieName, new HashSet&amp;lt;&amp;gt;(Arrays.asList(genres)));
    }
});
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To transform every movie we need to implement the &lt;code&gt;MapFunction&lt;/code&gt; that will receive every CSV record as a &lt;code&gt;Tuple3&lt;/code&gt; instance and will convert it into the &lt;code&gt;Movie&lt;/code&gt; POJO class:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class Movie {
    private String name;
    private Set&amp;lt;String&amp;gt; genres;

    public Movie(String name, Set&amp;lt;String&amp;gt; genres) {
        this.name = name;
        this.genres = genres;
    }

    public String getName() {
        return name;
    }

    public Set&amp;lt;String&amp;gt; getGenres() {
        return genres;
    }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you recall the structure of the CSV file, the second column contains a name of a movie and the third column contains a list of genres. Hence we access these columns using fields &lt;code&gt;f1&lt;/code&gt; and &lt;code&gt;f2&lt;/code&gt; respectively.&lt;/p&gt;

&lt;p&gt;Now when we have a dataset of movies we can implement the core part of our algorithm and filter all action movies:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DataSet&amp;lt;Movie&amp;gt; filteredMovies = movies.filter(new FilterFunction&amp;lt;Movie&amp;gt;() {
    @Override
    public boolean filter(Movie movie) throws Exception {
        return movie.getGenres().contains("Action");
    }
});
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This will only return movies that contain “Action” in the set of genres.&lt;/p&gt;

&lt;p&gt;Now the last step is very straightforward; we store result data into a file:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filteredMovies.writeAsText("output.txt");
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This simply stores the result data into a local text file, but as with the &lt;code&gt;readTextFile&lt;/code&gt; method, we could write this file into HDFS or S3 by specifying a protocol like &lt;code&gt;hdfs://&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  More information
&lt;/h1&gt;

&lt;p&gt;This was an introductory article, and there is much more to Apache Flink. I will write more articles about Flink in the near future, so stay tuned! You can read my other &lt;a href="https://brewing.codes/tag/flink/"&gt;articles here&lt;/a&gt;, or you can you can take a look at my Pluralsight course where I cover Apache Flink in more details: &lt;a href="http://bit.ly/understanding-flink"&gt;Understanding Apache Flink&lt;/a&gt;. Here is a &lt;a href="http://bit.ly/understanding-flink-preview"&gt;short preview&lt;/a&gt; of this course.&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://brewing.codes/2017/10/01/start-flink-batch/"&gt;Getting started with batch processing using Apache Flink&lt;/a&gt; appeared first on &lt;a href="https://brewing.codes"&gt;Brewing Codes&lt;/a&gt;.&lt;/p&gt;


</description>
      <category>apacheflink</category>
      <category>batch</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Python Data Structures Idioms</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Fri, 29 Sep 2017 16:38:32 +0000</pubDate>
      <link>https://forem.com/mushketyk/python-data-structures-idioms-6ae</link>
      <guid>https://forem.com/mushketyk/python-data-structures-idioms-6ae</guid>
      <description>&lt;p&gt;Significant portion of our time we as a developers spend writing code that manipulates basic data structures: traverse a list, create a map, filter elements in a collection. Therefore it is important to know how effectively do it in Python and make your code more readable and efficient.&lt;/p&gt;

&lt;h1&gt;
  
  
  Using lists
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Iterate over a list
&lt;/h2&gt;

&lt;p&gt;There are many ways to iterate over a list in Python. And the simplest way would be just to maintain current position in list and increment it on each iteration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;## SO WRONG
&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but Python provides a more convenient way to do using &lt;a href="https://docs.python.org/2/library/functions.html#range"&gt;&lt;strong&gt;range&lt;/strong&gt;&lt;/a&gt; function. &lt;strong&gt;range&lt;/strong&gt; function can be used to generate numbers from 0 to N and this can be used as an analog of a &lt;strong&gt;for&lt;/strong&gt; loop in C:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;## STILL WRONG
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this is more concise, there is a better way to do it since Python let us iterate over a list directly, similarly to &lt;strong&gt;foreach&lt;/strong&gt; loops in other languages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Iterate over a list in reverse order
&lt;/h2&gt;

&lt;p&gt;How can we iterate a list in the reverse order? One way to do it would be to use an unreadable 3 arguments version of the &lt;strong&gt;range&lt;/strong&gt; function and provide position of the last element in a list (first argument), position of an element before the first element in the list (second argument) and negative step to go in reverse order (third argument):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But as you've may already guessed Python should offer a much better way to do it. We can just use &lt;a href="https://docs.python.org/2/library/functions.html#reversed"&gt;&lt;strong&gt;reversed&lt;/strong&gt;&lt;/a&gt; function in a &lt;strong&gt;for&lt;/strong&gt; loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Access the last element
&lt;/h2&gt;

&lt;p&gt;A commonly used idiom to access the last element in a list would be: get length of a list, subtract 1 from it, use result number as a position of the last element:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is cumbersome in Python since it supports negative indexes to access elements from the end of the list. So -1 is the last element:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Negative indexes can also be used to access a next to last element and so on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use sequence unpacking
&lt;/h2&gt;

&lt;p&gt;A common way to extract values from a list to multiple variables in other programming languages would be to use indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="n"&gt;l1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;l2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;l3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But Python supports sequence unpacking that lets us to extract values from a list to multiple variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="n"&gt;l1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;l3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l1&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l2&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;l3&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use lists comprehensions
&lt;/h2&gt;

&lt;p&gt;Let's say we want to filter all grades for a movie posted by users of age 18 or bellow.&lt;/p&gt;

&lt;p&gt;How many times did you write code like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="n"&gt;under_18_grades&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;grade&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grades&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;grade&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;under_18_grades&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grade&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do it no more in Python and use list comprehensions with &lt;strong&gt;if&lt;/strong&gt; statement instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="n"&gt;under_18_grades&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;grade&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;grade&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;grades&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;grade&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use enumerate function
&lt;/h2&gt;

&lt;p&gt;Sometimes you need to iterate over a list and keep track of a position of each element. Say, if you need to display a menu items in a shell you can simply use the &lt;strong&gt;range&lt;/strong&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;menu_items&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;menu_items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;menu_items&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="s"&gt;"{}. {}"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;menu_items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A better way to do it would be to use &lt;a href="https://docs.python.org/2/library/functions.html#enumerate"&gt;&lt;strong&gt;enumerate&lt;/strong&gt;&lt;/a&gt; function. It is a iterator that returns pairs each of which contains position of an element and the element itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;menu_items&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;menu_items&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="s"&gt;"{}. {}"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;menu_items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use keys to sort
&lt;/h2&gt;

&lt;p&gt;A typical way to sort elements in other programming languages is to provide a function that compares two objects along with a collection to sort. In Python it would look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;people&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'John'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Peter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Joe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_people&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;cmp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compare_people&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Peter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'John'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Joe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But this is not the best way to do it. Since all we need to do to compare two instances of &lt;strong&gt;Person&lt;/strong&gt; class is to compare values of their &lt;strong&gt;age&lt;/strong&gt; field. Why should we write a complex compare function for this?&lt;/p&gt;

&lt;p&gt;Specifically for this case &lt;a href="https://docs.python.org/2/library/functions.html#sorted"&gt;&lt;strong&gt;sorted&lt;/strong&gt;&lt;/a&gt; function accepts &lt;strong&gt;key&lt;/strong&gt; function that is used to extract a key that will be used to compare two instances of an object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Peter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'John'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Joe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use all/any functions
&lt;/h2&gt;

&lt;p&gt;If you want to check if all or any value in a collection is True one way would be iterate over a list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;all_true&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But Python already has &lt;a href="https://docs.python.org/2/library/functions.html#all"&gt;&lt;strong&gt;all&lt;/strong&gt;&lt;/a&gt;, &lt;a href="https://docs.python.org/2/library/functions.html#any"&gt;&lt;strong&gt;any&lt;/strong&gt;&lt;/a&gt; functions for that. &lt;strong&gt;all&lt;/strong&gt; returns True if all values in an iterable passed to it are True, while &lt;strong&gt;any&lt;/strong&gt; returns True if at least one of values passed to it is True:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="nb"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To check if all items comply with a certain condition, you can convert a list of arbitrary objects to a list of booleans using list comprehension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nb"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or you can pass a generator (just omit square braces around the list comprehension):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nb"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only this will save you two keystrokes it will also omit creation of an intermediate list (more about this later).&lt;/p&gt;

&lt;h2&gt;
  
  
  Use slicing
&lt;/h2&gt;

&lt;p&gt;You can take part of a list using a technique called slicing. Instead of providing a single index in a square brackets when accessing a list you can provide the following three values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of these parameters are optional and you can get different parts of a list if you omit some of them. If only start position is provided it will return all elements in a list starting from the specified index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If only end position is provided slicing will return all elements up to the provided position:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also get part of a list between two indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default step in slicing is equal to one which mean that all elements between start and end positions are returned. If you want to get only every second element or every third element you need to provide a step value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Do not create unnecessary objects
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Use xrange
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;range&lt;/strong&gt; is a useful function if you need to generate consistent integer values in a range, but it has one drawback: it returns a list with all generated values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
# Returns a too big list
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000000000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Solution here is to use &lt;a href="https://docs.python.org/2/library/functions.html#xrange"&gt;&lt;strong&gt;xrange&lt;/strong&gt;&lt;/a&gt; function. It immediately return an iterator instead of creating a list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
# Returns an iterator
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;xrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000000000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The drawback of &lt;strong&gt;xrange&lt;/strong&gt; comparing to the &lt;strong&gt;range&lt;/strong&gt; function is that it's output can be iterated only once.&lt;/p&gt;

&lt;h3&gt;
  
  
  New in Python 3
&lt;/h3&gt;

&lt;p&gt;In Python 3 &lt;strong&gt;xrange&lt;/strong&gt; was removed and &lt;strong&gt;range&lt;/strong&gt; function behaves like &lt;strong&gt;xrange&lt;/strong&gt; in Python 2.x. If you need to iterate over an output of &lt;strong&gt;range&lt;/strong&gt; in Python 3 multiple times you can convert its output in to a list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use izip
&lt;/h2&gt;

&lt;p&gt;If you need to generate pairs from elements in two collections, one way to do it would be to use the &lt;strong&gt;zip&lt;/strong&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Joe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Kate'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Peter'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;41&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Creates a list
&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s"&gt;'Joe'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Kate'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Peter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;41&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead we can use the &lt;a href="https://docs.python.org/2/library/itertools.html#itertools.izip"&gt;&lt;strong&gt;izip&lt;/strong&gt;&lt;/a&gt; function that would return a return an iterator instead of creating a new list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;itertools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;izip&lt;/span&gt;
&lt;span class="c1"&gt;# Creates an iterator
&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;izip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  New in Python 3
&lt;/h3&gt;

&lt;p&gt;In Python 3 &lt;strong&gt;izip&lt;/strong&gt; function is removed and &lt;strong&gt;zip&lt;/strong&gt; behaves like &lt;strong&gt;izip&lt;/strong&gt; function in Python 2.x.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use generators
&lt;/h2&gt;

&lt;p&gt;Lists comprehensions is a powerful tool in Python, but since it can use extensive amount of memory since each list comprehension will create a new list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;
&lt;span class="c1"&gt;# Original list
&lt;/span&gt;&lt;span class="n"&gt;lst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# This will create a new list
&lt;/span&gt;&lt;span class="n"&gt;lst_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# This will create another list
&lt;/span&gt;&lt;span class="n"&gt;lst_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst_1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A way to avoid this is to use generators instead of list comprehensions. The difference in syntax is minimal: you should use parenthesis instead of square brackets, but the difference is crucial. The following example does not create any intermediate lists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;
&lt;span class="c1"&gt;# Original list
&lt;/span&gt;&lt;span class="n"&gt;lst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Won't create a new list
&lt;/span&gt;&lt;span class="n"&gt;lst_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Won't create another list
&lt;/span&gt;&lt;span class="n"&gt;lst_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is especially handy if you may need to process only part of the result collection to get a result, say to find a first element that match a certain condition.&lt;/p&gt;

&lt;h1&gt;
  
  
  Use dictionaries idiomatically
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Avoid using keys() function
&lt;/h2&gt;

&lt;p&gt;If you need to iterate over keys in a dictionary you may be inclined to use &lt;strong&gt;keys&lt;/strong&gt; function on a hash map:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But there is a better way, you use iterate over a dictionary it performs iteration over its keys, so you can do simply:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only it will save you some typing it will prevent from creating a copy of all keys in a dict as &lt;strong&gt;keys&lt;/strong&gt; method does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Iterate over keys and values
&lt;/h2&gt;

&lt;p&gt;If you use &lt;strong&gt;keys&lt;/strong&gt; method it's really easy to iterate keys and values in a dictionary like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;#WRONG
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But there is a better way. You can use &lt;strong&gt;items&lt;/strong&gt; function that returns key-value pairs from a dictionary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not only this method is more concise, it's a more efficient too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use dictionaries comprehension
&lt;/h2&gt;

&lt;p&gt;One way to create a dictionary is to assign values to it one-by-one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;
&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead you can use a dictionary comprehension to turn this into a one liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;person&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;person&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Use collections module
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Use namedtuple
&lt;/h2&gt;

&lt;p&gt;If you need a struct like type you may just define a class with an &lt;strong&gt;&lt;strong&gt;init&lt;/strong&gt;&lt;/strong&gt; method and a bunch of fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However &lt;a href="https://docs.python.org/2/library/collections.html"&gt;&lt;strong&gt;collections&lt;/strong&gt;&lt;/a&gt; module from Python library provides a &lt;a href="https://docs.python.org/2/library/collections.html#collections.namedtuple"&gt;&lt;strong&gt;namedtuple&lt;/strong&gt;&lt;/a&gt; type that turns this into a one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;namedtuple&lt;/span&gt;
&lt;span class="n"&gt;Point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;namedtuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Point'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'x'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'y'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition &lt;strong&gt;namedtuple&lt;/strong&gt; implements &lt;strong&gt;__str__&lt;/strong&gt;, &lt;strong&gt;__repr__&lt;/strong&gt;, and &lt;strong&gt;__eq__&lt;/strong&gt; methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;Point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Use defaultdict
&lt;/h2&gt;

&lt;p&gt;If we need to count a number of times an element is encountered in a collection, we can use a common approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG
&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;collections&lt;/strong&gt; module provides a very handy class for this case which is called &lt;strong&gt;defaultdict&lt;/strong&gt;. It's constructor accepts a function that will be used to calculate a value for a non-existing key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'key'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To rewrite counting example we can pass the &lt;strong&gt;int&lt;/strong&gt; function to &lt;strong&gt;defaultdict&lt;/strong&gt; which returns zero if called with no arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;
&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;defaultdict&lt;/strong&gt; is useful when you need to create any kind of grouping of items in a collection, but you just need to get count of elements you may use &lt;strong&gt;Counter&lt;/strong&gt; class instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RIGHT
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lst&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
&lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This post was originally posted at &lt;a href="https://brewing.codes/2017/01/12/python-data-structures-idioms/"&gt;Brewing Codes&lt;/a&gt; blog.&lt;/p&gt;

</description>
      <category>python</category>
    </item>
    <item>
      <title>Apache Spark vs. Apache Flink</title>
      <dc:creator>Ivan Mushketyk</dc:creator>
      <pubDate>Wed, 27 Sep 2017 14:40:21 +0000</pubDate>
      <link>https://forem.com/mushketyk/apache-spark-vs-apache-flink-co9</link>
      <guid>https://forem.com/mushketyk/apache-spark-vs-apache-flink-co9</guid>
      <description>&lt;p&gt;If you look at &lt;a href="http://mattturck.com/wp-content/uploads/2017/05/Matt-Turck-FirstMark-2017-Big-Data-Landscape.png" rel="noopener noreferrer"&gt;this image&lt;/a&gt; with a list of Big Data tools it may seem that all possible niches in this field are already occupied. With so much competition it should be very tough to come up with a groundbreaking technology.&lt;/p&gt;

&lt;p&gt;Apache Flink creators have a different thought about this. It started as a research project called &lt;a href="http://stratosphere.eu/" rel="noopener noreferrer"&gt;Stratosphere&lt;/a&gt;. Stratosphere was forked, and this fork became what we know as Apache Flink. In 2014 it was accepted as an Apache Incubator project, and just a few months later it became a top-level Apache project. At the time of this writing, the project has almost twelve thousand commits and more than 300 contributors.&lt;/p&gt;

&lt;p&gt;Why is there so much attention? This is because Apache Flink was called a new generation Big Data processing framework and has enough innovations under its belt to replace Apache Spark and become the new de-facto tool for batch and stream processing.&lt;/p&gt;

&lt;p&gt;Should you switch to Apache Flink? Should you stick with Apache Spark for a while? Or is Apache Flink just a new gimmick? This article will attempt to give you answers to these and other questions.&lt;/p&gt;

&lt;h1&gt;
  
  
  Apache Spark is an old news
&lt;/h1&gt;

&lt;p&gt;Unless you have been living under a rock for the last couple of years, you have heard about Apache Spark. It looks like every modern system that does any kind data processing is using Apache Spark in one way or another.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fspark.apache.org%2Fimages%2Fspark-stack.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fspark.apache.org%2Fimages%2Fspark-stack.png" alt="Spark architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a long time, Spark was the latest and greatest tool in this area. It delivered some impressive features comparing to its predecessors such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Impressive speed - it is ten times faster than Hadoop if data is processed on a disk and up to 100 times faster if data is processed in memory.&lt;/li&gt;
&lt;li&gt;Simpler Directed acyclic graph model - instead of defining your data processing jobs using rigid MapReduce framework Spark allows to define a graph of tasks that can implement complex data processing algorithms&lt;/li&gt;
&lt;li&gt;Stream processing - with the advent of new technologies such as the Internet of Things it is not enough to simply to process a huge amount of data. Now we need processing a huge amount of data as it arrives in real time. This is why Apache Spark has introduced stream processing that allows to process a potentially infinite stream of data.&lt;/li&gt;
&lt;li&gt;Rich set of libraries - In addition to its core features Apache Spark provides powerful libraries for machine learning, graph processing, and performing SQL queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To get a better idea of how you write applications with Apache Spark, let's take a look at how you can implement a simple Word Count application that would count how many times each word was used in a text document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Read file&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;file&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;sc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;textFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file/path"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;wordCount&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;
  &lt;span class="c1"&gt;// Extract words from every line&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;flatMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
  &lt;span class="c1"&gt;// Convert words to pairs&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
  &lt;span class="c1"&gt;// Count how many times each word was used&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;reduceByKey&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;_&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="k"&gt;_&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you know Scala, this code should seem straightforward and is similar to working with regular collections. First we read a list of lines from a file located in "file/path". This file can be either a local file or a file in HDFS or S3.&lt;/p&gt;

&lt;p&gt;Then every line is split into a list of words using the &lt;code&gt;flatMap&lt;/code&gt; method that simply splits a string by the space symbol. Then to implement the word counting we use the &lt;code&gt;map&lt;/code&gt; method to convert every word into a pair where the first element of the pair is a word from the input text and the second element is simply a number one.&lt;/p&gt;

&lt;p&gt;Then the last step simply counts how many times each word was used by summing up numbers for all pairs for the same word.&lt;/p&gt;

&lt;p&gt;Apache Spark seems like a great and versatile tool. But what does Apache Flink brings to the table?&lt;/p&gt;

&lt;h1&gt;
  
  
  New kid on the block
&lt;/h1&gt;

&lt;p&gt;At first glance, there does not seem to be many differences. The architecture diagram looks very similar:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.0%2Ffig%2Fstack.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.0%2Ffig%2Fstack.png" alt="Apache Flink architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you take a look at the code example for the Word Count application for Apache Flink you would see that there is almost no difference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;file&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;readTextFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file/path"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;counts&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;flatMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;split&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;groupBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sum&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Few notable differences, is that in this case we need to use the &lt;code&gt;readTextFile&lt;/code&gt; method instead of the &lt;code&gt;textFile&lt;/code&gt; method and that we need to use a pair of methods: &lt;code&gt;groupBy&lt;/code&gt; and &lt;code&gt;sum&lt;/code&gt; instead of &lt;code&gt;reduceByKey&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So what is all the fuss about? Apache Flink may not have any visible differences on the outside, but it definitely has enough innovations, to become the next generation data processing tool. Here are just some of them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implements actual streaming processing - when you process a stream in Apache Spark, it treats it as many small batch problems, hence making stream processing a special case. Apache Flink, in contrast, treats batch processing as a special and does not use micro batching.&lt;/li&gt;
&lt;li&gt;Better support for cyclical and iterative processing - Flink provides some additional operations that allow implementing cycles in your streaming application and algorithms that need to perform several iterations on batch data.&lt;/li&gt;
&lt;li&gt;Custom memory management - Apache Flink is  a Java application, but it does not rely entirely on JVM garbage collector. It implements custom memory manager that stores data to process in byte arrays. This allows to reduce the load on a garbage collector and increase performance. You can read about it in this &lt;a href="https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html" rel="noopener noreferrer"&gt;blog post&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Lower latency and higher throughput - multiple tests done by third parties &lt;a href="https://jobs.zalando.com/tech/blog/apache-showdown-flink-vs.-spark/?gh_src=4n3gxh1" rel="noopener noreferrer"&gt;suggest&lt;/a&gt; that Apache Flink has lower latency and higher throughput than its competitors.&lt;/li&gt;
&lt;li&gt;Powerful windows operators - when you need to process a stream of data in most cases you need to apply a function to a finite group of elements in a stream. For example, you may need to count how many clicks your application has received in each five-minute interval, or you may want to know what was the most popular tweet on Twitter in each ten-minute interval. While Spark supports some of these use-cases, Apache Flink provides a vastly more powerful set of operators for stream processing.&lt;/li&gt;
&lt;li&gt;Implements lightweight distributed snapshots - this allows Apache Flink to provide low overhead and only-once processing guarantees in stream processing, without using micro batching as Spark does.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  What to choose
&lt;/h1&gt;

&lt;p&gt;So, you are working on a new project, and you need to pick a software for it? What should use? Spark? Flink? &lt;/p&gt;

&lt;p&gt;Of course, there is no right or wrong answer here. If you need to do complex stream processing, then I would recommend using Apache Flink. It has better support for stream processing and some significant improvements.&lt;/p&gt;

&lt;p&gt;If you don't need bleeding edge stream processing features and want to stay on the safe side, it may be better to stick with Apache Spark. It is a more mature project it has a bigger user base, more training materials, and more third-party libraries. But keep in mind that Apache Flink is closing this gap by the minute. More and more projects are choosing Apache Flink as it becomes a more mature project.&lt;/p&gt;

&lt;p&gt;If on the other hand, you like to experiment with the latest technology, you definitely need to give Apache Flink a shot.&lt;/p&gt;

&lt;p&gt;Does all this mean that Apache Spark is obsolete and in a couple of years we all are going to use Apache Flink? The answer may surprise you. While Flink has some impressive features, Spark is not staying the same. For example, Apache Spark introduced custom memory management in 2015 with the release of project &lt;a href="https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html" rel="noopener noreferrer"&gt;Tungsten&lt;/a&gt;, and since then it has been adding features that were first introduced by Apache Flink. The winner is not decided yet.&lt;/p&gt;

&lt;h1&gt;
  
  
  More information
&lt;/h1&gt;

&lt;p&gt;In the upcoming blog posts I will write more about how you can use Apache Flink for batch and stream processing, so stay tuned!&lt;/p&gt;

&lt;p&gt;If you want to know more about Apache Flink, you can take a look at my Pluralsight course where I cover Apache Flink in more details: &lt;a href="http://bit.ly/understanding-flink" rel="noopener noreferrer"&gt;Understanding Apache Flink&lt;/a&gt;. Here is a &lt;a href="http://bit.ly/understanding-flink-preview" rel="noopener noreferrer"&gt;short preview&lt;/a&gt; of this course.&lt;/p&gt;

&lt;p&gt;This post was originally posted at &lt;a href="https://brewing.codes/2017/09/25/flink-vs-spark/" rel="noopener noreferrer"&gt;Brewing Codes&lt;/a&gt; blog.&lt;/p&gt;

</description>
      <category>flink</category>
      <category>spark</category>
      <category>bigdata</category>
    </item>
  </channel>
</rss>
