<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: lukaszkuczynski</title>
    <description>The latest articles on Forem by lukaszkuczynski (@lukaszkuczynski).</description>
    <link>https://forem.com/lukaszkuczynski</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F17768%2F9d527077-50d4-4ff0-ae0f-19bf4a105099.jpeg</url>
      <title>Forem: lukaszkuczynski</title>
      <link>https://forem.com/lukaszkuczynski</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/lukaszkuczynski"/>
    <language>en</language>
    <item>
      <title>Spark is Pandas on steroids</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Tue, 17 Sep 2019 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/spark-is-pandas-on-steroids-3j4g</link>
      <guid>https://forem.com/lukaszkuczynski/spark-is-pandas-on-steroids-3j4g</guid>
      <description>&lt;p&gt;Read it if you fell in love in Pandas, and started to think about whether parallel processing can be easy&lt;/p&gt;

&lt;h2&gt;
  
  
  Who are you
&lt;/h2&gt;

&lt;p&gt;Let me guess. You are a data scientist who runs preprocessing and modeling on your machine using pandas and other cool libraries. Or you are just a mature software engineer who once heard about parallel processing but have never seen the real use-case for that. Even if you don’t belong to either group think of it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;why do I need parallel processing?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Why it may be of your concern, even if you did not realize that before? Let me share with you my experience.Lately, we had a pipeline when &lt;code&gt;fbprophet&lt;/code&gt; was used to predict several thousand of series. It took ages to complete the whole process. A thought: maybe running it on some Azure fast computing set will speed it up? Yes, but still it is hours of waiting. So what is the bottleneck here? &lt;code&gt;fbprophet&lt;/code&gt; and all of that processing is one-machine solution. You cannot just spread it through nodes to be processed in parallel. But it would be nice, huh? You say: these 10k rows go to compute #1, the other 10k rows go to #2 and so on. Want to make it happen?&lt;/p&gt;

&lt;h2&gt;
  
  
  Spark is the answer
&lt;/h2&gt;

&lt;p&gt;Apache Spark was created to make parallel processing possible for you and me. But what was before?The era of Hadoop was there. The big ecosystem of tools to process big data. Maybe now some of the experts claim: it is out of date, but it was a big leap in the future - you can &lt;strong&gt;split your big job into smaller chunks and send them to be processed on several machines&lt;/strong&gt;. Then the result is collected and here you go!&lt;/p&gt;

&lt;p&gt;You can utilize the commodity machines you have without a need to invest in a top-class machine, which can be very expensive. You can scale out (as opposed to scaling up)! But to use Hadoop there were some limitations involved. You had to stick to MapReduce and Java language which was the only building and API language. Moreover, all your computation steps were using disk heavily, which was one of the biggest bottlenecks of the architecture.&lt;/p&gt;

&lt;p&gt;So this is why Spark was introduced. The main advantage of this engine is - as compared to Hadoop - processing all of its data in memory. And you do not think in terms of the machine which drives the process (&lt;em&gt;driver&lt;/em&gt;). You can instantiate as many &lt;em&gt;workers&lt;/em&gt; as you need. All of these - making up a &lt;em&gt;cluster&lt;/em&gt; - are quite a powerful tool in your hands. What is more, you are not glued to Java-based MapReduce, but there is nice DataFrame API at hand, having support for multiple languages, like Python, Scala, Java, R, and even SQL.Sounds expensive and hardly available? On the contrary!&lt;/p&gt;

&lt;h2&gt;
  
  
  How can I get it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Spark or hosted Spark?
&lt;/h3&gt;

&lt;p&gt;You could install Spark on your machine. But why if you may try it online? For free! No more installing binaries on your machine to try something out. Even there is no need to have docker daemon running somewhere in the background. Why? There is a blessing (or a curse) of PaaS and SaaS abundance all around you. So how about you. Would you like to try Spark yourself?&lt;/p&gt;

&lt;h3&gt;
  
  
  Databricks
&lt;/h3&gt;

&lt;p&gt;Let me introduce Databricks. This is a product available online, created for widely-understood analytics. It is also hosted by Azure (Microsoft cloud offering), where you can just try it and run. I mentioned Azure here because it was the first place when I encountered Databricks.&lt;/p&gt;

&lt;p&gt;Especially for the sake of this article I created a new account on community version of Databricks to check how easy it is. And yes, it took me roughly 2 minutes to create and start my first Databricks account without any credit card information and so on, just fill in the form &lt;a href="https://databricks.com/signup/signup-community"&gt;on this page&lt;/a&gt;.When you have your cluster ready, just may play with it using Spark API.&lt;/p&gt;

&lt;h2&gt;
  
  
  use-case
&lt;/h2&gt;

&lt;p&gt;I do not like learning for learning. I like to notice how some tools empower me to solve the problem faster. Maybe as an IT-person, you were traversing long, boring log files to find out what happened to the application. You have Gigabytes of logs and you wished to see patterns in it. You may continue grepping, but no… how are you going to make it happen with Spark?&lt;/p&gt;

&lt;p&gt;You don’t have to (however you may, if you wish) to upload your own log files to start playing with it in Community version of Databricks. There are many datasets already available on Databricks to use them, including sample log files.First, we will read them all to the structure called RDD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rdd_original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;textFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'/databricks-datasets/sample_logs/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rdd_original&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The previous command just reads all files from the location provided to an abstraction called RDD. This one can be used for further transformations (map/reduce/filter) or transformed into a data frame. There were some operations done to parse data out of file rows. For the sake of simplicity, I skip it in this article - the full notebook however &lt;a href="https://lukaszkuczynski.github.io/assets/logs-databricks.dbc"&gt;is also available there&lt;/a&gt;. To create a data frame, we just need to provide the schema.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TimestampType&lt;/span&gt;

&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="c1"&gt;# your data types definitions go here
&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rdd_mapped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toDF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;With a data frame ready we may start doing some aggregations and draw conclusions. For example, you may be interested in how often different errors occurred for specific users. We may learn about it using Python API or SQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_identified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'user'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;'-'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_grouped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_identified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;'user'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s"&gt;'status'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For you, my dear Pandas developer, these operations should look pretty familiar, am I right?Of course, you may express your query with SQL syntax.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'log'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;





&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="k"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;
&lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="k"&gt;user&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;"-"&lt;/span&gt;
&lt;span class="k"&gt;group&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="k"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Databricks gives you not only Spark on Cloud experience, but also contains basic visualization capabilities. I encourage you to use them for your scientific purposes, not as a tool to present data to your customers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The data analysis done here is available publicly for some period of time under &lt;a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3276244303656844/2182517269869031/287949548447532/latest.html"&gt;this public link&lt;/a&gt;. I also uploaded that as &lt;a href="https://lukaszkuczynski.github.io/assets/logs-databricks.dbc"&gt;a downloadable asset here&lt;/a&gt;.With Spark features and DataFrame API we can do really a lot. What is important the full advantage you can see when dealing with big amount of data that is hard to ingest by a single machine.&lt;/p&gt;

</description>
      <category>pandas</category>
      <category>spark</category>
      <category>databricks</category>
    </item>
    <item>
      <title>Predicting dev.to posts</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Mon, 18 Feb 2019 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/predicting-devto-posts-21pk</link>
      <guid>https://forem.com/lukaszkuczynski/predicting-devto-posts-21pk</guid>
      <description>

&lt;p&gt;The following is the short description of the full small end to end data science project I chose for myself. I always wanted to play with text data, so I chose to create my own corpus. I will use it later to apply some machine learning on top of it. Take a look at how do I predict a category of some blog posts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Goal
&lt;/h2&gt;

&lt;p&gt;Recently I learn about Machine Learning (ML). There are lots of possible areas of use: analyzing video, images, speech and text. The last one seems to be the most appealing to me. Online content grows every day. We have too much to read. Natural Language Processing (NLP) is a data science response to this problem. The content of my small project comes from a great developer community blog: &lt;strong&gt;dev.to&lt;/strong&gt;. To show basic ML I will guess the tag based on the &lt;em&gt;dev.to&lt;/em&gt; post content. Thus, I will run supervised learning. In simple words: I am pasting the content of a post and it should tell me is this about “java” or maybe “python”. The effect of my work is a web app, so feel free to play with it &lt;a href="http://guess.lukaszkuczynski.usermd.net"&gt;here&lt;/a&gt;. For the sake of simplicity, I used 4 major categories only. Take a look on this life demo gif:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KvXbF1Pf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://lukaszkuczynski.github.io/assets/img/posts/guess.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KvXbF1Pf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://lukaszkuczynski.github.io/assets/img/posts/guess.gif" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Corpus
&lt;/h2&gt;

&lt;p&gt;What is a &lt;em&gt;corpus&lt;/em&gt;? In the NLP world, it is a group of documents. We will analyze these documents later on. Doing any ML project you have to start with something to analyze. With some data. In NLP you have to get some texts. It is not obligatory to scrape them from some internet portal. You can start with built-in corpora as these being part of &lt;code&gt;nltk&lt;/code&gt; library &lt;a href="https://www.nltk.org/book/ch02.html"&gt;described here&lt;/a&gt;. In my case, I want use &lt;code&gt;dev.to&lt;/code&gt; data, so I will use API they expose. I will create my own corpus. The process of data acquisition is visible as part of my &lt;a href="https://github.com/lukaszkuczynski/data-analysis/blob/master/devto/fetch_docs.py"&gt;GitHub project here&lt;/a&gt;. Doing it with Python is so easy, I can use great libraries, such as &lt;code&gt;BeautifulSoap&lt;/code&gt; and &lt;code&gt;requests&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  BoW
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;BoW&lt;/em&gt; acronym stands for &lt;em&gt;Bag of Words&lt;/em&gt;. This is a simple way of &lt;em&gt;vectorization&lt;/em&gt; of text. Its name comes from the idea of having a &lt;em&gt;bag&lt;/em&gt; where we put all the words. We don’t care how are they ordered. We just need to vectorize our text. This is a prerequisite before applying any ML algorithm on text. So how does this vector look like for a simple sentence? Taking a &lt;code&gt;CountVectorizer&lt;/code&gt; available as part of Scikit Learn library let us see how does it transform text to vector.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="s"&gt;"I like playing the violin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"Playing football is nice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s"&gt;"How do you like playing football?"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.feature_extraction.text&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CountVectorizer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;operator&lt;/span&gt;
&lt;span class="n"&gt;vectorizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CountVectorizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vocabulary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vocabulary_&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;vocabulary_sorted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vocabulary is:"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary_sorted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"matrix resulted is:"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toarray&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The result would be:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vocabulary is:
[('do', 0), ('football', 1), ('how', 2), ('is', 3), ('like', 4), ('nice', 5), ('playing', 6), ('the', 7), ('violin', 8), ('you', 9)]
matrix resulted is:
[[0 0 0 0 1 0 1 1 1 0]
 [0 1 0 1 0 1 1 0 0 0]
 [1 1 1 0 1 0 1 0 0 1]]

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And let us see how does it look like when we visualize it as a heat map:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---86OOwuE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://lukaszkuczynski.github.io/assets/img/posts/count-heat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---86OOwuE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://lukaszkuczynski.github.io/assets/img/posts/count-heat.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we will produce a model soon we want to make it of good quality. We will perform a &lt;em&gt;Tf-Idf&lt;/em&gt; (Term Frequency - Inverse Document Frequency) transformation. It means we take into consideration how frequent the word is in a document (TF) and how rare it is among other documents (IDF). Applying this transformation in &lt;code&gt;sklearn&lt;/code&gt; results in a matrix little different from what we saw before:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vocabulary is:
[('do', 0), ('football', 1), ('how', 2), ('is', 3), ('like', 4), ('nice', 5), ('playing', 6), ('the', 7), ('violin', 8), ('you', 9)]
TfIdf matrix resulted is:
[[0. 0. 0. 0. 0.44451431 0.
  0.34520502 0.5844829 0.5844829 0. ]
 [0. 0.44451431 0. 0.5844829 0. 0.5844829
  0.34520502 0. 0. 0. ]
 [0.4711101 0.35829137 0.4711101 0. 0.35829137 0.
  0.27824521 0. 0. 0.4711101 ]]

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And the visualization &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O1_tEn1w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://lukaszkuczynski.github.io/assets/img/posts/tfidf-heat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O1_tEn1w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://lukaszkuczynski.github.io/assets/img/posts/tfidf-heat.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output matrix is little different now. It is because we care how often terms occur in the whole data set. In our case &lt;em&gt;TfIdf&lt;/em&gt; promotes rare ones like &lt;em&gt;violin&lt;/em&gt; and deprecates the popular, like &lt;em&gt;playing&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Machine Learning
&lt;/h2&gt;

&lt;p&gt;Having vector ready we can apply math on top of that. In supervised learning we have to find a function that matches input vector X with label y. Having structure and kind of data we process there is a need to apply the correct algorithm, not just a random one. If you’re facing this problem find this &lt;a href="http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html"&gt;&lt;code&gt;sklearn&lt;/code&gt; cheatsheet&lt;/a&gt;. In my case, I am doing the classification of textual data. I found many experienced data scientists tend to use Naive Bayes for that. After several trials I found it useful, too. You can check my &lt;a href="https://github.com/lukaszkuczynski/data-analysis/blob/master/devto/cluster_devto_articles.ipynb"&gt;notebook here&lt;/a&gt;. Remember, with textual data we use the Multinomial and not the Gaussian algorithm. So there is a snippet from the model training process.&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.naive_bayes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MultinomialNB&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train_naivebayes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;bayes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MultinomialNB&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;bayes_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bayes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bayes_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Please find &lt;a href="https://github.com/lukaszkuczynski/guess"&gt;the full codebase in my repo&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploy
&lt;/h2&gt;

&lt;p&gt;Once your model is ready there is a time to share it with others! You can do it with Azure or AWS. They usually have ready-to-use Docker containers where you just have to put your model inside, and they magically expose it as a REST service. However, the 1st time I expose some model I wanted to have everything under control. This is why I decided to build my model inside of a web application myself. It is as easy as serializing building blocks to files and then uploading these files to a server. You can go there and check &lt;a href="http://guess.lukaszkuczynski.usermd.net"&gt;my app deployed&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;p&gt;I do not think the project is perfect. I measured the accuracy of the model and it is around 80%. To have better results we could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;use another algorithm, f.e. ensemble algorithm, or tune the existing one&lt;/li&gt;
&lt;li&gt;have more data than hundreds of entries (more data always means: better)&lt;/li&gt;
&lt;li&gt;clean data better&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am happy to see how a nice experience working with data with &lt;code&gt;sklearn&lt;/code&gt; was. Python provides a must-have ML tool belt. I also tasted the full stack of the ML problem. I collected, analyzed, fitted model and finally deployed it.&lt;/p&gt;


</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>devto</category>
    </item>
    <item>
      <title>Clustering snacks and vegetables</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Thu, 10 Jan 2019 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/clustering-snacks-and-vegetables-48an</link>
      <guid>https://forem.com/lukaszkuczynski/clustering-snacks-and-vegetables-48an</guid>
      <description>&lt;h2&gt;
  
  
  Supervised vs Unsupervised:
&lt;/h2&gt;

&lt;p&gt;Lesson no 9 of Udacity’s &lt;a href="https://eu.udacity.com/course/intro-to-machine-learning--ud120"&gt;Introduction to Machine Learning&lt;/a&gt; showed another aspect of machine learning to me. I could strengthen my knowledge in an area of clustering. This is topic particularly interesting as is it reveals insights, which may inspire you to further analysis. As I am lately into NLP, one possible use case is that it can group similar documents together and then you can discover the connections between them. Supervised learning is about having labels and then checking if these labels fit features of newly acquired data. Unsupervised learning has no comparison phase, as no labels are known at the beginning.&lt;/p&gt;

&lt;h2&gt;
  
  
  K-Means and sklearn
&lt;/h2&gt;

&lt;p&gt;Clustering using K-means algorithm is one of the most widely used unsupervised techniques. It is about finding such cluster centres that will allow the whole system to be in a “harmony”. Distances are calculated using Euclidean distance.&lt;br&gt;&lt;br&gt;
There is a nice visualization that was advertised during the training, you can take a look at fantastic work was done &lt;a href="https://www.naftaliharris.com/blog/visualizing-k-means-clustering/"&gt;at naftaliharris blog&lt;/a&gt;As with every algorithm we have to be aware of its limitations. One of these is the fact that K-means is a hill climbing algorithm. This very fact has its own &lt;a href="https://en.wikipedia.org/wiki/Hill_climbing"&gt;Wikipedia page, that you can check&lt;/a&gt;. So the algorithm is very sensitive to local minima. Thus, some specific way of choosing initial points (centroids) can lead to clusters we would like not to have. This is why in sklearn implementation you are encouraged to do the clustering several times and then the best clustering is chosen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use case : veggies vs snacks.
&lt;/h2&gt;

&lt;p&gt;The Kaggle for the following &lt;a href="https://www.kaggle.com/panlukaszk/vegetables-vs-snacks"&gt;you can find here&lt;/a&gt;. I thought: maybe I could use clustering somewhere? Maybe a computer can be smart enough to distinct junk food from a nice one? So I found this dataset, that is &lt;a href="http://www.foodstandards.gov.au/science/monitoringnutrients/ausnut/ausnutdatafiles/Pages/foodnutrient.aspx"&gt;the Australian Food Nutrient Database&lt;/a&gt;. I played around a bit, and you know what? It works. Of course, if I spent more time on the careful assignment of snack and vegetable categories it would be more meaningful, without that I have some outliers, like tomatoes with a lot of fat, as an example. Anyhow KMeans was able to mark a clear distinction between high calories, low Vitamin C snacks and low calories, healthy vegetables.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>udacity</category>
      <category>clustering</category>
    </item>
    <item>
      <title>2018 Accomplishments</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Fri, 04 Jan 2019 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/2018-accomplishments-3iha</link>
      <guid>https://forem.com/lukaszkuczynski/2018-accomplishments-3iha</guid>
      <description>&lt;h2&gt;
  
  
  what is in the past
&lt;/h2&gt;

&lt;p&gt;It is good to ‘find enjoyment for all your hard work’. I can see many people doing their year summaries. Obviously, there are areas more important for than job matters, but here I will focus on business only. What has changed?&lt;/p&gt;

&lt;h3&gt;
  
  
  machine learning
&lt;/h3&gt;

&lt;p&gt;There was a lot of learning this year. I always thought this is too much for me. I always liked Math, but when I started to dig into the math behind the machine learning the task seem to be daunting. But the Udacity and Pluralsight brought much to the table. Especially the &lt;a href="https://dev.to/lukaszkuczynski/machine-learning-with-udacity-549e"&gt;Udacity one&lt;/a&gt;. With simple examples and lots of pictures, they were able to explain difficult matter easily. I immediately thought about the business use cases. So there was a need and motivation. I took some data and I was successful to apply this knowledge in my 2 projects. Both were about Text Analysis. The first was supervised machine learning with automatic team assignment for ticketing tool we have. I was able to achieve pretty high accuracies for teams that had distinct areas of responsibility. The second was applying clustering as an example of unsupervised learning with the famous K-Means algorithm. Of course, I cannot share this data. But when playing with my machine learning basics I am happy to share some of insights in my Kaggle account based on public datasets. You can see some of my kernels &lt;a href="https://www.kaggle.com/panlukaszk"&gt;here&lt;/a&gt;. By the way, learning about Kaggle is another big event for me this year as this is a great place to share my knowledge and progress. But not only this, I can gain inspiration from others, too. Some of the posts are proof of this, check &lt;a href="https://dev.to/lukaszkuczynski/regression-and-outliers-and-beer-57k8"&gt;this about beer in Brazil&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  cloud
&lt;/h3&gt;

&lt;p&gt;This year I understood some of the good aspects of “everything into the cloud” hype. When thinking of one’s local or on-premises environment I can see problems with scalability and lack of fluency. Everything seems to be so hard to get when for every change you have to raise the incident to a dedicated team. How different it is when using cloud infrastructure! This year I had a chance to work with Openshift as PaaS. All you need is to close your dependencies inside of a Docker image and reuse it whenever you need.&lt;/p&gt;

&lt;p&gt;I also played with big guys on the market: AWS and Azure. I even made a little comparison of their services deploying similar infrastructure on both of them, because I wanted to compare their Serverless support. I built a simple notification engine leveraging both &lt;a href="https://dev.to/lukaszkuczynski/serverless-monitoring-of-weather-with-azure-4d89"&gt;Azure Functions&lt;/a&gt; and &lt;a href="https://dev.to/lukaszkuczynski/monitoring-wrocaw-weather-with-aws-903"&gt;AWS Lambdas&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The main part of my current billable assignment was to conduct several data analysis tasks. I will take a closer look at it the next section but I can tell that without Azure Databricks this journey would not be that easy.&lt;/p&gt;

&lt;p&gt;In the cloud these my 3 favorites:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it is unlimited&lt;/li&gt;
&lt;li&gt;pay as you go&lt;/li&gt;
&lt;li&gt;don’t care about the infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  data analysis
&lt;/h3&gt;

&lt;p&gt;As I mentioned before I started to use Spark some years ago, but that time it was – to put it mildly – not so justified. So what has changed lately? I was given several tasks to make use of big data we have in our log files when it comes to user activity and I noticed when does the parallel computing unleashes its power: when the volume is greater than Mbytes. Of course, it is not yet the TB of processed data but we are getting there. Spark is cool when you run it somewhere where the platform is available. In my current assignment, I can leverage the power of Databricks, as we have some Azure account to utilize. Personally, Databricks make things really easy and fast. Why?&lt;/p&gt;

&lt;p&gt;First, I love playing with Python notebooks. It makes your job so descriptive, I think the idea of mixing code with some textual explanations is great.&lt;/p&gt;

&lt;p&gt;Second, it is so nice to use the flexibility of fast provisioning available there. You just choose the cluster you need and it is ready within minutes. If you are dissatisfied – or you are running out of money! – just dismiss the one. Simple as that.&lt;/p&gt;

&lt;p&gt;I was able to gain really nice insights that were hidden before the stakeholders, so then we can do the decision process better. We could answer the following questions: what are the areas are a user’s favourite? how a user is responding to a new feature? what are locations of users, and does it affect their choices?&lt;/p&gt;

&lt;p&gt;Gaining insights from data is important and I think data analysis with a help of Azure Databricks play is crucial here. This work can open the eyes of decision-makers to some challenges or successes. It can make complicated data an easy one.&lt;/p&gt;

&lt;h3&gt;
  
  
  more blogging
&lt;/h3&gt;

&lt;p&gt;This year I could do some blogging. Only at dev.to I published 11 entries starting May 2018. This year I left my &lt;a href="http://lukcreates.pl/"&gt;Wordpress account&lt;/a&gt; for the sake of &lt;a href="http://lukaszkuczynski.github.io"&gt;Github pages&lt;/a&gt; portal which I Ioved because I can write markdown only, and I like git-pushing of my blog entries. Not only I was able to write about my progress but also I could report 2 nice events I was part of, &lt;a href="https://dev.to/lukaszkuczynski/pyconpl-2018-report-2loo"&gt;Python Conference in August PyConPL&lt;/a&gt; and the internal &lt;a href="https://dev.to/lukaszkuczynski/innovation-day-at-volvo-1l64"&gt;AI days at Volvo&lt;/a&gt;. Of course, dev.to plays a very important role here because the community there is great and I like the User Experience when dealing both with content and its design.&lt;/p&gt;

&lt;h2&gt;
  
  
  what comes next?
&lt;/h2&gt;

&lt;p&gt;What about my business-related plans for this year? In the 2019 year I do &lt;strong&gt;not&lt;/strong&gt; plan to learn any new language. I am &lt;strong&gt;not&lt;/strong&gt; to going to change my direction. Rather I would like to continue the progress I started 2018 year. The year that I understood even more my change from Java to Python was a good choice. With Python I feel I am young again, I like programming and writing the code I feel like a poet now, not like a journalist forced to write a few articles before the end of the month.&lt;/p&gt;

&lt;p&gt;I am going to read new books, I don’t know which yet… But for sure there is a goal to finish the great one I started to read lately &lt;a href="https://www.amazon.com/Storytelling-Data-Visualization-Business-Professionals-ebook/dp/B016DHQSM2"&gt;about storytelling with data&lt;/a&gt;. Wow, it really opens my mind to the fields that are so green. I want to finish the Udacity in February and start playing with text data, seriously. By the way, you can check my progress of it on a &lt;a href="https://github.com/lukaszkuczynski/ud120-projects/commits/master"&gt;dedicated git repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To keep things short:&lt;/p&gt;

&lt;p&gt;I want to be a help and use the help of others.  &lt;/p&gt;

&lt;p&gt;I want to communicate to bring results. &lt;/p&gt;

&lt;p&gt;I want to make things simpler with visualization. &lt;/p&gt;

&lt;p&gt;I want to automatize to make things faster.&lt;/p&gt;

</description>
      <category>accomplishment</category>
      <category>resolutions</category>
    </item>
    <item>
      <title>Regression and Outliers and Beer</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Wed, 19 Dec 2018 21:22:12 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/regression-and-outliers-and-beer-57k8</link>
      <guid>https://forem.com/lukaszkuczynski/regression-and-outliers-and-beer-57k8</guid>
      <description>

&lt;h2&gt;
  
  
  Distribution
&lt;/h2&gt;

&lt;p&gt;It is all about data distribution. Before applying any machine learning algorithm you have to answer the question. Do I know what I want to know?  Having this question clear in mind is required when one wants to stay focused.   So what is your data distribution?  In lesson no 7 of &lt;a href="https://mena.udacity.com/course/intro-to-machine-learning--ud120"&gt;Introduction to Machine Learning&lt;/a&gt; a student is given several introductory tasks to comprehend is data distribution continuous or discrete one.  The discrete distribution can be likened to a categorical dataset with elements like human names or car brands.  The continuous one is connected to variables that can have any value within a range, like age or volume of something.&lt;/p&gt;

&lt;h2&gt;
  
  
  Regression Parameters
&lt;/h2&gt;

&lt;p&gt;Regression is a way to find the pattern in a continuous distribution.  Given X values (one or more features) you are finding what outcome it is producing.  People sometimes talk about a regression when saying: the more something happens the more some value grows.  The linear regression is about having the relationship between the two figures.  Say you want to know how much the air temperature affects the beer consumption.  For this study please check the next sections of this entry.    There are extensive and nice explanations what does a regression mean mathematically (f.e. &lt;a href="https://eli.thegreenplace.net/2016/linear-regression/"&gt;this blog&lt;/a&gt; or even simpler &lt;a href="http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm"&gt;here&lt;/a&gt; ), so I will not go into details in this post.  But to be concise I can tell that I was taught about two main parameters of regression.  These are &lt;strong&gt;slope&lt;/strong&gt; and &lt;strong&gt;intercept&lt;/strong&gt;.  Because the regression is a mathematic formula it can be represented as the following figure shows&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Y = a + bX
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;scikit-learn&lt;/code&gt; this two parameters can be easily fetched, what I will show in the following section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sklearn is &lt;strong&gt;fit&lt;/strong&gt;ting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Simple Example
&lt;/h3&gt;

&lt;p&gt;Having data imported to a data science toolbox you can leverage the power of ‘scikit-learn’.  It has major machine learning algorithms already built-in, so you just have to follow a simple flow of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data preparation&lt;/li&gt;
&lt;li&gt;choosing X and y&lt;/li&gt;
&lt;li&gt;fitting your data&lt;/li&gt;
&lt;li&gt;the algorithm evaluation&lt;/li&gt;
&lt;li&gt;visualization (optional)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With sklearn doing so is as simple as the following (imports excluded):&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Score is &lt;/span&gt;&lt;span class="si"&gt;%&lt;/span&gt;&lt;span class="s"&gt;f"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;reg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;coef_&lt;/span&gt;
&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;intercept_&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Regression calculated for equation y=ax+b. Params are a (slope)=&lt;/span&gt;&lt;span class="si"&gt;%.1&lt;/span&gt;&lt;span class="s"&gt;f, b (intercept)=&lt;/span&gt;&lt;span class="si"&gt;%.1&lt;/span&gt;&lt;span class="s"&gt;f"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For the complete notebook please refer to my &lt;a href="https://www.kaggle.com/panlukaszk/the-simpliest-linear-regression-ever"&gt;Kaggle kernel&lt;/a&gt;.  Normally we will not use the training values (X and y) for scoring, but this snippet works just for a presentation purpose.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real story: Kaggle and Beer consumption
&lt;/h3&gt;

&lt;p&gt;I was thinking which dataset will be a nice visualization of linear regression.  And this is how I reached &lt;em&gt;Beer consumption in Sao Paolo&lt;/em&gt; &lt;a href="https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo"&gt;dataset&lt;/a&gt;. I did some calculations harnessing &lt;code&gt;scikit-learn&lt;/code&gt; and visualization given by &lt;code&gt;matplotlib&lt;/code&gt; integration built-in into &lt;code&gt;pandas&lt;/code&gt;.  If you are interested in how was a data prepared and my steps please refer to my &lt;a href="https://www.kaggle.com/panlukaszk/beer-in-saopaolo"&gt;Kaggle notebook&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Outliers
&lt;/h2&gt;

&lt;p&gt;Lesson no 8 from the course teaches about outliers.  It is an important factor, especially when calculating a linear regression.  Most times you can just take a look at data doing the scatter plot and you’re done.  It was the way in the &lt;em&gt;beer consumption&lt;/em&gt; dataset.  I didn’t see any major anomalies so no &lt;em&gt;outliers removal&lt;/em&gt; technique was used.  But it is good to remember that we do have them, to make use of this when needed.   During this lesson, I saw how practical outlier removal is when dealing with &lt;em&gt;Enron&lt;/em&gt; data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Lessons 7 and 8 of Udacity’s Intro to Machine learning focused my attention to continuous distributions.  I had a chance to work with real-world data and the outcome was refreshing my &lt;a href="https://www.kaggle.com/panlukaszk"&gt;Kaggle account&lt;/a&gt;.  It was another proof of viewing Python as a primary language for data analysis.  Of course, I continue update of &lt;a href="https://github.com/lukaszkuczynski/ud120-projects"&gt;GitHub repo&lt;/a&gt; that serves as an insight into my coding while studying this course.&lt;/p&gt;


</description>
      <category>machinelearning</category>
      <category>udacity</category>
      <category>training</category>
      <category>regression</category>
    </item>
    <item>
      <title>AI days</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Fri, 30 Nov 2018 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/ai-days-3e95</link>
      <guid>https://forem.com/lukaszkuczynski/ai-days-3e95</guid>
      <description>

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;The following is the report about the event that took place Nov 29th, this year, inside the Volvo. We could listen to many talks concerning the AI revolution in both world and local scope. We are already using AI. Where and what tools do we use? Let me explain.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI application, trends
&lt;/h2&gt;

&lt;p&gt;There are some misconceptions about what AI is and how different it is from Machine Learning. Thanks to Jair Ribeiro we could learn that AI is currently a popular trend, and we have to face it. In Volvo, there are some use-cases with the famous &lt;a href="https://www.youtube.com/watch?v=2Gc1zz5bl8I"&gt;Vera vehicle&lt;/a&gt; and other projects using data to draw conclusions and to be informed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chatbots
&lt;/h3&gt;

&lt;p&gt;Virtual assistant: this is a topic which becomes more popular recently and it is surely not a dream! Whenever you visit some website and there is a nice chat window popping out at the bottom right of your browser – it is probably a machine talking to you. In Volvo, there is a use-case of a chatbot in the ticketing system. We were shown by Singh Kumar the chatbot that helps you with internal knowledge database, helping you to choose the correct person or team to help you with your problem. It can also answer you directly if he knows the answer. There are many possible challenges we can face when installing a system like this one. F.e. you have to teach your chatbot when it should give up and say “I don’t know”.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI Mindful?
&lt;/h3&gt;

&lt;p&gt;Thanks to Patrick Kozakiewicz we could have a moment of reflection on what kind of person we are and how does it affect the software we produce. Of course, AI is another type of software and it was interesting to learn that it could be biased by the way it was produced. There were some tricky questions about the “bad AI” and of course it can happen if we fed AI with a bad content. Should we teach AI our normal, human behaviors? I believe we should. And we all should be aware of the fact that when we are under stress, our productivity is worse, so take care of your surroundings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data collection
&lt;/h3&gt;

&lt;p&gt;The very important and the most time-consuming part of a machine learning process is the preparation phase. As Bartek Starościak explained, we do have the project of an improvement of a production process. We want to predict which features have an impact on the future vehicles failures. To collect all the data you need to face many kinds of inputs. Apart from nicely formatted files, we do have some hand-written notebooks and the variety of different systems and databases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qlik
&lt;/h3&gt;

&lt;p&gt;Having lots of data you will sooner or later face the issue of choosing what visualization is the best for explaining your data to a business person. As told by Tiago Hubner and his friends we can leverage QlikSense for that purpose. There are lots of features of that platform but one really caught my attention: &lt;strong&gt;Insights&lt;/strong&gt;. It is a nice feature suggesting you what you &lt;strong&gt;can&lt;/strong&gt; find in your data and &lt;strong&gt;how&lt;/strong&gt; can you visualize it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Black box
&lt;/h3&gt;

&lt;p&gt;There are some machine learning algorithms that are pretty hard to explain. When the model explanation is important? As Bartosz Kurlej explained, there are some business areas when there is a need for a detailed explanation. F.e. if you want to prove the bank management that someone should not be given a loan, it is better to explain what the reasons are for that. To make it happen we can use &lt;a href="https://www.datacamp.com/community/tutorials/understanding-model-predictions-lime"&gt;Lime algorithm&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Azure and Databricks
&lt;/h3&gt;

&lt;p&gt;I cannot imagine AI without tools. At some point of the time, you will face the following challenge: your local machine is not powerful enough. Training models require a lot of the computational power which – the more likely – you do not have. What should I do then? Use clusters and make your job distributed. It is far easier to do find lots of regular computers than just one big machine with a big processor.&lt;br&gt;&lt;br&gt;
Damian Kowalczyk was explaining what is the Microsoft’s response toward AI and how can we use Azure to do our AI stuff. Azure really has a wide variety of tools that can be used, you can find:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;trained models ready to be used like computer vision models or speech models&lt;/li&gt;
&lt;li&gt;frameworks, f.e. Tensorflow, Keras&lt;/li&gt;
&lt;li&gt;services: ready to use, f.e. Databricks (my favorite)&lt;/li&gt;
&lt;li&gt;infrastructure: you don’t care about how it is running&lt;/li&gt;
&lt;li&gt;deployment: your model can be easily packaged in a container and exposed as a web service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am particularly interested in the last element. When developing the model sooner or later you would like to share with others the great results of your work. We could also learn about Spark on Azure so Databrick platform. Running your data processing in a distributed environment is not so old because everything started to happen around 15 yrs ago when Google published its papers about MapReduce (2004). Then things changed dramatically when Spark was introduced as it was different from its precursor: Hadoop. Spark can do its job fully in-memory so it makes its computation more powerful. And this is why Databricks offers easy access to this technology using data-science standard: Jupyter notebooks.&lt;/p&gt;

&lt;h3&gt;
  
  
  NLP
&lt;/h3&gt;

&lt;p&gt;Maciej Szymczak provided an introduction to NLP for developers. I am particularly interested in NLP lately, so I was listening intensively and I can assure he did a great job. I like the way he explained NLP for developers so it was led in such a level that anyone who starts playing with text could be encouraged. We were able to see the live demo of a sentiment analysis using the standard Bag-of-Words technique. But it is not everything, we could learn the current trends of NLP with the following names as &lt;code&gt;spaCy&lt;/code&gt; or &lt;code&gt;gensim&lt;/code&gt; to be used in the modern NLP projects. Obviously, &lt;strong&gt;Bag of Words&lt;/strong&gt; is not perfect, and we have alternatives to that, so there was mentioned &lt;strong&gt;Word embeddings&lt;/strong&gt; with a continuous version of Bag of Words. Thus, we can avoid skipping the importance of the context of words we analyze.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced analytics
&lt;/h3&gt;

&lt;p&gt;Is advanced analytics something that your business needs? This is the question Fabio Bezerra answered. We live in a world of big data. And sooner or later we have to face it. There are other companies making good use of that, so we have to follow up. The analytics of a big data has several forms. We have &lt;strong&gt;descriptive analytics&lt;/strong&gt; when we do want to make our data more explainable. There is also &lt;strong&gt;prescriptive analytics&lt;/strong&gt; when I am analyzing my data to make it useful in making a decision later on. What are the &lt;em&gt;Business Intelligence&lt;/em&gt; tools making that analysis possible? We could learn about some of them, including Excel, Qliksense and Knime.&lt;/p&gt;


</description>
      <category>conference</category>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Machine Learning with Udacity</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Thu, 22 Nov 2018 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/machine-learning-with-udacity-549e</link>
      <guid>https://forem.com/lukaszkuczynski/machine-learning-with-udacity-549e</guid>
      <description>&lt;p&gt;I am studying &lt;a href="https://classroom.udacity.com/courses/ud120"&gt;Machine Learning with Udacity&lt;/a&gt;. Let me explain why, and why do I think Python nicely fits into the picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  My motivation
&lt;/h2&gt;

&lt;p&gt;I am studying Machine Learning. I want to explore the topic of Natural Language Processing. Before going deeper in NLP insights I felt I need to learn the basics of ML. I have to understand supervised techniques better. And this is course seems to be helpful. So far I learned about SVM and Naïve Bayes and what makes the difference between both. Before, it was not easy for me – I admit – but the explanation given by speakers was more than satisfying. Really, it is lots of graphics and examples with enough level of information. Very good job Udacity guys! I will try to regularly update my &lt;a href="https://github.com/lukaszkuczynski/ud120-projects"&gt;Git repo forked from Udacity&lt;/a&gt;, so stay tuned!&lt;/p&gt;

&lt;h2&gt;
  
  
  Python to play with data
&lt;/h2&gt;

&lt;p&gt;Every time I create something in Python I have this feeling I am using the right tool for the right task! That is interesting: I do not have that feeling when trying to fight with Spark ML. Every line of code I write here is just a pleasure. Didn’t you experience that? If you did not, try to use Python for data exploration. After several years of experience in Java, I cannot imagine myself doing the same in Java.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choose a model
&lt;/h2&gt;

&lt;p&gt;The mindful choice of an ML algorithm is important crucial. You cannot just take the first and try to manipulate it to the point it will be OK. You have to mind the use case from the very beginning. And when you are ready, it is time to apply the algorithm. Every choice has to be justified.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tune params
&lt;/h2&gt;

&lt;p&gt;When learning about SVM there was a nice hint to separate a small amount of data to check your model params, and then, when they seem to be good, try to apply them on a full set. It is because the training model requires a lot of try-and-fail attempts. So decrease your waiting time to the minimum! Parameters also can be nicely visualized by diagrams, so when you applied some change, try to see it on a plot. Why not, if you have so many visualization libraries like &lt;code&gt;matplotlib&lt;/code&gt; or &lt;code&gt;seaborn&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Udacity is cool
&lt;/h2&gt;

&lt;p&gt;I don’t know if Udacity is the best but compared to other platforms I can see its benefits. Let me a show you a few of them. They made a difference for me, being a programmer: Git repo with prepared data from the course! You just fork &lt;a href="https://github.com/udacity/ud120-projects"&gt;it&lt;/a&gt; and start doing things! Very interactive training. Many questions and lots of graphics The ideal tradeoff between the professional look and level of knowledge&lt;/p&gt;

&lt;h2&gt;
  
  
  Short Summary
&lt;/h2&gt;

&lt;p&gt;Why do I study ML? The reason is: I want to understand NLP. And everything is &lt;a href="https://github.com/lukaszkuczynski/ud120-projects"&gt;Git-visible&lt;/a&gt;. Read the story of me studying the first 3 lessons from Udacity’s “Introduction to Machine Learning” &lt;a href="https://classroom.udacity.com/courses/ud120"&gt;training&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>udacity</category>
      <category>training</category>
    </item>
    <item>
      <title>NLTK revisited</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Wed, 31 Oct 2018 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/nltk-revisited-13ea</link>
      <guid>https://forem.com/lukaszkuczynski/nltk-revisited-13ea</guid>
      <description>&lt;h2&gt;
  
  
  NLTK revisited: why
&lt;/h2&gt;

&lt;p&gt;When you start working with some text-analysis project, sooner or later you will encounter the following problem: Where to find sample text, how to get resources, where should I start. When I &lt;a href="http://lukcreates.pl/dajsiepoznac2017/porownanie-tekstu-scikit-learn-i-nltk-nlp/" rel="noopener noreferrer"&gt;first had a contact (Polish language post)&lt;/a&gt; with NLP I didn’t appreciate the power that lies behind the NLTK - the Python first-choice library for NLP. However, after several years, I see that I could use it earlier. Why? NLTK comes with an easy access to various sources of text. And I am going to show you what particularly I like and what caught my attention when studying the first 3 chapters of an &lt;a href="http://nltk.org/book/" rel="noopener noreferrer"&gt;official book&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But the main business is what? I would like to (finally) build some &lt;em&gt;Suggestion tool&lt;/em&gt; that will allow providing some Virtual Assistant to help in decision making progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bundled resources available
&lt;/h2&gt;

&lt;h3&gt;
  
  
  brown and its categories
&lt;/h3&gt;

&lt;p&gt;NLTK comes with various corpora, so big packs of text. You can utilize them as shown in the following example. All you need to do is to download appropriate corpus and start exploring that. Let us see now.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;
&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;brown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fileids&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt;Brown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; corpus contain &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Downloading&lt;/span&gt; &lt;span class="n"&gt;package&lt;/span&gt; &lt;span class="n"&gt;brown&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Unzipping&lt;/span&gt; &lt;span class="n"&gt;corpora&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Brown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="n"&gt;contain&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this corpus you will find different text categorized into categories. So it nicely fits into classification area of machine learning. Following there are categories of these texts together with some samples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt;brown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; contains following categories %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;categories&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;brown_adventure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adventure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;brown_government&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;government&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Following we have some sentences from &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adventure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; category:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;brown_adventure&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;And here we have some sentences from &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;government&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; category:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;brown_government&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 


&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;brown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="n"&gt;following&lt;/span&gt; &lt;span class="n"&gt;categories&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adventure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;belles_lettres&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;editorial&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fiction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;government&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hobbies&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;humor&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;learned&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lore&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mystery&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;religion&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reviews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;romance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;science_fiction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;Following&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt; &lt;span class="n"&gt;have&lt;/span&gt; &lt;span class="n"&gt;some&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adventure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dan&lt;/span&gt; &lt;span class="n"&gt;Morgan&lt;/span&gt; &lt;span class="n"&gt;told&lt;/span&gt; &lt;span class="n"&gt;himself&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;would&lt;/span&gt; &lt;span class="n"&gt;forget&lt;/span&gt; &lt;span class="n"&gt;Ann&lt;/span&gt; &lt;span class="n"&gt;Turner&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;He&lt;/span&gt; &lt;span class="n"&gt;was&lt;/span&gt; &lt;span class="n"&gt;well&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;her&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;He&lt;/span&gt; &lt;span class="n"&gt;certainly&lt;/span&gt; &lt;span class="n"&gt;didn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t want a wife who was fickle as Ann .
 &amp;gt; If he had married her , he&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="n"&gt;have&lt;/span&gt; &lt;span class="n"&gt;been&lt;/span&gt; &lt;span class="n"&gt;asking&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trouble&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;But&lt;/span&gt; &lt;span class="nb"&gt;all&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="n"&gt;was&lt;/span&gt; &lt;span class="n"&gt;rationalization&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;And&lt;/span&gt; &lt;span class="n"&gt;here&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt; &lt;span class="n"&gt;have&lt;/span&gt; &lt;span class="n"&gt;some&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;government&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Office&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Business&lt;/span&gt; &lt;span class="nc"&gt;Economics &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;OBE&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Department&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Commerce&lt;/span&gt; &lt;span class="n"&gt;provides&lt;/span&gt; &lt;span class="n"&gt;basic&lt;/span&gt; &lt;span class="n"&gt;measures&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;national&lt;/span&gt; &lt;span class="n"&gt;economy&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;short&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;changes&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;economic&lt;/span&gt; &lt;span class="n"&gt;situation&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;business&lt;/span&gt; &lt;span class="n"&gt;outlook&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;It&lt;/span&gt; &lt;span class="n"&gt;develops&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;analyzes&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;national&lt;/span&gt; &lt;span class="n"&gt;income&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;international&lt;/span&gt; &lt;span class="n"&gt;payments&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;many&lt;/span&gt; &lt;span class="n"&gt;other&lt;/span&gt; &lt;span class="n"&gt;business&lt;/span&gt; &lt;span class="n"&gt;indicators&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Such&lt;/span&gt; &lt;span class="n"&gt;measures&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;essential&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;its&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;presenting&lt;/span&gt; &lt;span class="n"&gt;business&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;Government&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;facts&lt;/span&gt; &lt;span class="n"&gt;required&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;meet&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;objective&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;expanding&lt;/span&gt; &lt;span class="n"&gt;business&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;improving&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;operation&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;economy&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Contact&lt;/span&gt;
 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;For&lt;/span&gt; &lt;span class="n"&gt;further&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt; &lt;span class="n"&gt;contact&lt;/span&gt; &lt;span class="n"&gt;Director&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Office&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Business&lt;/span&gt; &lt;span class="n"&gt;Economics&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;U&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Department&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Commerce&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Washington&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;D&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is a very important component of NLP in the &lt;em&gt;brown&lt;/em&gt; corpus, namely Part Of Speech tagging (POS). How is it organized? Let us see the example from one of the sentences printed just minutes ago.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;adv_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adventure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adv_words&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;adv_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;brown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tagged_words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adventure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adv_words&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Morgan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;told&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;himself&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;he&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;would&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;forget&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Ann&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Turner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Morgan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;told&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;VBD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;himself&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PPL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;he&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PPS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;would&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;forget&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;VB&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Ann&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Turner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NP&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes! It is tagged and ready to be analyzed. What does these symbols mean? They are symbols of parts of speech, nicely described &lt;a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" rel="noopener noreferrer"&gt;here&lt;/a&gt;. F.e. &lt;code&gt;NP&lt;/code&gt; stands fo &lt;em&gt;Proper Noun&lt;/em&gt; and &lt;code&gt;VBD&lt;/code&gt; is &lt;em&gt;Verb, past tense&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  sentiments
&lt;/h3&gt;

&lt;p&gt;One of the areas where NLP is used is a Sentiment Analysis, playing an important role in digital marketing. Imagine how nice it is to process vast amount of opinions and instantly recognize whether the product is approved or rejected by a community. Seeing trends is also possible with that. So what is the one of the NLKT bundled tools to deal with sentiments? This is &lt;code&gt;opinion_lexicon&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.corpus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;opinion_lexicon&lt;/span&gt;
&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;opinion_lexicon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;negatives&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;opinion_lexicon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;negative&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;positives&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;opinion_lexicon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;positive&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If you find some negative words, here you are: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;negatives&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;But let us try to see the positive side of life! Described with these words: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;positives&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Downloading&lt;/span&gt; &lt;span class="n"&gt;package&lt;/span&gt; &lt;span class="n"&gt;opinion_lexicon&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Unzipping&lt;/span&gt; &lt;span class="n"&gt;corpora&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;opinion_lexicon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;find&lt;/span&gt; &lt;span class="n"&gt;some&lt;/span&gt; &lt;span class="n"&gt;negative&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;here&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2-faced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2-faces&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abnormal&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abolish&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abominable&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abominably&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abominate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abomination&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abort&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;aborted&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;But&lt;/span&gt; &lt;span class="n"&gt;let&lt;/span&gt; &lt;span class="n"&gt;us&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;see&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;positive&lt;/span&gt; &lt;span class="n"&gt;side&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;life&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt; &lt;span class="n"&gt;Described&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;these&lt;/span&gt; &lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abound&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abounds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abundance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abundant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accessable&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accessible&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;acclaim&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;acclaimed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;acclamation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  and much more…
&lt;/h3&gt;

&lt;p&gt;There is a lot of other corpora available there. This is not the task of this post/notebook to repeat sth what one can read him/herself in the &lt;a href="http://www.nltk.org/howto/corpus.html#tagged-corpora" rel="noopener noreferrer"&gt;official papers as here&lt;/a&gt;. With NLTK after downloading some material, you have access to such materials as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multilingual corpora (like _Universal Declaration of Human Rights_with 300+ languages)&lt;/li&gt;
&lt;li&gt;lexical resources (&lt;em&gt;WordNet&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;pronouncing dictionaries (&lt;em&gt;CMU Pronouncing Dictionary&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lots of things to browse. But it is worth just to take a look on some of them, to have this feeling “I saw it somewhere..” when facing some NLP task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fetch anything
&lt;/h2&gt;

&lt;p&gt;If attached resources will not be enough for you, just start using different resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  requests
&lt;/h3&gt;

&lt;p&gt;Python has libraries for anything so this is possible to use available NET resources in your app just having their URL available.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://databricks.com/blog/2018/09/26/whats-new-for-apache-spark-on-kubernetes-in-the-upcoming-apache-spark-2-4-release.html&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;blog_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="n"&gt;blog_text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;!DOCTYPE html&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;html lang=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; prefix=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# video: http://ogp.me/ns/video# ya: http://webmaster.yandex.ru/vocabularies/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;head&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s"&gt; &amp;lt;meta charset=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTF&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  RSS
&lt;/h3&gt;

&lt;p&gt;Ready to consume RSS? With Python, nothing is easier. You can create an instance of &lt;code&gt;nltk.Text&lt;/code&gt; having RSS feed as the input. This is a snippet how it could be done&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;
&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://jvm-bloggers.com/pl/rss.xml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;feed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;entries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Look ma! I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve just parsed RSS from a very nice Polish blogging platform. It has a title %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;And there we go with 5 exemplary entries:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;Collecting&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;
&lt;span class="n"&gt;Successfully&lt;/span&gt; &lt;span class="n"&gt;built&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;
&lt;span class="n"&gt;Installing&lt;/span&gt; &lt;span class="n"&gt;collected&lt;/span&gt; &lt;span class="n"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;
&lt;span class="n"&gt;Successfully&lt;/span&gt; &lt;span class="n"&gt;installed&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;5.2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;Look&lt;/span&gt; &lt;span class="n"&gt;ma&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve just parsed RSS from a very nice Polish blogging platform. It has a title JVMBloggers
And there we go with 5 exemplary entries:
 &amp;gt; Odpowiedź: 42
 &amp;gt; Thanks for explaining the behaviour of dynamic (partition overwrite) mode.
 &amp;gt; Non-blocking and async Micronaut - quick start (part 3)
 &amp;gt; Strefa VIP
 &amp;gt; Mikroserwisy – czy to dla mnie?

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  cleaning
&lt;/h3&gt;

&lt;p&gt;When your HTML doc is fetched you probably have a doc that is full of HTML mess and there is no added value of having &lt;code&gt;&amp;lt;body&amp;gt;&lt;/code&gt; in your text. So there is some clean-up work that can be done and there are tools that can make it happen, but they are not part of nltk package. So let us have BeautifulSoap as an example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blog_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blog-content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;text_without_markup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;text_without_markup&lt;/span&gt;


&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n\n&lt;/span&gt;&lt;span class="s"&gt;What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release&lt;/span&gt;&lt;span class="se"&gt;\n\n\n&lt;/span&gt;&lt;span class="s"&gt;September 26&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Normalization - languages are not easy
&lt;/h2&gt;

&lt;p&gt;Your language is not easy. If you are Polish like me, it is soooo true. But even English and other European languages add complexity to NLP. Why? Words have different forms and we have to conform grammar rules to be respected when analyzing text by the machine. Taking English &lt;code&gt;going&lt;/code&gt; word as an example, you mean &lt;code&gt;go&lt;/code&gt; verb, but there is also its lemma &lt;code&gt;-ing&lt;/code&gt; that has to be recognized and skipped for the moment of analysis. What are 3 processes that have in-built support in NLTK? Read the following.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tokenization
&lt;/h3&gt;

&lt;p&gt;Text consists of sentences and these contain words. Often times, we would like to have words presented as vectors as we will apply some algebra to that. The simplest approach of tokenization can be implemented as follows, however, there are some limitations. You can use a variety of &lt;em&gt;tokenizers&lt;/em&gt; available in &lt;code&gt;nltk.tokenize&lt;/code&gt; &lt;a href="https://www.nltk.org/api/nltk.tokenize.html" rel="noopener noreferrer"&gt;package&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# write tokenizer yourself?
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Two smart coders are coding very quickly. Why? The end of the sprint is coming! The code has to be finished!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tokens_manual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[\s+]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tokens taken manually %s &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;tokens_manual&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# or maybe choose the one from the abundance in `nltk`
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.casual&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TweetTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;casual_tokenize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.mwe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MWETokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.punkt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PunktSentenceTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.regexp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RegexpTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WhitespaceTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;BlanklineTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WordPunctTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;wordpunct_tokenize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regexp_tokenize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;blankline_tokenize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.repp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReppTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.sexpr&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SExprTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sexpr_tokenize&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.simple&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SpaceTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TabTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LineTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="n"&gt;line_tokenize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.texttiling&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextTilingTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.toktok&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToktokTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.treebank&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TreebankWordTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.util&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;string_span_tokenize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regexp_span_tokenize&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize.stanford_segmenter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StanfordSegmenter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;word_tokenize&lt;/span&gt;

&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;punkt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;word_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;Tokens&lt;/span&gt; &lt;span class="n"&gt;taken&lt;/span&gt; &lt;span class="n"&gt;manually&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Two&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;smart&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;very&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quickly.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Why?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sprint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coming!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;finished!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Downloading&lt;/span&gt; &lt;span class="n"&gt;package&lt;/span&gt; &lt;span class="n"&gt;punkt&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Unzipping&lt;/span&gt; &lt;span class="n"&gt;tokenizers&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;punkt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Two&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;smart&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;very&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quickly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Why&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sprint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coming&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;finished&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stemming
&lt;/h3&gt;

&lt;p&gt;So one of the tasks that can be done is &lt;em&gt;stemming&lt;/em&gt;, so getting rid of the words ending. Let us see what does the popular &lt;em&gt;stemmer&lt;/em&gt; does to the text we already tokenized before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;porter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PorterStemmer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;tokens_stemmed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;porter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens_stemmed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;two&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;smart&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;veri&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quickli&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;whi&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sprint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;come&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ha&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;finish&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lemmatization
&lt;/h3&gt;

&lt;p&gt;If stemming is not enough, there has to be &lt;em&gt;lemmatization&lt;/em&gt; done, so your words can be classified against the real dictionary. Following there is an example of running this for our text sample.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wordnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;wnl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;WordNetLemmatizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;lemmas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;wnl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lemmatize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lemmas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Downloading&lt;/span&gt; &lt;span class="n"&gt;package&lt;/span&gt; &lt;span class="n"&gt;wordnet&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nltk_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;Unzipping&lt;/span&gt; &lt;span class="n"&gt;corpora&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;wordnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Two&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;smart&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;very&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quickly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Why&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sprint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;coming&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ha&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;finished&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;So where did we go? I have just analyzed NLKT book available online, chapters no 2 and 3. I gave a try for few tools from the big number available in this Natural Language Toolkit. Now there is a time to explore other chapters over there. Stay tuned. I have to build my &lt;em&gt;Suggestion tool&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;All of these was created as a Jupyter notebook.&lt;/p&gt;

</description>
      <category>nlp</category>
      <category>nltk</category>
      <category>text</category>
      <category>python</category>
    </item>
    <item>
      <title>Innovation day at Volvo</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Wed, 17 Oct 2018 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/innovation-day-at-volvo-1l64</link>
      <guid>https://forem.com/lukaszkuczynski/innovation-day-at-volvo-1l64</guid>
      <description>&lt;p&gt;As a Volvo employee, I had a chance to attend several sessions during the Innovation Days summit held a few days ago in Wroclaw, Volvo. There were several stories told, and it is nice that these stories are not just a mere theory; they are practised in some rooms of Volvo buildings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automation
&lt;/h2&gt;

&lt;p&gt;Automation is visible in many business areas. One of these is a process automation. There is always some mundane work that can be automated, if you can use the browser to perform an action for sure you can use an automat to click on a webpage. One of the automation tools is &lt;a href="https://www.uipath.com/"&gt;UIPath&lt;/a&gt; and what is nice there is also BigData tool linked to it that allows for better control over the metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  convince them
&lt;/h2&gt;

&lt;p&gt;Another presentation was about how to encourage your colleagues that they need an innovation. Of course, always there is a question about money and time. When may the busy workers have time to work on innovations? Maybe in their free time, they can find a slot. There is always some period-of-doing-nearly-nothing, so why not to utilize it?&lt;/p&gt;

&lt;h2&gt;
  
  
  UX
&lt;/h2&gt;

&lt;p&gt;User experience is another hot topic. Why do you like software tools that you often use? Mostly because they are easy to use, you don’t need to read a manual to start using it. So when creating a tool remember, about not falling into aforementioned traps: don’t be too trendy, don’t overdesign, and don’t allow your user to be designer. The simple example was shown when a user had to use some complicated HTML form, and when it was simplified (fewer HTML elements, more open search capabilities present), the usage of a tool increased, and more user became happy ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Artificial and Business Intelligence
&lt;/h2&gt;

&lt;p&gt;AI is really here. As you can see the global results here in &lt;a href="https://www.youtube.com/watch?v=2Gc1zz5bl8I"&gt;Vera Car&lt;/a&gt; there are other teams utilizing Artificial Intelligence. The intro was very nice. At the beginning the question was posed: how do you know that an apple is an apple? Because you are able to properly identify and measure its properties, features. This is what AI and machine learning is about, but it is done by machine and not you. Do you fear AI progress in our lives? You should, cause soon you can learn, your job can be done by a machine better, faster, and cheaper! And at the end, I learned about QlikSense. There is no value from data that is not easily viewable and discoverable. The data you have brings value to your business if it can help you to conclude what your action is and make the right decisions. QlikSense brings value to you as it makes it possible to create dashboards and analyze your datasets, having easy to use dashboards and visualizations.&lt;/p&gt;

</description>
      <category>report</category>
      <category>conference</category>
    </item>
    <item>
      <title>Monitoring Wrocław weather with AWS</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Tue, 18 Sep 2018 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/monitoring-wrocaw-weather-with-aws-903</link>
      <guid>https://forem.com/lukaszkuczynski/monitoring-wrocaw-weather-with-aws-903</guid>
      <description>&lt;p&gt;What if I wanted to compare AWS Lambdas with Azure Function apps? The previous weeks I did some Azure sample app. &lt;a href="https://dev.to/lukaszkuczynski/serverless-monitoring-of-weather-with-azure-4d89"&gt;There is some article on that&lt;/a&gt;. But I always felt AWS compared to Azure is smarter, GUI is lighter and I love Python. So why not to copy my project to AWS?&lt;/p&gt;

&lt;h2&gt;
  
  
  Plan
&lt;/h2&gt;

&lt;p&gt;Create a notification mechanism telling me about changes of weather in Wrocław. This will be done using Lambda architecture in AWS mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Check the weather
&lt;/h3&gt;

&lt;p&gt;Lambda function can be fired by &lt;em&gt;CloudWatch Events&lt;/em&gt;, using which I can set up some cron expression. Thus I will check weather exposed by &lt;a href="https://openweathermap.org/current"&gt;openweather API&lt;/a&gt; and put it on a queue. Queues in Amazon are part of so-called &lt;em&gt;SQS&lt;/em&gt;. Handling SQS in Amazon is pretty easy using &lt;a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html"&gt;Python boto clients&lt;/a&gt;. Following there is my function to fetch data. In my case every 10 minutes I will call &lt;em&gt;openweather&lt;/em&gt; API to query the current weather and I will put its description in a queue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;urllib.request&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urlopen&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;SITE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'site'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# URL of the site to check, stored in the site environment variable
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;EXPECTED&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Checking {} at {}...'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SITE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'time'&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SITE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'User-Agent'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'AWS Lambda'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"utf-8"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'weather'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'main'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;dt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'dt'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;dt_iso&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fromtimestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"weather is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; with UTC time &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dt_iso&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;queueUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'SQS_NEW_WEATHER'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'sqs'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;QueueUrl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queueUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;MessageBody&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Check failed!'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Check passed!'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Check complete at {}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Write to persistence
&lt;/h3&gt;

&lt;p&gt;When received the value I would like to compare it against the previous one to decide whether there was a change. I have to put &lt;strong&gt;state&lt;/strong&gt; somewhere. I will use default database for AWS: DynamoDB.&lt;/p&gt;

&lt;p&gt;The function is triggered for a new message on SQS, and it is easily configurable with AWS lambda GUI. The message received on SQS is available as &lt;em&gt;event&lt;/em&gt; parameter of lambda.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;boto3.dynamodb.conditions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Key&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"received!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;new_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Records'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'body'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;dynamodb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'dynamodb'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'weather'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
       &lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;'created_at'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'%Y%m%d_%H%M%S'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="s"&gt;'type'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'wroclawWeather'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;'text'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new_value&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"time created with value "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;new_value&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the above function will create a new document in DynamoDB instance. As you can see &lt;em&gt;boto3&lt;/em&gt; library helps to handle DynamoDB connections. All you need is to call &lt;em&gt;resource&lt;/em&gt; and &lt;em&gt;table&lt;/em&gt; and you are ready to &lt;code&gt;put_item&lt;/code&gt; into the table. This will enable the comparison in the near future. Behold!&lt;/p&gt;

&lt;h3&gt;
  
  
  Compare time!
&lt;/h3&gt;

&lt;p&gt;Having the new document inserted into the table, I can compare it against the previous one of the same type. I will use again the appropriate trigger, this time it will be “New row” trigger for a DynamoDB. The last 2 items will be fetched using this complicated query with &lt;em&gt;KeyConditionExpression&lt;/em&gt; because I want to get only two previous values of the exact type (like a &lt;code&gt;WHERE&lt;/code&gt; clause in SQL).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;boto3.dynamodb.conditions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Key&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# print("Received event: " + json.dumps(event, indent=2))
&lt;/span&gt;    &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Records'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"len records != 1, len = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"len(REC) &amp;lt;&amp;gt; 0"&lt;/span&gt;
    &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;current_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'dynamodb'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'NewImage'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'type'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'S'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"current_type = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dynamodb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'dynamodb'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dynamodb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'weather'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;KeyConditionExpression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'type'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_type&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;ScanIndexForward&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;Limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;new_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'dynamodb'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'NewImage'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'text'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'S'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;old_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Items'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'text'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;old_value&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;new_value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;change_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"changed IN DYNAMO trigger! "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;old_value&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s"&gt;"-&amp;gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;new_value&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;queueUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'SQS_CHANGED_WEATHER'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'sqs'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;QueueUrl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queueUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;MessageBody&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;change_text&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Current FROM DYNAMO &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;new_value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, previous &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;old_value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the value was changed the message will wander to another SQS topic, and this one I will use to notify the user about the changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Notification with SNS
&lt;/h3&gt;

&lt;p&gt;When the weather is changed, it will go to &lt;strong&gt;Simple Notification Service&lt;/strong&gt; , that can be used to… apply notification rules. In my case, I will simply use an email to be sent with the text of a change.&lt;/p&gt;

&lt;p&gt;And once again, thanks to &lt;em&gt;boto3&lt;/em&gt; library enables I create a SNS client so I can send SNS message super-easily from within the lambda.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;boto3&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"called weatherChanged"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;change_text_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'Records'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'body'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"I received a change: "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;change_text_value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;arn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'SNS_TOPIC'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;change_text_value&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'sns'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;TargetArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;change_text_value&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that one I just needed to configure my SNS connection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuring SNS
&lt;/h3&gt;

&lt;p&gt;To receive emails I had to create a topic. This topic has SNS address that I used in lambda function code. Then there is a need to create a subscription. You can use subscription of “Email” type, so then AWS will push the messages to your email inbox.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS or Azure?
&lt;/h2&gt;

&lt;p&gt;I configure the same business logic in both AWS and Azure. Time for a comparison.&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure pros:
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;security made easier&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I wasn’t forced to care about securty too much, all the functions were just connected to each other out-of-the-box.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;functions bundled together&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For a reason Function App in Azure is a root for all the functions inside, I could nicely put all the related functions in one resource.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS pros:
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;python&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The language is waaay better for me. Have no experience in C# and Python is just beautiful for playing with functions.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;no hassle with param bindings&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Binding in Azure is not so straighforward for me as AWS triggers and handling params for AWS Lambdas. Just better.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;boto3&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I had to make some strange returning and AsyncCollectors in Azure while AWS gives &lt;em&gt;boto3&lt;/em&gt; that makes everything just simple.&lt;/p&gt;

&lt;p&gt;My choice? &lt;strong&gt;AWS&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>lambda</category>
      <category>weather</category>
      <category>comparison</category>
    </item>
    <item>
      <title>My girl in IT</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Fri, 14 Sep 2018 06:05:24 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/my-girl-in-it-3pml</link>
      <guid>https://forem.com/lukaszkuczynski/my-girl-in-it-3pml</guid>
      <description>&lt;p&gt;Recently I was asked by my wife, is something she could learn/do to help me in my IT job. What options do you see?&lt;br&gt;
Say: You want to work 4 days and your girl 1 day.  And you want to co-operate. &lt;br&gt;
 What IT stuff can she do? What should I teach her?&lt;br&gt;
I am cloud/python/java/devops person.&lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>PyConPL 2018 report</title>
      <dc:creator>lukaszkuczynski</dc:creator>
      <pubDate>Sun, 26 Aug 2018 00:00:00 +0000</pubDate>
      <link>https://forem.com/lukaszkuczynski/pyconpl-2018-report-2loo</link>
      <guid>https://forem.com/lukaszkuczynski/pyconpl-2018-report-2loo</guid>
      <description>&lt;p&gt;It was time for a Python conference. Attending many JVM ones I never had chance to give Python its place. Now, whenever the problem stands before me – Python is default language of choice, I love the freedom and possibilities it gives, especially in data science world. Let’s be honest, it has become my “love” since 2 years, so attending &lt;a href="https://pl.pycon.org/2018/"&gt;PyCon&lt;/a&gt; was a must. What kept my focus? I will share it with you:&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed is what matters
&lt;/h2&gt;

&lt;p&gt;One of talks I attended was about &lt;code&gt;pypy&lt;/code&gt;. Mr Cuni showed it as the default choice when you want to write fast Python code &lt;a href="https://pypy.org/"&gt;with that library&lt;/a&gt;. If you want to do so, execute your code with &lt;code&gt;pypy&lt;/code&gt; and not &lt;code&gt;python&lt;/code&gt;. The use case shown during the presentation was video processing, that is true, the version of program run by &lt;em&gt;pypy&lt;/em&gt; was much faster. But the next day, when I discussed the topic with video processing people. They said comparison chosen for the presentation was not quite fair. Python is just a scripting language not a real-time video processing tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud is everywhere
&lt;/h2&gt;

&lt;p&gt;We were shown how cloud services are important. Two main players were mentioned. Who won? In my opinion stage is of AWS. It is hard to imagine how now we could handle processing great amounts of data. First of all, Suryansh Tibarewala presented Alexa, which is a service on AWS. Unfortunately we didn’t see any live demo, but at least the theory was introduced. Beware, speech recognition brings lots of difficulties, one of these is handling consecutive questions within the context. The next day, Michał Smereczyński shed some light on Microsoft as a cloud provider. He was convincing us, Azure is not only about Windows. &lt;em&gt;Azure Batch&lt;/em&gt; is a nice Azure service, thanks to what you can run your scripts in parallel. If you have some batch job to be executed occasionally, there is no need to buy and maintain big servers. Furthermore you are not forced to set up your environment manually each time, as &lt;em&gt;ACR&lt;/em&gt; allows you to store and use your Docker images. The Friday’s Ansible task given by Paweł Kopka also made it clear, having professional configuration management tool, I can play with any cloud provider using the same (or similar) scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are you healthy?
&lt;/h2&gt;

&lt;p&gt;Two of the presentations were especially cool in my humble opinion. And important! They proved how Python may be a help in disease prevention. On Friday I learned about Python use in bioinformatics in talk given by Jacek Śmietański. We learned what make DNA, RNA and proteins different and how can I fit very loooong genome sequences, and analyze them with &lt;code&gt;Biopython&lt;/code&gt; that you can &lt;a href="https://biopython.org/"&gt;fetch from here&lt;/a&gt;. Unfortunately the talk was not given in English, so it blocked some foreign audience from listening to that talk, although they were interested. On Saturday in &lt;em&gt;MRI talk&lt;/em&gt; we were shown how does Python help scientist to analyze MRI pictures. The &lt;a href="http://nipy.org/"&gt;libraries used&lt;/a&gt; for that purpose are &lt;code&gt;nipy&lt;/code&gt;, but you can make good use of other non-Python apps as &lt;em&gt;FreeSurfer&lt;/em&gt;. The main use case of Mikołaj Buchwald was to learn, what parts of our brain are stimulated when we are shown particular images. What motivates me to have a positive thinking is the following sentence I extracted from that talk:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An action and visualizing it is almost the same, so keep thinking of positive attitude&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Be safe!
&lt;/h2&gt;

&lt;p&gt;Ms Christine Bejerasco showed the way of &lt;em&gt;FSecure&lt;/em&gt; way toward security standards. She started as Perl developer and then switched to Python as she could continue working with regex there, and well… we all know it: Python is far better. In my opinion her Cybersecurity talk was one of the best, as it was of great quality, no extensive details were enforced and a lot of practical examples were shown. When the Internet was not so popular, evil code was to be linked to existing &lt;em&gt;exe&lt;/em&gt; files and it could infect machines via physical drives. Now, living in times of so popular “online life” we have to be careful about what we browse and what we do open. I learned that even visiting dangerous sites, without actually downloading something intentionally, can lead to disaster. My browser can be &lt;strong&gt;probed&lt;/strong&gt; for vulnerabilities and then &lt;strong&gt;injected&lt;/strong&gt; with some malicious code. Viruses can even live for some time deactivated and then run (wake up) in the most critical moment (Stuxnet). How to play with viruses, and not to harm your machine? They are using sandboxes to do so.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to teach a machine
&lt;/h2&gt;

&lt;p&gt;Machine learning is not an obvious task. And this is not only a buzzword nowadays. The words like &lt;em&gt;numpy&lt;/em&gt; &lt;em&gt;scipy&lt;/em&gt; &lt;em&gt;sklearn&lt;/em&gt; were mentioned heavily during days of the conference. I was happy to attend Saturday’s workshop that showed me how neural networks help to recognize picture. The challenges and solutions for &lt;em&gt;computer vision&lt;/em&gt; were presented by Prakhar Srivastava. It was pretty new for me to learn, that in 2015 year Convolution Neural Networks already surpassed human ability to recognize simple pictures. Sounds a little creepy. I learned that before using some advanced libraries as &lt;code&gt;keras&lt;/code&gt; you would better understand how it works under the hood. Then you would be better prepared to 1) cherish what &lt;code&gt;import sklearn&lt;/code&gt; enables you to do and 2) use well these tools. You can also play with nice Neural Network demo online, &lt;a href="http://scs.ryerson.ca/~aharley/vis/conv/"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>pycon</category>
      <category>python</category>
      <category>conference</category>
      <category>report</category>
    </item>
  </channel>
</rss>
