<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Vijay</title>
    <description>The latest articles on Forem by Vijay (@vnarayaj).</description>
    <link>https://forem.com/vnarayaj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1006589%2F27bbb813-0840-4d4b-b593-9fd0ee5b98e4.png</url>
      <title>Forem: Vijay</title>
      <link>https://forem.com/vnarayaj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/vnarayaj"/>
    <language>en</language>
    <item>
      <title>Analysing Github Stars - Extracting and analyzing data from Github using Apache NiFi®, Apache Kafka® and Apache Druid®</title>
      <dc:creator>Vijay</dc:creator>
      <pubDate>Thu, 12 Jan 2023 07:53:17 +0000</pubDate>
      <link>https://forem.com/vnarayaj/analysing-github-stars-extracting-and-analyzing-data-from-github-using-apache-nifir-apache-kafkar-and-apache-druidr-280</link>
      <guid>https://forem.com/vnarayaj/analysing-github-stars-extracting-and-analyzing-data-from-github-using-apache-nifir-apache-kafkar-and-apache-druidr-280</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.tourl"&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;
As part of the developer relations team in Imply, I thought it would be interesting to extract data about users who had &lt;a href="https://docs.github.com/en/get-started/exploring-projects-on-github/saving-repositories-with-stars" rel="noopener noreferrer"&gt;starred&lt;/a&gt; the &lt;a href="https://github.com/apache/druid" rel="noopener noreferrer"&gt;apache/druid&lt;/a&gt; repository. Stars don’t just help us understand how many people find Druid interesting, they also give insight into what other repositories people find interesting. And that is really important to me as an advocate – I can work out what topics people might be interested in knowing more about in my articles and at Druid &lt;a href="https://www.meetup.com/pro/apache-druid/" rel="noopener noreferrer"&gt;meetups&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/spencerwkimball/" rel="noopener noreferrer"&gt;Spencer Kimball&lt;/a&gt; (now CEO at &lt;a href="https://www.cockroachlabs.com/" rel="noopener noreferrer"&gt;CockroachDB&lt;/a&gt;) wrote an interesting &lt;a href="https://www.cockroachlabs.com/blog/what-can-we-learn-from-our-github-stars/" rel="noopener noreferrer"&gt;article&lt;/a&gt; on this topic in 2021 where they created &lt;a href="https://github.com/spencerkimball/stargazers" rel="noopener noreferrer"&gt;spencerkimball/stargazers&lt;/a&gt; based on a Python script. So I started thinking: could I create a data pipeline using &lt;a href="https://nifi.apache.org/" rel="noopener noreferrer"&gt;Nifi&lt;/a&gt; and &lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Kafka&lt;/a&gt; (two OSS tools often used with Druid)  to get the API data into &lt;a href="https://druid.apache.org/" rel="noopener noreferrer"&gt;Druid&lt;/a&gt; - and then use SQL to do the analytics? The answer was yes! And I have documented the outcome below. Here’s my analytical pipeline for Github stars data using Nifi, Kafka and Druid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sources - the Github API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Github provides an API (&lt;a href="https://docs.github.com/en/rest/activity/starring?apiVersion=2022-11-28#list-stargazers" rel="noopener noreferrer"&gt;/repos/{owner}/{repo}/stargazers&lt;/a&gt;) for extracting stargazer data that returns 30 users per page (with results in multiple pages). Each page has an array like below:&lt;br&gt;
[&lt;br&gt;
  {&lt;br&gt;
    "starred_at": "2012-10-23T19:08:07Z",&lt;br&gt;
    "user": {&lt;br&gt;
      "login": "user1",&lt;br&gt;
      "id": 45,&lt;br&gt;
      "node_id": "MDQ6VXNlcjQ1",&lt;br&gt;
      "avatar_url": "&lt;a href="https://avatars.githubusercontent.com/u/45?v=4" rel="noopener noreferrer"&gt;https://avatars.githubusercontent.com/u/45?v=4&lt;/a&gt;",&lt;br&gt;
      "gravatar_id": "",&lt;br&gt;
      "url": "&lt;a href="https://api.github.com/users/user1" rel="noopener noreferrer"&gt;https://api.github.com/users/user1&lt;/a&gt;",&lt;br&gt;
      "html_url": "&lt;a href="https://github.com/user1" rel="noopener noreferrer"&gt;https://github.com/user1&lt;/a&gt;",&lt;br&gt;
      "followers_url": "&lt;a href="https://api.github.com/users/user1/followers" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/followers&lt;/a&gt;",&lt;br&gt;
      "following_url": "&lt;a href="https://api.github.com/users/user1/following%7B/other_user%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/following{/other_user}&lt;/a&gt;",&lt;br&gt;
      "gists_url": "&lt;a href="https://api.github.com/users/user1/gists%7B/gist_id%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/gists{/gist_id}&lt;/a&gt;",&lt;br&gt;
      "starred_url": "&lt;a href="https://api.github.com/users/user1/starred%7B/owner%7D%7B/repo%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/starred{/owner}{/repo}&lt;/a&gt;",&lt;br&gt;
      "subscriptions_url": "&lt;a href="https://api.github.com/users/user1/subscriptions" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/subscriptions&lt;/a&gt;",&lt;br&gt;
      "organizations_url": "&lt;a href="https://api.github.com/users/user1/orgs" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/orgs&lt;/a&gt;",&lt;br&gt;
      "repos_url": "&lt;a href="https://api.github.com/users/user1/repos" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/repos&lt;/a&gt;",&lt;br&gt;
      "events_url": "&lt;a href="https://api.github.com/users/user1/events%7B/privacy%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/events{/privacy}&lt;/a&gt;",&lt;br&gt;
      "received_events_url": "&lt;a href="https://api.github.com/users/user1/received_events" rel="noopener noreferrer"&gt;https://api.github.com/users/user1/received_events&lt;/a&gt;",&lt;br&gt;
      "type": "User",&lt;br&gt;
      "site_admin": false&lt;br&gt;
    }&lt;br&gt;
  },&lt;br&gt;
  {&lt;br&gt;
    "starred_at": "2012-10-23T19:08:07Z",&lt;br&gt;
    "user": {&lt;br&gt;
      "login": "user2",&lt;br&gt;
      "id": 168,&lt;br&gt;
      "node_id": "MDQ6VXNlcjE2OA==",&lt;br&gt;
      "avatar_url": "&lt;a href="https://avatars.githubusercontent.com/u/168?v=4" rel="noopener noreferrer"&gt;https://avatars.githubusercontent.com/u/168?v=4&lt;/a&gt;",&lt;br&gt;
      "gravatar_id": "",&lt;br&gt;
      "url": "&lt;a href="https://api.github.com/users/user2" rel="noopener noreferrer"&gt;https://api.github.com/users/user2&lt;/a&gt;",&lt;br&gt;
      "html_url": "&lt;a href="https://github.com/user2" rel="noopener noreferrer"&gt;https://github.com/user2&lt;/a&gt;",&lt;br&gt;
      "followers_url": "&lt;a href="https://api.github.com/users/user2/followers" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/followers&lt;/a&gt;",&lt;br&gt;
      "following_url": "&lt;a href="https://api.github.com/users/user2/following%7B/other_user%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/following{/other_user}&lt;/a&gt;",&lt;br&gt;
      "gists_url": "&lt;a href="https://api.github.com/users/user2/gists%7B/gist_id%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/gists{/gist_id}&lt;/a&gt;",&lt;br&gt;
      "starred_url": "&lt;a href="https://api.github.com/users/user2/starred%7B/owner%7D%7B/repo%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/starred{/owner}{/repo}&lt;/a&gt;",&lt;br&gt;
      "subscriptions_url": "&lt;a href="https://api.github.com/users/user2/subscriptions" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/subscriptions&lt;/a&gt;",&lt;br&gt;
      "organizations_url": "&lt;a href="https://api.github.com/users/user2/orgs" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/orgs&lt;/a&gt;",&lt;br&gt;
      "repos_url": "&lt;a href="https://api.github.com/users/user2/repos" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/repos&lt;/a&gt;",&lt;br&gt;
      "events_url": "&lt;a href="https://api.github.com/users/user2/events%7B/privacy%7D" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/events{/privacy}&lt;/a&gt;",&lt;br&gt;
      "received_events_url": "&lt;a href="https://api.github.com/users/user2/received_events" rel="noopener noreferrer"&gt;https://api.github.com/users/user2/received_events&lt;/a&gt;",&lt;br&gt;
      "type": "User",&lt;br&gt;
      "site_admin": false&lt;br&gt;
    }&lt;br&gt;
  }]&lt;/p&gt;

&lt;p&gt;This is the pipeline that I decided to build:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwr5xjnh0qsftf140v49b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwr5xjnh0qsftf140v49b.png" alt="Image description" width="501" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Nifi - fetches JSON results from the multiple pages returned by the API, then splits the JSON into multiple JSONs - one for each star.&lt;/li&gt;
&lt;li&gt;Kafka - acts as the reliable delivery from NiFi to Druid and then the source of ingestion.&lt;/li&gt;
&lt;li&gt;Druid - ingests the JSON and allows me to use JSON path in SQL queries, and do analytics along the timeline&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As this was a bit of an experiment, I thought I would build up two tables - “Blog3” to contain users who have starred the druid repository in github, “Blog 4” contains organisation names. So in Kafka, I would create two topics with the same names.&lt;br&gt;
Druid expects new line delimited JSON – it doesn’t support JSON arrays at the top level of the JSON (arrays inside the JSON are ok). To get this data into Druid easily, I decided to break up each [...] array into a separate JSON, then publish it to a Kafka topic.&lt;br&gt;
&lt;a href="https://druid.apache.org/docs/latest/ingestion/schema-design.html" rel="noopener noreferrer"&gt;As for the schema&lt;/a&gt;, Druid needs to have a __time column. This was easy to work out – I would use the datetime from the JSON when the repository was starred. I’d also have a field “user”: a JSON containing properties of the user who did the starring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install and configure Kafka&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kafka.apache.org" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.&lt;/p&gt;

&lt;p&gt;Kafka is a natural fit with Druid – Druid has an out-of-the-box consumer that guarantees exactly once ingestion, and that I could scale up and down quickly thanks to Druid’s architecture.&lt;/p&gt;

&lt;p&gt;You can install Kafka from &lt;a href="https://kafka.apache.org/quickstart" rel="noopener noreferrer"&gt;https://kafka.apache.org/quickstart&lt;/a&gt;.&lt;br&gt;
Because Druid and Kafka both use &lt;a href="https://zookeeper.apache.org" rel="noopener noreferrer"&gt;Apache Zookeeper&lt;/a&gt;, I opted to use the Zookeeper deployment that comes with Druid, so didn’t start it with Kafka.&lt;br&gt;
Once running, I &lt;a href="https://kafka.apache.org/documentation/#basic_ops_add_topic" rel="noopener noreferrer"&gt;created&lt;/a&gt; two topics for me to post the data into, and for Druid to ingest from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./kafka-topics.sh --create --bootstrap-server localhost:9092 --topic blog3 --replication-factor 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Install and configure NiFi&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.&lt;br&gt;
Nifi is very useful when data needs to be loaded from different sources. In this case, I will nifi to access the Github API as it is very easy to make repeated calls to a Http endpoint and get data from multiple pages.&lt;br&gt;
You can see what I did by downloading NiFi yourself and then adding my template from the Druid Datasets repo:&lt;br&gt;
&lt;a href="https://github.com/implydata/druid-datasets/blob/main/githubstars/github_stars.xml" rel="noopener noreferrer"&gt;https://github.com/implydata/druid-datasets/blob/main/githubstars/github_stars.xml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s a screenshot of the flow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zhnvyo2lh3j9470diuv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zhnvyo2lh3j9470diuv.png" alt="Image description" width="492" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.GenerateFlowFile/index.html" rel="noopener noreferrer"&gt;GenerateFlowFile&lt;/a&gt;: Generates dummy content to trigger the flow&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-update-attribute-nar/1.19.1/org.apache.nifi.processors.attributes.UpdateAttribute/index.html" rel="noopener noreferrer"&gt;UpdateAttribute&lt;/a&gt;: Generates an attribute “p3” to handle the multiple pages from the Github end point.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.17.0/org.apache.nifi.processors.standard.InvokeHTTP/index.html" rel="noopener noreferrer"&gt;InvokeHTTP&lt;/a&gt;: Invoke this Github API to get Druid stargazers &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.SplitJson/index.html" rel="noopener noreferrer"&gt;SplitJson&lt;/a&gt;: We now have the endpoint (&lt;a href="https://api.github.com/repos/apache/druid/stargazers?page=$%7Bp3%7D" rel="noopener noreferrer"&gt;https://api.github.com/repos/apache/druid/stargazers?page=${p3}&lt;/a&gt;)
but it needs to be split using this processor&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs.html" rel="noopener noreferrer"&gt;MergeContent&lt;/a&gt;: Merge the split JSONS with a new line separator&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-2-6-nar/1.19.1/org.apache.nifi.processors.kafka.pubsub.PublishKafka_2_6/index.html" rel="noopener noreferrer"&gt;PublishKafka_2_6&lt;/a&gt;: Post JSON to kafka.&lt;/li&gt;
&lt;li&gt;EvaluateJsonPath: Extract loginid and then…&lt;/li&gt;
&lt;li&gt;InvokeHTTP: …use it to invoke this API: &lt;a href="https://api.github.com/users/$%7Bloginid%7D/orgs" rel="noopener noreferrer"&gt;https://api.github.com/users/${loginid}/orgs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SplitJson: Split the array. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.EvaluateJsonPath/index.html" rel="noopener noreferrer"&gt;EvaluateJsonPath&lt;/a&gt;: Extracts orgid from JSON. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.AttributesToJSON/index.html" rel="noopener noreferrer"&gt;AttributesToJSON&lt;/a&gt;: Creates JSON with loginid and orgid.&lt;/li&gt;
&lt;li&gt;PublishKafka_2_6: Post JSON to kafka. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 8 - 12  use the login id retrieved in step 7 to get the orgid associated with each starring user  by calling the corresponding  end point in the Github API.&lt;/p&gt;

&lt;p&gt;For the invokeHTTP processors, I set my Github API key.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq19ng3yvu89o50nr3z75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq19ng3yvu89o50nr3z75.png" alt="Image description" width="800" height="84"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install and configure Druid&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://druid.apache.org" rel="noopener noreferrer"&gt;Apache Druid&lt;/a&gt; is a real-time database to power modern analytics applications.&lt;br&gt;
This was going to be how I would look at the data from the APIs using SQL.&lt;br&gt;
Druid can be downloaded from &lt;a href="https://druid.apache.org/docs/latest/tutorials/index.html" rel="noopener noreferrer"&gt;https://druid.apache.org/docs/latest/tutorials/index.html&lt;/a&gt; - I just started it up with the default configuration.&lt;br&gt;
The druid console is on &lt;a href="http://localhost:8888" rel="noopener noreferrer"&gt;http://localhost:8888&lt;/a&gt; by default, so I could quickly get into the ingestion setup wizard and connect to Kafka. The wizard creates a JSON-version of the ingestion specification – you can see mine here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/implydata/druid-datasets/blob/main/githubstars/blog3_ingest.json" rel="noopener noreferrer"&gt;https://github.com/implydata/druid-datasets/blob/main/githubstars/blog3_ingest.json&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/implydata/druid-datasets/blob/main/githubstars/blog4_ingest.json" rel="noopener noreferrer"&gt;https://github.com/implydata/druid-datasets/blob/main/githubstars/blog4_ingest.json&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can submit the specifications yourself in the Supervisors pane under ingestion:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9f2x2hllch9my98dguml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9f2x2hllch9my98dguml.png" alt="Image description" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It’ll show you a preview before you submit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ilstdnac835b7abvj48.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ilstdnac835b7abvj48.png" alt="Image description" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Querying Druid&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As soon as the Kafka ingestion supervisor was running, I could see the two sources in Druid’s query tab: table blog3 and table blog4.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foekknb798x4ce3h0udja.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foekknb798x4ce3h0udja.png" alt="Image description" width="534" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now I can see Blog3: the users who have starred the druid repository in github. &lt;br&gt;
And I can also see Blog4: organization names for each login.&lt;/p&gt;

&lt;p&gt;I could straight away do some SQL querying on the incoming Kafka data. Some examples are below.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Number of star gazers who are site admins&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;select JSON_VALUE(user,'$.site_admin'),count(*) from blog3 &lt;br&gt;
   group by 1&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Star gazers added by month &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;select TIME_FLOOR(&lt;/em&gt;&lt;em&gt;time,'P1M'),count(*) from blog3 group &lt;br&gt;
   by 1&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Number of users by org who have starred druid repo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;select orgid,count(&lt;em&gt;) from blog4 group by 1 order by &lt;br&gt;
   count(&lt;/em&gt;) desc&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
I started this with wanting to get some insights on engaging with the community. Where am I on that?&lt;br&gt;
From last query above above I get the below top ten orgs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fra187mokc5wd6n80snx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fra187mokc5wd6n80snx2.png" alt="Image description" width="700" height="922"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Clearly I should definitely think about gaming focussed content. Maybe see if there are Trino meetups I can present in. I should also try reaching out to fossasia.&lt;/p&gt;

&lt;p&gt;What else can I see? When going through the github API I realized that the end point &lt;a href="https://api.github.com/users/USERNAME/starred" rel="noopener noreferrer"&gt;https://api.github.com/users/USERNAME/starred&lt;/a&gt; will allow me to fetch the other repositories starred by the users who starred the Druid repository. &lt;br&gt;
I enhanced the Nifi template to add this new end point (&lt;a href="https://github.com/implydata/druid-datasets/blob/main/githubstars/nifi_other_repos.xml" rel="noopener noreferrer"&gt;https://github.com/implydata/druid-datasets/blob/main/githubstars/nifi_other_repos.xml&lt;/a&gt; ) &lt;br&gt;
and used the Druid supervisor spec (&lt;a href="https://github.com/implydata/druid-datasets/blob/main/githubstars/nifi_other_repos.xml" rel="noopener noreferrer"&gt;https://github.com/implydata/druid-datasets/blob/main/githubstars/nifi_other_repos.xml&lt;/a&gt; ) to ingest this into the same datasource (blog4) and ran the query &lt;/p&gt;

&lt;p&gt;&lt;em&gt;select repo,APPROX_COUNT_DISTINCT_DS_THETA (loginid) from blog4 where repo not in ('apache/druid') and repo&amp;lt;&amp;gt;'' group by 1 order by 2 desc limit 20&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;to get&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9bjyy9fs1q1mc95r1zu9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9bjyy9fs1q1mc95r1zu9.png" alt="Image description" width="800" height="895"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This clearly tells me that I should look at content related to React (ant-design is a UI framework using React), Superset, Tensorflow,Flink,Kubernetes,Metabase and Spark. I should also try to engage these communities to help them use Druid with these other products.&lt;/p&gt;

&lt;p&gt;Using Nifi, Kafka and Druid I’ve put together the beginnings of a real-time modern analytics application. This pipeline fetches data from the github API and helps me analyze the users that have starred the Druid repository in github. All the three products- Nifi, Kafka and Druid are capable of handling large data volumes. All three products can be run in clusters and are horizontally scalable.&lt;/p&gt;

&lt;p&gt;Next step – a UI to sit on the top of the data! Watch this space for a next post on a UI to go with this.&lt;/p&gt;

&lt;p&gt;If you have questions on using Druid please do go to the community link below and sign up or come to the &lt;a href="https://calendly.com/vijay-narayanan/druid-poc-clinic-by-imply" rel="noopener noreferrer"&gt;POC clinic&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn more&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://druid.apache.org/community" rel="noopener noreferrer"&gt;https://druid.apache.org/community&lt;/a&gt; – connect with other Apache Druid users&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.imply.io/" rel="noopener noreferrer"&gt;https://learn.imply.io/&lt;/a&gt; - Free druid courses with hands on labs&lt;br&gt;
&lt;a href="https://druid.apache.org/docs/latest/tutorials/tutorial-kafka.html" rel="noopener noreferrer"&gt;Kafka Ingestion tutorial&lt;/a&gt; on Druid docs&lt;br&gt;
&lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;https://kafka.apache.org/&lt;/a&gt; - all things Kafka&lt;br&gt;
&lt;a href="https://nifi.apache.org/" rel="noopener noreferrer"&gt;https://nifi.apache.org/&lt;/a&gt; - all things Nifi&lt;br&gt;
&lt;a href="https://druid.apache.org/community" rel="noopener noreferrer"&gt;https://druid.apache.org/community&lt;/a&gt; – connect with other Apache Druid users&lt;/p&gt;

</description>
      <category>watercooler</category>
    </item>
  </channel>
</rss>
