Forem: kuwala

What are the hottest dbt Repositories you should star on Github 2022? - Here are mine.

kuwala — Wed, 08 Jun 2022 10:11:57 +0000

Data engineering has just been nominated as the profession most in-demand in 2022. The process of data engineering can be understood as extracting, transforming, and loading data (ETL). Data is loaded from the source, transformed, and loaded into a table at a data warehouse. This process is automated and repetitive so that clean, updated, and reliable data ultimately flows into a Dashboard and into other data products (e.g. a recommendation engine).

‍The dbt movement and rise of Analytics Engineering

In recent years, dbt in particular has enjoyed growing popularity. Dbt is a software framework that sits in the middle of the ETL process. It represents the transformative layer after loading data from an original source. Dbt combines SQL with software engineering principles. In plain language, this means that SQL can now be used to develop data models, new views, and tables. DBT takes into account the following important software engineering frameworks:

Dbt represents DRY Coding
Dbt manages dependencies well. Through lineage capabilities, it is excellent for complex data warehouse structures and enables you to build out DAGs
version control with Git
incorporating advanced data tests including Data Freshness metrics

Dbt was able to build an entire job class with the title Analytics Engineer. I think this nails the importance but also the audience that is mostly using dbt, namely engineers. A new era of open source tools have been built that set up on top of dbt. Here are the hottest ones and** highest ranked projects on Github**:‍

Lightdash ( https://github.com/lightdash/lightdash )
Lightdash converts dbt models and makes it possible to define and easily visualize additional metrics via a visual interface. The front end helps to understand and extend the underlying SQL queries. Lightdash also visualizes business metrics and makes them shareable with the data team. It is also possible to integrate all data into another visualization tool.

‍

re_data ( https://github.com/re-data/re-data )
Re_data is an abstraction layer that helps users monitor dbt projects and their underlying data. For example, you get alerts when a test failed or a data anomaly occurs in a dbt project and which underlying metric is affected. In addition, the lineage graph is also intuitively displayed. Re-data is one of two others frameworks focusing on the observability aspect of lengthy pipelines in dbt (check also out: open-metadata and Elementary).

‍

Evidence ( https://github.com/evidence-dev/evidence )
Evidence is another tool for lightweight BI reporting. With Evidence you can build simple reports in “medium style” using SQL queries and Markdown. It is reminiscent of Jupyter Notebooks except that it is based on SQL instead of Python. You can also initiate SQL queries from the reports you create. I haven’t used the tool myself yet but it seems to be ideal for quick prototyped metrics in a report.

‍

Kuwala ( https://github.com/kuwala-io/kuwala )
Kuwala is a data workspace that consolidates the Modern Data Stack and makes it usable for BI analysts and Engineers. Even though dbt is originally targeted at BI Analysts, dbt is mainly used by Engineers. This shifts a large amount of pipeline engineering effort to the IT department. With Kuwala, a BI analyst can intuitively build advanced data workflows using a drag-drop interface on top of the modern data stack without coding. Consequently, the BI Analyst can work more iteratively and maintain the complete workflow from source to metrics in a dashboard. Under the hood and Behind the Scenes, the dbt models are generated so that a more experienced engineer can customize the pipelines at any time. In addition, engineers can easily convert dbt models into reusable “drag and drop” components.

‍

Fal-AI ( https://github.com/fal-ai/fal )
Fal helps to run Python scripts directly from the dbt project. For example, you can load dbt models directly into the Python context which helps to apply Data Science libraries like SKlearn and Prophet in the dbt models. This especially improves the data science capabilities within a data pipeline. What I extremely like about fal is that it extends dbt from a interesting angle.

‍Of course, there are not just 5 interesting projects out of 16K repos on Github using dbt. So what is your hottest repo for dbt?

Data Engineering Should Not Be A Problem For BI Analysts

kuwala — Thu, 26 May 2022 17:17:03 +0000

The Modern Data Stack is a tooling set for engineers to flexibly process data. Each framework has a specific task. In Snowflake we store data, Airflow orchestrates, Airbyte loads data, DBT transforms it and Superset visualizes. Complex frameworks are needed to deal with complex data. It's no wonder that with the divergence of tools, it's easy to lose sight of the big picture. In fact its a crowded market now.

While in the past a BI analyst could analyze data in Excel or MySQL and then create a report, now he is just the person who visualizes data and briefs the IT team. The BI analyst is left out, and the engineering department is left alone and overloaded with tasks. The important discourse between Business and IT is missing as well as a fast iterative process. No wonder most data projects fail, don't deliver ROI or take way too long. The cross-functional teams suffer from speaking different languages. A BI Analyst wants to use DBT, but can't. An engineer wants more support so he doesn't have to work through ad-hoc and redundant tasks all the time.

I have built an open source tool with Matti for this very reason that allows BI Analysts to build advanced data pipelines directly through an abstraction layer. The BI Analysts can build workflows from data blocks (Data Sources = Airbyte under the hood), transformation blocks (Transformation= dbt) and data science blocks and connect them to a dashboard. Here is an overview of all integrations: https://www.kuwala.io/data-libary.

This creates a clean codebase under the hood. For example, DBT projects are created and built automatically. We have the idea to build a tool for the data analytics space similar to Webflow for web designers. An experienced engineer can customize the dbt models or create new models and make them available on the Kuwala Canvas for no-coders. How to do this (for Engineers) is described here: https://docs.kuwala.io/contributing/adding-transformations. Did I mention we are open-source? 😅 It would be awesome working with you together on a PR.

Back to topic: This flexibility is absolutely necessary because data projects are individual, grow, and need to be customizable.

Our current project state covers the following points:

💯 The setup of Kuwala is now straightforward. Just clone the repo and pull the dockers!
🗂 You can now easily connect to Snowflake, PostGres or BigQuery.
🧱 +20 Transformations covering merging, aggregating, and filters for data. Under the hood, we are generating clean dbt Labs models! (https://www.kuwala.io/data-libary)
🛠 you can now also easily add a new transformation by using dbt-core. (https://docs.kuwala.io/contributing/adding-transformations)

Want to give Kuwala a try? It's easy:

visit our Github repo: https://github.com/kuwala-io/kuwala
clone our repo: https://github.com/kuwala-io/kuwala.git
cd to the root directory...
start from there in the terminal docker-compose --profile kuwala up.
open your http://localhost:3000

We now search for more people using it and building seperate parts to grow as an community. We are set - are you?

Start hacking! Send us your issues! Start contributing! And if something doesn`t work, join our Slack community and we will help you 🚀 https://kuwala-community.slack.com/ssb/redirect

The Paradigm Shift of Business Models in the Data Space is Real

kuwala — Wed, 16 Mar 2022 18:08:34 +0000

Last year we saw an explosion of startup investments in no-code and low-code data platforms as well as open-source projects in the data space, which primarily define the modern data stack. It is time to dive a little deeper into the topic and understand the dynamics in those markets. In this article, I highlight the reasons for these booms and would like to draw attention to the problems of the markets to ultimately express the thesis — Are the final days of the classic SaaS model in the data space counted?

The complexity of data projects

8 years ago, companies started to look at how to make better decisions for the business using the data they collect. At that time, I started my career as a data scientist and I heard for the first time that there was a shortage of data scientists and software developers. Today in 2022, that picture has become even more dramatic. The demand for data and engineering jobs is at twice the rate of the absolute supply. Salaries for data talents exploded and data science was the “sexiest job in the world” twice in a row. And the demands for developers continue to grow: new frameworks, new apps, and new ideas require complex solutions. Two years ago in Rasa’s office in Kreuzberg Berlin, one of my favorite data bloggers told me that the “Data science fallout winter” is coming. What he meant was that companies were disappointed that data projects rarely end as a success. From my observations, this was due to several reasons:

It takes a long time for a manager to align with a data scientist and software engineer in a way that translates business goals into a data strategy and data project. The feedback loops were taking too long.
Expectations were set too high and timelines were planned far too tightly. A data project cannot be treated like a normal software project which is already fairly complex.
The data collection process and cleansing tied up a lot of the resources.

Then, on a Venturebeat panel, the often-quoted figure was thrown into the room: 85% of data projects fail. The symptom of the data science fallout winter was there.

The renaissance of open-source led by the modern data stack

While companies were a bit timid about investing in data science and business intelligence projects, grandiose solutions were emerging among software developers. GitHub was the place where big solutions were born. A safe space for engineers with a vision. On GitHub, some of the hottest frameworks in the data space quietly emerged. We are talking about: Elastic Search (database), Airflow (data pipelines), dbt (transformations), Meltano (data extracting), and most recently Airbyte (data extracting). These tools are often summarized under the buzzword “modern data stack”. They run flexibly on a data warehouse like Snowflake, the setup is relatively simple, and the services can be combined to map the processing and analysis of data.

But why open-source? Why hasn’t Microsoft or another tech giant owned and branded the modern data stack? The answer is trivial and the realization has been world-changing: Data projects are the most complex task in software development. Every data project is an edge case no matter if the business question might be the same. Data from a wide variety of sources with varying data quality rushes into enterprises at an even wider variety of frequencies. The data must be melted together, and the methods of analysis and interpretation of the data are as diverse as fauna and flora. Complex software can only be accomplished with many differently qualified developers. So many developers that even a tech giant would not have been able to build the foundation for the modern data stack without neglecting the core business.

What did the modern data stack change for companies?

In the meantime on the business side, more and more companies are adopting the modern data stack with a data warehouse in the center point of business intelligence (BI). I am very happy to follow this development since companies were able to work lean and agile with data. But one problem remains: Companies still have too few developers. And in data projects, managers and engineers still don’t speak the same language. The role of the business analyst got more and more relevant! The mediator between both parties: a BI Analyst that understands the fundamentals of data analysis and the objective of increasing the company’s ROI but lacks the technical coding skills.

Startups emerged in 2021 with the promise of connecting data without coding skills, creating analytics, and moving companies forward in the digital transformation and data literacy. The startups build on the open modern data stack and give it a user-friendly wrapper. Prominently, they even advertise that the UI would replace dbt, Airbyte, Snowflake, and Airflow. However, the truth is that all of these technologies run in the background of the software, which is resold under their license. The modern data stack that was developed to be freely available is now being sold to customers as a SaaS tool?

If you take the market cap of the closed SaaS solutions together and compare it on the time axis with the market cap of opens-source solutions, you can see how the market gains for SaaS are decreasing. On the other hand, a very steep growth can be seen for open-source solutions. Early open-source companies like RedHat, databricks or elastic are nowadays publicly traded companies.

OSS Capital puts it this way: “Open-source is eating software faster than software is eating the world”.

Limited Scalability of SaaS in a fast changing world:

Each use case must be built by the internal development team of a SaaS company. This means that the focus can only be on simple, top use cases.
In SaaS Sales takes place via sales reps and conventional marketing. Scalability is limited.

→ High costs for sales and product development

Uncomfortable Lock-In Effect of Customers:

No adaptability of the tool to the individual use case.
High lock-in effect.
Training for new users required.

→ The low cost-benefit for the customer in the long run

In the end, we see a frustrated customer who uses a tool because it was bought once but the results are not satisfying. This issue can be well illustrated by the example of Airbyte vs. Fivetran. Fivetran started in the market 10 years ago and helps software engineers to extract data from one source and load it into another data source, e.g., a data warehouse. Over time, many new data sources evolved from SaaS tools such as advertiser tools (Facebook), CRM, and reporting tools. Today, the number of SaaS tools exceeds 10,000, and with Fivetran’s closed approach, it was only possible to maintain and connect 1% of the data sources. And this process alone is a heavy lifting task if you build it internally. It is also very costly, which is also reflected in the subscription prices. As a result, customers were increasingly frustrated, paying a high price for limited usability. Airbyte recognized this problem 1.5 years ago. Instead of relying on a closed approach, Airbyte adopted an open-source business model. The Airbyte team makes it easy for developers to connect new data sources that are still missing. Airbyte is free to use and easy to set up by an engineer who finds the repo on GitHub. Thus, the complexity of the tool continues to grow independently with each new use case and data connector built by the community. Key capabilities for shift hand in hand with the different business models. It is now about serving the contributors in the community, making it easily accessible, and building out the overarching roadmap. Developing a convenient solution for enterprises that feels not like a ripoff.

The success proves Airbyte right, within 1.5 years Airbyte won 16,000 customers while Fivetran stands at 2,000 customers. Of course, Fivetran customers are all paying, while Airbyte has a lower turnover, but considering the long-term development and renaissance of open-source, it is only a matter of time until enough customers pay for Airbyte. The popularity, the developer-friendly features, and the transparent subscription model is too convincing for that. (Picture Airbyte vs. Fivetran)

The remaining problem with open-source

The success story of Airbyte should not hide the problems of open-source. An average contributor works only 3 months on an open-source project before moving on to a new adventure. All the pressure is on the main contributors and initiators of the project. This problem became obvious with the Log4j vulnerability in 2021. Many large companies including Apple, Microsoft, and Cloudfare were using the open-source library. When the vulnerability became known, these companies turned to the Log4j team to fix the problem as soon as possible. We are talking about multi-billion dollar companies here, turning to a group of idealists, some of whom only work on the libary in their spare time. Understandably, the situation feels thankless, since most companies don’t pay for tech support or even make it known to them beforehand that they were using the Log4j libary to the initiators. The call became louder for more transparency and a system of monetization and valuation of the work. This is the biggest hurdle open-source software has to overcome. Contributors should feel appreciated and get something in return (dollars are also not always the answers). Systems should be developed that keep contributors longer on a project and value their work adequately.

What do we learn from this?

In summary, we are currently experiencing a renaissance of open-source software. Traditional SaaS business models in the data space are difficult to implement because data projects are too complex to solve in a proprietary tool for customers. I drew an example around no-code data platforms that rely on SaaS business models in particular but build on open-source without openly admitting it. This can lead to a problem in terms of customer satisfaction. That this is problematic is shown by the Log4j case as well as the Airbyte/Fivetran case study. Closed SaaS tools are not only expensive to build, but also do not address the complexity of customer problems. Because closed tools rely heavily on open-source libraries, non-transparent communication also creates major security vulnerabilities. There is a certain frustration and tension on the side of open-source contributors. This holds opportunities to change something, create completely new tools, business models, and software categories. And this is into what Kuwala turned now: an Open-Source No Code Data Platform. Give us a try on Github![https://github.com/kuwala-io/kuwala] You have a different perspective? Or would like to discuss it more in depth? You can easily join our discussion on slack, here[https://kuwala-community.slack.com/ssb/redirect] . Or just comment below.

Data Science Feels Like a Fake Entrepreneur in a YouTube Ad

kuwala — Thu, 11 Nov 2021 17:39:13 +0000

Philosophers, politicians, and visionaries are talking about a future in 2030 in which autonomous vehicles will be driving through the streets, cities will be able to adapt efficiently to environmental influences, and digitization will be a crucial tool for operating efficiently and saving resources.

The world we live in today is subject to constant technological change. With the commercialization of the Internet, companies quickly recognized the value of data. The most authoritative innovations of this time are based on fast data-driven decisions and intelligent algorithms fed by the constant linking of data.

Examples include the recommendation algorithms of Amazon, Youtube, and Netflix or surge pricing of Uber and Airbnb and any Ad-based business (hello Facebook).

Data are archived building blocks of knowledge that can be used to be more predictive in a world of rapid change and shape the world into a better place.

Governments launched open data initiatives. Public data is made available to the general public so that businesses and individuals can work with the data. Data scientists, professions that use programming to analyze large amounts of data, became the sexiest job description in the technology world and, as a result, a scarce resource for companies.

They are under enormous pressure by expectations to make the world and especially the company more innovative.

While at the same time, they face a massive flood of data from a wide variety of data sources and formats. Surveys and studies show that data scientists spend up to 80% of their time searching, collecting, preparing, and integrating data. Rarely are the desires of policymakers and businesses and the true reality so far apart.

I worked as a Data Scientist and Data Consultant for many larger companies.

In a project for the city of Leipzig, I advised their administration on its Open Data strategy.

So in December 2019, I audited hundreds of data sources of the city of Leipzig to find out how to make the data more readily available for relevant target groups (Data Scientists and Developers) and how to simplify data access.

I spent my time just before Christmas combining hundreds of CSV files and then evaluating when, where, and to which granularity the data was available.

Together with the city of Leipzig, the results of the data audit were presented and discussed at the world-famous Chaos Communication Congress. The audience was about to challenge us! The results were clear:

Data scientists approach external data with great skepticism regarding data quality. For good reasons, the data documentation is mostly not very useful, file formats vary, and the link between data sets is virtually non-existent.
Searching for high-quality data is tedious and takes too long. Open Data platforms primarily launch their data on different subdomains. The data among them is not connected.
Moreover, integrating data for digital products from various data sources requires considerable data standardization and harmonization effort. Let’s face it. My mother also works with Excel files for her private finances. But CSV, Excel, and PDF formats are not formats that facilitate the work of a Data Scientist or even an Engineer. And that’s the target group an open data platform should go for.

Then 2020 — Covid Year. Give yourself a second to go through the changes and impacts on us, on you, and others.

An interesting side effect: The world became a statistician. Curves were analyzed but never was the intensive care bed occupancy predicted correctly.

Traffic modelers from TU Berlin (Technical University of Berlin) and physicists were consulted, but even they can’t build good models with bad data. When I had a chance to talk with a science task force working on predicting ICUbed occupancy, they told me they were nowhere near modeling.

They were still trying to find and use the correct numbers from three different data sources of ICU bed occupancy. And to be honest, I don’t think they cracked the nut till this point. I have never felt so far away from the 2030 utopias as I did this spring 2020.

How can we talk about mobility transformation when e-scooters remain a fun factor for hipsters but not an efficient mobility solution? How can we dream of self-driving cars? How are we going to use technology to solve our problems?

Throw away all your AI/ML bullshit bingo when you cannot understand the world in clean data. Data Science thus becomes in that quarter a joke for me. But when you lose confidence in others, you find it in yourself.

I didn’t get it. Why is data integration from external data sources so tricky? After all, the ETL process has been around since 1980. There is at least a starting point for standardizing data.

ETL is the process of extracting, transforming, and loading data. The term first became prominent for me when Alteryx, with its visual interface, made the process accessible to analysts and data scientists. Data came in, then it was transformed, and at the end, you had a dataset that you could visualize and report in Tableau.

In an era with a lot of data, the ETL process has shifted back towards the engineer. Due to circumstances, the last two letters were swapped to ELT. Data is now extracted via APIs to fit into a data warehouse (Snowflake), loading is orchestrated (Airflow), data is observed (Great Expectations) and transformed directly in the table of data warehouses (DBT).

This works pretty well if you want to combine your Mailchimp data with your CustomerID. This works pretty well as long as you have a reliable source (e.g., Mailchimp). The data is connected via an API, and it’s managed in a commercial superlative. But there hasn’t been a similar eruption with third-party data integration.

Today, if you ask a third-party data provider for transactional data, you get 10 CSV files with 1M rows each for your +20K bucks.
If you look for data on an open data platform, you don’t find sufficient documentation.
When you scrape data, you trust your own results but probably none of those scraping providers.
And one last question, have you tried working with OpenStreetMap data?

If you are lucky, you will find the external data you are looking for. But there, the luck usually stops. The data is not of high quality, nor is it easy to integrate, or at least as adequately documented as you would expect from the APIs of SaaS solutions.

OpenStreetMap is an excellent example of a buried data treasure. One of my favorite slides shows how open-source contribution has been increasing from commercial companies since 2019. Yes, your Apple Maps is based on free, open, external data.

Your Tesla sends new road segments to this data treasure. Microsoft even put all the buildings in the U.S. into GeoJSON and shared it with the world on OpenStreetMap. Most of Mapbox is based on OpenStreetMap (it would be just fair to attribute them well and contribute more back!).

These are billion-dollar companies that put a lot of developer effort into cleaning and preparing OpenStreetMap data to build outstanding products. Just imagine if your young scooter startup would have access to those data points? And what if you don’t have just one open data treasure but 100s of those?

I am currently working with Matti on Kuwala, an open-source platform platform to transfer the ELT logic we know to external, third-party providers. For example, we are also integrating OpenStreetMap data. We pre-process the data, clean it and connect the features to each other.

This data pipeline can then be easily connected to, e.g., the High-Resolution Demographics data from Facebook for Good. The setup is straightforward via a CLI. Then you can launch a Jupyter Notebook with which you can transform the data directly in your familiar environment and create stunning insights.

We are looking for more collaborators to help us also support smaller companies to connect many external data sources, not only Apple and Co.

You have a completely different opinion, a use case, or are you just curious? Visit us on Slack and join the discussion.

How to build an Uber-like intelligence system for your New Mobility Startup without a big data team

kuwala — Tue, 02 Nov 2021 17:09:50 +0000

Sometime in 2016, car and ridesharing services were suddenly joined by e-scooters, mopeds, and bicycles for rent on the corner. And, of course, more car and ridesharing concepts joined them in the following years. New Mobility startups have set the plan to simplify people’s journeys. While they can make a crucial contribution to lowering emissions by a personal car traveling, it’s not as simple as it seemed at first ten years ago. The companies’ big bet is that people will use the services to get to the closest public transport hub or to cover short distances in cities (intermodality). In the long run, this will develop into a network of options that will motivate people to give up their privately owned cars since sharing vehicles are more reliable, cheaper, and more trustworthy than driving their cars. You can expect the twist coming! And for developers and data scientists, we provide a guide on using the Kuwala software framework that solves some issues in this article.

In fact, people love the service of Uber or Tier, but the hoped-for effect fails to convert into the big vision. Uber leads to congested streets in New York because individuals do not give up their own vehicles. An e-scooter is moved just five times per day. So on average, an e-scooter is used for just 60–80 minutes. There are currently about 60,000 sharing vehicles in Berlin (distributed among more than 40 regional and global players). On the other hand, there are over 1.2 million vehicles, with new registrations of privately owned cars at a 5-year high and an increase of 1.1% (YoY Growth). Therefore, it will take more sharing offerings that are also more frequently used to achieve measurable success and the profitability of the companies.

Matti and I still believe that the hype around New Mobility was not just hot air. Maybe also because we are working with one of the few profitable car-sharing services in the industry. The two key goals of utilization and profitability go hand in hand with the utopian goal of traffic transformation and a profitable business. The goal can be further operationalized in this regard:

Vehicle availability and distribution must be optimized to be located precisely where they are needed in terms of actual demand.
Vehicle maintenance and recharging times must be optimized so that the short-term loss of missing vehicles on the road is not noticeable.
Service areas must be chosen to balance intermodality, coverage, and profitability in new and existing markets.

But how can you turn a great idea into a profitable business? We think it takes a certain amount of data intelligence. It would be best to find out the theoretical maximum at a specific time in a particular place in a vast city like Berlin. Your gut feeling (the biggest neural network in the world) might not be too helpful in this case and leads to decisions with considerable opportunity costs. For this reason, it is essential to identify and estimate the actual demand with the help of data. Possible influencing factors are for example:

Weather (many new mobility startups already use this source with success)
Events such as concerts with an estimation of the number of people who need a ride.
Total visitation frequency and popularity at places and locations on an hourly estimation.
High-resolution demographics information to have finely graded adjustments instead of top-level decisions on a zip code level.

One company that has mastered external and dynamic data processing is Uber. It even incorporates data from electric toothbrushes into its predictions to dynamically adjust pricing and fleet distribution. However, not every company is Uber. Not every company manages to pay the brightest Data Scientists and develop their own prediction models. However, we believe that changing the mobility of people for good is only possible if all players can develop such algorithms. One thing in advance, we have already created something for you so that your mobility startup also has the possibility to make data-driven decisions with just one hands-on developer. Let us start with a little recipe that spots some challenges when starting with spatial mobility analytics.

Are you curious if it’s actually that easy? With Kuwala, we have already implemented exactly the case described above and looked at Lisbon as an example. We decided to correlate Uber traversals with the holistic popularity score through Kuwala. Everything we show you in the next paragraph is easy to reproduce with a well-set-up computer, Python3, Docker (and docker-compose), and not more than five lines of code. Since we talk data and give you the skills to run everything during a commercial break on TV, make sure to have 8GB RAM and 10GBs left on your hard disk.

Setup

Clone our GitHub repository (https://github.com/kuwala-io/kuwala)
Launch Docker in the background
From inside the root directly run with Shell or GitBash

cd kuwala/scripts && sh initialize_core_components.sh && sh run_cli.sh

Select to download the demo data
Now a Jupyter Notebook will open

Launch and Analyze in Jupyter Notebook

The Jupyter Notebook will guide you now through the following analytical steps. Just execute the commands to get a feeling of the already integrated functions of data quality review, analytics, and visualization

Load CSV with traversals
Load popularity score for Lisbon
Join data frames
Data quality report with Pandas Profiling
Correlation analysis between popularity score data of Lisbon and Uber traversals
Launch a map for explorative results with the Unfolded.ai SDK

You want to replicate the results in a web environment? We have also hosted the complete example on Binder. No Installation is needed and you just run the commands to get a first impression: https://bit.ly/3nX0Wq6.

Learn to adjust Kuwala to your use case, region, and data...

Now it’s your turn. With the built-in CLI of Kuwala, you can populate the three external data pipelines, namely, Point-of-Interest data, popularity score, and high-resolution demographics into a local database.

Run CLI
Select pipelines

Populate in database
Integrate your own data

Query with the Jupyter Notebook

Where do we go from here? Kuwala is an open-source project, and therefore we live from other developers using our tools, giving feedback, and developing further. For example, we would love to integrate weather data or other data sources. You can contact us directly or join our Slack community. For interested mobility or location-based startups, we are happy to evaluate how quickly we can build a scaled system together into your company. In any case, get in touch one way or the other.

For additional content from Kuwala, we recommend our weekly podcast episodes , which you can also find below!

Why Instant Grocery Delivery Should Follow a Data-Driven Path Like Uber to Survive (Part 1)

kuwala — Thu, 12 Aug 2021 14:50:25 +0000

Instant Grocery Delivery is the startup hype of the year in Europe. You select a few groceries via the shopping app, pay via Paypal, and 10 minutes later, a bike courier is at your door with your purchases. It’s a business model that spreads magic among the users. A few months after launch, I know friends who do almost half of their shopping this way. It’s a multi-billion dollar idea like Uber. A business model that is so easy to explain and still magical? But there are also apparent problems with highly disruptive business models like this:

Overworked bike couriers going on strike.
Issues with the districts because of noise pollution from warehouses located in the middle of residential areas.
A low margin on products and little price tolerance from customers.
Business growth is occurring geographically from district to district and city to city for companies like Gorillas.
The colossal competition (I count 12 providers in Germany alone by now).

The US company GoPuff, founded in 2013, is considered a pioneer for the startups Gorillas, Flink, Zap, or Getir. GoPuff makes data-driven decisions to minimize the risks mentioned above. To boost these ambitions, GoPuff recently acquired the data science startup RideOS for $115 million. In markets with aggressive pricing, for many direct competitors and existing substitutes building a competitive advantage quickly via technology has proven to make the business model more efficient. A bold but also expensive move by GoPuff. In this article, I will show how to integrate within a day geospatial analytics for an instant grocery delivery use case without spending multi-millions on a startup acquisition.

But how exactly can we think of data-driven decision-making for instant grocery delivery? Assets that are important to optimize are:

Where should I set up warehouses?
What is the optimal size of the drivers fleet?
What are the preferences of target customers in the region?
How big is the market potential overall?

In this article, we ask ourselves the fictitious question, should an instant grocery delivery company go to the outlying Berlin district of Pankow? We do this using external data sources that can scale globally and use the data integration framework of Kuwala (it’s open-source). With Kuwala, we can easily extract scalable and granular behavioral data in entire cities and countries. Below you see activity patterns at grocery shops in Hamburg. We will make use of some of the functionalities to derive insights from the described areas.

We start our analysis by comparing the data on a neighborhood of Pankow with the neighboring part of PBerg (“Prenzlauer Berg”). The two selected areas are similar in size (square kilometers). Using the Kuwala framework, we first integrate high-resolution demographics data. On a top-level view, they are comparable to each other in total and within subgroups of gender and age.

In the next step, we analyze the current status quo of Point-of-Interests regarding groceries (e.g., supermarkets). We build the data pipeline on OpenStreetMap data and extract categorization and name as well as price level. We combine that data with hourly popularity and visitation frequency at those POIs.

We find that Pankow has significantly fewer supermarkets per square kilometer. In addition, it shows that the price level of grocery stores is much higher in PBerg. Furthermore, we identify that groceries in Pankow are +10% more visited during the evening than PBerg. In summary, we can assume now that people in Pankow…

… travel longer to supermarkets on average.
… often spend more time in the evening hours in supermarkets.
… have a lower price elasticity towards groceries.

Companies can now use that information in a market entry strategy. An aggressive cashback activation convinces people in Pankow to skip the evening shopping in a supermarket for a comfortable way of receiving the purchases right at their door.

We aggregated the high-resolution demographics data on an H3 resolution of 11 (based on raw data representing 30x30 meter areas). By that, we can analyze in-depth the distribution of people in a comparatively small district.

We can spot areas with a high population of the young target demographic and less reachable options for doing groceries.
In addition, we can spot micro-neighborhoods with a low population density, which makes those areas a perfect spot to open a warehouse, close enough to service areas and further away from people who could be disturbed by noise.

In the next part of this article, I will share some more advanced algorithms to identify over- and under-served areas and put everything at scale by comparing entire cities and the popularity of those places. If you want to discuss geospatial topics with us in the meanwhile, I recommend joining our slack community.

Our Slack - https://app.slack.com/client/T01FG2CNZPB/C01EWM5R19U
Our Website - https://kuwala.io/

Querying the Most Granular Demographics Dataset

kuwala — Mon, 19 Jul 2021 14:59:20 +0000

There are a plethora of use cases that require detailed population data. For example, having a detailed breakdown of the demographic structure is a significant factor in predicting real estate prices. Also, humanitarian projects such as vaccination campaigns or rural electrification plans highly depend on good population data.
It is very challenging to find high-quality and up-to-date data on a global scale for these use cases. Usually, census data is published every four years, which makes those datasets outdated quickly. Arguably the best datasets out there for population densities and demographics are published by Facebook under their Data for Good initiative. They combine official census data with their internal data and leverage machine learning algorithms for image recognition to determine buildings’ location and type.

Facebook Data for Good and Kuwala (2021)
Using those different sources can give a detailed statistical breakdown of demographic groups in 1-arcsecond blocks, a resolution of approximately 30 meters. Each square contains statistical values for the following demographic groups:
Total
Female
Male
Children under 5
Youth 15–24
Elderly 60 plus
Women of reproductive age 15–49
Facebook delivers for each country a file per demographic group, either as a GeoTIFF or CSV. The CSV contains the latitude and longitude of the cell and the respective population value.

Just working with a static CSV file can be cumbersome. That is why we created an open-source wrapper that exposes the data over an API. You can directly download the data for entire countries over a CLI. We preprocess the data to make it easily queryable. For that, we are leveraging the power of Uber’s H3 spatial indexing.
Thanks to the H3 indexing, it is easy to build queries on top of the database. Using either H3 cells or coordinate pairs, you can retrieve the population based on a point, a given radius, or polygon. That way, it is straightforward to aggregate the population on a zip code level, for example.

Uber H3 and Kuwala (2021)
We aggregate the squares into H3 cells at resolution 11 and store them in a MongoDB with the aggregated values for each demographic group. Using JS streams and MongoDB’s aggregation pipelines, the memory usage stays low, and you can process millions of rows on your local machine.

For quick data exploration and visualization, you can directly create datasets compatible with Kepler.gl or Unfolded.ai to make beautiful maps. We published an example map for Malta. It is directly visible where the highly populated regions are and where the heart of the city is.

By having Facebook’s population data now directly queryable, it is much faster to create predictive models or visualizations so data teams can spend time on the value-adding tasks. That is also the main reason why we are building an open-source community for third-party data integration with Kuwala. So if you want to get your hands on more connectors like these, star us on Github and join our Slack community.