Forem: Marc Dupuis

Pandas histogram: creating histogram in Python with examples

Marc Dupuis — Wed, 22 Jan 2025 17:45:00 +0000

TL;DR: Use Matplotlib's plt.hist() to get started quickly with pandas histograms. Consider Searborn or Plotly for more visually appealing or interactive charts. Other alternatives to histograms include boxplots, violin plots and hexbins.

Histograms are one of the most fundamental and widely used visualization tools in data analysis. Whether you're exploring the distribution of numerical data or comparing datasets, histograms provide a quick and intuitive way to understand patterns and trends. In this post, we’ll walk you through how to create histograms using Python’s pandas library, explore advanced visualization techniques, and discuss alternatives that can offer deeper insights for specific scenarios.

If you’re more of a visual learner or want to see some examples in action, check out our video:

What is a histogram?

A histogram is a type of bar chart that displays the distribution of numerical data by grouping values into intervals (or "bins"). It’s an essential tool for:

Understanding the spread and central tendency of data.
Identifying outliers or anomalies.
Comparing distributions across datasets.

Each bar in a histogram represents the frequency of data points within a specific range, making it easy to visualize patterns, skewness, and variability at a glance.

For example, if you’re analyzing customer age data for a product, a histogram can show you the most common age groups, helping guide targeted marketing strategies. Histograms are a fundamental tool for exploratory data analysis and storytelling.

How to create a histogram from a Python pandas DataFrame

Pandas is a powerhouse, and fundamental, library for data manipulation in Python. If you’re doing any sort of data analysis in Python, you’re likely either using Pandas or polars. Its tight integration with Matplotlib makes it incredibly easy to create histograms directly from a DataFrame.

Let’s start by creating a histogram with Matplotlib, but in later sections, we’ll explore other options that are a bit more visually appealing, and also interactive.

Basic Python histograms using Matplotlib

Matplotlib is effectively the standard charting library for Python and is tightly integrated with pandas.

In the examples below, we’re going to generate some fake data to use in our examples. We have a script to generate this data so that you can play around with the data and see how it changes the chart produced. If you would like to follow along, here’s the script that generates some fake sales data for our Superdope company:

import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Generate 1000 orders with different distributions for each channel
n_orders = 5000

# Generate channel data with different means
channels = np.random.choice(
    ['Mobile', 'Web', 'In-Store'], 
    size=n_orders, 
    p=[0.4, 0.35, 0.25]  # Different probabilities for each channel
)

# Generate units sold with different distributions per channel
basket_sizes = []  # Initialize the list
for channel in channels:
    if channel == 'Mobile':
        # Mobile: Lower average order size
        basket_sizes.append(np.random.normal(20, 8))
    elif channel == 'Web':
        # Web: Medium average order size
        basket_sizes.append(np.random.normal(50, 12))
    else:  # In-Store
        # In-Store: Highest average order size
        basket_sizes.append(np.random.normal(150, 15))

# Create DataFrame
superdope_sales = pd.DataFrame({
    'order_id': range(1, n_orders + 1),
    'channel': channels,
    'basket_size': basket_sizes
})

‍

This script simply generates a list of orders with the number of items sold in each order and where the order was placed: mobile, web or in store and stores it as a “superdope_sales” pandas DataFrame.

Let’s now plot this data using Matplotlib:

import matplotlib as plt

# Create a single histogram for all transactions
plt.figure(figsize=(12, 6))
plt.hist(superdope_sales['basket_size'], bins=100, color='blue', edgecolor='black')

plt.title('Distribution of Order Sizes')
plt.xlabel('Basket Size per Order')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)

# Show the plot
plt.show()

‍

In the code above, plt.hist() is the key function to focus on. It takes the pandas DataFrame as the first argument and the “bins” which specifies the number of bins to use. The other critical function is plt.show() to display the chart. The rest is mostly adding more details like chart color and axis and chart labels.

Here’s the chart it produces:

As you can see, in the chart above where we only use 10 bins, we start to see a pattern and it looks like a bimodal distribution, but let’s see what happens when we increase the number of bins to 100:

Now we can see that there are actually three distinct distributions hidden in the data - we did this deliberately as you can see in the data generation script where we’ve given a different mean for each channel.

Think of the number of bins like the focus of a camera. You’ll need to tweak the bin count to capture the right profile of the data.

Okay, so we’ve created a basic histogram, but admittedly, it’s not the most visually appealing plot and it’s also static (users can’t hover over individual bins to see the content). Let’s explore some more advanced libraries to help you create a beautiful histogram.

More advanced histogram charting libraries

Seaborn

Seaborn builds on Matplotlib’s capabilities, offering a cleaner syntax and more visually appealing charts. It’s particularly useful for overlaying distributions and adding kernel density estimates (KDE).

import seaborn as sns

plt.figure(figsize=(12, 6))
sns.histplot(data=superdope_sales, x='basket_size', 
             kde=True, bins=100)

plt.title('Distribution of Order Sizes by Channel (Seaborn)')
plt.xlabel('Basket Size per Order')
plt.ylabel('Frequency')

# Show the plot
plt.show()

‍

The kde=True option overlays a smooth curve, providing a better sense of the data’s density.

This is starting to look pretty good. Let’s take it a step further and make it interactive.

Plotly

For interactive histograms, Plotly is an excellent choice. You can zoom, pan, and hover over data points for deeper exploration.

import plotly.express as px

# Create a single histogram using Plotly
fig = px.histogram(superdope_sales, x='basket_size', 
                   title='Distribution of Order Sizes by Channel (Plotly)', 
                   nbins=100, 
                   labels={'basket_size': 'Basket Size per Order', 'count': 'Frequency'})

# Show the plot
fig.show()

‍
With just a few lines of code, you can create a chart that users can interact with directly in a browser or dashboard.

Advanced Histogram Plotting Tips

Overlaid histograms

In the examples above we’ve only been showing a single histogram for the entire dataset. But as we saw when we increased the number of bins, it looks like the distribution profile may actually be different depending on whether the sell as done on mobile, web or in store (and we know from our data generation script that that’s the case). So rather than view all sales data as a single histogram, it would be helpful to view each distribution by channel.

Let’s use Plotly again:

import plotly.express as px

# Create a single histogram using Plotly
fig = px.histogram(superdope_sales, x='basket_size', 
                   title='Distribution of Order Sizes by Channel (Plotly)', color='channel',
                   nbins=100, 
                   labels={'basket_size': 'Basket Size per Order', 'count': 'Frequency'})

# Show the plot
fig.show()

You’ll notice that this code is almost identical to the code above, but we’ve added the “color” parameter, and suddenly we can distinctly spot the pattern: In-store sales have a much higher basket size!

Histogram facets

Instead of trying to overlay all the histograms on a single chart, you can also break them out into different facets. This clearly disambiguates each distribution. However, if you choose to do this, you’ll want to make sure to be mindful of the min and max of each axis to make sure you’re telling the true story. In the example below, we’ve forced the min and max on the X axis to be the same for each chart to make sure we were comparing apples to apples.

Alternatives to histograms

While histograms are powerful, they may not always be the best choice for every analysis. Here are some powerful alternative visualizations:

Box Plots
Box plots show the distribution of data while highlighting key statistics like median, quartiles, and outliers. They’re very similar to histograms in their functionality, but they do a better job of highlighting the main range that the data falls within. In other words, you can simplify your view on a dataset even more: In-store purchases total basket sizes tend to be between $137 and $160. This type of simplification can help when it comes to decision making.

Violin plots
If you want to take your box plots to the next level, consider violin plots. They’re very similar, but offer a few key advantages:

Distribution details: In a lot of ways, they resemble histograms more than they do box plots. You can see the detailed distribution, which can be important if you suspect different distributions within each group. For example if you think there are maybe two modes for “in-store purchases”, the box plot would obfuscate that
Outlier details: You can see the outliers with a box plot, but in a violin plot it becomes much more obvious how much of an outlier and outlier truly is.
Visual appeal: Violin plots are a bit more visually appealing, which can be important for user engagement!

Raincloud plots
If you want to get even a bit fancier with your distribution plots and you like violin and box plots, you may want to check out raincloud plots. These require a bit more technical know-how and aren’t necessarily ready out of the box (no pun intended), but can make for some very neat looking charts.

The plot above was generated using the ptitprince Python library (named after the drawing of a snake that ate an elephant in the Petit Prince).

Hexbin plots
Sometimes you want to understand the distribution of items along two dimensions. Let’s say you want to view the distributions of baskets by number of items per basket and basket total value. You could create two histograms, or you could look at the orders on a scatter plot. Scatter plots don’t tell you the full story though, because it fails to convey density of the points. That’s where hexbins come into play.

We can see here that most baskets are in the 3 to 7 item and $80 to $160 range.

Ridgeline plots
If you want to display the distribution for a lot of groups that are distinct, you could use a ridgeline plot. Joypy is the best Python package to quickly get started (named after Joy Division’s 1979 album cover for Unknown pleasures)

That said, ridgeline plots are a bit better looking than they are practical. They can be a fun way to tell a story but may not be the most scientifically useful plots.

Ready to try these out?

Histograms are fundamental to exploratory data analysis. They provide a quick way to profile data and understand how it’s distributed. You can quickly generate a pandas histogram in a few lines of code using matplotlib, but you can also get more sophisticated with boxplots, hexbins, violin plots and other distribution density charts.

If you want to try these out and get started without the hassle of setting up a Python environment locally, you can sign up for free at Fabi.ai and take these for a spin. Fabi.ai is an AI data analysis platform designed to make data exploration and collaboration incredibly easy.

Interactive Python plots: Getting started and best packages

Marc Dupuis — Fri, 10 Jan 2025 16:51:00 +0000

TL;DR: If you want to create a quick, visually appealing interactive Python plot, your best bet is to start with Plotly. It offers interactivity out of the box, the syntax is relatively simple, and it supports almost any chart type you would ever need. If you want to share these plots, solutions like Streamlit, Gradio, or a hosted platform like Fabi.ai can make the process seamless.

Data visualization is a crucial skill for anyone working with data. Whether you're analyzing trends, presenting findings, or exploring datasets, visualizations make complex data more accessible and understandable. While static charts are useful, interactive plots can take your visualizations to the next level by allowing viewers to zoom, pan, hover, and more. This interactivity is a great way to create a more engaging data experience and put your work forward.

If you're a Python user, you have access to a wide array of libraries designed to create interactive plots. In this post, we’ll walk through why interactive visualizations matter, how you can share them, and the best Python libraries to get started with. We’ll teach you how to plot data in Python with beautiful interactive charts, and in no time you’ll be sharing your insights with your team.

Why use interactive Python plotting libraries?

First, let’s talk about what makes a chart “interactive”. There are many ways to interpret this term, but in its most basic form, chart interactivity means that the viewer can hover over data points to get more information, zoom or pan around the chart.

This interactivity offers a few key benefits:

Enhanced data exploration: With features like zooming, filtering, and tooltips, interactive charts allow users to dig deeper into the data without creating multiple static plots. This means, simpler, more powerful visualizations.

Improved engagement: Interactive plots are naturally more engaging for your viewers. By exploring the data on their own, they can pick their own adventure which will draw them in and make your data analysis more impactful.

Credibility: Interactive plots often look polished and modern, ideal for sharing insights with stakeholders. Sharing interactive visualizations lends extra credibility to you and your work.

You can also take chart interactivity to the next level with “drill-down” and callback functionality. For example, if you want a viewer of a bar chart to be able to click on a specific bar and view the underlying records, this can also be accomplished. However, this type of behavior is significantly more complex to achieve and merits its own tutorial.

Finally, in this tutorial, we’re going to focus on interactivity within the plot itself. With Python you can also easily let your viewer slice and dice the data with filters and inputs. This is fairly easy to accomplish with any Python charting library. You can learn more about that in our documentation.

A note on sharing and using interactive charts

Creating an interactive Python plot usually means rendering the plot in a browser or an environment that supports HTML and Javascript outputs. This introduces some challenges when sharing your work since you can’t simply share a screenshot. You have a few options at your disposal depending on your technical skill level:

Streamlit or Gradio: These Python frameworks let you build lightweight web apps to showcase your plots. These are best suited for semi-technical and technical users. Creating a shareable URL also requires a good understanding of Docker and cloud hosting. More on that in this tutorial where we show you how to deploy a Streamlit app to AWS.

Jupyter notebooks: Ideal for data exploration and sharing code, though they require viewers to have Python set up. Creating interactive charts in Jupyter notebooks does have its own challenges, and sharing Jupyter notebooks as interactive apps is best suited for a technical audience. We’ve covered this topic in detail in a previous post.. This solution is best suited for very technical users.

‍Fabi.ai: A hosted solution for sharing interactive plots and dashboards. You can start for freeand share your visualizations without worrying about setup or hosting. This is best suited for Python users of all levels.

‍

Pandas plot: Why DataFrames are important

If you’re already familiar with Pandas or Polars DataFrames and you’re simply looking to plot your data, you can skip ahead. However, if you’re less familiar with these concepts, here’s a quick crash course.

Python DataFrames are effectively “Tables” in the Python world. They’re a two-dimensional data structure, similar to an Excel spreadsheet or a SQL table. DataFrames allow you to organize, manipulate, and analyze your data in an intuitive and efficient manner. Each column in a DataFrame can hold data of different types, and operations like filtering, grouping, and aggregating data become straightforward with this structure.

The reason this concept is important is that nearly all Python charting libraries leverage DataFrames. So, you’ll want to start by ensuring your data is clean and organized within a DataFrame before plotting. This means handling missing values, ensuring consistent data types, and possibly reshaping your data to fit the needs of your chosen visualization library.

For example, let’s create a simple DataFrame and plot it using pandas:

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
data = pd.DataFrame({
    'Month': ['January', 'February', 'March', 'April'],
    'Sales': [200, 250, 300, 400]
})

# Plot the data using pandas' built-in plotting capabilities
data.plot(x='Month', y='Sales', kind='bar', title='Monthly Sales')
plt.show()

This quick example demonstrates how easy it is to visualize data directly from a DataFrame. While pandas' built-in plotting is great for basic charts, more advanced interactive plots can be created by integrating libraries like Plotly, Altair, or Bokeh, which also work seamlessly with DataFrames.

Simple bar chart using matplotlib, a static Python plotting library.
Best Python interactive plotting libraries to get started
Several Python libraries excel at creating interactive visualizations. Here, we focus on a few popular, open-source options that are beginner-friendly and well-documented. These libraries also integrate well with tools like AI-assisted coding, making them accessible to a broad audience.

Plotly

Why it’s great: Plotly is a versatile library with out-of-the-box interactivity. It supports a wide variety of chart types, including line charts, scatter plots, heatmaps, and 3D plots. Check out some examples of what you can do with Plotly here.

Key features:

All-in-one solution: From simple visualizations to complex dashboards, Plotly covers it all
Interactive by default: Features like tooltips, zooming, and panning require no extra configuration
Seamless integration with Dash: Build full-fledged interactive web apps with Plotly’s dashboard framework
Wide chart variety: Supports over 40 types of visualizations, including advanced ones like Gantt charts and 3D scatter plots

Example:

import plotly.express as px
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D'],
    'Values': [23, 45, 56, 78]
})

# Create an interactive bar chart
fig = px.bar(data, x='Category', y='Values', title='Interactive Bar Chart')
fig.show()

Altair

Why it’s great: Altair focuses on simplicity and declarative plotting. It’s built on the Vega-Lite framework, offering a balance between ease of use and flexibility. Check out some examples of what you can do with Altair here.

Key features:

Declarative syntax: Define visualizations with just a few lines of code, ideal for rapid prototyping
Customizable statistical charts: Great for scatter plots, histograms, and regressions
Efficient data transformations: Built-in support for aggregations, filtering, and other data manipulations
Compact yet powerful: Aimed at generating insights quickly with minimal boilerplate code

Example:

import altair as alt
import pandas as pd

# Sample data
data = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [10, 20, 30, 40, 50]
})

# Create a scatter plot
chart = alt.Chart(data).mark_point().encode(
    x='x',
    y='y',
    tooltip=['x', 'y']
)
chart.show()

Bokeh

Why it’s great: Bokeh is highly customizable and ideal for creating interactive visualizations that look professional. Check out some examples of what you can do with Bokeh here.

Key features:

Extensive customization: Tailor every aspect of your visualization, from tools to widgets
Server-side interactivity: Create dynamic dashboards and web apps with live data updates
Multiple output options: Embed in Jupyter notebooks, standalone HTML files, or integrate with Flask/Django
Sophisticated widgets and tools: Add sliders, dropdowns, and more for a richer user experience

Example:

from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()

# Create a simple line plot
p = figure(title='Simple Line Plot', x_axis_label='X', y_axis_label='Y')
p.line([1, 2, 3, 4], [10, 20, 30, 40], line_width=2)
show(p)

Comparison Chart

Quick reference comparison chart of Python interactive plot libraries:
‍

Start with Plotly and go from there

If you’re new to interactive plotting in Python, Plotly is an excellent place to start. Its combination of ease of use, robust interactivity, and extensive chart options makes it suitable for beginners and advanced users alike. Once you’re comfortable, explore Altair for concise code or Bokeh for advanced customizations.

If you would like to get started quickly building interactive charts in Python and want to easily share your work, we invite you to try out Fabi.ai. You can get started for free in less than 2 minutes.

Who should be your first data hire and when should you hire them?

Marc Dupuis — Thu, 21 Nov 2024 18:26:00 +0000

Before we dive in, a bit about the authors and why our perspective may be relevant:

Marc is CEO and co-founder of Fabi.ai, and in his day to day role has spoken with hundreds of data leaders across all company sizes and industries. This gives Marc a unique vantage point on what generally does and doesn’t work for data teams. Marc has also led product teams at Clari and Assembled pre-first data hire and has experienced first-hand and assisted in the set up of the data team
Aditya is the data leader at Shogun, and has led data teams at companies such as Poshmark and StrongDM. He comes with a wide variety of experiences and a deep experience with the modern data stack. So… you’re a founder, maybe a CTO, or perhaps an engineering or product leader, and you’re starting to ask yourself if you should hire your first full-time data “person”. This is fantastic. If you’re asking yourself this question, it means you’ve likely made it past the chaos of the first few years as a startup and you’re starting to think about scale and operational efficiency.

But the question of who you should hire for this role can be complicated. Do you hire a data scientist who can perform some magic on your data and uncover insights no one thought about? Or perhaps a data engineer who can actually clean up the data so that you can even try and start making sense of it?

In this post, we’re going to first touch on what signs you should be looking for that indicate that you might be ready for your first “data hire”, and then once you’re ready to proceed, what experience that individual should come with and what you can realistically expect of them.

When is the right time for your first data hire?

What should happen prior to your first hire

Long before you’ve hired your first data employee, you’ve likely already been doing some form of data analytics, you just haven’t had a person doing it full time. Generally speaking, there are two stages an early stage company goes through before they think about “analytics” as a full-time role:

Basic product & marketing analytics: On the product front, you’re using a product analytics solution that stores events, such as Posthog, Amplitude, Mixpanel or Pendo. This gives you some basic user metrics to help you understand how customers are using your product. In a separate thread, you’ve set up some web analytics (likely in GA4) to start tracking website traffic and attribution, and if you’re following more of a B2B sales motion, you likely have a CRM (Hubspot, Attio or Clarify) that provides basic reports. At this stage, you should be embracing canned reports as much as possible since you’re mostly trying to get directional insights from your data. You likely don’t have too many users or such a large sales pipeline that you don’t already have a good feel for what’s going on. This is also the stage where you can, and should, embrace spreadsheets as much as possible. Your business is evolving so quickly that you should expect what you care about to continually change.
SQL queries = BI: At this stage you want to start digging into more custom reports that are unique to your business. This is when data stops fitting nicely into canned reporting solutions. You may have a certain type of record in your customer database that’s powering your product that you want to better understand. For example, going back to our fictional Superdope company that sells widgets, you may want to see how many customers have more than X widgets in their cart but haven’t checked out. The best way to do this is likely just to write a SQL query against your production data. If you’re just doing one-off queries and are comfortable in a SQL IDE, that will probably get the job done, or if you want to start building dashboards or board reports, you might adopt a light-weight BI solution.

In other words, before you hire your first data person, you should already have some data (it doesn’t need to be all unified and cleaned in a data warehouse), and you should already have a sense of the types of questions you’re trying to answer with data.

Signs that you’re ready to hire your first full time data employee

Deciding to hire a full time data person isn’t a light decision. WIth your typical senior data scientist coming in at roughly $240k in total comp and the tools they will likely ask for easily running you $10k-$20k a year, you should expect a real return on investment.

Note that your first data hire likely should not be a data scientist, we touch on this more below, but this salary should give you a good gauge of what to expect in terms of compensation.

The types of questions that you should be asking yourself that likely indicate that you’re ready are some flavors of the following:

How can we pull all our data together to get a big picture of the business and tell a cohesive story to the board and investors?
What should we be working on to better serve our customers and accelerate growth?
Should we be doing more A/B testing and quick experimentation to figure out what might help us hit our growth plans? You may have noticed something about these questions: for the most part, these questions are not just the output and end of the journey. They’re actually the input into decisions that the business then has to make. So a very large part of being ready for your first data hire, is also being ready to support them both financially and operationally. We touch on this below.

Prerequisites before you make the hire

To ensure fertile ground for your first data hire, you need three things:

Data
Willingness to support them financially
Willingness to execute on their insights Going back to what we discussed above about what should happen before you hire your first data employee, you need data. The data doesn’t have to be pristine, well defined and centralized. That’s part of their job. But it does need to exist and have some kernel of usefulness. If you’re not sure if you have this, circle back around with your sales, marketing, product and engineering teams to see what you have today.

The second point is critical. As we touched on, even though a data scientist wouldn’t be your first hire, using their salaries as a measuring stick, you can easily expect to have to pay them $240k. The real cost of an employee is approximately 115% of their salary, and you need to be able and willing to purchase some of the tools they need to do their job. If we do some basic math:

Salary: $240k
Benefits and other costs: $36k
Data warehouse: $10k
ETL tooling: $2k
Lightweight BI solution: $3k (a more heavy-duty BI solution typically starts around $10k) You may already have a data warehouse and some of the other tools already, so adjust as you see fit, but the general idea being that there are very material additional costs and that needs to be part of the plan.

There are a lot of great “free”, open source alternatives for some of the tools as well, but if that’s the plan, it needs to be factored into the hiring profile and does require more technical expertise which does come at a cost. “Free” in this case doesn’t literally mean free, there are a lot of hidden costs to self-managing data platforms.

Finally, you need to be ready to support them following their insights. An insight is only useful if it influences a decision. If your roadmap and plan is set in stone for the next 6-12 months, it may not be that valuable to hire a data person. If you think some of the data may help you make better decisions in the next few months and you’re open to experimentation or adjusting plans, you’ll be setting them up for success.

You’ll notice here that your first data hire should be providing insights that feed into your corporate strategy. For this very reason, you likely don’t want to hire a junior individual who doesn’t have executive exposure. More on this in the next section.

What background should your first data hire have?

Technical expertise and experience for the first hire

As we saw above, your first data hire should be able to work with data at a very technical level but also be a strategic advisor to your leadership team. That sounds like a unicorn and may feel a bit daunting, but fret not, they’re out there! And with the rise of powerful AI tools built specifically for data analytics and reporting, the profile of this individual has changed a bit.

Starting with technical expertise. You may or may not already be extracting data and pulling it into a centralized data warehouse, but either way, this individual will be owning that process going forward. Being able to do this requires advanced SQL and Python at the very least. In addition to this, they should have experience with data modeling (dbt or Coalesce), version control (Github), and data warehousing (example: PostgreSQL, Snowflake, Redshift, MotherDuck, BigQuery…)

This type of technical expertise is particularly valuable in the age of AI. AI is only as good as the underlying data, and with AI transforming the way reporting is done, specific skills at the reporting level tend to matter less. That said, if you already have a BI solution that you’ve spent a lot of money on and started building out (we don’t recommend doing this before you hire your first full-time data employee), then this person should ideally have experience with that specific tool. But again, if you had to choose, you should lean more heavily on experience deeper in the stack since everything else is built on top of that.

On the non-technical front, you should be looking for someone who you would invite into executive discussions. This is a person who has been an individual contributor, but also grown a team and ideally worked at a startup. If you have to pick one: focus on finding someone who has worked at a startup that grew its data function and had a strong mentor at the very least. During the interview process, you should consider asking candidates about some of their past projects which should ideally include some of the following:

Using event data to understand a customer’s journey
Building a unified customer model
Leveraging historical data to trace customer or user changes over time For all the reasons above, we highly advise against trying to save a few dollars by hiring a relatively inexperienced individual. Someone with this profile, without proper guidance will struggle to draw useful insights and will likely cause you to rack up tech debt that will only cost you more down the road.

At this stage, we also feel that it’s important to call out that this person likely isn’t your finance person. Although you may be lucky enough to find someone who is deeply technical and can also help with financial planning and forecasting, these skillsets tend to be very distinct. Data can be very forgiving, but not when it comes to the financial health of the company, so best to leave that to someone who is experienced in that specific area.

A note on titles: Titles in the data space are notoriously confusing and interchangeable. You may see someone with a “Senior Data Scientist” title whose experience is mostly building out BI while you may find someone with a “BI Lead” title who is actually quite knowledgeable on the machine learning and statistics front. But to attract the best candidates, this role should be titled “Head/Director of Data”.

30/60/90 plan and ROI

If you’re considering hiring your first full time data employee, you’re likely wondering what the ramp up period should look like and when you can expect positive ROI.

Let’s start with the more difficult question: How do you define ROI for this role and when should you expect positive ROI? Unfortunately, ROI on data projects remains nearly impossible to measure despite much debate (try searching “ROI on data teams”). The reality is that you simply can’t expect to put in $X and get out $Y when it comes to reporting. You’ll know your investment is worthwhile when it feels like you couldn’t operate without the insights that the team delivers. Put another way: following ramp up time, if your executive team is not turning towards the data team to get answers to help guide them, you may not be getting your value.

Now for a ramp up plan for this first data hire:

First 30 days - The first month is about two things: Understanding the business and understanding the state of the data stack. The very first thing this hire should be doing is gaining a deep understanding of how your company makes money and how that’s measured today. Every other question branches out from there, and a failure to understand the fundamentals of the business model will cause this individual to pursue vanity metrics. They should meet with every executive, starting with the CEO, and acting heads of marketing and sales.
First 60 days - At the end of 60 days, your data hire should understand the top priorities for all key functions of the business and how they can support them. Within those priorities, they should have identified the top 3 priorities across the entire business and have developed some form of reporting for those areas. At this point, you should reasonably expect this person to have developed any new data models (tables) they need for reporting and delivered useful and accurate reports in the executive team’s tool of choice - spreadsheets or lightweight BI are totally fine and perhaps even encouraged. A first data hire that comes in and immediately dives into BI and spends the first few months just getting set up is likely not going to be a good long term fit.
First 90 days - By the end of the first 3 months, this person should have a deep understanding of what makes the business run, they should have already delivered reports for the top areas of focus and they should have started putting in place an operational heartbeat. They should be driving meetings or updates (new technologies make this incredibly easy) and be actively participating in strategic discussions. They should also have a full roadmap that encompasses the needs of all key executives.

Biggest mistakes when hiring your first data person

Although we’ve already touched on this above, it’s worth calling out specifically the biggest mistakes we tend to see.

The first is not setting clear expectations. Either with yourself or with the hire. If all you have is a general sense that “we need more reporting”, it might be a rough ride for all parties. If you’re able to clearly articulate why you’re hiring for this role and what you hope to see in the first 30/60/90 days, all parties will be much better off.

The second mistake is not providing the proper support. We touched on this in detail above, so we won’t linger on this too much here, but not building in the monetary or resource budget to provide them the tools they need and the time to react to their plans or suggestions will render this first hire completely ineffective. Make sure you’re ready to provide more than just their salary.

The third mistake is hiring someone who is very specialized in a certain department. For example you may see someone with extensive experience as both an IC and a leader in “Marketing analytics”, but unfortunately, if this is the individual’s only experience, they may have a really difficult time working across functions and will likely default back to what they’re most comfortable with: marketing. This is great for your marketing department, but they’re surely not the only ones in need of data support.

Finally, the most common mistake is thinking this person just needs to write some SQL, and hiring someone with very limited experience. Without proper mentorship, they will likely get stuck in the technology, miss the forest from the trees, and ultimately end up costing you much more in both strategic direction and tech debt than you might expect.

‍

Your first hire should be deeply technical with experience providing strategic guidance to executives

If you’ve outgrown your pre-canned product and marketing analytics platforms, and you’re starting to wonder if there are insights in your data that could help drive the company strategy, you’re ready to start thinking about building out a data team. But before you do so, make sure you’re ready to provide the financial support they will need along with a willingness to experiment and adjust the plans based on their feedback.

You’ll want an individual with deep SQL and Python experience, who has ideally led and grown data teams in previous startups. This person should be someone that you expect to turn towards for strategic advice, but they’re going to be on their own initially so they should be self-sufficient. Make sure you don’t hire too junior or too much of a specialist.

It may feel daunting to find someone who fits the bill, but they’re out there, and once you find the right person, they will be a force multiplier on your business. Data can truly deliver competitive insights, and conversely, it can be a money pit, so it’s worth taking the time and waiting for the right moment to find the right person. Ultimately, you’ll need to consider the unique nature of your business and team and weigh your priorities to determine which traits are the most important for this role.

Why use Python for data analysis (when you have Excel or Google Sheets)

Marc Dupuis — Thu, 14 Nov 2024 18:47:00 +0000

TL;DR: While spreadsheets are perfect for many data tasks, Python becomes essential when you need to handle large datasets, create advanced visualizations, automate workflows, or use machine learning models. The key is knowing when to leverage each tool's strengths for your specific data analysis needs.
‍

While Python is often considered essential for data work, spreadsheets remain the most practical tool for many analysts' daily needs – and that's perfectly fine. But knowing when to graduate beyond them is crucial for advancing your data capabilities.

If you look at any data analyst or data scientist curriculum, you'll find the same core tools: spreadsheets, SQL, Python, and various Business Intelligence (BI) solutions. Yet when I talk with data practitioners and leaders, a common question emerges: "Why switch to Python when spreadsheets handle most of my needs?"

As someone who co-founded a company built on SQL, Python, and AI, my stance might surprise you: if a spreadsheet can do the job, use it. These tools have endured since the 1970s for good reason – they're intuitive, flexible, and excellent for explaining your work to others.

But they have their limits.

When you start conducting more ad hoc analysis or exploratory data analysis or dealing with more data in the enterprise, you’ll quickly run into a few issues:

They struggle with large datasets
They offer limited visualization and dashboarding capabilities
They make it difficult to build automated data pipelines
They lack advanced statistical and machine learning capabilities
They don't support version control, making it hard to follow engineering best practices Below, I’ll break down why spreadsheets remain invaluable for many tasks, and when Python becomes the necessary next step in your data journey.

Why use Excel or Google Sheets?

At their core, spreadsheets are powerful because they put you in complete control of your data workspace. Like having your own custom-built dashboard, they let you instantly manipulate, visualize, and analyze data exactly how you want.

There are two main reasons that folks gravitate toward spreadsheets:

1. Spreadsheets are flexible and personalized

Let’s start with the most obvious reasons why data practitioners, regardless of skill level, love spreadsheets: They’re incredibly flexible and customizable.

In a spreadsheet, you’re working in your own environments, and you have full control over it. You want to highlight specific rows and create a quick chart? Easy. You want to add some conditional formatting to highlight a specific pattern? No problem. You even want to add a row or column to add some inputs? Go right ahead.

As a user, you’re in full control, even in shared workspace environments like Google Sheets. This is really powerful, especially in contrast with traditional BI solutions where you can’t edit the data directly in line the same way, nor can you call out specific pieces of data without having to slice the data into smaller subsets, which can quickly get out of hand. As a matter of fact, some new BI solutions such as Sigma are capitalizing on this idea with a spreadsheet-like interface being their main pitch.

All in all, there’s something deeply intuitive about the user experience of a spreadsheet. We learn math from a young age, and spreadsheets offer a nicely structured way of looking at data and understanding how all the numbers add up.

2. Spreadsheets are reactive & explainable

Reactivity in spreadsheets means that when you change one number, everything connected to it updates automatically. This instant feedback makes them perfect for understanding how different pieces of data affect each other.

For example, let’s say you have cells that are connected like:

C1 = A1 + B2

Reactivity means that when you update A1 or B2, C1 is automatically updated. There’s effectively a DAG which tracks the dependencies, or lineage, between all cells. This is an incredibly powerful concept, because, unlike with code, you don’t have to “run” the spreadsheet. You can simply create a model of the world and adjust inputs and see how the results react to that change.

This reactivity is also in very large part what contributes to the ease of understanding of a spreadsheet. I can view an easy-understood formula, click on it to highlight the dependent cells, and I adjust the dependent cells to understand how the number I’m looking at reacts and relates to it.

As you can see in the image above, if you want to know what numbers contribute most to Net Income Before Tax, you can simply click on the cell, view the dependent cells, and immediately understand what variables Net Income Before Taxes.

For these reasons, if you’re able to do your work in a spreadsheet, it’s probably a good idea.

‍

Why use Python

While spreadsheets excel at many tasks, Python opens up a whole new world of possibilities for data work. From handling massive datasets to creating complex visualizations and automating repetitive tasks, there are five reasons why Python is a powerful tool for your data workflows.

1. Python easily tackles large amounts of data

The first and most obvious reason to use Python is illustrated when dealing with large datasets. Excel can support approximately 1M rows by 17k columns and Google Sheets can support approximately 10M cells. This may sound like a lot, and in many cases this is plenty, but chances are, you’ll quickly run up against this limit. In contrast, Python on a powerful machine can support many orders of magnitude more data. This is especially true if you leverage new technologies like polars and DuckDB.

We may see an increase in limits with spreadsheets over time, but Python (especially in tandem with SQL) will always be able to handle more.

2. Python supports advanced & customized visualizations

Spreadsheets can offer some pretty powerful visuals, but it’s only a small fraction of what you can do with Python. I’m a big believer that bar charts, line charts, and maps cover the vast majority of cases, but telling a story with data often requires breaking from the mundane and creating an engaging canvas.

For example, I love a good Sankey diagram to tell the story of how data flows from point A to point B. Or perhaps you want to create a radar plot to compare attributes from different categories.

These can be incredibly easy to build in Python with libraries like plotly, seaborn or bokeh.

To give you an example, let’s go back to our Superdope example from previous posts and say you want to compare product performance on a sunburst plot like the one below:

Generating this chart with code using a library such as plotly is rather straightforward:

import plotly.express as px

# Create the sunburst plot
fig = px.sunburst(
    df_sunburst,
    path=['Region', 'Category', 'Product'],
    values='Sales',
    color='Region',
    title='Sales Distribution by Region, Category, and Product',
    width=800,
    height=450
)

# Update layout
fig.update_layout(
    margin=dict(t=50, l=0, r=0, b=0)
)

# Show the plot
fig.show()
And this code can be generated by AI in about 3 seconds. Building something similar in a spreadsheet would require a lot more time and effort.

3. Python helps you automate data pipelines & cleaning

When working with data, you’ll oftentimes find yourself doing repetitive data transformation tasks. Say, for example, you work in an industry where your clients regularly send you CSV or Excel files and you have to clean up and format the data, and turn it into a report or prepare it for another step. This is a perfect task for Python. If you’re managing your own server and are resourceful, you can write a script and schedule it to run using a Cron job, or if you would like to go with managed solutions that work out of the box and handle orchestration and more complex jobs, you can use a solution like Dagster or Airflow.

As a general rule, these days it’s usually best to avoid home-grown Cron jobs unless you know exactly what you’re doing. Ensuring that these remain up and running, have proper logging and monitoring and are orchestrated properly can quickly turn into a lot of work.

Note: If you’re simply looking for a lightweight and quick way to build data pipelines, Fabi.ai may be a good option for you. We can help you easily set up a data wrangling and cleaning pipeline from and to any source, including CSV files and Excel files, in a matter of minutes.

4. Python supports for complex data analysis & machine learning

You can do a lot in a spreadsheet, but building and using more advanced statistical and machine learning models is not generally one of them. If you’re simply doing a univariate data analysis and some simple calculations like distributions, averages etc. a spreadsheet should be able to get the job done. But if you want to venture into more advanced multivariate analysis, or perhaps even clustering, forecasting and churn prediction, Python is equipped with a rich suite of tools that work out of the box.

Here are a few examples of the types of analysis you may want to do along with the corresponding Python package:

Buyer or customer grouping using clustering: sklean.cluster (ex. Kmeans)
Sales or marketing pipeline time series forecasting: Prophet or statsmodels (ex. ARIMA)
Customer churn prediction: scikit-survival These are all advanced machine learning and statistical models implemented by some of the best engineers and researchers in the world, available for free and immediately ready to use in Python.

5. Python helps you follow code versioning & engineering best practices

Finally, in a lot of cases, it’s good practice to ensure that your work is traceable and reproducible.

In practice, what this means is that when someone else (or perhaps yourself at a later date), looks at your analysis, this individual should be able to understand:

Where the data came from
How the data was manipulated and how you got to your results
Be able to reproduce the same results independently So, if working in a spreadsheet means exporting data and manipulating it somewhere that is disconnected from the original source, it can make the results very hard to reproduce. This also means the steps you take during your analysis aren’t version controlled. As you conduct your analysis and make adjustments, the exact steps may not get recorded. This can set you up for a tough situation that we’ve all been in at least once: You’ve built a beautiful analysis in a spreadsheet, shared it with some co-workers, went back at a later date and noticed that the data was different. You may go through the change history to understand what happened, to no avail.

Using a version control system like Github or Gitlab and committing changes to the underlying code as you conduct your analysis can help you avoid this type of situation.

‍

Verdict: For large data sets; advanced analysis and visualization; and automation, Python wins🏅

If you’re looking to do complex ad hoc or exploratory data analysis, use advanced machine learning techniques, or build complex visualizations, Python is one of the best and most powerful tools for the job.

Yes, spreadsheets are incredibly popular for good reason. If you’re dealing with relatively small datasets, in a one-off analysis that doesn’t need to be automated, Excel or Google Sheets are great tools.

However, Python performs exceptionally well when dealing with large datasets which would be an issue for Excel or Google Sheets. Python is also commonly used to automate data pipelines, especially if it requires some form of data transformation and cleaning.

Like most things, there’s a time and place to use certain tools to make the most of their strengths. We built Fabi.ai to act as the bridge between all the tools, so you can have the best of both worlds.

We make it incredibly easy to connect to any data source, including spreadsheets and files and build lightweight data pipelines. Our built-in SQL and Python interface, augmented with AI, makes it incredibly easy to leverage advanced machine learning and statistical models, regardless of prior experience. If you’re interested in checking us out, you can get started for free today in less than 2 minutes.

The future of AI data visualization

Marc Dupuis — Tue, 29 Oct 2024 05:01:48 +0000

Since LLMs hit the scene, one of the very first use cases/demo was data analysis. At this stage, most of us have used ChatGPT, Claude or some other AI to generate a chart, but it feels like the jury is still out on the role AI will play in data visualization. Will we continue to default to point and click charting? Will AI generate 100% of charts? Or is the future hybrid, intermixing some AI generation and some point and click?

As a founder in the AI and data visualization space, I find this topic almost existential. Founded post-2022 (ie. after LLMs hit the scene in a real way), we have to make a decision about how we want to handle charting. Do we invest hours and hours of dev work (and funds) to develop charting functionality, or is that going away and a sunk cost for all tools built pre-LLMs? Or is the future hybrid? I recently came across Data Formulator, a research project, which explores some really interesting interactions between AI and traditional charting which revived this question for me.

In this post I’m going to take a look at where we are today for text-to-chart (or text-to-visualization) and where we might be headed in the future.

The current state of text-to-visualization

Like all things AI, this post likely won’t age very well. Some new piece of information or model will come out in the next 6 months and completely change how we think about this topic. Nonetheless, let’s take a look at the various states of data visualization and AI.

Pure point-and-click charting

I won’t linger on this one too much since most readers know this one well. Open up Excel, Google Sheets or any other data tool built pre-2023 and you’ll have some form of this. Sometimes you click to add data to an axis, sometimes you drag and drop a field, but the concept is the same: You structure the data appropriately, then you press a few buttons to generate a chart.

In this paradigm, the vast majority of data cleaning and transformation happens prior to the charting. You can generally apply aggregation metrics like average, median, count, min, max etc. but all transformations are fairly rudimentary.

100% AI generated charting

AI generated charts, or text-to-visualization, has only really existed since the advent of modern LLMs (if we dig around, there were experiments going on before then, but for all practical purposes we can focus on post-2022 LLMs).

OpenAI’s ChatGPT can generate non-interactive charts using Python, or a limited set of interactive charts using front end libraries (see OpenAI Canvas for some examples). As with all things OpenAI, Anthropic has its own analogous concepts and has Artifacts.

It’s worth noting here that AI-generated charts can be subdivided into two families: Purely Pythonic/back end generated charts or a mix of back end and front end.

ChatGPT and Claude alternate between the two. Training an AI to generate front end code, and integrating that front end code to create visualizations can be a lot more work than just relying on Python, using a library such as plotly, matplotlib, seaborn. On the other hand, front end libraries give the providers and users more control over the look and feel of the chart and interactivity. This is why LLM providers have their AI generate basic charts like bar charts, line charts or scatter plots, but anything more sophisticated like a Sankey diagram or waterfall chart falls back to Python.

A brief sidebar on Fabi.ai: Seeing as we’re a data analysis platform, we obviously offer charting, and despite some point-and-click charting, the vast majority of charts created by our users are AI-generated. So far, we’ve found that AI is remarkably good at generating charts, and by leveraging pure Python for charting, we’ve been able to train the AI to generate nearly any chart the user can dream up. So far, we’ve chosen that accuracy and flexibility over point-and-click functionality and custom UI designs.
Hybrid: AI generation in a point-and-click paradigm
This is where things start to get interesting in the debate of where AI text-to-visualization is headed. Fast forward 3 years from now, when someone is doing an analysis, if they use AI, will they let AI take 100% control, or will the AI be used in a mixed-environment where it can only edit the charts within the confines of certain point-and-click functionality.

To help make this picture more concrete, check out Data Formulator. This is a recent research project that attempts to offer a true mixed environment where AI can make certain edits, but the user can take over and use the point-and-click functionality as needed.

If we ask the question using a car analogy: Do you believe that in the future cars will not have a steering wheel, or do you believe that there will be a driver who will have to sit there and pay attention and occasionally take over, similar to how the Tesla self-driving functionality currently works?

First principles: What I believe to be true

The question of where things are headed is really important to us at Fabi.ai seeing as this could greatly influence certain decisions we make: Do we invest in integrating a charting library in the front end? Do we even bother with point-and-click functionality at all? As a growing, innovative company leading in the AI data analysis space, we need to be thinking about where the puck is going, not where it currently is.

So to answer this question, I’m going to use some first-principle thinking.

AI is only getting better, faster, cheaper

From the very first time I used AI and complaints arose around the speed and cost, I’ve believed that AI was going to continue getting better, faster and cheaper. Roughly speaking, the cost per token has fallen by 87% per year in the past few years. Not only has the cost gone down, but accuracy and speed have both gone up drastically as well.

In the next 10 years, we’re going to look back on 2024 LLMs the same way we look back on “supercomputers” from the 80s and 90s now that we all have supercomputers in our pockets everywhere we go.

All that to say, that any argument for or against any of the various charting approaches mentioned above cannot be that AI is too slow, expensive or inaccurate to generate charts. In other words, to believe that point-and-click charting will still exist in any way, shape or form, you have to believe that there is something about the user experience or the use case, that merits that functionality.

The hard part about data visualization is the data wrangling and cleaning

In my experience, when doing any form of data analysis that involves visualization, the hard part is not the charting. The hard part is getting the data cleaned and ready in the right format for the chart I’m trying to create.

Say I some user event data that has the following fields:

Event ID
Event start timestamp
Event end timestamp

Now say I want to plot the average event duration by hour to measure latency. Before I can do any sort of charting in a spreadsheet or legacy charting tool, I have to:

Calculate the end time minus the start time (I probably have to do some sort of formatting first)
Pivot the data by hour, which is actually surprisingly hard to do

But asking AI to do this, it takes care of all of that and the charting in just a second or two:

# Calculate the event duration in hours
df['Event duration (hours)'] = (df['Event end datetime'] - df['Event start datetime']).dt.total_seconds() / 3600

# Extract the start hour from the start datetime
df['Start hour'] = df['Event start datetime'].dt.hour

# Group by start hour and calculate the average duration
average_duration_by_hour = df.groupby('Start hour')['Event duration (hours)'].mean().reset_index()

# Plot using Plotly
fig = px.bar(
    average_duration_by_hour, 
    x='Start hour', 
    y='Event duration (hours)',
    title='Average Event Duration by Hour',
    labels={'Event duration (hours)': 'Average Duration (hours)', 'Start hour': 'Hour of Day'},
    text='Event duration (hours)'
)

# Show the figure
fig.show()

And this was one of the simplest possible examples. Most times real-world data is much more complicated.

The future of AI text-to-visualization: some point and click with 100% AI generated

At this point, you likely have a sense of where I’m leaning. As long as you can get your dataset roughly right with all the data needed for an analysis, AI already does a remarkably good job at manipulating it and charting it in the blink of an eye. Fast forward one, two or three years from now, it’s hard to imagine that this won’t be the standard.

That said, there are some interesting hybrid approaches that are cropping up like Data Formulator. The case for this type of approach is that perhaps our hands and brains are able to move faster to quickly make tweaks than it takes us to think about what we want and explain it sufficiently clearly for the AI to do its job. If I ask “Show me total sales by month over the last 12 months” with the assumption that this should be a stacked bar chart broken out by region, it’s possible that we may find it easier to just move our mouse around. If that’s the case, the hybrid approach may be the most interesting: Ask the AI to take a first stab at it, then a few clicks and you have what you want.

The key to success for either a full AI approach or a hybrid approach is going to be in the user experience. Especially for the hybrid approach, the AI and human interactions have to work perfectly hand in hand and be incredibly intuitive to the user.

I’m excited to watch the space develop and where we head with text-to-visualization in the next 12 months.