Forem: Julien Kervizic

Azure Message Brokers patterns for Data Applications

Julien Kervizic — Tue, 24 Sep 2019 19:19:05 +0000

I have previously written how pub/sub patterns could be helpful to put machine learning models into production, but message brokers in data application has a wider use than just Machine Learning. On Azure, there is two main messages brokers, Service Bus and Event Hub. Knowing what these message brokers can do, how they should be used for developing data applications and which ones are more suitable for a certain type of use case is primordial in order to get the most out of them.

Service bus and Event Hub

According to the Azure documentation, service bus is meant for high value messaging while Event hub is meant for big data pipelines. Now, both message brokers service have a use in analytics applications, but the key is to understand the specificities of each system.

Service-Bus

Some of the key feature from Service Bus include, include duplicate detection, transaction processing and routing.

Duplicate detection:

Service Bus checks the particular MessageId for topics having the “requiresDuplicateDetection” property set to true.

It looks back the history of messages that have passed through the topic based on a duplicateDetectionHistoryTimeWindow. An example ARM template would incorporate the deduplication mechanism is shown below:

{
  "$schema": "[https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#](https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#)",
  "contentVersion": "1.0.0.0",
  "parameters": {
  },
  "variables": { },
  "resources": [
    {
      "apiVersion": "2017-04-01",
      "name": "my\_sb\_namespace",
      "type": "Microsoft.ServiceBus/namespaces",
      "location": "West Europe",
      "properties": {
      },
      "resources": [
        {
          "apiVersion": "2017-04-01",
          "name": "my\_serviceBusTopicName",
          "type": "topics",
          "dependsOn": [
            "[concat('Microsoft.ServiceBus/namespaces/', 'my\_sb\_namespace')]"
          ],
          "properties": {
            **"requiresDuplicateDetection": "true",  
            "duplicateDetectionHistoryTimeWindow": "PT10M",**"path": "my\_serviceBusTopicName",
            "lockDuration": "PT5M",
          },
          "resources": []
        }
      ]
    }
  ],
  "outputs": {}
}

This property of service bus can be for instance leveraged in application capturing orders information from tracking script and wanting to ensure that no duplicate information is flowing in downstream systems from occurrences such as a page refresh on a thank you page.

One example of this type of application is shown above:

A tracking script generates an http call to an API end point
The API pushes the message to a Service bus topic with duplicate detection enabled
An analytics ingestor application, reads the messages in a topic subscription and pushes the data to Google Analytics using the measurement protocol

Peek / Lock:

Azure Service Bus offers two way to handle messages, one destructive on every message read, one non-destructive using a peek /lock mechanism.

Setting up the non-destructive read, only requires setting up one parameter in the receiving function:

The peek / lock mechanism allows to implement transactions using ASB. Messages can be deleted on success and otherwise can be either abandoned explicitly or through the lock time expiring. Abandoning the message, allows for it to be put back into the queue for further processing attempts.

Being able to handle transactions mechanism is particularly important, when dealing with issue of compliance, such as revenue recognition, or that have a direct downstream impact for example a request for delivery of inventory that needs to be planned.

Both the lock and destructive read mechanism, enable receiver scaling.

In the example above, each receiver are locking a set of messages. When that happens the messages become invisible to the other receivers. This type of mechanism allows for scaling the downstream ingestion of the messages without the risk of ingesting the same message multiple times.

Routing:

Routing is enabled using topic filters rules on subscription. There are three kind of potential filters that can be applied to subscription, boolean, sql and correlation filters. These can leverage the data present in system and (customer) user properties.

For instance, if I wanted to filter only the messages intended on a specific topic by country, and {‘country’: ‘GB’} had been set in the message custom properties, the following azure command line command would create the necessary filter on the subscription.

az servicebus topic subscription rule create \
  --resource-group ${AZURE\_RESOURCE\_GROUP}\
  --namespace-name ${SB\_NAMESPACE\_NAME} \
  --topic-name ${TOPIC\_NAME} \
  --subscription-name ${SUBSCRIPTION\_NAME}\
  --name ${FILTER\_NAME}\
  --filter-sql-expression "user.country='${COUNTRY}'"'

An example of how this kind of routing mechanism could be useful is described below.

A churn scoring applications could publish messages to a service bus topic, containing two subscriptions, one that would only consider customers likely to churn and one generic that would contain all customers. Based on these topic subscriptions two applications could then naively consume the message, one for instance to trigger a retention campaign, the other to simply update the score of every consumer in the database.

EventHub

EventHub is known for its low latency and high scalability, making it particularly well suited to handle real-time data and big data. Lesser known are its replay and data capture functionalities as well as its Kafka integration.

Partition ownership and Checkpoints

While Azure Service Bus uses the concept of message locks to handle being able to use multiple consumers, Azure Event Hub relies on the concept of partition ownership.

Partitions for a given consumer group get assigned to a receiver application for processing purpose. In the example above the receiver application get assigned partition 1 and 2, while the receiver application 2, gets assigned partitions 3 and 4. It is best practice to only have one active receiver application by consumer group / partition.

Instead of locks and deletion, event hub uses the concept of check-points to understand what event have been processed. Once the events have been processed, the position of the event across (ingestion) time is check-pointed within a partition, to indicate that the receiver should process the messages from this point onward.

Since the events are kept within the EventHub for the duration of the retention period, this approach makes it possible to “replay” the data. This can be useful, when needing to train Machine Learning models and see how they behave on past historical data.

Data Ingestion

EventHub offers the possibility to export the ingested data directly onto a blog storage or Azure Data Lake directly, through its’ capture functionality.

Once in a Data Lake/Blob Storage , you can leverage Azure Data Lake Analytics, or computation engines such as Presto or Apache Drill to directly query this data.

Kafka

Event hub offers a Kafka interface, this notably enables an integration with Apache Spark. Spark offers a straight-forward way to deal with streams, and notably offers a way to perform windowed operations, this is particularly useful in doing real-time aggregations. The following post from data-bricks, explains how this time aggregation works.

Julien Kervizic - Senior Enterprise Data Architect - GrandVision NV | LinkedIn

More from me on Hacking Analytics:

ON the evolution of Data Engineering

Julien Kervizic — Tue, 30 Jul 2019 21:40:30 +0000

A few years ago being a data engineer meant managing data in and out of a database, creating pipelines in SQL or Procedural SQL and doing some form of ETL to load data in a data-warehouse, creating data-structures to unify, standardize and (de)normalize datasets for analytical purpose in a non-realtime manner. Some companies were adding to that a more front facing business components that involved building analytic cubes and dashboard for business users.

In 2018 and beyond the role and scope of the data engineers has changed quite drastically. The emergence of data products has created a gap to fill which required a mix of skills not traditionally embedded within typical development teams, the more Software Development Oriented data engineers and the more data oriented Backend Engineers were in a prime role to fill this gap.

This evolution was facilitated by a growing number of technologies that helped to bridge the gap both for those of Data Engineering and those of a more Backend Engineering background.

Big Data: The emergence of Big Data and the associated technologies that can with it drastically changed the data landscape with Hadoop open-sourced in 2006, it became easier and cheaper to store large amount of data, Hadoop unlike traditional RDBMS databases did not require a lot of structuring in order to be able to process the data. The complexity to develop on Hadoop was initially quite high, requiring the development of Map Reduce jobs in Java. The challenges of processing big data forced the emergence of Backend Engineers working on analytical data workflow. It was not until Hive was open sourced in 2010 that the more traditional data engineers could get an easy bridge to get on boarded in this era of Big data.

Data Orchestration Engines: With the development of Big data, large internet companies were faced with a challenge to operate complex data flow without any tools such as SSIS used for more traditional RDBMS working in this ecosystem. Spotify opened sourced Luigi in 2012 and Airbnb Airflow (inspired by a similar system at facebook) in 2015. Coming from a heavy engineering driven background, these orchestration engines were essentially data-flows as code.

Python being the language most of these orchestration engine were built on helped them gain ground benefiting based on the traction in the PyData ecosystem and from the increase use of python among Production engineers. Traditional Data Engineers coming into this ecosystem needed to adapt and up-skill in software engineering.

Machine Learning: The trove of data that was now possible to collect from the internet, Machine learning quickly gained traction. Until the advent of Hadoop, Machine Learning models were usually trained on a single machine and usually applied on a very ad-hoc manner. For large internet companies, in the early days of Hadoop leverage machine learning models required some advanced software development knowledge in order to train and apply models into production, with the use of frameworks such as Mahout leveraging upon MapReduce.

Some Backend engineers started to specialize in this area to become Machine Learning Engineers, very production focused Data Scientists. For a lot of startup this kind of development was however overkill. Improvement in SKLearn, a python project open started in 2007, and the popularization of orchestration engine made it fairly easy to go from a proof of concept by a Data Scientist to production ready workflows by Data Engineers for moderately sized datasets.

Spark & Real-time: It was the release of Spark’s MLlib for python in 2014, that democratized machine learning computation on Big data. The API was fairly similar to the one Data-scientists were used to from the PyData ecosystem and further development of Spark further helped bridged the gap. Spark further offered a way for data engineers to easily process streaming data, offering a window towards real-time processing. Spark enabled an increased contribution of data engineers towards data products.

Cloud development & Serverless: AWS was officially launched in 2006, its storage layer S3 had been built upon Hadoop the traditional big data platform. Elastic Map Reduce was launched in 2009 making it easier to dynamically spin up and scale Hadoop clusters for processing purpose.

The move to the cloud had multiple implication for data engineers. The cloud abstracted physical limitations, for most users it meant that storage and compute was essentially infinite provided one can pay for it. Optimization previously done to keep business running waiting for new servers to be installed or upgraded needed not to be done anymore. So was the work previously done tasks scheduling to allocate the load across time due to ressource constraint. The cloud by allowing for scaling up and down ressources made it much easier to handle high peak batch jobs typical in data engineering. This however came at the cost of having to manage infrastructure and the scaling process through code.

The introduction of Lambda function on AWS in late 2014 kicked off the serverless movement. From a data perspective data could be easily ingested without managing infrastructure. The release Athena launching in late 2016 pushed things further allowing to query directly onto s3 without the need to setup a cluster. This is freeing data engineers from managing infrastructure scaling based on requests allowing them to spend more time on development.

The role of the data engineer is no longer to just provide support for analytics purpose but to be the owner of data-flows and to be able to serve data both to production and for analytics purpose.

To that end Data Engineering has been looking more towards a software engineering perspective. Maxime Beauchemin’s post on functional data engineering advocates advocates for borrowing patterns of functional programming and apply them to data engineering. The emerging data ops movement and its manifesto in turn borrows from the DevOps movement in software engineering.

CDPs and DMPs a story of Advertising platforms

Julien Kervizic — Fri, 19 Jul 2019 06:25:55 +0000

Customer Data Platform and Data Management platforms are some of the components of a data driven, marketing technology stack. They provide tools for marketers to effectively target different segment of the population.

Customer Data Platform (CDP)

The Customer data platform is meant as a data store for all internal attributes and events related to a given customer. It is meant to provide a way to unify data across multiple data source, with potentially different types of customer identifiers and as such needs to offer a way to resolve identities.

Use cases for a CDP:

Single Source of Truth: The aim of the customer platform is to act as the single source of truth for customer data, offering a 360 degree views of customer behavior across the different touch point and identities.

It enables to also structure and standardize the inputs and data models of customers, for instance with mParticle’s commerce events which structure different product action onto the same schema. Beside the standardization component, part of the CDP’s role during the data integration phase is also to provide better data quality by cleansing and de-duping the data.

Advertising: Since CDPs are able to collect data from the different touch points of customer interactions and resolving their identities. It is also able to harmonize the communication towards the customers across the different communication channels.

Empower Personalization: Personalization can be empowered to the highest level by the use of a customer data platform. Customer data platforms allows for the creation and utilization of data from a merged identity, through means of a) Attribute retrieval b) segment creation c) machine learning score. These data points can then be used to power advertising campaigns, website recommendation or to empower personalization across the full user journey.

Ease Integration: One of the role of customer data platforms is to facilitate the integration of different datasources both inbound or outbound. The customer data platform needs to be able to integrate data sources such as web analytics, e-commerce data, email behavior and export to the different end points that can make use of the data such as advertising platform, marketing automation tools or a data warehouse.

List of CDPs

There exists quite a large list of CDP vendors out there, usually each with a different focus.

CDPs originating from a Tag Management systems provider such as Tealium, or Ensighten. These systems tend to have pivoted into Customer Profile by leveraging their common data model used for tag integration.
Pure Play Oriented: such as Segment, one of the few CDP offering a free developer trial version, mParticle, a CDP focused on providing data integration, offering an unified data model, ingesting different types of events such as commerce events onto a common data model and exporting it to different end-points.
Campaign Management Oriented: Lytics is a CDP with a strong focused on campaign management, customer journeys and of enrichment of profile from 2nd party data , Agile One a CDP with both campaigns and analytics capabilities or SessionM a CDP focused on providing loyalty management as part of their offering.
Personalization Oriented: Qubit, although it does not sell itself as a CDP, but rather as a personalization platform, it shares a number of functionalities with them, such as an ability to deduplicate customers across identities, storing of data at user level, ability to segment audiences and some possibilities for integration with other platforms (eg: Salesforce Marketing Cloud).
Open Source: Unomi is an open source CDP, some of its draw back being the lack of UI, the need to have to host it and the lack of an extensive integration ecosystem.

Caveats

Seeing a CDP as a golden records for Customer needs to be taken with a certain grain of salt, ingested data is usually persisted with attributes, for instance product information is usually persisted at the same time as the events triggered. The backend of most CDPs don’t usually allow for the kind of merging activities necessary to maintain dimensions such as a product category dimension.

Data Management Platform (DMP)

What’s the purpose of a DMP — the purpose of a data management platform is the collection, enrichment of data and management of data for digital marketing purpose. One of the main value from a DMP is the ability to provide a consistent experience across different marketing channels.

Data collection and enrichment:

The DMP collects data from a clients’ first party sources, for instance from the users visiting its websites. From second party sources, ie: data that is obtained through collaboration, this could be targeting settings for a Facebook campaign that can be identified when a user visit the website, and from third party sources, data that is obtained from other activities.

At the heart of the DMP is the notion of an anonymous user identity, typically identified through cookie ids, these cookies can be matched through rule base systems or through probabilistic matching.

Use cases for a DMP:

Advertising : The main purpose of DMPs is to offer a a cohesive audience targeting. This is done through segment building and the creation of different segments for different type of personas. DMPs typically provide different type of segments:

Static Segments: Segments are run once based on a snapshot of a visitors’ attributes.
Adaptive Segments: Segments are constantly being updated based on the data being provided to the platform. For instance an adaptive segment can be “website visitors 30days”
Segments obtained from Machine Learning models for Lookalike modeling.

These segments can then be exported to DSPs or other activation channels.

Audience Insight : DMPs also provide insights on the behavioral patterns of your customers, as well as learn more about who your customers are by leveraging informations provided by third parties.

List of DMP

Salesforce DMP (Krux): Salesforce integrated Krux as part of their marketing cloud platform, offering a way to target “unknown” visitors on your websites, and enhance the information, you have on them with third party data.
Adobe Audience manager: is Adobe’s DMP, well integrated into adobe marketing cloud, it enables marketeers to get a better grasps of the different audience segments, interacting with their ads or the website.
Mediamath: an independent adtech vendor in between a DMP and a DSP solution, benefiting from over $600M of funding.
Oracle BlueKai: Oracle’s DMP with a heavy data enrichment slant, it boast overs 200 media partners.

Wrap up

At their core, CDPs and DMPs have an overlap in functionalities, but also some difference, which mostly resolve from their treatment of identity. CDPs also have an overlap in functionality with other systems such as Tag Management, Master Data Management and Marketing Automation tooling.

How to collect the data you need to bootstrap your digital marketing analytics

Julien Kervizic — Mon, 20 May 2019 19:51:16 +0000

Photo by Campaign Creators on Unsplash

To a large extent boot-strapping your marketing activities with data, resolves around the collection of data of two or three specific domains depending on the scope of your business.

Campaign Activity : Be it from Digital Marketing, pushing ads such as Facebook or Google ads, or more through email service providers/marketing automation.
Clickstreams: Provides an understanding of the customer journey on site and the different factors that contributed to a conversion. Leveraging clickstream data allows to be able to properly attribute conversion to specific campaigns.
Sales: Digital Marketing for e-commerce websites revolves around generating online sales. These should be tracked and the campaign spend should be optimized against this objective.

Use Cases

There are quite a few use cases around marketing analytics, but we can easily show how these data sources of Campaign Activity, Clickstreams and Ecommerce sales can drive some of the biggest marketing analytics use cases such as media-mix modelling, marketing attribution and churn prevention.

Media-Mix Modeling

Media-mix modeling (MMM) allows to get an understanding of how to shift your budget mix among different advertising channels to optimize your outcome. It relies on statistics techniques to understand where 1 unit of marginal spend would be best placed.

MMM relies on two specific source of information 1. Spend data (Campaign) 2. an Outcome data (Eg: Ecommerce Sales), that we want to optimize for. In Order to be effective it is important for MMM to have a view of all the different channel spend contributing to the outcome.

Marketing Attribution

Marketing attribution role is to assign credit to specific marketing campaign, to get a better sense of their contribution to a specific objective. Attribution methods such as “last click”, provide full credit to a specific campaign when an objective is reached, in this case it provides the full credit of a campaign to the last touchpoint having contributed to the said objective, while other techniques can partial credit. Explanation of different attribution techniques are provided in the following medium posts: here and here.

Marketing attribution relies on Campaign Data (Spend), Clickstream information and Sales data in order to properly attribute Campaign activities. Along with setting up the different systems to collect the information, proper url tagging (UTM) and potentially setting up the right Cookies on the websites are necessary first step to enable this use case.

Churn Prevention

Churn identification and prevention is one of the most traditional CRM use cases. It leverages upon Sales and Campaign data (CRM) to get a better understanding as to what customers are likely to churn and to what offer they would tend to respond to in order to stick with the service offering.

Data Collection

In general one of the hardest part of empowering Marketing Analytics is the ability to source the data. It requires information to be pulled from a variety of sources. Depending on the specific of the Business trying to kick-start this use-case there can be a variety of applicable ways to integrate each of the data sources required.

Campaigns data collection

There exists quite a few way to integrate and collect campaign data, from using specific data integration solutions, to leveraging singers taps, using the different tools built in data export capabilities or through building specific API pipelines.

Data Integration solutions

Different tools exists to simplify the collection of data from marketing campaigns, Talend, Adverity, Fivetran and Alooma (recently acquired by Google) provide a series of connectors that make data integration from these different ad-sources fairly easy.

Singer Taps

Photo by Austin Neill on Unsplash

Singer taps provides open-source pre-built connectors to a series of advertising sources such Facebook, Google, Outbrain, Salesforce, Marketo, Selligent,.. The data from these sources can be then be easily fetched by only modifying some configuration settings and executing a command line call, for instance

tap-adwords -c config.json -p properties.json -s state.json

Data Exports

Photo by chuttersnap on Unsplash

Certain advertising tools allows for data exports to Big Query, a different data-warehouse tool or as file exports, for instance:

DoubleClick and Emarsys provide exports to Google BigQuery
Salesforce Marketing Cloud allows FTP/CSV exports, flow which can be automated in automation studio and then ingested into a database/data-warehouse

Custom Development — API Pipelines

Photo by Tyler Lastovich on Unsplash

Development of pipelines to pull directly the data from Facebook, Google through API calls. This requires a data engineer or developer to setup the different data-flows. Most advertising sources provide SDK for easy integrations with their platforms.

Google Ads: Google API Python Client
Facebook: Facebook Business SDK
Salesforce Marketing Cloud: FuelSDK

ClickStream Collection

Different alternatives exist in order to collect clickstream data, from relying on a premium analytics tool such as google analytics 360 or Adobe, leveraging a Customer Data Platform, a Clickstream collector or through setting up some custom development.

Google Analytics/Adobe Analytics

Photo by Austin Distel on Unsplash

The simplest way, if you can afford it, to collect raw clickstream data is through google/adobe Analytics. Google offers the possibility to export raw clickstream data to Big Query as part of their Google 360 offering. One of the major draw-back of going for that route is the $150k annual cost of Google Analytics 360.

Customer Data Platform

Photo by Blake Wisz on Unsplash

Most customer data platform offer the possibility to export ingested events to a data-warehouse. They ingest data from multiple sources, including website activity and are able to stream it back for processing or analysis. Depending on the size of your business, a customer data platform might prove more expensive than purchasing a Google 360 license, but offers additional benefits. Certain CDP such as segment, offer a free version up to a certain amount of events or active users.

ClickStream Collector

Different open source clickstream collectors exist, the most known one are Snowplow and divolte. They offer a way to ingest clickstream data, without the need to fully develop it. The draw-back of using these, is that you need to be managing the infrastructure.

Custom development

Another solution to collect clickstream data is through custom development. Logic App, Function App, Lambda function and an EventHub/Kinesis/PubSub setup would allow for a scalable ingestion of data, but at the cost of managing code and infrastructure .

(Online) Sales data collection

There are different ways to source information related to online sales, all of which have their own set of pros and cons:

Analytics Tags: Analytics tags on the website to get a sense of online sales data. Data from your analytics tools can then be exported either manually, through API call or for those with Google 360 through a Big query export.
Database replication/mirroring: Some e-commerce platforms expose or let you choose your own database solution. This allows for using database replication / mirroring to get a live copy of the data for analysis or reporting purpose.
Data Integration Solutions: Pre-built data integration solutions exists, with certain platforms, acquiring one of these solution can significantly eases the development and maintenance needed to handle this part of the data collection process.
Singer Taps: Certain platforms, have singer taps created for them, essentially a pre-built ETL/API client which only needs to be configured before pulling the data.
API Pipelines: API pipelines can be generated on schedule to pull relevant information. They do require some custom development in order to pull.
Webhooks: Webhooks provides for platform supporting it, of a way to export and ingest data in real-time. Their main draw back is that you need to develop and maintain infrastructure, API/Webhook receiver.
Streams: Some e-commerce solutions allow you to get a feed of transaction to an event stream exports. This allows for getting real-time ingestion by passing an API/Webhook receiver.

Analytics Tags

Photo by Neven Krcmarek on Unsplash

It is possible to capture online sales data through an analytics tag such as Google’s. Google provides a structured way to pass the information to analytics through its enhanced ecommerce plug-in. This allows to provide a good first pass at capturing e-commerce data, there is however quite a few drawbacks from that approach:

Ad-blocker: This approach faces issues for user using certain ad-block, which blacklist Google Analytics domain and any attempt from the analytics.js tag, although there are some workarounds
Page Load issues: Users, who leaves the thank you page before the tag has had the chance to fire, would not have their order data being pushed to Google Analytics
Delayed Payment: Orders having payment methods that allow for delayed payment, such as Bank Transfer or Paypal, might be incorrectly classified as a successful sales or not recorded depending on the approach taken.

The main advantage from using this approach is the universality of it and the speed at which this can be deployed, usually only necessitating some tag integration for most web-shops, and for certain platform such as shopify, only some simple configuration.

Another of the advantage of setting up enhanced e-commerce tracking, is the ability to tie purchase to specific sessions and therefore be able to rely on google’s last click attribution. With it it is possible to attribute specific orders to specific campaign and sales-channel based on a last click attribution, this can be beneficial.

Database Replication & Mirroring

Photo by Fares Hamouche on Unsplash

Some E-commerce platforms, allow those operating the platform to setup their own database , this is the case of Magento, EpiServer, or SiteCore for instance. In these cases, it is possible to setup a Master-Slave database replication or database mirroring, so that the data can be used for reporting purpose without affecting the production environment.

These can be setup without custom development, and can allow for a quick turnaround for providing data for reporting purposes.

Data Integration solutions

As for Campaign data, data integration tools exists that provide turnkey integration for e-commerce data. Each of the mentioned vendors provide connectors to certain e-commerce platforms:

Talend: Shopify, BigCommerce and Magento are available as Talend connectors from Cloudbee,
Advertity: supports Hybris, Shopware, Shopify and Magento
FiveTran: supports Salesforce CommerceCloud, Magento, WooCommerce, Shopify and SpreeCommerce
Alooma: supports Shopify and Magento

The use of these data integration solution are an alternative when there is lack of technical capabilities within the team/department.

Singer Taps

Photo by Daniel Chekalov on Unsplash

Currently WooCommerce is the only web-shop, having a singer-tap connector, making it’s use quite restricted. It is however possible to develop custom singer taps for specific use. This can be a good move, when already operating online campaign data collection through SingerTaps.

Custom Development — API Pipelines

Photo by Quinten de Graaf on Unsplash

Most web-shops these days allows for pulling order information directly as API calls, this is the case of Pure SaaS platforms such as Shopify, Lightspeed, Commercecloud or Commercetools, but also of the likes of Magento.

Some of which are supported by python SDK:

One of the major drawback of this approach is that it requires custom data engineer or software engineer work, it requires polling and to a certain degree is not “real-time”. Certain platform furthermore have rate limitations that might make it impractical to work with for pulling large amounts of orders.

One of the advantage of it however, is the ability to pull updated information about specific orders to date-ranges.

Custom Development — Webhooks

Photo by Chris Scott on Unsplash

Webhooks are essentially a way to send a notification over HTTP when some type of event happen, for our purpose, they provides a way to create a real-time ingestion of data from an e-commerce platform. Web-hooks can also provide a way to go around some of the rate-limitation if there isn’t any need to make callbacks.

They do require some sort of web-hook listener API and ingestion layer in order to capture the data. These can be built with the same type of technologies used for capturing clickstream data, for example a Logic App / EventHub combination.

Most e-commerce platform supports web-hooks, Shopfiy, Lightspeed and WooCommerce, Shopware, and Big commercesupport it natively, while Magento supports it through 3rd party plugins, and platforms, such as sitecore or episerverneed custom development.

Custom Development — Streams

Photo by Boudhayan Bardhan on Unsplash

Some e-commerce platforms are able to publish events directly onto a message queue (eg: Google Pub/Sub, Azure Service Bus, AWS SQS). This is the case of Commercetools, that prefered this approach to the standard HTTP webhooks. This allows for instance to natively “duplicate” the relevant data for both processing (eg order fulfillment), and for long term storage in a datawarehouses and let the different consumer of the data “subscribe” from that single source of information.

Beside Google Pub/Sub that has a turnkey export to a Datawarehouse (BigQuery), the other technology choices will still require development work in order to ingest he data.

More from me on Hacking Analytics:

10 Benefits to using Airflow

Julien Kervizic — Thu, 16 May 2019 16:33:46 +0000

In continuation from a series of posts, where I have explained the basics of airflow and how to setup airflow on azure, and what considerations to have when using airflow, I wanted to cover in details what makes airflow a great tool to use for data processing.

1. DAGs :

Dags are a way to setup workflows, they can setup a sequence of operations that can be individually retried on failure and restarted where the operation failed. Dags provide a nice abstraction to a series of operations.

2. Programmatic Workflow Management:

Airflow provide a way to setup programmatic workflows, Tasks for instances can be generated on fly within a dag. While Sub-DAGs and XCom, allows to create complex dynamic workflows.

Dynamics Dags can for instance be setup based on variables or connections defineed within the Airflow UI.

3. Automate your Queries, Python Code or Jupyter Notebook

Airflow has a lot of operators setup to run code. Airflow has operator for most databases and being setup in python it has a PythonOperator that allow for quickly porting python code to production.

Papermilll is an extension to jupyter notebook, allowing the parametrization and execution of notebooks, it is supported through airflow PapermillOperator. Netflix notably has suggested a combination or airflow and papermill to automate and deploy notebook in production:

Part 2: Scheduling Notebooks at Netflix

4. Task Dependency management:

It is extremely good at managing different sort of dependencies, be it a task completion, dag runs status, file or partition presence through specific sensor. Airflow also handles task dependency concept such as branching.

Use conditional tasks with Apache Airflow

5. Extendable model:

It is fully extendable through the development of custom sensors, hooks and operators. Airflow notably benefits from a large amount of community contributed operators.

Operators in different programming languages such as R [AIRFLOW-2193] are being built in using python wrappers, in the future other programing language such as Javascript which also have python wrapper (pyv8) could also be created.

6. Monitoring and management interface:

Airflow provides a monitoring and managing interface, where it is possible to have a quick overview of the status of the different tasks, as well as have the possibility to trigger and clear tasks or DAGs runs.

7. Retry policy built in:

It has an auto-retry policy built-in, configurable through :

retries: number of retries before failing the task
retry_delays: (timedelta) delay between retries
retry_exponential_backoff: (Boolean) to setup an exponential backoff between retries
max_retry_delay: Maximum delay (timedelta) between retgries

These arguments can be passed through the context to any operator, as they are supported by the BaseOperator class.

8. Easy interface to interact with logs:

Airflow provides an easy access to the logs of each of the different tasks run through its web-ui, making it easy to debug tasks in production.

9. Rest API:

Airflow’s API allows to create workflows from external sources, and to be data product on top of it:

Using Airflow Experimental Rest API on Google Cloud Platform: Cloud Composer and IAP

The rest API, allows to use the same paradigm used to built pipelines, to create asynchronous workflows, such ascustom machine learning training operations.

10. Alerting system:

It provides a default alerting system on tasks failed, email is the default, but alerting through slack can be set up using a callback and the slack operator:

Integrating Slack Alerts in Airflow

More from me on Hacking Analytics:

5 Considerations to have when using Airflow

Julien Kervizic — Tue, 14 May 2019 19:00:52 +0000

5 considerations to have when using Airflow

In previous posts, I have explained the basics of airflow and how to setup airflow on azure, I haven’t however covered what considerations we should give, when using Airflow.

I see 5 main considerations to have when using airflow:

What type of infrastructure to setup to support it
What type of operator model to abide by, and which operators to choose
How to architect your different DAGs and setup your tasks
Whether to leverage templated code or not
Whether and how to use it’s REST API

These considerations will dictate how you and your team will be using airflow and how it will be managed.

(1) Airflow Infrastructure — Go for a Managed Service if Possible

Setting up and maintaining airflow isn’t so easy, if you need to set it up, you will most likely need quite a bit more than the base image:

Encryption needs to be setup to safely store secrets and credentials
Setting up an authorization layer, if only through the flask login setup and preferably through an oAuth2 provider such as google
SSL needs to be configured
The web server needs to be moved to a more production ready setup (for example using wsgi/nginx)
Libraries and drivers need to be installed to support the different types of operations you wish to handle
…

For the most simple use cases, it is possible to rely solely on the local executor, but once real processing need arise, more distributed computation need arise and management of the infrastructure becomes more complex.

They require also more resources to run than a Local executor setup, where worker, scheduler and web-server can lie in the same container:

Celery executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Flower (monitoring), Scheduler, Worker
Mesos Executor: Webserver (UI), Redis (MQ), Postgres (Metadata), Mesos infra
Kubernetes: Webserver (UI), Postgres (Metadata) and Scheduler, Kubernetes infra

The high number of components will raise the complexity, make it harder to maintain and debug problems requiring that one understand how the Celery executor works with Airflow or how to interact with Kubernetes.

Managed version of airflow exists on Google Cloud, through Cloud Composer, and Astronomer.io also offers managed versions, qubole offers it as part of its’ data platform. Where applicable it is more than recommended to go for a managed version than setting up and managing this infrastructure yourself.

(2) Sensors, Hooks and Operators — Find your fit

Depending on your use case, you might want to be able to use certain sensor , hooks or operator. And while airflow has a decent support for the most common operators, and good support on google cloud. If you have a more uncommon, use case you will probably need to check in user contributed operators list or develop your own.

Understanding how to use operators depending on your particular company setup is also important. Some, have a radical stance with respect to operator, but the reality is that the use of operators need to be taken in the context of your company.

Does your company have an engineering bias that supports the use of Kubernetes or other container style instances?
Is your company use of airflow, more driven by your Datas-Science department, with little engineering support? For them, it might make more sense to use a python operator or the still pending R operator
Is your company only planning to use airflow to operate data transfers (Sftp/S3 …) and SQL queries to maintain a data-warehouse? For them using K8s or any container instances would be overkill. This is for example the approach taken at Fetchr, where most of the processing is done in ERM/Presto.

Selecting your operator setup is not a one size fit all.

(3) DAGS — Keep them simple

There are quite a few ways to architect your DAGS in airflow, but as a general rule it is good to keep them simple. Keep within the DAGS tasks that are truly dependent on each other, when dealing with multiple DAGS dependencies abstract them into another DAG and file.

When dealing with lot of data-sources and interdependencies, things can get messy, and setting up dags as self-contained files, kept as simple as possible can go a long way to make your code maintainability. The external task sensor, helps to separate DAG and their dependencies in multiple self contained DAGS.

As in most distributed system it is important to setup operation as idempotent as possible — at least within a Dag Run. Certain operations between dag runs may rely on a depend on past settings.

Sub-DAGS, should be used with parsimony for the same reason of code maintainability. One of the only valid reason for me in using Sub-DAGS is for the creation of Dynamic DAGS.

Communication between tasks, although possible with XCom should be minimized as much as possible in favor of self containing functions/operators, this makes the code more legible, stateless and unless you want to be able to only re-run this part of the operation do not justify the use of these. Dynamic Dags are one of the notable exception to this.

(4) Templates and Macros — Legible Code

Airflow leverages jinja for templating. Commands such as Bash or SQL command can easily be templated, for execution with variables fitted or computed by the context. Templates can provide a more readable alternatives to direct string manipulation in python (eg: through a format command). JinJa templates is the default templating engine of most Flask developers, and can also provide a good bridge for python web developers getting into data.

Macros provides a way to take further advantage of templating by exposing objects and functions to the templating engine. User can leverage a set of default macros, or customize theirs at global or DAG level.

Using templated code does however take you away from vanilla python and exposes one more layer of complexity, for engineers typically needing to leverage quite a large array of technologies and apis.

Whether or not you choose to leverage template is a team/personal choice, there are more traditional ways to obtain the same results, wrapping the same in python format commands for example, but it can make the code more legible.

(5) Event Driven — REST API for building Data Products

Airflows’ REST Api, allow for the creation of event driven driven workflows. The key feature of the API, is to let you trigger DAGS runs with specific configuration:

The rest API allow for building, data product applications built on top of airflow, with use cases such as:

Spanning out clusters and processing based on anhttp request
Setting up a workflow based on a message or file appearing in respectively a message topic or blog storage
Building fulling fledge Machine Learning platforms.

Leveraging the Rest API allows for the construction of complex asynchronous processing patterns, while re-using the same architecture, platform and possibly code that are used for more traditional data processing.

More from me on Hacking Analytics:

Python Screening Interview question for DataScientists

Julien Kervizic — Sat, 11 May 2019 11:28:51 +0000

Python Screening Interview questions for DataScientists

Photo by Marius Masalar on Unsplash

DataScience requires an interdisciplinary set of skills, from handling databases, to running statistical model, to setting up business cases and programming itself. More often than not technical interviews for data-scientists assess more the knowledge of specific data manipulation APIs such as pandas, sklearn or spark, rather than a programming way of thinking.

While I think that a knowledge of the more “applied” APIs is something that should be tested when hiring data-scientist, so is the knowledge of more traditional programming.

String Reversal & Palindrome

String reversal questions can provide some information as to how well, certain candidates have been with dealing with text in python and at handling basic operations.

Question 1 :

Question: Reverse the String “ “the fox jumps over the lazy dog”

Answer:

a = "the fox jumps over the lazy dog"
a[::-1]

or ''.join(x for x in reversed(a)) [less efficient]
or ''.join(a[-x] for x in range(1, len(a)+1)) [less efficient]

Assessment:

This is more of a warmup question than anything else and while it is good to know the shortcut notation, especially as it denotes some knowledge of how python deals with strings (eg substr a[0:7] for the fox) it is not necessary for most data-science’s purpose

Question 2:

Question : identity all words that are palindromes in the following sentence “Lol, this is a gag, I didn’t laugh so much in a long time”

Answer:

def isPalindrome(word: str) -\> bool:
 if(word == word[::-1]):
 return True
 return False

def getPalindromesFromStr(inputStr: str) -\> list:
 cleanStr = inputStr.replace(",","").lower()
 words = set(cleanStr.split(" "))
 wPalindromes = [
 word for word in words 
 if isPalindrome(word) and word != ""
 ]
 return wPalindromes

getPalindromesFromStr(“Lol, this is a gag, I didn’t laugh so much in a long time”)

Assessment:

Does the candidate thinks about cleaning his/her inputs?
Does the candidate know the basic or word processing in python such as replace / split / lower?
Does the candidate know how to use list comprehension?
How does the candidate structure his/her code?

FizzBuzz

FizzBuzz is a traditional programming screening question, that allows to test if a candidate can think through a problem that is not a simple if else statement. The approach that they take can also shed some light to their understanding of the language.

Question: Write a program that prints the number for 1 to 50, for number multiple of 2 print fizz instead of a number, for numbers multiple of 3 print buzz, for numbers which are multiple of both 2 and 3 fizzbuzz.

Answer:

def fizzbuzzfn(num) -\> str:
 mod\_2 = (num % 2 == 0) 
 mod\_3 = (num % 3 == 0)
 if (mod\_2 or mod\_3):
 return (mod\_2 \* 'Fizz') + (mod\_3 \* 'Buzz')
 return str(num)

print('\n'.join([fizzbuzzfn(x) for x in range(1,51)]))

Assessment:

Do they know the modulo operator and are able to apply it?
Are they storing the result of the modulo operators in variables for re-use?
Do they understand how True/False interact with a String?
Are they bombarding their code with if statements?
Do they return a consistent type or mix both integer and string?

First Duplicate word

First finding of duplicate word allows to identity if candidates know the basic of text processing in python as well as are able to handle some basic data structure.

Question 1

Question: Given a string find the first duplicate word, example string: “this is just a wonder, wonder why do I have this in mind”

Answer:

string = "this is just a wonder, wonder why do I have this in mind"

def firstduplicate(string: str) -\> str:
 import re
 cleanStr = re.sub("[^a-zA-Z -]", "", string)

 words = cleanStr.lower().split(" ")
 seen\_words = set()
 for word in words:
 if word in seen\_words:
 return word
 else: 
 seen\_words.add(word)
 return None

firstduplicate(string)

Assessment:

Do I have constraint I need to work with, for instance in terms of memory?
Cleans the string from punctuation? Replace or Regexp? If use regexp replace, should I compile the regexp expression or used it directly?
Knows the right data-structure to check for existence.
Does it terminate the function as soon as the match is found or?

Question 2:

Question: What if we wanted to find the first word with more than 2 duplicates in a string?

Answer:

string = "this is just a wonder, wonder why do I have this in mind. This is all that matters."

def first2timesduplicate(string: str) -\> str:
 import re
 cleanStr = re.sub("[^a-zA-Z -]", "", string)

 words = cleanStr.lower().split(" ")
 seen\_words = dict()

for word in words:
 previous\_count = seen\_words.get(word, 0)
 seen\_words[word] = previous\_count + 1
 if previous\_count \>= 2:
 return word
 return None

first2timesduplicate(string)

Assessment:

Some small modification is needed to be able to accommodate that change, the main one is arising from the use of a dictionary data-structure rather than a set. Counters are also a valid data-structure for this use case.
There is little difficulty on modifying the previous function to cope with this change request, it is worth checking that the candidate does instantiate the specific key correctly, taking into account default values.

Quick Fire questions

Some quick fire questions can also be asked to test the general knowledge of the python language.

Question 1:

Question: Replicate the sum for any number of variables, eg sum(1,2,3,4,5..)

Answer

def sum(\*args):
 val = 0
 for arg in args:
 val += arg
 return val

Assessment:

Quick interview question to check the knowledge of variable arguments, and how to setup one of the most basic functions.

Question 2:

Questions around the Fibonacci series is a classic of programming interviews and candidates should in general be at least familiar with them. They allow to test recursive thinking.

Question: Fibonacci sequences are defined as follow:

F\_0 = 0 ; F\_1 = 1
F\_n = F\_{-1} + F\_{-2}

Write a function that gives the sum of all fibonacci numbers from 0 to n.

Answer:

def fibonacci(n: int) -\> int:
 # fib series don't exist \< 0 
 # might want to throw an error or a null 
 # for that
 if n \<= 0: 
 return 0
 if n == 1: 
 return 1
 else:
 return fibonacci(n-1) + fibonacci(n-2)

def naiveFibSum(n: int) -\> int:
 return sum([fibonacci(x) for x in range(0, n+1)])

def sumFib(n: int) -\> int:
 return fibonacci(n + 2) -1

Assessment:

First, is the candidate able to think recursively?
Is the candidate only thinking about a naive solution to the sum of fibonacci series or is s/he understanding that it can also summarized to a in a more effective way?

Wrap up

These questions are just meant to be a first screener for data-scientist and should be combined with statistical and data manipulation types of questions. They are meant to give a quick glimpse on whether a candidate has the basic minimum knowledge to go through a full interview rounds.

More advanced programming questions for Python would tend to cover the use of generators, decorators, cython or the efficient use of libraries such as pandas/numpy.

More from me on Hacking Analytics:

Become a Pro at Pandas, Python’s data manipulation Library

Julien Kervizic — Thu, 09 May 2019 17:34:26 +0000

The pandas library is the most popular data manipulation library for python. It provides an easy way to manipulate data through its…

Understanding The pandas library

One of the keys to getting a good understanding of pandas, is to understand that pandas is mostly a wrapper around a series of other python libraries. The main ones being Numpy, sql alchemy, Matplot lib and openpyxl.

The core internal model of the data-frame is a series of numpy arrays, and pandas functions, such as the now deprecated “as_matrix” functtion, which return results in numpy’s internal representation.

Pandas leverages other libraries to get data in and out of data-frames, SQL Alchemy for instance is used through the read_sql and to_sql functions. While openpyxl and xlsx writer are used for read_excel and to_excel functions.

Matplotlib and Seaborn in turn are used to provide an easy interface, to plot information available within a data frame, using command such as df.plot()

Numpy’s Panda — Efficient pandas

One of the complain that you often hear is that Python is slow or that it is difficult to handle large amount of data. Most often than not, this is due to poor efficiency of the code being written. It is true that native python code tends to be slower than compiled code, but libraries like Pandas provides a python interface to compiled code and knowing how to properly use this interface, let us get the best out of pandas/python.

APPLY & VECTORIZED OPERATIONS

Pandas, like its underlying library Numpy, performs vectorized operations more efficiently than performing loops. These efficiencies are due to vectorized operations being performed through C compiled code, rather than native python code, and on the ability of vectorized operations to operate on entire datasets rather than just a sub-portion at the time.

The apply interface allows to gain some of the efficiency by using a CPython interfaces to do the looping:
df.apply(lambda x: x['col_a'] * x['col_b'], axis=1)

But most of the performance gain would be obtained from the use of vectorized operation themselves, be it directly in pandas or by calling its’ internal Numpy arrays directly.

As you can see from the picture above the difference in performance can be drastic, between processing it with a vectorized operation (3.53ms) and looping with apply to do an addition (27.8s). Additional efficiencies can be obtained by directly invoking the numpy’s arrays and api, eg:

Swifter: swifter is a python library that makes it easy to vectorize different types of operations on dataframe, its API is fairly similar to that of the apply function

EFFICIENT DATA STORING THROUGH DTYPES

When loading a data-frame into memory, be it through read_csv, or read_excel or some other data-frame read function, SQL makes type inference which might prove to be inefficient. These APIs allow you to specify the types of each columns explicitly. This allows for a more efficient storage of data in memory.

df.astype({'testColumn': str, 'testCountCol': float})
Dtypes are native object from Numpy, which allows you to define the exact type and number of bits used to store certain informations.

Numpy’s dtype np.dtype('int32') would for instance represent a 32 bits long integer. Pandas default to 64 bits integer, we could be save half the space by using 32 bits:

memory_usage() shows the number of bytes used by each of the columns, since there is only one entry (row) per column, the size of each int64 column is 8bytes and of int32 4bytes.

Pandas also introduces the categorical dtype, that allows for efficient memory utilization for frequently occurring values. In the example below, we can see a 28x decrease in memory utilization for the field posting_date when we converted it to a categorical value.

In our example, the overall size of the data-frame drops by more than 3X by just changing this data type:

Not only using the right dtypes allows you to handle larger datasets in memory, but it also makes some computations become more efficient. In the example below, we can see that using categorical type brought a 3X speed improvement for the groupby / sum operation.

Within pandas, you can define the dtypes, either during the data load (read_ ) or as a type conversion (astype).

CyberPandas: Cyber pandas is one of the different library extension that enables a richer variety of datatypes by supporting ipv4 and ipv6 data types and storing them efficiently.

HANDLING LARGE DATASETS WITH CHUNKS

Pandas allows for the loading of data in a data-frame by chunks, it is therefore possible to process data-frames as iterators and be able to handle data-frames larger than the available memory.

The combination of defining a chunksize when reading a data source and the get_chunk method, allows pandas to process data as an iterator, such as in the example shown above, where the data frame is read 2 rows at the time. These chunks can then be iterated through:

i = 0 for a in df_iter: # do some processing chunk = df_iter.get_chunk() i += 1 new_chunk = chunk.apply(lambda x: do_something(x), axis=1) new_chunk.to_csv("chunk_output_%i.csv" % i )
The output of which can then be fed to a csv file, pickled, exported to a database, etc…

setting up operator by chunks also allows certain operations to be perform through multi-processing.

Dask: is a for instance, a framework built on top of Pandas and build with multi-processing and distributed processing in mind. It makes use of collections of chunks of pandas data-frames both in memory and on disk.

SQL Alchemy’s Pandas — Database Pandas

Pandas also is built up on top of SQL Alchemy to interface with databases, as such it is able to download datasets from diverse SQL type of databases as well as push records to it. Using the SQL Alchemy interface rather than the Pandas’ API directly allows us to do certain operations not natively supported within pandas such as transactions or upserts:

SQL TRANSACTIONS

Pandas can also make use of SQL transactions, handling commits and rollbacks. Pedro Capelastegui, explained in one of his blog post notably, how pandas could take advantage of transactions through a SQL alchemy context manager.
with engine.begin() as conn: df.to_sql( tableName, con=conn, ... )

the advantage of using a SQL transaction, is the fact that the transaction would roll back should the data load fail.

SQL extension

PandaSQL

Pandas has a few SQL extension such as pandasql a library that allows to perform SQL queries on top of data-frames. Through pandasql the data-frame object can be queried directly as if they were database tables.

SQL UPSERTs

Pandas doesn’t natively support upsert exports to SQL on databases supporting this function. Patches to pandas exists to allow this feature.

MatplotLib/Seaborn — Visual Pandas

Matplotlib and Searborn visualization are already integrated in some of the dataframe API such as through the .plot command. There is a fairly comprehensive documentation as how the interface works, on pandas’ website.

Extensions: Different extensions exists such as Bokeh and plotly to provide interactive visualization within Jupyter notebooks, while it is also possible to extend matplotlib to handle 3D graphs.

Other Extensions

Quite a few other extensions for pandas exists, which are there to handle no-core functionalities. One of them is tqdm, which provides a progress bar functionality for certain operations, another is pretty pandas which allows to format dataframes and add summary informations.

tqdm

tqdm is a progress bar extension in python that interacts with pandas, it allows user to see the progress of maps and applys operations on pandas dataframe when using the relevant function (progress_map and progress_apply):

PrettyPandas

PrettyPandas is a library that provides an easy way to format data-frames and to add table summaries to them:

More from me on Hacking Analytics:

New roles of Analytics — The Data Product Owner & Analytics Translator

Julien Kervizic — Wed, 01 May 2019 16:25:17 +0000

New roles of Analytics — The Data Product Owner & Analytics Translator

Photo by Markus Spiske on Unsplash

Two new roles have emerged in the world of data in the past few years, roles that are providing a softer touch in a world of technical data. These roles are aimed to fill a clear gap, in a function where juniors are plenty and senior few and bring product and project leadership in a world constrained by the lack of analytical leadership talent.

Data-Science Product Owner

The role was created by companies like Booking.com, heavily involved in Agile and employing over 200 data-scientists. Nowadays the role can be found in companies hiring only a couple of data-scientists.

Ignoring the ill-fit that data-science has with Agile (where the Product Owner title comes from), there are preconditions and draw backs to having data-science specific product owners.

Overall the company and team needs to have a certain technical orientation and composition, critical size and focus to make efficient use of a data product owner.

Technical Orientation: In a similar vein that some companies have TPMs, Technical product managers, to cope with the degree of technicality of the role, data-science product owners should have a technical background and preferably this should be within the field of data-sciences, booking.com notably used to hire ex-data-scientists for this role. Product Owners are meant to set the strategic vision, roadmap and prioritize the feature for development, this is not possible in a data-science team without a deep understanding of data-sciences, its constraints, how to setup a MVP and being able to differentiate what can add value and what would likely bring minimal improvement.

Team composition: Some of the issues that often happen with data-science teams staffed with a product owner is that of team composition. It is very unusual for instance, to have a data-scientist put model into production by itself. They tend to leverage the expertise of Data Engineers, and often of Backend Software Engineers for that purpose. In the team composition, it is also worth making the distinction between what people often call type “A” (A for Analysis) and type “B” (B for Build) data-scientists. The distinction is worth considering as the work of a type “A” data-scientists would be ported with more difficulty to production and might need additional engineering support.

Critical size: In order for it to make sense to have a dedicated product owner for a data-science team, there needs to be a critical mass of data-scientist in your organization, that can be lead through a single vision. For this to apply, you need to have enough data-scientists focused on a single product area.

Team Focus: One of the issues having product owner focused purely on data-science, is the focus that it can give to the team. Most often than not a data-science problem is more easily solved by changing the different business or product processes in order to provide more signal in the data. Having products teams purely focused on data-scientists can hinder their output by limiting their scope.

Analytics Translator

Back in early 2018, McKinsey noticing a gap in the market, coined the title of “Analytics Translator”. They described the role as required deep domain knowledge, to help better prioritize business opportunities.

There is definitely parallels and overlap in the role of an analytics translator and a data(-science) product owner, notably around setting up the teams’ prioritization, providing domain knowledge and being an interface between the data-science/development team and the business.

Prioritization : Both the analytics translator role and the product owner role help prioritize the work carried out by the different members of the team.
Domain Knowledge: Both the role of an analytics translator and product owner requires business domain knowledge. Although it is only stated as a hard requirement for the analytics translator.
Interface: The analytics translator is meant to bridge the gap towards the business and make sure that the business can leverage upon them, be it an insight or a product that is being built. The role of the product owner similarly is to act as a representative of the business to the development team, and of the development team to the business.

There is some difference in the roles however. While the product owner role, tends to have more of a focus towards building products, the analytics translator has a role more geared towards the business and generating insights.

Product / Project orientation: The role of the analytics translator is defined as being more oriented project rather than products. It encompass the full delivery of insights and the leverage of the information provided into action, as I had described into the 4 Pillars of analytics, Analytics Translator are responsible for the Last Mile.
Technical fluency: McKinsey describes the skills needed for an analytics translator as “general technical fluency”, they need to have some general understanding of the technical concepts used in analytics, although not necessarily be able to apply them themselves. This is a much higher level of data technical competency than a general product owner, but would likely be lower than most data product owner from large technological organizations.

Wrap Up

Both role aim to fill the gaps in a field that is becoming increasingly technical and where certain actors are detaching from their business roots. They can help compensate a lack of more senior analytics talents, such as a Head of Data Science by providing more product / project guidance.

We need to be however be very particular on the conditions that we bring about these roles, as they need a certain critical mass of technical data talent in order to be truly beneficial and only introduce them in organization that would provide the right context for these roles.

More from me on Hacking Analytics:

Overview of the different approaches to putting Machine Learning (ML) models in production

Julien Kervizic — Mon, 29 Apr 2019 20:21:50 +0000

There are different approaches to putting models into productions, with benefits that can vary dependent on the specific use case. Take for example the use case of churn prediction, there is value in having a static value already that can easily be looked up when someone call a customer service, but there is some extra value that could be gained if for specific events, the model could be re-run with the newly acquired information.

There is generally different ways to both train and server models into production:

Train: one off, batch and real-time/online training
Serve: Batch, Realtime (Database Trigger, Pub/Sub, web-service, inApp) Each approach having its own set of benefits and tradeoffs that need to be considered.

One off Training

Models don’t necessarily need to be continuously trained in order to be pushed to production. Quite often a model can be just trained ad-hoc by a data-scientist, and pushed to production until its performance deteriorates enough that they are called upon to refresh it.

From Jupyter to Prod
DataScientists prototyping and doing machine learning tend to operate in their environment of choice Jupyter Notebooks. Essentially an advanced GUI on a repl, that allows you to save both code and command outputs.
Using that approach it is more than feasible to push an ad-hoc trained model from some piece of code in Jupyter to production. Different types of libraries and other notebook providers help further tie the link between the data-scientist workbench and production.

Model Format

Pickle converts a python object to to a bitstream and allows it to be stored to disk and reloaded at a later time. It is provides a good format to store machine learning models provided that their intended applications is also built in python.

ONNX the Open Neural Network Exchange format, is an open format that supports the storing and porting of predictive model across libraries and languages. Most deep learning libraries support it and sklearn also has a library extension to convert their model to ONNX’s format.

PMML or Predictive model markup language, is another interchange format for predictive models. Like for ONNX sklearn also has another library extension for converting the models to PMML format. It has the drawback however of only supporting certain type of prediction models.PMML has been around since 1997 and so has a large footprint of applications leveraging the format. Applications such as SAP for instance is able to leverage certain versions of the PMML standard, likewise for CRM applications such as PEGA.

POJO and MOJO are H2O.ai’s export format, that intendeds to offers an easily embeddable model into java application. They are however very specific to using the H2O’s platform.

Training

For one off training of models, the model can either be trained and fine tune adhoc by a data-scientists or training through AutoML libraries. Having an easily reproducible setup, however helps pushing into the next stage of productionalization, ie: batch training.

Batch Training

While not fully necessary to implement a model in production, batch training allows to have a constantly refreshed version of your model based on the latest train.

Batch training can benefit a-lot from AutoML type of frameworks, AutoML enables you to perform/automate activities such as feature processing, feature selection, model selections and parameter optimization. Their recent performance has been on par or bested the most diligent data-scientists.

Using them allows for a more comprehensive model training than what was typically done prior to their ascent: simply retraining the model weights.

Different technologies exists that are made to support this continuous batch training, these could for instance be setup through a mix of airflow to manage the different workflow and an AutoML library such as tpot, Different cloud providers offer their solutions for AutoML that can be put in a data workflow. Azure for instance integrates machine learning prediction and model training with their data factory offering.

Real time training

Real-time training is possible with ‘Online Machine Learning’ models, algorithms supporting this method of training includes K-means (through mini-batch), Linear and Logistic Regression (through Stochastic Gradient Descent) as well as Naive Bayes classifier.

Spark has StreamingLinearAlgorithm/StreamingLinearRegressionWithSGD to perform these operations, sklearn has SGDRegressor and SGDClassifier that can be incrementally trained. In sklearn, the incremental training is done through the partial_fit method as shown below:

When deploying this type of models there needs to be serious operational support and monitoring as the model can be sensitive to new data and noise, and model performance needs to be monitored on the fly. In offline training, you can filter points of high leverage and correct for this type of incoming data. This is much harder to do when you are constantly updating your model training based on a stream of new data points.

Another challenge that occurs with training online model is that they don’t decay historical information. This means that, on case there are structural changes in your datasets, the model will need to be anyway re-trained and that there will be a big onus in model lifecycle management.

Batch vs. Real-time Prediction

When looking at whether to setup a batch or real-time prediction, it is important to get an understanding of why doing real-time prediction would be important. It can potentially be for getting a new score when significant event happen, for instance what would be the churn score of customer when they call a contact center. These benefits needs to be weighted against the complexity and cost implications that arise from doing real-time predictions.

Load implications

Catering to real time prediction, requires a way to handle peak load. Depending on the approach taken and how the prediction ends up being used, choosing a real-time approach, might also require to have machine with extra computing power available in order to provide a prediction within a certain SLA. This contrasts with a batch approach where the predictions computing can be spread out throughout the day based on available capacity.

Infrastructure Implications

Going for real-time, put a much higher operational responsibility. People need to be able to monitor how the system is working, be alerted when there is issue as well as take some consideration with respect to failover responsibility. For batch prediction, the operational obligation is much lower, some monitoring is definitely needed, and altering is desired but the need to be able to know of issues arising directly is much lower.

Cost Implications

Going for real-time predictions also has costs implications, going for more computing power, not being able to spread the load throughout the day can force into purchasing more computing capacity than you would need or to pay for spot price increase. Depending on the approach and requirements taken there might also be extra cost due to needing more powerful compute capacity in order to meet SLAs. Furthermore, there would tend to be a higher infrastructure footprint when choosing for real time predictions. One potential caveat there is where the choice is made to rely on in app prediction, for that specific scenario the cost might actually end up being cheaper than going for a batch approach.

Evaluation Implications

Evaluating the prediction performance in real-time manner can be more challenging than for batch predictions. How do you evaluate performance when you are faced with a succession of actions in a short burst producing multiple predictions for a given customer for instance? Evaluating and debugging real-time prediction models are significantly more complex to manage. They also require a log collection mechanism that allows to both collect the different predictions and features that yielded the score for further evaluation.

Batch Prediction Integration

Batch predictions rely on two different set of information, one is the predictive model and the other one is the features that we will feed the model. In most type of batch prediction architecture, ETL is performed to either fetch pre-calculated features from a specific datastore (feature-store) or performing some type of transformation across multiple datasets to provide the input to the prediction model. The prediction model then iterates over all the rows in the datasets providing the different score.

Once all the predictions have been computed, we can then “serve” the score to the different systems wanting to consume the information. This can be done in different manner depending on thee use case for which we want to consume the score, for instance if we wanted to consume the score on a front-end application, we would most likely push the data to a “cache” or NoSQL database such as Redis so that we can offer milliseconds responses, while for certain use cases such as the creation of an email journey, we might just be relying on a CSV SFTP export or a data load to a more traditional RDBMS.

Real-time Prediction integration

Being able to push model into production for real-time applications require 3 base components. A customer/user profile, a set of triggers and predictive models.

Profile: The customer profile contains all the related attribute to the customer as well as the different attributes (eg: counters) necessary in order to make a given prediction. This is required for customer level prediction in order to reduce the latency of pulling the information from multiple places as well as to simplify the integration of machine learning models in productions. In most cases a similar type of data store would be needed in order to effectively fetch the data needed to power the prediction model.

Triggers: Triggers are events causing the initiation of process, they can be for churn for instance, call to a customer service center, checking information within your order history, etc …

Models: models need to have been pre-trained and typically exported to one of the different formats previously mentioned (eg pickle, ONNX, PMML) to be something that we could easily port to production.

There are quite a few different approach to putting models for scoring purpose in production:

Relying on in Database integration: a lot of database vendors have made a significant effort to tie up advanced analytics use cases within the database. Be it by direct integration of Python or R code, to the import of PMML model.
Exploiting a Pub/Sub model: The prediction model is essentially an application feeding of a data-stream and performing certain operations, such as pulling customer profile information.
Webservice: Setting up an API wrapper around the model prediction and deploying it as a web-service. Depending on the way the web-service is setup it might or might not do the pull or data needed to power the model.
inApp: it is also possible to deploy the model directly into a native or web application and have the model be run on local or external datasources.

Database integrations

If the overall size of your database is fairly small (< 1M user profile) and the update frequency is occasional it can make sense to integrate some of the real-time update process directly within the database.

Postgres possess an integration that allows to run Python code as functions or stored procedure called PL/Python. This implementation has access to all the libraries that are part of the PYTHONPATH, and as such are able to use libraries such as Pandas and SKlearn to run some operations.

This can be coupled with Postgres’ Triggers Mechanism to perform a run of the database and update the churn score. For instance if a new entry is made to a complaint table, it would be valuable to have the model be re-run in real-time.

Sequence flow

The flow could be setup in the following way:

New Event: When a new row is inserted in the complain table, an event trigger is generated.

Trigger: The trigger function would update the number of complaint made by this customer in the customer profile table and fetch the updated record for the customer.

Prediction Request: Based on that it would re-run the churn model through PL/Python and retrieve the prediction.

Customer Profile Update: It can then re-update the customer profile with the updated prediction. Downstream flows can then happen upon checking if the customer profile has been updated with new churn prediction value.

Technologies

Different databases are able to support the running of Python script, this is the case of PostGres which has a native Python integration as previosuly mentioned, but also of Ms SQL Server through its’ Machine Learning Service (in Database), other databases such as Teradata, are able to run R/Python script through an external script command. While Oracle supports PMML model through its data mining extension.

Pub/Sub

Implementing real-time prediction through a pub/sub model allows to be able to properly handle the load through throttling. For engineers, it also means that they can just feed the event data through a single “logging” feed, to which different application can subscribe.

An example, of how this could be setup is shown below:

The page view event is fired to a specific event topic, on which two application subscribe a page view counter, and a prediction. Both of these application filter out specific relevant event from the topic for their purpose and consume the different messages in the topics. The page view counter app, provides data to power a dashboard, while the prediction app, updates the customer profile.

Sequence flow:

Event messages are pushed to the pub/sub topic as they occur, the prediction app poll the topic for new messages. When a new message is retrieved by the prediction app, it will request and retrieve the customer profile and use the message and the profile information to make a prediction. which it will ultimately push back to the customer profile for further use.

A slightly different flow can be setup where the data is first consumed by an “enrichment app” that adds the profile information to the message and then pushes it back to a new topic to finally be consumed by the prediction app and pushed onto the customer profile.

Technologies

The typical open source combination that you would find that support this kind of use case in the data ecosystem is a combination of Kafka and Spark streaming, but a different setup is possible on the cloud. On google notably a google pub-sub/dataflow (Beam) provides a good alternative to that combination, on azure a combination of Azure-Service Bus or Eventhub and Azure Functions can serve as a good way to consume the mesages and generate these predictions.

Web Service

We can implement models into productions as web-services. Implementing predictions model as web-services are particularly useful in engineering teams that are fragmented and that need to handle multiple different interfaces such as web, desktop and mobile.

Interfacing with the web-service could be setup in different way:

either providing an identifier and having the web-service pull the required information, compute the prediction and return its’ value
Or by accepting a payload, converting it to a data-frame, making the prediction and returning its’ value.

The second approach is usually recommended in cases, when there is a lot of interaction happening and a local cache is used to essentially buffer the synchronization with the backend systems, or when needing to make prediction at a different grain than a customer id, for instance when doing session based predictions.

The systems making use of local storage, tend to have a reducer function, which role is to calculate what would be the customer profile, should the event in local storage be integrated back. As such it provides an approximation of the customer profile based on local data.

Sequence Flow

The flow for handling the prediction using a mobile app, with local storage can be described in 4 phases.

Application Initialization (1 to 3): The application initializes, and makes a request to the customer profile, and retrieve its initial value back, and initialize the profile in local storage.

Applications (4): The application stores the different events happening with the application into an array in local storage.

Prediction Preparation (5 to 8): The application wants to retrieve a new churn prediction, and therefore needs to prepare the information it needs to provide to the Churn Web-service. For that, it makes an initial request to local storage to retrieve the values of the profile and the array of events it has stored. Once they are retrieve, it makes a request to a reducer function providing these values as arguments, the reducer function outputs an updated* profile with the local events incorporated back into this profile.

Web-service Prediction (9 to 10): The application makes a request to the churn prediction web-service, providing the different the updated*/reduced customer profile from step 8 as part of the payload. The web-service can then used the information provided by the payload to generate the prediction and output its value, back to the application.

There are quite a few technologies that can be used to power a prediction web-service:

Functions

AWS Lambda functions, Google Cloud functions and Microsoft Azure Functions (although Python support is currently in Beta) offer an easy to setup interface to easily deploy scalable web-services.

For instance on Azure a prediction web-service could be implemented through a function looking roughly like this:

Container

An alternative to functions, is to deploy a flask or django application through a docker container (Amazon ECS, Azure Container Instance or Google Kubernetes Engine). Azure for instance provides an easy way to setup prediction containers through its’ Azure Machine Learning service.

Notebooks

Different notebooks providers such as databricks and dataiku have notably worked on simplifying the model deployment from their environments. These have the feature of setting up a webservice to a local environment or deploying to external systems such as Azure ML Service, Kubernetes engine etc…

in App

In certain situations when there are legal or privacy requirements that do not allow for data to be stored outside of an application, or there exists constraints such as having to upload a large amount of files, leveraging a model within the application tend to be the right approach.

Android-ML Kit or the likes of Caffe2 allows to leverage models within native applications, while Tensorflow.js and ONNXJS allow for running models directly in the browser or in apps leveraging javascripts.

Considerations

Beside the method of deployments of the models, they are quite a few important considerations to have when deploying to production.

Model Complexity

The complexity of the model itself, is the first considerations to have. Models such as a linear regressions and logistic regression are fairly easy to apply and do not usually take much space to store. Using more complex model such as a neural network or complex ensemble decision tree, will end up taking more time to compute, more time to load into memory on cold start and will prove more expensive to run

Data Sources

It is important to consider the difference that could occur between the datasource in productions and the one used for training. While it is important for the data used for the training to be in sync with the context it would be used for in production, it is often impractical to recalculate every value so that it becomes perfectly in-sync.

Experimentation framework

Setting up an experimentation framework, A/B testing the performance of different models versus objective metrics. And ensuring that there is sufficient tracking to accurately debug and evaluate models performance a posteriori.

Wrapping Up

Choosing how to deploy a predictive models into production is quite a complex affair, there are different way to handle the lifecycle management of the predictive models, different formats to stores them, multiple ways to deploy them and very vast technical landscape to pick from.

Understanding specific use cases, the team’s technical and analytics maturity, the overall organization structure and its’ interactions, help come to the the right approach for deploying predictive models to production.

More from me on Hacking Analytics: