How To Automate Your Statistical Data Analysis

hkad98 — Mon, 21 Mar 2022 12:26:41 +0000

During my university studies, I attended a course named Statistical Data Analysis. I was excited about this course because it taught me all the basic statistical analysis methods such as (non-)linear regression, ANOVA, MANOVA, LDA, PCA, etc. However, I never learned about the business application of these methods. During the course, we worked with several examples. Still, all the samples were CSV datasets, mainly from Kaggle. I wondered how I could perform some statistical analysis over a database such as Postgres and automate it by creating a pipeline, trying all possible combinations of columns, yielding a result saying which models are better than others. Years passed by, and now I am here to present you with my proof of concept of statistical data analysis automation.

What Statistical Method Will Be Automated?

For my proof of concept, I chose to automate linear regression. I decided on linear regression because I consider this method an entry point to statistical data analysis. Since linear regression has assumptions, I can easily demonstrate how I imagine sorted models from the best to the worst.

What Is Linear Regression, and What Are Its Assumptions?

Simple linear regression is a statistical method that models a relationship between two continuous variables, and if there are multiple variables, we call it multiple linear regression. The most straightforward approach for linear regression is ordinary least squares estimation.

We can extend ordinary least squares estimation with regularization. One of the most known regularizations are shrinkage methods (ridge, lasso). They are called shrinkage methods because they force estimated parameters to be shrunken towards zero, producing reduced variance.

Let's remind ourselves what the goal of the ordinary least squares estimation method is.

In the figure above, we can see points (red) in space and a line (black). We define residual (s values) as a distance between the predicted value and a point. The ordinary least squares estimation method aims to find the line (line coefficients – intercept and slope), which minimizes the sum of squared s.

In this article, I will consider four assumptions of linear regression. I cannot forget to mention that some method modifications are not that strict about assumptions and can demand them in a weaker form than standard linear regression.

The four assumptions for linear regression are:

Linearity
- the existence of a linear relationship between variables
Homoscedasticity
- constant variance – residuals are independent of the predictor variables
Independence
- residuals are independent of each other
Normality
- residuals are normally distributed

What Technologies Am I Going To Use?

I'm enthusiastic about Python, so my programming language of choice is evident. I chose to use GoodData.CN – cloud-native analytics platform.

The reasons why I chose GoodData.CN are:

The logical data model (LDM) holds information about attributes, facts, datasets (tables), and their relationships.
Metrics – GoodData.CN allows users to create, store and reuse metrics. Metrics are aggregation functions that do not require any joins and adapt to the context, thanks to LDM.
GoodData's Python SDK allows me to work with LDM, metrics, and access data from GoodData.CN in Python pandas data frame.

An example of Logical Data Model

An example of a revenue metric containing another metric order amount showing that we can reuse metrics in other metrics.

The figure above shows an example of a pandas data frame containing attribute campaign name, fact price, and metric revenue.

Note that metric revenue without any joins is used, and even though facts budget and price are both in different datasets, we get valid results.

Let's Automate Our Statistical Data Analysis

I will now describe my proposed pipeline, and then I shall present my results.

Generate all possible combinations without repetition – triplets consisting of precisely one attribute, metric/fact, and metric/fact. Thanks to the logical data model and Python SDK, this task becomes trivial. The logical data model gives me information about what attributes, facts, and metrics I can combine.
For every combination, get a data frame using GoodData Python SDK.
For every data frame combination, evaluate linear regression assumptions.
Sort triplets in descending order of the number of valid assumptions.
Enjoy your analyzed data.

My proposed pipeline seems quite simple, but the tricky parts are hidden in the details – like what statistical tests are used for assumption validation, thresholds for statistical tests hypothesis, etc. Such details are subject to personal choice.

In the following section, I will demonstrate how simple it is to implement parts of the pipeline mentioned above.

content_service = sdk.catalog_workspace_content
catalog = content_service.get_full_catalog(workspace_id)

# Get attributes
attributes = []
for dataset in catalog.datasets:
    attributes.extend(dataset.attributes)

# Get metrics
metrics = catalog.metrics

# Get facts
facts = []
for dataset in catalog.datasets:
    facts.extend(dataset.facts)

As you can see above, using a few lines of code, you can access workspace content – LDM.

numbers: list[Numeric] = []
numbers.extend(metrics)
numbers.extend(facts)

combinations = set()
pairs = itertools.combinations(numbers, 2)
for pair in pairs:
    valid_objects = content_service.compute_valid_objects(workspace_id, list(pair))
    for a in valid_objects.get("attribute", []):
        attribute = catalog.find_label_attribute(f"label/{a}")
        if attribute:
            combinations.add(Triplet([attribute] + list(pair)))

Using the compute valid objects method, I can find out what other LDM entities I can add to the combination of metric/fact and metric/fact, making a valid combination accessible.

pandas = GoodPandas(os.getenv('HOST'), os.getenv('TOKEN'))
df_factory = pandas.data_frames(workspace_id)
combinations = load_combinations()
columns = list(combinations)[0].as_computable_dictionary
data_frame = df_factory.not_indexed(columns)

The code snippet above is an example of getting non-indexed pandas data frames using generated combinations of triplets.

You can find the code implementing the pipeline above with a script for setting up the GoodData.CN CE environment in the GitHub repository.

The figure above shows the linear regression returned as the best one with all four passed assumptions.

The above example shows the worst linear regression with only one valid assumption - independence.

Potential Simplifications

I would like to address the possible simplifications I noted during the creation of the code. The first simplification arises from generating combinations. In my example, I can omit combinations containing the primary keys of datasets because their role is purely technical.
Furthermore, I found that generating combinations is time-consuming for massive databases, so I propose to create combinations only once, store them using Pickle, for example, and then just load them. I added an option to cache combinations in the code in GitHub, so I encourage you to cache combinations when trying it out. When you add a new column to your database, you can regenerate your combinations or just generate combinations related to the new column and add them to existing ones.
My last noted simplification is related to assumptions validation. Three out of four assumptions need residuals, which means we require the existence of a linear regression model. Linearity assumption does not require residuals, so we do not have to create a linear regression model. We can validate this assumption using, e.g. Pearson’s correlation coefficient, and if linearity is not satisfied, we can omit other assumptions and speed up the pipeline.

Summary

If you've come this far, I thank you for reading my article about my proof of concept. I consider statistical data analysis essential for data analysis, and I find it convenient to simplify work using the right technologies and automate the research process. I welcome any feedback concerning this topic, and I will be pleased to hear from you if you come across any solutions to similar problems. I encourage you to check out my proof of concept and to try it for yourself.

I look forward to seeing your modifications of statistical tests.

MAQL: Powerful Analytical Querying Made Simple

hkad98 — Mon, 29 Nov 2021 16:05:28 +0000

If you are reading this you may know what MAQL is and want to know more or, alternatively, you may have never heard of MAQL but are keen to find out what it is. Either way, welcome and let me introduce you to MAQL.

What Is MAQL?

MAQL is GoodData's analytical querying language, which simplifies querying over multidimensional data and provides predefined functions that allow simple queries, such as simple aggregations, or more complex functions like skewness or kurtosis.

You’re probably thinking that the same thing can be done using SQL, and you would be right, but MAQL makes it significantly easier. When you are carrying out multidimensional analysis using SQL, you have to be careful about querying and joining your data. MAQL, on the other hand, uses a semantic model that allows you to omit many of these things, making querying much easier. With this in mind, let me first tell you what the semantic model is and why it is essential.

Semantic Model

When a database is being modeled, the first thing which is done is a conceptual model. The conceptual model is there for database architects to understand the structure of the database. MAQL uses this knowledge as well and calls it a semantic model. The semantic model is the same as the conceptual model for database architects: It gives MAQL the power and knowledge of the database, its attributes, and relationships between tables. The inside is very crucial for efficient and user-friendly querying of data.

MAQL Syntax and Usage

MAQL syntax is similar to SQL syntax but with one main difference; MAQL does away with keywords like FROM and JOIN. Why? Due to the presence of the semantic model, they are simply not required: The semantic model has the knowledge of relations and the occurrence of attributes. Let's take a look at an example.

As can be seen in the above, MAQL significantly simplifies the querying process. All we need to know with MAQL is what we want to do: The rest is up to the MAQL engine, which uses the semantic model to query our data. A MAQL query is called a metric; an aggregation producing a single number.

Another crucial advantage of MAQL is that you can specify the "ground truth analysis" (i.e., single source of truth) of your data by creating and storing metrics. Stored metrics can be reused in other metrics, as can be seen below.

Another use for stored metrics is to provide results to other data consumers such as data analysts, data scientists, data engineers, and the relevant data consumption tools. The main idea can be seen in the picture below, where these tools access stored metrics using API, and all data consumers are using the same "ground truth analysis". This approach prevents the creation of duplicate analysis and simplifies work for data consumers.

Ready to Learn More?

MAQL forms part of the headless BI family, with headless BI providing consistent, real-time insights to an array of different data consumers. To try out MAQL and see it in action, simply pull the GoodData.CN docker image. For help and further info, see the MAQL documentation, step-by-step tutorials, community forum, and community slack channel. I think it is worth trying it on demo data which are included in the docker image or your own data or any data you can find. Get as creative as you like.

Forem: hkad98