Leveling with cluster analysis in Python: basic Python concepts

Hilton Fernandes — Thu, 06 Nov 2025 21:36:06 +0000

This is the 2nd of a series of 5 little articles that intend to present a simple idea of time series, and their implementation in Python. The purpose here is to present both a time series problem, and how we can solve it in simple Python code.

Only very basic knowledge of Python and time series are needed as most concepts will be explained with care and references to longer tutorials.

Roadmap

The 1st article of the series presented the basic concepts of this series. This one, the 2nd one will present basic Python concepts and techniques to be used in the solution. The 3rd one will present a solution implemented in Python. The 4th article will add a sinusoidal decomposition of the data after the filtering of the solution. And the 5th and last one will use all the elements to address a real problem in cryptocurrencies.
'

Some simple Python ideas

Libraries, modules and submodules

Isaac Newton, that created a huge part of modern Physics and Mathematics once said the he could see further because because he standed in the shoulder of giants.

This concept is behind most software codes: they do not create everything, they use a large part of what was already created, mainly in the form of sofware libraries. Python is very good at this, and here is the part of the code used here that will use some libraries. In Python, a library is usually named a module.

import numpy as np 

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

The idea that the current code will bring information from another one is encompassed in the word import in the code above. Another important point is that programmers usually prefer to write less. So, in the 1st line, the library numpy is renamed as np. In another line, the module matplotlib.pyplot is renamed simply as plt.

And in the last line of the code, another shortening is presented: instead of renaming a code fragment in a shorter from, Python lets one pick only what a programmer needs. In this case, only KMeans will be picked from sklearn.cluster.

A less important point is that use of a dot lets selecting a part of a module; that is: a submodule. In the code shown, matplotlib.pyplot means the submodule pyplot of the module matplotlib. And, of course, sklearn.cluster is the submodule cluster of the module sklearn. Creating submodules increases the organization of larger modules, as this divides them in specialized parts.

Pieces of information in scalar variables

The next part of the code deals with storing information that shall be processed. Here a list of scalar variables, or variables that are individual:

coeff : float = 0.25

base : float = 0.0
ladder : float = 1.0
n_points : int = 101
n_half : int = n_points // 2

delta : float = 1.0 / n_points

There are two types of information in this code fragment: int, that hold only integral values. In this case the count of points (n_point) and the half of that count, n_half.

The double bar in the line for n_half is used to make sure the division of n_points by 2 will generate a number of type int, and not a float number. That is: it will hold 50, and not 50.5, as the division of 101 by 2 would create.

The type float can be used to hold number with a decimal part. For instance, coeff will be used to hold a coefficient. In this case, 0.25. Since the problem here is to represent a discontuinuity, a set of values will be close to a base level, while other ones will be above it, in ladder.

And finally, delta holds the step that will be used as a clock tick in our time series.

A data generator

A time series usually is a series of data collected along the time. For instance, mean wage in a certain year. But to avoid the need of getting real data, this code will generate its own data. By means of a pseudorandom number generator, aka PRNG. In a few words, a PRNG is a mathematical algorithm that can generate a sequence of numbers without any pattern; that is: they look random. It's called pseudorandom because if one such algorithm is fed with a constant seed it will always generate the same sequence of numbers.

In this case it is:

rng = np.random.default_rng(42)

In this code, rng is the name of the fabric of numbers, and it's created by calling the function (or piece of code) defaut_rng with the parameter 42. This function is in the submodule np.random.

Pieces of information in arrays

Since a time series contain several data, a scalar variable can't be used to contain it. The module numpy has the resource of arrays or ndarrays. In an array, the elements are identified by an index, that is analogous to the apartment number in a building.

The following code fragment creates the arrays needed to this problem:

x : np.ndarray = np.linspace(0.0, 1.0, n_points)
y : np.ndarray = rng.uniform(-0.5, 0.5, size=n_points)

y[n_half : ] += ladder

In the 1st line, the array x receives n_points (aka 101) numbers from 0.0 to 1.0, subdivided in increments of 0.01; that is: 0.0, 0.01, 0.02, ..., 0.99, 1.0

In the 2nd line, the array y receives n_points numbers chosen pseudorandomly between -05 and 0.5.

In the third line, the 2nd half of y receives an increment of ladder. That creates the discontinuity to be solved in the remaining of the texts.

Plotting the data

Please consider the following code fragment:

fig, ax = plt.subplots()
ax.plot(x, y)

ax.set_xlabel('x')
ax.set_ylabel('y')

plt.grid(visible=True)
plt.show()

The 1st line is a generic one, that can be used to create plots -- aka charts -- much more complicated than the ones used here. For instance, several plots in the same image.

The 2nd line plots the elements of two arrays, taking care of connecting by lines each one of their values.

The 3rd and 4th lines of the code label the x and y axis.

The 5th line of the code create a grid to ease the visualization of data. And the 6th and final line causes the chart to be shown in the screen.

The final result is like this:

Leveling with cluster analysis in Python: general concepts

Hilton Fernandes — Thu, 30 Oct 2025 18:52:11 +0000

Financial markets have discontinuities: sometimes a price jumps up or down in a time so short that it can be considered a real discontinuity if the time measured by our clocks were really continuous, real-line continuous.

Those discontinuities create problems for many forms of mathematical modelling, since their models are based upon continuous functions. For instance, many price oscillations look like periodic functions, but when a discontinuity is found, any harmonic analysis becomes troublesome.

Actually, a trend can also be troublesome to the fitting of periodic functions to financial data. But in this case, fitting a polynomial of low grade to the data can filter the trend and then a periodic function series can be fitted to the residuals, the difference between the fitted polynomial and the original data.

The purpose of this suite of articles is to present a simple method to eliminate jumps from the observed data. Of course, when reconstructing the fitted data, the discontinuity will be added back.

Only very basic knowledge of Python and time series are needed as most concepts will be explained with care and references to longer tutorials.

Roadmap

This one, the 1st of 5 short articles, will introduce the general concepts for the solution, the 2nd one will present basic Python concepts and techniques to be used in the solution, the 3rd one will present a solution implemented in Python, the 4th article will add a sinusoidal decomposition of the data after the filtering of the solution, and the 5th and last one will use all the elements to address a real problem in cryptocurrencies.

Cluster analysis as a means to group similar data

Cluster analysis is a well-known technique for grouping data elements based on their similarities. In a metric space, similarity means smaller distances. There are several ways to devise the groups or clusters of data, and one of the simplest is called k-means clustering. In very few words, it creates clusters by assigning a mean average of the coordinates to a point, that's a centroid. Through these articles we shall use only k-means clustering.

The following image is a typical two-dimensional representation of two groups.

The points are in blue, and the centroids are in red.

Cluster analysis in a curve

Since the k-means clustering is based upon the distance of points, an interesting effect will happen when the points are connected in a curve; therefore, they are much closer to each other than the points dispersed in a cloud, like in the previous image.

Please consider the following image that shows a time series with a discontinuity.

When a k-means cluster analysis is applied to it, the centroids of the clusters are shown in red.

It's easy to see that the there are two groups are in different levels, as shown in the following image:

The Group 2 is around the green line, while the Group 1 is around the red line.

Then to eliminate the discontinuity, it's enough to lower the Group 1 to the level of the Group 2. That is: to subtract from the points y coordinate the difference between the level of the two groups.

That can be shown in the following image:

Now no differences can be seen in the two groups of points.

Next step

The next article in this suit will introduce the basic Python concepts needed to create the 1st of the images presented here, and also used in the other articles.

Forem: Hilton Fernandes

Leveling with cluster analysis in Python: basic Python concepts

Roadmap

Some simple Python ideas

Libraries, modules and submodules

Pieces of information in scalar variables

A data generator

Pieces of information in arrays

Plotting the data

Leveling with cluster analysis in Python: general concepts

Roadmap

Cluster analysis as a means to group similar data

Cluster analysis in a curve

Next step