Forem: Salman Khan

A Practical Guide to Hypothesis Testing I : Association - From Fisher’s Exact Test to the Chi-Square

Salman Khan — Sun, 07 Sep 2025 11:25:51 +0000

In experimental research and A/B testing, analysts frequently compare two independent groups on a binary outcome such as success/failure, conversion/no conversion, or alive/dead. The situation has three ingredients:

Two independent groups (for example treatment and control)
A binary outcome variable (for example death / survival)
The comparison of proportions between groups

The data naturally form a 2×2 contingency table, and the central inferential question is whether any observed difference in proportions reflects a real effect or mere chance.

Two of the most prominent methods for answering this question are Fisher's Exact Test and the Chi-Square Test of Independence. While both tests address the same core hypothesis, they are founded on different statistical philosophies and assumptions. This article provides a practical guide to navigating this choice. We will demystify the underlying principles of each test, clarify their assumptions, and address common misconceptions.

To illustrate the practical application and differences between these tests, we will use a clinical trial example throughout this guide. Imagine a study comparing a new treatment to a control with a one-sided hypothesis $p_T < p_C$ (example adapted from MITx 6.419x):

	Treatment	Control
Death	39	63
Survive	30,961	30,937
Total	31,000	31,000

The observed mortality rates are:

Treatment: $p^T=39/31,000≈0.126%\hat p_T = 39/31{,}000 \approx 0.126\%$
Control: $p^C=63/31,000≈0.203%\hat p_C = 63/31{,}000 \approx 0.203\%$
Risk difference: $p^T−p^C≈−0.077%\hat p_T - \hat p_C \approx -0.077\%$

Although the treatment group shows a lower mortality rate, we must employ statistical testing to determine whether this difference is statistically significant or likely to have occurred by chance under the null hypothesis.

Fisher's Exact Test

Fisher's exact test is a non-parametric method that calculates the exact probability of observing a table as extreme as, or more extreme than, the one observed, given the fixed marginal totals. It is ideal for small sample sizes or rare events, because it does not rely on large-sample approximations.

The hypergeometric probability for observing exactly $a$ events in the treatment group is

P(A=a)=\frac{\binom{n_T}{a}\,\binom{n_C}{s-a}}{\binom{n}{s}}

where $n_T$ and $n_C$ are the treatment and control sample sizes, $s$ is the total number of events and $n=n_T+n_C$ .

For one-sided testing of $p_T < p_C$ we sum hypergeometric probabilities for tables with treatment-group counts less than or equal to the observed count.

from scipy.stats import fisher_exact

# Table format: [[Deaths_T, Survived_T], [Deaths_C, Survived_C]]
table = [[39, 30961],
         [63, 30937]]

odds_ratio, p_value_fisher = fisher_exact(table, alternative='less')

print("Fisher's exact test (one-sided, p_T < p_C):")
print("  Odds ratio:", odds_ratio)
print("  p-value:", p_value_fisher)

Assumption: Fisher's test assumes that both row and column margins are fixed. Under this model the total number of events and the group sizes are treated as fixed, and the probability of the observed cell counts follows a hypergeometric distribution. Fisher’s test gives exact p-values under this conditioning assumption.

Barnard's test

Barnard's test treats each group as an independent binomial experiment and constructs an exact test without conditioning on the column margin. In many small-sample settings it is more powerful than Fisher's test because it uses the larger sample space of possible outcomes.

Assumption: Barnard's test models each group as an independent binomial with probabilities $p_T$ and $p_C$ . It does not condition on the column margin (total events). As a result Barnard's test is unconditional and often has higher power than Fisher in small samples because it does not restrict attention to fixed column totals

The Chi-Square Test of Independence and the pooled Z-Test

The Pearson chi-square test and the pooled two-proportion z-test are asymptotic methods that test the hypothesis of equal proportions. For a 2×2 table, they are mathematically equivalent.

Pooled proportion

p^=a+cnT+nC \hat p = \frac{a + c}{n_T + n_C}

Pooled standard error

SEpooled=p^(1−p^)(1nT+1nC) SE_{\text{pooled}} = \sqrt{\hat p(1-\hat p)\left(\frac{1}{n_T} + \frac{1}{n_C}\right)}

Z-statistic for the difference in proportions (treatment minus control)

z=p^T−p^CSEpooled z = \frac{\hat p_T - \hat p_C}{SE_{\text{pooled}}}

Note on one-sided versus two-sided p-values

The pooled z-test can produce a one-sided p-value directly from the z-statistic.

The chi-square routine typically returns a two-sided p-value based on the chi-square upper tail. If the z-statistic has the direction you expect (for example negative when testing $p_T < p_C$ ), the one-sided p-value in that direction equals half the chi-square two-sided p-value. In other words, when the direction of the effect matches the alternative, one-sided p ≈ chi-square p / 2.

Treat counts as approximately normal via the central limit theorem when expected counts are sufficiently large. The pooled z-test uses a pooled variance estimate under the null. The chi-square compares observed to expected counts under independence.
These methods are fast and accurate for moderate to large samples, but they are approximations. The usual rule of thumb is expected counts at least 5, but this is a heuristic. For rare events or very skewed margins, check approximations against exact methods or simulation.

Enhancing Machine Learning Models: A Deep Dive into Feature Engineering

Salman Khan — Sun, 31 Dec 2023 12:48:27 +0000

The efficacy of machine learning models heavily depends on the quality of input data and features [1]. In traditional machine learning, transforming raw data into features is crucial for model accuracy. Feature engineering aims to transform existing data into informative, relevant, and discriminative features. Although deep learning and end-to-end learning have revolutionized and automated processing for images, text, and signals, feature engineering for relational and human behavioural data remains an iterative, slow and laborious task [2].

This article explores techniques for feature engineering to enhance the accuracy and reliability of a predictive model. Additionally, it presents solutions that can help streamline the feature engineering process.

Data Science Workflow: An Iterative Three-Step Process [2]

In the initial phase, analysts define the predictive objectives. Data engineers then extract, load, transform variables, engineer features, and define target labels. Finally, machine learning engineers construct models tailored to the specified predictive goals, exploring various techniques iteratively for the most suitable solution.

What is Feature Engineering?

Feature Engineering is an essential step in traditional machine learning models, where the experts manually design and extract relevant features from the processed data. The goal is to encode expert knowledge, intuitive judgement, and human preconceptions into the machine learning model. This enables easier learning, especially with smaller data sets and increased model accuracy and interpretability.

However, feature engineering is not a one-size-fits-all solution. The choice of features and techniques depends on the nature of the data, the complexity of the problem, and the goals of the model. Moreover, feature engineering is an iterative process where the model performance is evaluated, and the features are refined and updated accordingly.

The figure below illustrates how simply projecting existing covariates into higher dimensional space, i.e. creating polynomial variants of existing features, can make the data linearly separable and easier to learn by a simple machine learning model.

Example of how feature engineering improves model accuracy - Adding a polynomial variant of existing features can make classes linearly separable and easier for a simple ML model to learn.

Basic Feature Engineering Techniques on relational and temporal data

The table below summarizes some basic feature engineering techniques relevant to the traditional machine learning models.

Technique	Description and Use Cases	Procedure
Imputation	Filling or estimating missing values to complete the dataset. Critical for handling missing data before model training.	Mean, median, or mode imputation for numerical variables.
Scaling	Normalizing numerical features to a similar scale to prevent bias. Essential for preventing the dominance of features with larger magnitudes.	Min-Max scaling (values in [0, 1]). Z-score normalization (mean of 0, standard deviation of 1).
Outliers	Setting predefined upper and lower bounds for numerical values to limit extreme values (outliers). This helps prevent extreme values from disproportionately influencing the model.	Define upper and lower bounds based on percentiles or specific thresholds. Cap values above the upper bound and collar values below the lower bound.
One-Hot Encoding	Representing categorical variables as binary vectors. Enables machine learning algorithms to work with categorical data.	Create a binary column for each category. Assign 1 to the corresponding category, 0 otherwise.
Binning	Transforming continuous numerical features into categorical ones. Useful for handling non-linear relationships in numerical data.	Group values into discrete intervals or bins.
Log Transform	Applying logarithmic transformation to skewed numerical features. Effective for variables like income with skewed distributions.	Logarithm of values to handle right-skewed distributions.
Polynomial Features	Creating new features via polynomial transformation. Captures non-linear relationships in data.	If x is a feature, x^2 and x^3 become new features.
Feature Interactions	Creating new features by combining existing features. Captures joint effects on the target variable.	If x1 and x2 are features, create a new feature x1 * x2.
Feature Aggregation	Combining multiple related features into a single, more informative feature. Reduces dimensionality and captures consolidated information.	Calculate averages or sums of related features.
Time-Based Features	Extracting temporal information from timestamps or time-related data. Useful for understanding temporal patterns in the data.	Examples: Day of the week, hour of the day, time lags for time series data.
Regular Expressions Features	Extracting patterns from text data using regular expressions. Useful for identifying specific structures or formats in text.	Examples: Matching email addresses, extracting dates, identifying hashtags in social media text.
Frequency Encoding	Assigning numerical values based on the frequency of categorical variables. Useful for encoding categorical variables with varying frequencies.	Preserves information about the distribution of categories. Suitable for high-cardinality categorical variables.

Domain Specific Feature Engineering

Natural Language Processing (NLP)
NLP tasks require extractions of features from text data, which include:

Tokenization: Splitting text into individual words or tokens.
Stemming and lemmatization: Removing prefixes and suffixes from words and mapping them to their base or dictionary form.
Part-of-speech (POS) tagging - labelling each word with its corresponding POS, i.e. noun, verb, adjective, etc.
Named-entity recognition: Locating and tagging named entities in text, such as persons, organizations, and locations.
Bag of Words: Representing text as an integer vector of its word counts from a predefined vocabulary.
Term Frequency-Inverse Document Frequency: Representing words by their numeric weights based on their frequency in a document relative to their frequency across all documents.
Word Embeddings: Representing words as dense vectors in a continuous vector space based on semantic similarity.
Sentiment analysis: Identifying the sentiment or tone of the text, whether it is positive, negative, or neutral.
Topic modelling: Identifying the underlying topics in a document or set of documents.

Computer Vision (CV)
Feature engineering in computer vision involves techniques for extracting features from images, including:

Image augmentation: Transforming the training set through geometric alterations like image rotation and filters to increase the training set for better model generalization.
Edge detection filter: Utilizing Sobel, Prewitt, Laplacian, or Canny edge filters to highlight changes in intensity or edges in the image.
Scale-invariant feature transform (SIFT): Identifying and describing local features in images that are invariant to scaling and rotation.
Colour Histogram: Representing an image by the distribution of its colours.
Histogram of Oriented Gradients (HOG): Extracting features from an image based on the distribution of gradients in the image.

Time-Series Analysis
Feature engineering in time-series analysis involves techniques for extracting features from time-series data, including:

Autocorrelation: Measuring the correlation between time series and its lagged values.
Moving averages: Calculating the average of a subset of time-series data over a defined window.
Trend analysis: Identifying trends and patterns in the time series data.
Fourier transforms: Decomposing a time-series signal into its frequency components.
Mel-frequency cepstral coefficients (MFCCs): Representing the audio signal by its power spectrum.
Phonemes Representing words by phonemes, leveraging human preconceptions about how words are pronounced.

Automated Feature Engineering

There are a myriad of tools and open-source packages that can help automate and streamline feature engineering. These packages utilize algorithms to generate and select features based on data characteristics. This reduces manual efforts and broadens the exploration of potential features.

Featuretools: designed for automated feature engineering on temporal and relational data.
tsfresh: designed for feature engineering from time-series and other sequential data
AutoFeat: streamlines the generation of nonlinear features from data.
TPOT (Tree-based Pipeline Optimization Tool): Designed to automate all aspects of machine learning pipeline, i.e. feature engineering, feature selection and model optimization using genetic programming.
featurewiz: automates feature engineering and selection.

Featuretool, with its deep feature synthesis (DFS) [2] algorithm, stands out for its versatility, particularly when working with relational datasets and incorporating temporal aggregation.

Tutorial

In this tutorial, we will implement Featuretools on a dataset consisting of four tables:

clients - information about clients at a credit union
loan - previous loans taken out by the clients
payments due - payments due date and amount
outcomes - loan payment and date

Data Source: Kaggle [3]

Data Sample: The objective of the machine learning model here is to classify if the customer will make or miss the next payment

Featuretools offers three distinct advantages that make it a powerful tool for automated feature engineering:

1. EntitySet Approach
Featuretools operates on EntitySet, i.e. data frames and the relationships between them. This simplifies the feature engineering process for relational datasets, enables users to define relationships between tables, and automatically generates features based on these relationships.

Code:

es = ft.EntitySet(id = 'clients')

## Entities Dataframe
es = es.add_dataframe(
    dataframe_name="clients",
    dataframe=clients,
    index="client_id",
    time_index="joined")

es = es.add_dataframe(
    dataframe_name="loans",
    dataframe=loans,
    index="loan_id",
    time_index="loan_start")

es = es.add_dataframe(
    dataframe_name="payments_due",
    dataframe=payments_due,
    index="payment_id",
    time_index="due_date")

es = es.add_dataframe(
    dataframe_name="outcome",
    dataframe=outcome,
    time_index="outcome_time")


## Adding Relationships in data frames

# Relationship between clients and previous loans
ed = es.add_relationship('clients', 'client_id', 'loans', 'client_id')

# Relationship between previous loans and payments
es = es.add_relationship('loans', 'loan_id', 'payments_due', 'loan_id')

# Relationship between payments and outcome
es = es.add_relationship('payments_due', 'payment_id', 'outcome', 'payment_id')


es.plot()

Featuretools EntitySet - Tables and their Relationship.

2. Feature Primitives, Deep Feature Synthesis (DFS)
"Feature primitives are the building blocks of Featuretools. They define computations that can be applied to raw datasets to create new features."[4].

Feature primitives fall into two categories:

Aggregation: Functions that group together child datapoints for each parent and compute statistics such as mean, variance etc. [3]
Transformation: Operations applied to a subset of columns in a table, e.g. extracting the day from dates, difference between two columns. etc. [3]

Code:

# Index and the time to use as a cutoff time
cutoff_times = es['payments_due'][['payment_id', 'due_date']].sort_values(by='due_date')

# Rename columns to avoid confusion
cutoff_times.rename(columns = {'due_date': 'time'},
                    inplace = True)
cutoff_times.head()

# subtract 1 day from the time
cutoff_times['time'] = cutoff_times['time'] - pd.Timedelta(0, 'days')
payments_due

# Feature Primitives
agg_primitives =  ["sum","min"]
trans_primitives = ["time_since_previous"]

# Deep feature synthesis 
agg_primitives =  ["sum","count","max"]
trans_primitives = ["time_since_previous"]

# Deep feature synthesis 
f, feature_names = ft.dfs(entityset=es, target_dataframe_name='payments_due',
                       agg_primitives = agg_primitives,
                       trans_primitives = trans_primitives,
                       n_jobs = -1, verbose = 1,
                       cutoff_time = cutoff_times,                    
                       cutoff_time_in_index = True,
                       max_depth = 2)

Moreover, Featuretools provides a visual representation to inspect, interpret, and validate generated features, enhancing understanding and interpretability of the subsequently built machine learning model.

Visual representation from featuretool of how a particular feature is generated for the given dataset.

3. Data Leakage and Handling Time
Data leakage in its many forms remains a challenge for machine learning models and systems [5]. Featuretools provide an intrinsic solution to prevent 'temporal leakage' in feature generation by specifying cutoff times for each record and filtering out all data preceding that time stamp before calculating features, effectively preventing the introduction of temporal leakage [6].

A detailed tutorial on how to use featuretool with consideration for handling time can be found here: Github

In conclusion, feature engineering is a critical step in machine learning models that can significantly enhance the model's performance and accuracy. Automated tools and packages, such as Featuretools, can streamline the feature engineering process and contribute to the success of a machine learning system.

References

[1] Domingos, P., 2012. A few useful things to know about machine learning. Communications of the ACM, 55(10), pp.78-87.

[2] Kanter, J.M. and Veeramachaneni, K., 2015, October. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA) (pp. 1-10). IEEE.

[3] Automated Feature Engineering Tutorial - Kaggle

[4] Feature primitives - Featuretools Documentation

[5] Kapoor, S. and Narayanan, A., 2023. Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9).

[6] Handling Time - Featuretools Documentation

Beyond Model Deployment: Catching Data Drift

Salman Khan — Sun, 17 Dec 2023 12:20:18 +0000

Machine learning and deep learning techniques 'learn' by recognizing and generalizing patterns and statistical properties within the training data. The efficacy of these models in real-world scenarios is contingent on the assumption that the training data is an accurate representation of the production data. However, this assumption often breaks in the real world. Consumer behaviours and market trends may undergo gradual or even drastic shifts. Sensors responsible for data collection can experience a decline in sensitivity over time. Additionally, disruptions such as broken data pipelines, alterations in upstream systems, and changes in external APIs can introduce gradual or abrupt changes to the data used for predictions in production. In essence, the dynamic nature of real-world conditions poses challenges to the sustained accuracy and reliability of any ML system. Therefore, it is crucial to understand how the model behaves in production and promptly identify and resolve any issues that may arise. One of the critical aspects of ML Monitoring is identifying data drift.

What is data drift?

"Data drift is a change in the statistical properties and characteristics of the input data. It occurs when a machine learning model production encounters data that deviates from the data the model was initially trained on or earlier production data"[1]. Simply put, data drift is a change in the distribution of the input features, as illustrated in the figure below.

How to identify data drift?

Data drift monitoring necessitates a well-defined reference dataset. This dataset serves as a benchmark against which production data can be systematically compared and analyzed. Only by establishing the baseline via this reference data set it is possible to discern any variations in the distribution of features, enabling the timely identification of potential drift and ensuring the ongoing reliability and performance of the model.

Methods to identify data drift

Rule Based:
Heuristic-based alerts can be set up to indirectly monitor data drifts:

Percentage of missing values.
Percentage of numeric values outside a predefined min-max threshold.
Percentage of new values in a categorical feature.

Statistical Tests:
Parametric and non-parametric tests can be utilized to compare the production data against reference datasets, such as:

Two sample t-test - to compare means for numeric features.
Kolmogorov Smirnov test (KS) - to test for equality of distribution of numerical features.
Chi-squared test - to test for equality of distribution of categorical features.
K-Sample Anderson-Darling (AK) - tests the null hypothesis that k-samples are drawn from the same population without having to specify the distribution function of that population.

Distance Metrics:

Kullback–Leibler Divergence (KL Divergence) - it is a non-symmetric metric that measures the relative entropy or difference in information represented by two distributions.
Jensen–Shannon distance (JS Distance) - measures the similarity between two probability distributions. It is based on KL Divergence, and one main difference between JS divergence and KL divergence is that JS is symmetric and it always has a finite value.
Population Stability Index (PSI) - measures the distance between the distribution of numeric and categorical features.

Selecting the right metric

Each metric discussed above possesses distinct properties and inherent assumptions. Therefore, it is crucial to identify the metric that aligns most effectively with the problem. This selection should consider both the dataset volume and the magnitude of drift, which is significant for the particular model.

To gain a deeper understanding, we will analyze and compare these metrics across two distinct variations of a numerical feature in relation to reference data across various sample sizes. The figure below depicts the distribution of the two variants compared to the reference dataset.

Based on the experiments above, the following observations can be made:

Statistical Test Sensitivity: It is evident that statistical tests often demonstrate heightened sensitivity when applied to large datasets. Even minute near zero differences can attain statistical significance with a sufficiently high volume of data.
Distance Metric Challenges: Distance metrics lack standardized cutoffs for alarms, and their interpretation depends on the specific context of application and analysis goals. Establishing suitable thresholds for these metrics necessitates empirical evaluation based on the data's characteristics and the ML model's objectives.

The code employed for the aforementioned experiments is available on GitHub

In conclusion, the dynamic nature of real-world conditions poses significant challenges to the accuracy and reliability of machine learning systems. Changes in consumer behaviours, market trends, and potential disruptions in data collection mechanisms can introduce gradual or abrupt changes to the data used for predictions in production. In this context, monitoring and identifying data drift becomes paramount. As demonstrated through various statistical tests, distance metrics, and the analysis of experiment results, it is clear that selecting the right metric for monitoring data drift is a nuanced task. The sensitivity of statistical tests and the lack of standardized cutoffs for distance metrics highlight the need for a context-specific and empirical approach to establishing thresholds for effective monitoring. Ultimately, understanding how machine learning models behave in production and promptly addressing any identified issues are critical for ensuring these models' ongoing success and reliability in real-world applications.

References

[1] https://www.evidentlyai.com/ml-in-production/data-drift
[2] https://docs.scipy.org/doc/
[3] https://www.aporia.com/learn/data-science/practical-introduction-to-population-stability-index-psi/

Decoding Deep Learning: An Introductory Guide to Neural Networks

Salman Khan — Sat, 09 Dec 2023 13:07:36 +0000

Deep Learning has become a pivotal driving force in reshaping the technology landscape, empowering advancements in image analysis, text interpretation, and the development of Generative AI.

The global Deep Learning market size is projected to grow from $17.60 billion in 2023 to $188.58 billion by 2030, at a CAGR of 40.3% during the forecast period [1]. While it might initially appear complex, Deep Learning basics are pretty straightforward.

This article aims to cover the basics of Neural Networks and Deep Learning for beginners.

What is Deep Learning?

Deep Learning is a subset of artificial intelligence (AI) that employs advanced algorithms to discern intricate patterns in data, primarily through multi-layered neural networks. Deep Learning offers the inherent ability to simultaneously process multi-modal data, such as text, images, and audio, through specialised architectures that extract patterns from different modalities. It further enables end-to-end learning. End-to-end learning involves training neural networks to directly transform raw input data into desired output, eliminating the need for manual feature engineering or intermediate processing steps. For example, in natural language processing, end-to-end learning can involve training a neural network to directly translate one language into another without requiring explicit linguistic feature extraction or rule-based translation systems. In computer vision, end-to-end learning can enable a neural network to take raw image data and output information about objects or scenes, bypassing manual image preprocessing steps. This streamlined approach allows deep learning models to learn the most relevant features and representations from the data without requiring feature engineering. It is particularly well-suited for tasks where traditional, multi-stage processing pipelines may be cumbersome or less effective.

The applications of Deep Learning are diverse and far-reaching - from enabling autonomous vehicles to medical diagnostics. Deep Learning has enabled exponential advancements in the realm of Generative AI, which can generate new data - from realistic images to chatbots that generate human-like responses.

Neural Networks - Building Blocks of Deep Learning

Neural networks draw inspiration from how the human brain operates, especially how biological neurons manage data. In the brain, neurons receive, process, and transmit information, forming the fundamental basis for thinking and learning.

Neural Networks are composed of layers that act as distinct stages of information processing. As data progresses from one layer to the next, it transforms, with each layer contributing to the final decision or prediction made by the network.

Common layer types

Input layer: This initial layer interfaces with the provided data. It takes in the raw information, be it pixels from an image, values from a dataset, or any other form of data, and forwards it to subsequent layers for processing. The number of neurons in this layer corresponds to the number of features or inputs in the dataset. This layer does not perform any computation on the input data

Hidden layer: Hidden layers are positioned between the input and output layers. Hidden layers are where the majority of computation occurs. They transform the data from the input layer and pass it on to another hidden layer or the output layer. The depth of a Neural Network often refers to the number of these hidden layers. Commonly used hidden layers include:

Dense Layer: Dense or Fully Connected Layers are the most common layers in feedforward neural networks. Neurons in a dense layer are connected to all neurons in the previous layer. They are responsible for learning and representing complex patterns in the data.
Convolutional layers: Mainly used in image processing tasks, they are specialised for spatial hierarchies in data. Using a mathematical operation called convolution, they can detect local patterns like edges, textures, and shapes in images. They are fundamental to Convolutional Neural Networks (CNNs), often used in image recognition and classification tasks.
Pooling Layer: They are generally used in conjunction with convolution layers and reduce the spatial dimensions of the feature maps, which helps reduce computation and control for overfitting.
Recurrent layer: These layers are tailored for sequential data, like time series or natural language. Unlike traditional layers, recurrent layers maintain a form of memory from their previous outputs, which allows them to make decisions influenced by a sequence of data rather than individual data points. Recurrent Neural Networks (RNNs), which use these layers, are beneficial for tasks like language modelling, speech recognition, and time series forecasting. Advanced NN architectures include the Dropout Layer, Long short-term memory (LSTM), and Gated recurrent unit (GRU) layer.

Output layer: The final layer in the sequence delivers the result. Depending on the network's design and function, this result can be a single value, a set of values, a category label, or even a complex data structure.

So, layers are the building blocks of Neural Networks. Each serves a unique purpose, ensuring the network can process various data types and perform different tasks.

Neurons - Building Blocks of Neural Networks

Each layer comprises a group of neurons that generally utilise non-linear functions to transform the data. Three critical components define the behaviour of each neuron - weights, biases and the activation function. Weights determine the significance of input data as a neuron processes it. In simple terms, weights amplify or dampen the input. Biases, on the other hand, are additional parameters that ensure neurons have a non-zero value when activated, even if all their input data is zero. The activation function, which is generally non-linear, enables the network to approximate and represent a wide range of functions and, hence, learn complex patterns in data.

Overall, understanding a Neural Network's anatomy involves recognising its layers' purpose and function, the neurons within those layers, and the critical role weights and biases play in processing and refining information.

Training a Neural Network

Neural networks learn the underlying parameters of the model via an iterative process in which the calculations are carried out forward and backwards through the network until the desired level of accuracy is attained.

Forward Propagation
This is the first phase of training, where the input data is passed through the Neural Network. This result is then compared to the actual desired output to compute the error (loss) between the actual and model outputs.

Backward Propagation
Backward propagation involves calculating the gradient of the loss with respect to each model parameter (weight and bias). This gradient represents the sensitivity of the loss to changes in each parameter. With this gradient, the model parameters (weights and biases) are adjusted in reverse order, starting from the output layer and working back to the input layer

Loss Functions
These assess the performance of a Neural Network by measuring the disparity between its predictions and the actual values. Commonly used loss functions include Mean Squared Error for regression tasks and Cross-Entropy for classification tasks. The main goal of training is to minimise this loss value, which means that the model's predictions are getting closer to the actual values.

Optimisation
Optimisation algorithms adjust the weights and biases in the network to minimise the loss. One of the most popular optimisation techniques is Gradient Descent, where the model iteratively updates its parameters in the direction that reduces the loss. Variants of Gradient Descent, like Stochastic Gradient Descent and Adam, offer more nuanced approaches to this adjustment process.

Training a Neural Network is an iterative process of prediction, evaluation, and refinement. By adjusting its internal parameters in response to training data, the network tries to produce outputs that closely match the actual results.

How does this all look in practice?

Let us examine a basic deep-learning task: classifying handwritten numbers using the well-known MNIST dataset. This collection contains grayscale images of handwritten digits from 0 to 9 and is a standard reference for testing Neural Network models.

Objective: To build a Neural Network model that can accurately identify and classify images of handwritten digits.

Data Preparation: The MNIST dataset is split into training and testing sets. Each image is 28x28 pixels and represents a flattened array of 784 values (28 multiplied by 28). These values, ranging from 0 to 255, represent pixel intensities. To make computations more stable, the pixel values are normalised to range between 0 and 1.

Network Architecture: Our primary Neural Network consists of the:

An input layer with 784 nodes (corresponding to the 784-pixel values).
A hidden layer with 128 nodes and a ReLU (Rectified Linear Unit) activation function.
An output layer with ten nodes (representing digits 0–9) and a softmax activation function to provide classification probabilities.

Forward and Backward Propagation: As previously discussed, the model initialises with forward propagation to produce outputs, then backward propagation to adjust weights and biases.

Loss Function: Given this classification problem, we use the Cross-Entropy loss function.

Optimisation: For this example, we will use the Adam optimiser, known for its efficiency.

Training iterations: The model is trained over multiple iterations (often called epochs). After each epoch, the model's performance on a validation set can be assessed to avoid overfitting.

Tech solutions and tools:

TensorFlow: This open-source framework developed by Google offers several tools for building and training Machine Learning models. For our handwritten digit classification, TensorFlow provides predefined functions and structures that simplify the model-building process.

PyTorch: This tool, developed by Facebook's AI Research lab, is another popular framework for Deep Learning. PyTorch offers an adaptable and straightforward approach, making it an excellent alternative to TensorFlow for Neural Network tasks.

Final thoughts and suggestions

Neural Networks are central to current technological breakthroughs, including Generative AI. We have covered their core components and the detailed training process. For those just starting, this is only the beginning. A vast field of application and innovation remains to be explored.

For further learning, you can find information in multiple sources:

Books: "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville as your first step.

Online courses: Platforms like Coursera or Udemy offer detailed courses on the subject.

Communities: Use platforms like Stack Overflow or dedicated subreddits to discuss and learn.

Deep Learning is expansive and constantly evolving. Embrace the challenge and get the results.

References:
https://www.fortunebusinessinsights.com/deep-learning-market-107801