Forem: ashwins-code

Cricket Match Simulation using Machine Learning [Part 1]

ashwins-code — Sat, 07 Jan 2023 18:26:26 +0000

Having watched Cricket as my favourite sport for years, I've recently come to wonder how well the outcomes of games could be predicted.

Especially in tournaments like the IPL, it seems difficult to predict the winner of each game since the difference in quality between sides often seems minimal on paper.

It would be interesting to see how effective machine learning could be in this task and what certain factors influence the outcomes of games.

Top-Down vs Bottom-Up Approach

One way to predict the outcome of a game is by taking in a list of players from two teams predicting the win percentage of both teams.

While this approach works well to predict the final outcome of the match, it fails to answer any other questions we may have about a game.

For this reason, I have decided to take the bottom up approach of predicting the outcome of each ball in a game, which would ultimate predict the final outcome of a game. This way, we can answer many questions such as:

Who will score the highest in a game?
Who will take the most wickets?
What will they score?
What happens if we reverse the batting order?

Predicting the outcome of balls

To accurately predict the outcome of each ball in a game, you would need to take into account two groups of data:

Data of the current match (team score, batsman score, wickets fallen etc.)
Pre-match historical data (batter and bowler skill)

We of course would need data from the current match so that the situation in which the ball is bowled is known.

In order to make predictions more accurate, the actual skill level of the bowler and batsman should be known, in order to know who is most likely to have the advantage. A players skill level is found out by looking at their data of previous matches.

For simplicity in this project, run outs are counted as a bowler's wicket and extras are counted as the batsman's runs. This means each ball outcome can be one of:

0 runs
1 run
2 runs
3 runs
4 runs
6 runs
Wicket

The Dataset

The dataset I have is a collection of CSV files from each IPL match from 2008.

Each CSV file has ball-by-ball data of their match.

This collection of CSV files will be used to calculate certain statistics of the players and will be ultimately used as part of the dataset for the machine learning model to predict the outcomes of balls.

Calculating Player Skill Ratings

Batsmen Ratings

The two metrics commonly used to indicate the skill of a batsman are:

Strike Rate (how many runs a batsman scores per 100 balls)
Average (how many runs a batsman scores on average before getting out)

Adding these together can give a rating for a batsman, where a higher number would indicate a more skilled player.

However, there are some limitations when using this as a rating

Different batsmen excel in different situations in a game. For example, finishers are excellent at scoring quick runs near the end of an innings, but probably would not do as good a job when opening the innings. Strike Rates and Averages do not provide this insight into what role a batsman is best in.
Strike Rates and Averages do not give an insight into how a batsman scores their runs. Two batsman can have a strike rate of 135 and an average of 30, but one can achieve that by frequently running between the wickets while the other can achieve it by hitting boundaries after every few balls. Predicting the outcome of a ball can be much more accurate if the type of batsman is considered.

With these things considered, I felt a single number was not enough to capture the skill of a batsman. After some thought, I decided that each batsman shall have the following ratings

Explosivity Rating
- Quantifies whether a batsman is a big hitter who likes to accumulate their runs from boundaries
- Total no. of boundaries hit / Total no. of balls faced
Running Rating
- Quantifies whether a batsman likes to score their runs by running hard between the wickets
- Total no. of balls where batsman ran for their runs / Total no. of balls faced
Finisher Rating
- Quantifies how often a batsman is involved in the finish of an innings
- Total no. of not outs / Total no. of balls faced
Consistency Rating
- Measure of how consistently a batsman performs
- Batsman's Average
Quick Scorer Rating
- Measure of how quickly a batsman gets their runs
- Batsman's Strike Rate

After each rating above is calculated for each player in the database, the ratings will be standardised to acquire the final rating.

The formula for standardising a data point is:

\frac{x - μ}{σ}

$x$ is the data point in question.
$μ$ is the mean of the whole dataset
$σ$ is the standard deviation of the whole dataset
$z$ is the resulting score. It shows how many standard deviations above/below the average the datapoint is.

Bowler Ratings

The metrics commonly used to indicate the skill of a bowler are

Economy (how many runs they concede per over)
Strike Rate (balls bowled per wicket)
Average (runs conceded per wicket)

I felt like these metrics were already good enough to indicate the type of bowler.

The economy measures how good a bowler is at conceding less runs, while strike rate measures the wicket taking ability of the bowler.

A low average would mean a bowler has good both economy and strike rate stats.

I named the ratings:

economy_rating
wicket_taking_rating
bowling_consistency_rating

These will be standardised using this formula:

-\frac{x - μ}{σ}

$x$ is the data point in question.
$μ$ is the mean of the whole dataset
$σ$ is the standard deviation of the whole dataset
$z$ is the resulting score. It shows how many standard deviations above/below the average the datapoint is.

This is almost the same as the formula used for batting, except for the negation at the front. Since lower economies, strike rates and averages are better, the scores are negated so that a higher rating would indicate a more skilled bowler.

I also decided to use another rating:

specialist_rating
- used to measure whether a bowler is a specialist, part-timer or a non-bowler
- Number of balls bowled / number of matches played

This rating will be standardised using the original standardisation formula.

Experience

Right now, a tailender who has only faced 4 balls in their career and hit 16 runs would be given a high number in most of the batting ratings. This obviously should not be the case, as a tailender would not be able to keep up those numbers for an extended period of time. Similarly, a batsman may have bowled a few decent overs in their career and may end up being higher rated than some established bowlers.

Therefore, these ratings need to be adjusted to account for their experience as a batsman and a bowler.

The number of innings a player has batted in will determine the batting experience.

The number of balls a player has bowled will determine the bowling experience.

Both of these will be standardised (using the formula from before) and their values will be clipped from -1.5 to 1.5, so that the ratings don't just favour the players who have played the most.

Once they have been standardised, each player rating will be adjusted as follows:

If the rating is a batting rating, add the batting experience to the rating. Otherwise, add the bowling experience.
Re-standardise this new number

Exploring the ratings

I wrote a script to go through each ball from the dataset and calculate the overall ratings of players and save it to a CSV file. You can see a snippet of the data below.

According to these ratings...

The top 15 fastest scorers in the IPL have been:

PN Mankad
AD Russell
V Sehwag
AB de Villiers
GJ Maxwell
RR Pant
CH Gayle
KA Pollard
HH Pandya
SP Narine
YK Pathan
DA Warner
SR Watson
SA Yadav
DA Miller

The top 15 most consistent batsmen:

Iqbal Abdulla
KL Rahul
DA Warner
AB de Villiers
CH Gayle
MS Dhoni
DA Miller
V Kohli
RR Pant
F du Plessis
JC Buttler
S Dhawan
JP Duminy
SPD Smith
SK Raina

The top 15 most explosive batsmen (highest proportion of their runs scored in boundaries):

PN Mankad
B Stanlake
RS Sodhi
V Sehwag
SP Narine
AD Russell
CH Gayle
GJ Maxwell
SR Watson
RR Pant
AB de Villiers
SA Yadav
BB McCullum
DR Smith
DA Warner

The top 15 finishers:

RA Jadeja
HH Pandya
DJ Bravo
MS Dhoni
DA Miller
Harbhajan Singh
JR Hazlewood
YK Pathan
IK Pathan
Iqbal Abdulla
Mukesh Choudhary
P Sahu
BA Bhatt
K Upadhyay
JE Taylor

Top 15 hardest runners (highest proportion of runs scored by non-boundaries):

CRD Fernando
DP Vijaykumar
NJ Rimmington
RG More
SPD Smith
DA Miller
RA Jadeja
V Kohli
AB de Villiers
MS Dhoni
MK Pandey
AR Patel
AT Rayudu
KL Rahul
SK Raina

Top 15 most economical bowlers:

AD Russell
SN Thakur
GJ Maxwell
SR Watson
PP Chawla
JA Morkel
JJ Bumrah
STR Binny
DL Chahar
HV Patel
Rashid Khan
Mohammed Siraj
R Vinay Kumar
KH Pandya
K Rabada

Top 15 best wicket takers:

Sandeep Sharma
RP Singh
MM Patel
UT Yadav
GJ Maxwell
MG Johnson
Rashid Khan
AR Patel
A Nehra
SN Thakur
MM Sharma
AB Dinda
JD Unadkat
DL Chahar
DJ Bravo

Top 15 most consistent bowlers:

MM Patel
Sandeep Sharma
RP Singh
A Nehra
MG Johnson
UT Yadav
AR Patel
AB Dinda
Rashid Khan
GJ Maxwell
JH Kallis
MM Sharma
Harbhajan Singh
R Bhatia
SN Thakur

Obviously there are a few anomalies seen in each rating, with some bowlers ranking higher than specialist batsmen, despite correcting for experience. However, these anomalous players do not carry over to the other ratings and the ratings as a whole do make sense.

Match Data

For each ball, along with player skill data, the context of the current match will also be considered.

The following pieces of information will be considered in each ball:

Ball Number
Batsman's Score
Balls faced by the batsman
Proportion of balls faced by the batsman that resulted in 0 runs
Proportion of balls faced by the batsman that resulted in 1 run
Proportion of balls faced by the batsman that resulted in 2 runs
Proportion of balls faced by the batsman that resulted in 3 runs
Proportion of balls faced by the batsman that resulted in 4 runs
Proportion of balls faced by the batsman that resulted in 6 runs
Runs conceded by the bowler
Number of balls bowled by the bowler
Number of wickets taken by the bowler
Proportion of balls bowled by the bowler that resulted in 0 runs
Proportion of balls bowled by the bowler that resulted in 1 runs
Proportion of balls bowled by the bowler that resulted in 2 runs
Proportion of balls bowled by the bowler that resulted in 3 runs
Proportion of balls bowled by the bowler that resulted in 4 runs
Proportion of balls bowled by the bowler that resulted in 6 runs
Proportion of balls bowled by the bowler that resulted in a wicket
Chasing score (if applicable)
Required run rate (if applicable)
Innings score
Innings wickets

All of these will be standardised as done before.

Building the dataset

Now that the data to be used has been decided, it is time to process the ball-by-ball CSV files into the dataset to train on.

import pandas as pd
import numpy as np
import os
import pickle as pkl

# standardising formula
def zscore(col):
    mean = col.mean()
    std = col.std()

    return (col - mean)  / std

df = pd.DataFrame() # will hold the final dataset at the end
player_db = pd.read_csv("player-db.csv") # need it for player ratings

The code below goes through each match and adds the relevant data to the dataframe.

for file in os.listdir("matches"):
    f = os.path.join("matches", file)
    match_df = pd.read_csv(f)
    match_df = match_df.fillna(0)

    # columns of all the match data needed

    ball_no = []
    striker = []
    bowler = []
    batsman_runs = []
    batsman_balls = []
    batsman_outcome_dists = { outcome : [] for outcome in [0,1,2,3,4,6]}
    bowler_economy = []
    wicket_taking = []
    bowler_consistency = []
    bowler_wickets = []
    bowler_runs = []
    bowler_balls = []
    bowler_outcome_dists = { outcome : [] for outcome in range(0, 7) }
    innings_score = []
    innings_wickets = []
    chasing =  []
    req_run_rate = []

    outcome = []

    batsmen = {}
    bowlers = {}
    score = 0
    wickets = 0
    chasing_score = 0

    prev_innings = 1

    for ball in match_df.iloc:

            batter = ball["striker"]
            _bowler = ball["bowler"]

            striker.append(batter)
            bowler.append(_bowler)

            runs = int(ball["runs_off_bat"])
            runs = min(runs, 6)
            runs = 4 if runs == 5 else runs
            wides = int(ball["wides"])
            wicket = 1 if ball["wicket_type"] else 0
            innings = int(ball["innings"])

            if innings != prev_innings:
                chasing_score = score
                score = 0
                wickets = 0

            prev_innings = innings


            if batter not in batsmen:
                # this will hold the number of balls faced for each outcome by a batsman
                batsmen[batter] = {
                    0: 0,
                    1: 0,
                    2: 0,
                    3: 0,
                    4: 0,
                    6: 0
                }

            if _bowler not in bowlers:
                # this will hold the number of balls bowled for each outcome by a bowler (5 means wicket)

                bowlers[_bowler] = {
                    0: 0,
                    1: 0,
                    2: 0,
                    3: 0,
                    4: 0,
                    5: 0,
                    6: 0
                }


            ## Batting Data ##

            batsman_balls_faced = np.sum([batsmen[batter][i] for i in batsmen[batter]])
            batsman_dist = { _outcome : 0 if batsman_balls_faced == 0 else batsmen[batter][_outcome] / batsman_balls_faced for _outcome in batsmen[batter]} # getting proportion of balls faced for each outcome
            batsman_runs_scored = np.sum([_outcome * batsmen[batter][_outcome] for _outcome in batsmen[batter]])

            for _outcome in batsman_outcome_dists:
                batsman_outcome_dists[_outcome].append(batsman_dist[_outcome])

            batsman_runs.append(batsman_runs_scored)
            batsman_balls.append(batsman_balls_faced)

            ## Bowling Data ##

            bowler_balls_bowled = np.sum([bowlers[_bowler][i] for i in bowlers[_bowler]])
            bowler_runs_given = np.sum([_outcome * bowlers[_bowler][_outcome] if _outcome != 5 else 0 for _outcome in bowlers[_bowler]])
            bowler_wickets_taken = bowlers[_bowler][5]
            bowler_dist = { _outcome : bowlers[_bowler][_outcome] / bowler_balls_bowled if bowler_balls_bowled != 0 else 0 for _outcome in bowlers[_bowler] } # getting proportion of balls bowled for each outcome

            for _outcome in bowler_outcome_dists:
                bowler_outcome_dists[_outcome].append(bowler_dist[_outcome])

            bowler_runs.append(bowler_runs_given)
            bowler_wickets.append(bowler_wickets_taken)
            bowler_balls.append(bowler_balls_bowled)

            ## Innings Data

            innings_score.append(score)
            innings_wickets.append(wickets)
            chasing.append(chasing_score)

            ball_outcome = runs if wicket == 0 else 5

            outcome.append(ball_outcome)

            discrete_ball_no = ball["ball"]
            discrete_ball_no = int(discrete_ball_no) * 6 + round(discrete_ball_no % 1) * 10
            ball_no.append(discrete_ball_no)

            rem_balls = 120 - (discrete_ball_no - 1) if discrete_ball_no <= 120 else int(discrete_ball_no - 120)
            _req_run_rate = max(0, (chasing_score - score) / rem_balls)

            req_run_rate.append(_req_run_rate)

            ## Update State ##

            batsmen[batter][runs] += 1
            bowlers[_bowler][runs] += 1
            bowlers[_bowler][5] += wicket
            score += runs
            wickets += wicket

    new_match_df = pd.DataFrame()
    new_match_df["striker"] = striker
    new_match_df["bowler"] = bowler
    new_match_df["batsman_runs"] = batsman_runs
    new_match_df["batsman_balls"] = batsman_balls

    for _outcome in batsman_outcome_dists:
        new_match_df[f"batsman_{_outcome}"] = batsman_outcome_dists[_outcome]

    new_match_df["bowler_runs"] = bowler_runs
    new_match_df["bowler_balls"] = bowler_balls
    new_match_df["bowler_wickets"] = bowler_wickets

    for _outcome in bowler_outcome_dists:
        new_match_df[f"bowler_{_outcome}"] = bowler_outcome_dists[_outcome]

    new_match_df["innings_score"] = innings_score
    new_match_df["innings_wickets"] = innings_wickets
    new_match_df["ball"] = ball_no
    new_match_df["chasing"] = chasing
    new_match_df["req_run_rate"] = req_run_rate
    new_match_df["outcome"] = outcome

    frames = [df, new_match_df]
    df = pd.concat(frames)


# df now contains the relevant match data from every single ball in the dataset

This now adds the player ratings to the collected data and saves the dataset.

batting_ratings = player_db[["player", "explosivity_rating", "consistency_rating", "finisher_rating", "quick_scorer_rating", "running_rating"]]
bowling_ratings = player_db[["player", "economy_rating", "wicket_taking_rating", "bowling_consistency_rating", "specialist_rating"]]

df = df.join(batting_ratings.set_index("player"), on="striker") # add batsmen's batting ratings to dataframe
df = df.join(bowling_ratings.set_index("player"), on="bowler") # add bowler's bowling ratings to dataframe

df = df.drop(["striker", "bowler"], axis=1) # remove the striker and bowler columns since they are not part of the features needed to predict the outcome of a ball

# get the means and standard deviations of each column

df_mean = df.drop(["outcome"], axis=1).mean()
df_std = df.drop(["outcome"], axis=1).std()

# save the mean and std of all the columns. this would be needed later for preprocessing when predicting new data.

with open("mean-std.bin","wb") as f:
    pkl.dump({
        "mean": df_mean,
        "std": df_std
    }, f)

# standardise all the columns except the ratings and outcome

for col in df.columns:
    if col != "outcome" and "rating" not in col:
        df[col] = zscore(df[col])

# shuffle

df = df.sample(frac=1)

# split into a training set and a testing set 
training_split = int(len(df) * 0.85) # take 85% to train

training_df = df[:training_split]
testing_df = df[training_split:]

# save the datasets

training_df.to_csv("balls-train.csv", index=False)
testing_df.to_csv("balls-test.csv", index=False)

The dataset looks something like this.

Training and testing the model

I have chosen to use a Random Forest Classifier to train on the data, which is really simple and quick to train using sklearn.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import pickle as pkl
from random import choices

dataset = pd.read_csv("balls-train.csv")
x = dataset.drop(["outcome"], axis=1)
y = dataset["outcome"]

clf = RandomForestClassifier(max_depth=15, n_estimators=50)
clf.fit(x, y)

with open("clf.bin", "wb") as f:
    pkl.dump(clf, f)

Now to test this model, we could see its accuracy on the test set.

test = pd.read_csv("balls-test.csv")
predicted = clf.predict(test.drop(["outcome"], axis=1))
print (np.sum(predicted == test["outcome"]) / len(test))

Running this results in the following output:

0.4430872720835546

An accuracy of 44% does not seem too great. However, each ball can easily have 2 or 3 reasonable outcomes, so using accuracy is not a good measure to capture how the model is performing.

Instead, I thought it would be better to look at the input data of each outcome. For each outcome, take the inputs that were predicted to have this outcome by the model and also take the inputs in the test set that were labelled with this outcome. If these two sets on inputs are similar, then the model has predicted this outcome well.

test = pd.read_csv("balls-test.csv")
predicted = test.drop(["outcome"], axis=1)
preds = clf.predict(predicted)
predicted["outcome"] = preds

for outcome in range(0, 7):
    print ("Outcome ", outcome)

    # get rows which were labelled with this outcome in the test data
    test_slice = test[test["outcome"] == outcome]

    # get rows which were predicted with this outcome by the model
    predicted_slice = predicted[predicted["outcome"] == outcome]

    # a vector of the average values of the inputs that were labelled with the outcome in the test set
    test_slice_mean = test_slice.mean() 

    # a vector of the average values of the inputs that were predicted to have this outcome by the model
    predicted_slice_mean = predicted_slice.mean().fillna(0) 

    # calculates the euclidean distance between the two vectors
    dist = ((predicted_slice_mean - test_slice_mean) ** 2).sum() ** 0.5

    print ("Actual Count", len(test_slice), "Predicted Count", len(predicted_slice)) # compared how many times this outcome appeared in the test set and how many times it got predicted
    print ("Average Distance", dist)

Running this results in the following output...

Outcome  0
Actual Count 12041 Predicted Count 13324
Average Distance 1.1953108714619187
Outcome  1
Actual Count 12564 Predicted Count 20402
Average Distance 0.8442394082114827
Outcome  2
Actual Count 2055 Predicted Count 16
Average Distance 4.355749118524852
Outcome  3
Actual Count 112 Predicted Count 0
Average Distance 4.526265034977223
Outcome  4
Actual Count 3848 Predicted Count 74
Average Distance 2.6306944519746605
Outcome  5
Actual Count 1646 Predicted Count 20
Average Distance 3.586353351892623
Outcome  6
Actual Count 1628 Predicted Count 58
Average Distance 4.243057588990431

This shows that the model is performing poorly. Its distances are quite large for each outcome, except for 0 and 1, considering the magnitudes of the values in the input.

It has also not been able to match the proportions of outcomes in the test set, with the predicted counts being very different to the actual counts.

The model is clearly biased to predicted 0 and 1 for each ball. This does make sense however since those are the most common outcomes in games of Cricket.

The problem here is how the predictions are made. The model outputs a probability distribution of the outcomes for each ball. Currently, the outcome with the highest probability is taken as the predicted outcome. It would make more sense to randomly select an outcome, using the probability distribution to weight the random selection.

Here is the same code but using weighted random selection.

test = pd.read_csv("balls-test.csv")
predicted = clf.predict_proba(test.drop(["outcome"], axis=1))
preds = []

for weights in predicted:
    outcomes = [0,1,2,3,4,5,6]

    # weighted random selection
    p = choices(outcomes, weights=weights)

    preds.append(p[0])

predicted = test.drop(["outcome"], axis=1)
predicted["outcome"] = preds

for outcome in range(0, 7):
    print ("Outcome ", outcome)

    test_slice = test[test["outcome"] == outcome]
    predicted_slice = predicted[predicted["outcome"] == outcome]

    # a vector of the average values of the inputs that were labelled with the outcome in the test set
    test_slice_mean = test_slice.mean() 

    # a vector of the average values of the inputs that were predicted to have this outcome by the model
    predicted_slice_mean = predicted_slice.mean().fillna(0) 

    # calculates the euclidean distance between the two vectors
    dist = ((predicted_slice_mean - test_slice_mean) ** 2).sum() ** 0.5

    print ("Actual Count", len(test_slice), "Predicted Count", len(predicted_slice))
    print ("Average Distance", dist)

Running this results in the following...

Outcome  0
Actual Count 12041 Predicted Count 12072
Average Distance 0.06285163994309419
Outcome  1
Actual Count 12564 Predicted Count 12507
Average Distance 0.047105551432281484
Outcome  2
Actual Count 2055 Predicted Count 2089
Average Distance 0.13890305459997243
Outcome  3
Actual Count 112 Predicted Count 94
Average Distance 0.779151006384056
Outcome  4
Actual Count 3848 Predicted Count 3858
Average Distance 0.09621869495553977
Outcome  5
Actual Count 1646 Predicted Count 1676
Average Distance 0.2084278773037825
Outcome  6
Actual Count 1628 Predicted Count 1598
Average Distance 0.29624599526308754

Now the model can be seen to be performing very well. It matches the test set's proportion of outcomes and the distances between the inputs have reduced to a small range.

Exploring each feature

I thought it would be intriguing to see what each feature of the dataset contributed to the outcome of a ball.

Shown below are box plots for each feature against each predicted outcome to help visualise the distribution of features against each output. This will help in showing how a feature contributes to the outcome.

Batsman's runs

Batsman's balls

Batsman's dot ball proportion

Batsman's single proportion

Batsman's double proportion

Batsman's three runs proportion

Batsman's four runs proportion

Batsman's six runs proportion

Runs conceded by the bowler

Number of balls bowled by the bowler

Wickets taken by the bowler

Bowler's dot ball proportion

Bowler's single proportion

Bowler's double proportion

Bowler's three runs proportion

Bowler's four runs proportion

Bowler's wicket delivery proportion

Bowler's six runs proportion

Innings score

Innings wickets

Ball number of the innings

Score to chase

Required run rate

Explosivity rating

Consistency rating

Finisher rating

Quick scorer rating

Running rating

Economy rating

Wicket taking rating

Bowling consistency rating

Bowling specialist rating

From these box plots, we can see which features have more of an impact on the outcome.

Features that produce similar looking boxplots for each outcome do not have much of an effect on the outcome of a ball. These features include:

Batsman/Bowler ball outcome proportions
Bowler ratings
Number of wickets taken by a bowler
Chasing score
Required run rate

It was not surprising to see the chasing score and required run rate features to not have much of an effect on the outcome, as half the balls in the dataset wouldn't have had these features applied to them.

It also was not surprising to see the ball outcome proportion and no. of wickets taken features in this list too. This is because these can have the same values appear all throughout a cricket match, so they are bound to have a wide range of outcomes for the same values.

I was however surprised to see that bowler ratings did not have much of an impact on the outcome, while batting ratings did.

This can mean a few things

The outcome of a game comes more down to how strong a team's batting lineup is rather than their bowling.
There must be a better way to quantify the skills of a bowler

After having a thought about this, while I do think it's a mix of both, I feel it is mainly due to the first point.

I believe this is because, especially in a competition like the IPL, the quality of bowlers does not fluctuate as much as they do for batsmen (statistically speaking at least). It is much more common to have bowlers / non-specialist batsmen to bat in games, while you would almost never find a batsman bowling in a game and occasionally find a part-time bowler.

The way to quantify a bowler's skill could still be improved however, maybe considering some of the following:

Average pace of the bowler
Average degree of turn the bowler gets (for spinners)
Adjusting their economy/wicket taking ratings to consider the contexts of the matches they bowled in

These pieces of data were outside the scope of the dataset I had, but would be interesting to implement to improve this project in the future.

Nevertheless, I am happy with how the model has trained, aligning itself closely to the test dataset.

Part 2

To avoid making this part too long, I have decided to split this up into two parts.

Part 2 will involve building the actual match simulator. I will use it to see how it does in simulating real games and to answer any question about any hypothetical game situations.

Thank you for reading!

AI learns how to land on the moon

ashwins-code — Fri, 21 Oct 2022 07:26:17 +0000

Welcome everyone to this post where I teach an AI how to land on the moon.

Of course I am not talking about the actual moon (although I wish I was). However, I am talking about a simulated environment instead (OpenAI's Lunar Lander Gym environment)

Reinforcement Learning

I recently came across an article about DeepMind's AlphaTensor.

You can read about it here

AlphaTensor learnt how to multiply two matrices together in an extremely efficient way, managing to complete multiplications in fewer steps than Strassen's algorithm (the previous-best algorithm).

Reading this article inspired me to read into an area of machine learning I did not know much about - reinforcement learning. DeepMind has built several other very impressive AIs, which were all trained using reinforcement learning algorithms.

What is reinforcement learning?

Reinforcement learning concerns learning the best actions to take in certain situations in an environment in order to achieve a certain goal. For example, in a game of chess, RL algorithms would learn what piece to move and where to move it, given the state of the game board and the goal of winning the game.

RL models learn purely from their interactions in an environment. They are given no training dataset with what are the best actions to take in a given situation. They learn everything from experience.

How do they learn?

An agent is anything that interacts with an environment by observing it and taking actions based on those observations.

For each action the agent takes in an environment, the agent is given a reward which indicates how good that action was, given the observation and the aim of the agent in the environment.

RL algorithms improve the performance of these agents essentially through trial and error. They initially perform random actions and see what rewards they get from them. They then can develop a policy over time based on these action-reward pairs. A policy describes the best actions to take in given situations, with the goal of the environment kept in mind.

There are several algorithms available for developing a policy.

For this project, I used a DQN (Deep Q-Network) to develop a policy.

A DQN however is not the most efficient method for the Lunar Lander environment. Since the observation space is not too large for this environment, methods involving a Q-Table would be able to train much quicker to solve this environment.

I decided to use a DQN however since I wanted to learn how they worked as they could be applied to a much wider range of environments than Q-Tables.

Q-Values and DQNs

DQNs are neural networks that take in the state of the environment as input and output the q-values for each possible action that the agent can take. It describes the agent's policy.

There is no fixed model architecture for DQNs. It varies from environment to environment. For example, an agent playing an Atari game might observe the environment through a picture of the game. In the case, it would be best to use convolutional layers as part of the DQN architecture. Using a simple feedforward neural network would work fine with other environments, such as the lunar lander environment in OpenAI's Gym.

Q-values measure the expected future rewards when taking that action, assuming that the same policy is followed in the future. When an agent is following a policy, it takes the action with the greatest q-value at the current state it's in.

The way DQNs are trained to determine accurate q-values will be explained as I go through the code.

DQNs are used when the action space in an environment is discrete i.e there is a finite number of possible actions an agent can take in an environment.

For example, if an environment involved driving a car, its action space would be considered discrete if the only actions allowed were to drive forward, turn left and right 10 degrees. Its action space would be continuous if actions involved specifying the angle of the steering wheel and the speed to travel at. This is because these can take an infinite number of values and, therefore, there an infinite number of actions.

Problem and Code

As I mentioned earlier, I am going to train an agent within OpenAI's Lunar Lander Gym environment.

The aim of the agent in this environment is to land the lander on its legs between the two flags.

The agent can take 4 actions:

Do nothing
Fire left engine
Fire right engine
Fire main engine

An observation taken from this environment is an 8-dimensional vector containing:

X coordinate
Y coordinate
X velocity
Y velocity
Angle of the lander
Angular velocity
2 Booleans describing whether each leg is in contact with the ground or not

The agent is rewarded as follows:

-100 points for crashing
+100 points for coming to rest
+10 points for each leg in contact with the ground
-0.3 points for each frame firing main engine
-0.03 points for each frame firing a side engine
+100 - 140 points for moving from top to the landing pad
The agent is considered to have solved the environment if it has collected at least 200 points in an episode.

episode - series of steps/frames that occur until some criteria for the environment to reset has been met (episode termination)

Episodes terminate if:

the lander crashes
the lander goes out of view horizontally

Training a DQN

An agent interacts with an environment by taking actions in it. For each step in the environment, the agent records the following into a replay memory:

The observation it took of the environment
The action it took from this observation
The reward it gained
The new observation of the environment
Whether the episode has terminated or not

The agent has an initial exploration rate, which determines how often it should take random moves instead of taking actions from the DQN's policy.

This is so that all actions can be explored during the training phase and therefore allow the algorithm to see which actions would be best in certain situations. The DQN's initial policy is random, so having the agent follow it all the time would mean the DQN would struggle to train to develop a strong policy, since different actions haven't been explored for the same states.

The exploration rate decreases by an appropriate rate after each episode. By the time the exploration rate reaches 0, the agent will follow the DQN's policy only. By this time, the DQN should have produced a strong policy.

Q (s, a) = r + γ Q (s^{'}, a^{'})

$Q$ is the policy function. Takes in the environment state and an action as input and returns the q-value for that action.
$s$ is the current state
$a$ is a possible action
$r$ the reward for taking action $a$ at state $s$
$s^{'}$ the state of the environment after taking action $a$ at state $s$
$a^{'}$ is the action with the highest q-value at state $s^{'}$
$γ$ is the discount rate. It is a specified constant measuring how important future actions are in the environment.

For every n steps in the environment, a random batch is taken from the replay memory.

The DQN then predicts the q-values at each state in the batch $Q (s, a)$ and the q-values at each of the new states, so that the best action at that state can be obtained $Q (s^{'}, a^{'})$ .

For each item in the batch, the calculated $Q (s, a)$ , $Q (s^{'}, a^{'})$ and reward values are substituted into the equation above. This should calculate a slightly better $Q (s, a)$ value for this batch item.

The DQN is then trained with the batch observations as input and the newly calculated $Q (s, a)$ values as output.

Note: calculating $Q (s, a)$ and the $Q (s^{'}, a^{'})$ values are done by separate networks - the policy and target network. They are initialised with the same weights. The policy network is the main network that is trained. The target network isn't trained, however the policy network's weights are copied to for some every m steps in the environment.
This is done so that the training process becomes stable. If one network was used to predict both $Q (s, a)$ and $Q (s^{'}, a^{'})$ and trained, the network would end up be chasing a forever moving target, leading to poor results. The use of the target network to calculate $Q (s^{'}, a^{'})$ means that the policy network has a still target to aim at for a while before the target changes, instead of the target changing ever step in the environment.

This is repeated for a specified number of steps. Over time, the policy should become stronger.

class DQN:
    def __init__(self, action_n, model):
        self.action_n = action_n
        self.policy = model(action_n)
        self.target = model(action_n)
        self.replay = []
        self.max_replay_size = 10000
        self.weights_initialised = False

    def play_episode(self, env, epsilon, max_timesteps):

        obs = env.reset()
        rewards = 0
        steps = 0

        for _ in range(max_timesteps):
            rand = np.random.uniform(0, 1)

            #taking a random action or the action described by the DQN policy
            if rand <= epsilon:
                action = env.action_space.sample()
            else:
                actions = self.policy(np.array([obs]).astype(float)).numpy()
                action = np.argmax(actions)

            if not self.weights_initialised:
                    self.target.set_weights(self.policy.get_weights())
                    self.weights_initialised = True

            new_obs, reward, done, _ = env.step(action)
            if len(self.replay) >= self.max_replay_size:
                self.replay = self.replay[(len(self.replay) - self.max_replay_size) + 1:]

            #save data into replay memory for training
            self.replay.append([obs, action, reward, new_obs, done])

            #count rewards and steps so that we can see some information during training
            rewards += reward
            obs = new_obs
            steps += 1

            yield steps, rewards

            if done:
                env.close()
                break


    def learn(self, env, timesteps, train_every = 5, update_target_every = 50, show_every_episode = 4, batch_size = 64, discount = 0.8, min_epsilon = 0.05, min_reward=150):
        max_episode_timesteps = 1000
        episodes = 1
        epsilon = 1 #exploration rate
        decay = np.e ** (np.log(min_epsilon) / (timesteps * 0.85)) #how much the exploration rate should reduce each episode
        steps = 0

        episode_list = []
        rewards_list = []

        while steps < timesteps:
            for ep_len, rewards in self.play_episode(env, epsilon, max_episode_timesteps):
                epsilon *= decay
                steps += 1


                if steps % train_every == 0 and len(self.replay) > batch_size:
                    #taking random batch from replay memory
                    batch = random.sample(self.replay, batch_size)
                    obs = np.array([o[0] for o in batch])
                    new_obs = np.array([o[3] for o in batch])

                    #calculating the Q(s,a) values
                    curr_qs = self.policy(obs).numpy()

                    #calculating q-values of the "future"/new observations to obtain Q(s', a')
                    future_qs = self.target(new_obs).numpy()

                    for row in range(len(batch)):
                        action = batch[row][1]
                        reward = batch[row][2]
                        done = batch[row][4]

                        if not done:
                            #Q(s, a) = reward + Q(s', a')
                            curr_qs[row][action] = reward + discount * np.max(future_qs[row])
                        else:
                            #if the environment is completed, there are no future actions, so Q(s, a) = reward only
                            curr_qs[row][action] = reward

                    #fitting DQN to newly calculated Q(s, a) values
                    self.policy.fit(obs, curr_qs, batch_size=batch_size, verbose=0)

                if steps % update_target_every == 0 and len(self.replay) > batch_size:
#updating target model                     
self.target.set_weights(self.policy.get_weights())

            episodes += 1
            #showing some training data
            if episodes % show_every_episode == 0:
                print ("epsiode: ", episodes)
                print ("explore rate: ", epsilon)
                print ("episode reward: ", rewards)
                print ("episode length: ", ep_len)
                print ("timesteps done: ", steps)



                if rewards > min_reward:
                    self.policy.save(f"policy-model-{rewards}")

            episode_list.append(episodes)
            rewards_list.append(rewards)
        self.policy.save("policy-model-final")
        plt.plot(episode_list, rewards_list)
        plt.show()

DQN.py

Now that training is out the way, here is the code for the whole DQN.py file.

import numpy as np
import tensorflow as tf
import random
from matplotlib import pyplot as plt

def build_dense_policy_nn():
    def f(action_n):
        model = tf.keras.models.Sequential([
                tf.keras.layers.Dense(256, activation="relu"),
                tf.keras.layers.Dense(128, activation="relu"),
                tf.keras.layers.Dense(64, activation="relu"),
                tf.keras.layers.Dense(32, activation="relu"),
                tf.keras.layers.Dense(action_n, activation="linear"),
            ])

        model.compile(loss=tf.keras.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.Adam(0.0001))

        return model

    return f

class DQN:
    def __init__(self, action_n, model):
        self.action_n = action_n
        self.policy = model(action_n)
        self.target = model(action_n)
        self.replay = []
        self.max_replay_size = 10000
        self.weights_initialised = False

    def play_episode(self, env, epsilon, max_timesteps):

        obs = env.reset()
        rewards = 0
        steps = 0

        for _ in range(max_timesteps):
            rand = np.random.uniform(0, 1)

            if rand <= epsilon:
                action = env.action_space.sample()
            else:
                actions = self.policy(np.array([obs]).astype(float)).numpy()
                action = np.argmax(actions)

                if not self.weights_initialised:
                    self.target.set_weights(self.policy.get_weights())
                    self.weights_initialised = True

            new_obs, reward, done, _ = env.step(action)
            if len(self.replay) >= self.max_replay_size:
                self.replay = self.replay[(len(self.replay) - self.max_replay_size) + 1:]

            self.replay.append([obs, action, reward, new_obs, done])
            rewards += reward
            obs = new_obs
            steps += 1

            yield steps, rewards

            if done:
                env.close()
                break


    def learn(self, env, timesteps, train_every = 5, update_target_every = 50, show_every_episode = 4, batch_size = 64, discount = 0.8, min_epsilon = 0.05, min_reward=150):
        max_episode_timesteps = 1000
        episodes = 1
        epsilon = 1
        decay = np.e ** (np.log(min_epsilon) / (timesteps * 0.85))
        steps = 0

        episode_list = []
        rewards_list = []

        while steps < timesteps:
            for ep_len, rewards in self.play_episode(env, epsilon, max_episode_timesteps):
                epsilon *= decay
                steps += 1


                if steps % train_every == 0 and len(self.replay) > batch_size:
                    batch = random.sample(self.replay, batch_size)
                    obs = np.array([o[0] for o in batch])
                    new_obs = np.array([o[3] for o in batch])

                    curr_qs = self.policy(obs).numpy()
                    future_qs = self.target(new_obs).numpy()

                    for row in range(len(batch)):
                        action = batch[row][1]
                        reward = batch[row][2]
                        done = batch[row][4]

                        if not done:
                            curr_qs[row][action] = reward + discount * np.max(future_qs[row])
                        else:
                            curr_qs[row][action] = reward

                    self.policy.fit(obs, curr_qs, batch_size=batch_size, verbose=0)

                if steps % update_target_every == 0 and len(self.replay) > batch_size:
                    self.target.set_weights(self.policy.get_weights())

            episodes += 1

            if episodes % show_every_episode == 0:
                print ("epsiode: ", episodes)
                print ("explore rate: ", epsilon)
                print ("episode reward: ", rewards)
                print ("episode length: ", ep_len)
                print ("timesteps done: ", steps)



                if rewards > min_reward:
                    self.policy.save(f"policy-model-{rewards}")

            episode_list.append(episodes)
            rewards_list.append(rewards)
        self.policy.save("policy-model-final")
        plt.plot(episode_list, rewards_list)
        plt.show()


    def play(self, env):
        for _ in range(10):
            obs = env.reset()
            done = False

            while not done:
                actions = self.policy(np.array([obs]).astype(float)).numpy()
                action = np.argmax(actions)
                obs, _, done, _ = env.step(action)
                env.render()

    def load(self, path):
      m = tf.keras.models.load_model(path)
      self.policy = m

play is the method that shows the agent in action!

load loads a saved DQN into the class

Testing it out!

import gym
from dqn import *

env = gym.make("LunarLander-v2")

dqn = DQN(4, build_dense_policy_nn())

dqn.play(env)
dqn.learn(env, 70000)
dqn.play(env)

Before training, the agent plays like this...

During training we can see how it's going...

epsiode:  4
explore rate:  0.9868456446936881
episode reward:  -124.58158870915031
episode length:  71
timesteps done:  263
epsiode:  8
explore rate:  0.9661965615592099
episode reward:  -120.64909734406406
episode length:  101
timesteps done:  683
epsiode:  12
explore rate:  0.9492716348733212
episode reward:  -115.3412820349026
episode length:  103
timesteps done:  1034
epsiode:  16
explore rate:  0.9321267977756045
episode reward:  -93.92673345696777
episode length:  85
timesteps done:  1396

...

epsiode:  44
explore rate:  0.8147960354481776
episode reward:  -81.70688741109889
episode length:  87
timesteps done:  4068
epsiode:  48
explore rate:  0.8007650685225999
episode reward:  -134.96785569534904
episode length:  95
timesteps done:  4413
epsiode:  52
explore rate:  0.7822352926206606
episode reward:  -252.0391992426531
episode length:  117
timesteps done:  4878
epsiode:  56
explore rate:  0.7682233340884487
episode reward:  -129.31041070395162
episode length:  118
timesteps done:  5237
epsiode:  60
explore rate:  0.7510891766906618
episode reward:  -42.51701614323742
episode length:  150
timesteps done:  5685

...

epsiode:  200
explore rate:  0.05587076853211827
episode reward:  -100.11491946673415
episode length:  1000
timesteps done:  57295
epsiode:  204
explore rate:  0.045679405591051145
episode reward:  -107.24645551050241
episode length:  1000
timesteps done:  61295
epsiode:  208
explore rate:  0.040104832222074366
episode reward:  -16.873940050515692
episode length:  1000
timesteps done:  63880
epsiode:  212
explore rate:  0.03278932696585445
episode reward:  116.37994616097882
episode length:  1000
timesteps done:  67880
epsiode:  216
explore rate:  0.03008289932359156
episode reward:  -200.89010177116512
episode length:  354
timesteps done:  69591

and this episode-vs-reward graph...

You might expect there to be a clearer trend showing the reward increasing over time. However, due to the exploration rate, this trend is distorted. As the episodes go on, the exploration rate decreases, so the expected trend of rewards increasing over time becomes slightly more apparent.

Here is how the agent performs at the end of the training process!

It could still do with a smoother landing, but I think this is a good performance nonetheless.

Maybe you could try this out yourself and tweak some of the training parameters and see what results they yield!

Deep Learning Library From Scratch 7: Implementing RNN layers

ashwins-code — Mon, 03 Oct 2022 18:02:05 +0000

Hi guys! Welcome back to part 7 of this series of where we build our own deep learning library.

Last article went through the implementation of the automatic differentiation module. We will see how easy this module makes adding new layer types by implementing RNN layers in this part of the series.

The code for this series' library can be found at this repo:

ashwin6-dev / Zen-Deep-Learning-Library

Deep Learning library written in Python. Contains code for my blog series on building a deep learning library.

Zen - Deep Learning Library

A deep learning library written in Python.

Contains the code for my blog series where we build this library from scratch

mnist.py contains an example for a MNIST digit recogniser using the library
rnn.py contains an example of a Recurrent Neural Network which learns to fit to the graph sin(x) * cos(x)

View on GitHub

What are RNNs?

So far in our library, we have only encountered simple feed forward neural networks.

These networks work well in quite a few cases, but when our inputs introduce the concept of time, feed forward networks begin to struggle, since they contain no mechanism to encapsulate contextual data.

Several pieces of data, like language and stock price data just to name a few, are determined from historical data points. For example, in terms of generating language, the words in a sentence are determined by what words have already been generated. In terms of predicting stock prices, the previous prices can be used to determine whether the stock price rises/falls. Our feed forward networks simply do not have the mechanisms to handle such types of data, so how can we model such data?

RNNs were designed for this specific problem.

RNN stands for recurrent neural network.

As the name suggests, RNNs rely on a recurrence mechanism to capture contextual data.

How do they work?

RNNs utilise a hidden state to encapsulate sequential data.

What the hidden state exactly is will be explained a bit later.

RNNs take in a sequence as input. They can output both a new sequence and the final hidden state from the input.

x is the input sequence
y is the outputted sequence
h is the final hidden state
U, W, V are all linear functions

How does this exactly happen?

I like to think of RNNs as a unit that is applied to each time-step of the input sequence. A time-step refers to the item in a sequence at a specific point in time.

$x_t$ refers to the item of the input sequence $x$ at time-step $t$

$y_t$ refers to the item of the output sequence $y$ at time-step $t$

$h_t$ refers to the hidden state after iterating through $t - 1$ time-steps

As the diagram shows, RNN units produce an output sequence time-step and a hidden state after being fed the input sequence time-step and the previous hidden state, using the linear functions $U$ , $W$ , $V$ .

$U$ , $W$ , $V$ take the form

f (x) = W x + B

where $W$ is some adjustable weight matrix and $x$ and $B$ are vectors. $B$ is an adjustable vector.

The hidden state is a vector that encapsulates the meaning of the sequence. It essentially carries the information of all the time-steps it has seen so far.

$y_t$ and $h_t$ are caculated as follows.

U(x_t) \newline w = W(h_{t-1}) \newline h_t = tanh(u + w) \newline y_t = V(h_t) \newline

The initial hidden state $h_1$ is usually a vector of 0s.

Hopefully you can see that if these calculations are applied to each time-step of the input sequence, an output sequence and a final hidden state is produced.

This final hidden state is a vector that contains all the information about the sequence the RNN layer was given.

The output sequence could be given to another RNN layer to reveal more complex patterns within the data.

During training, the weights and biases used within the $U$ , $W$ and $V$ functions are adjusted, so that the output sequence and hidden state better represent the inputted sequence.

Code Implementation

We shall add our RNN layer class to "nn.py"

import autodiff as ad
import numpy as np
import loss 
import optim

...

class RNN(Layer):
    def __init__(self, units, hidden_dim, return_sequences=False):
        self.units = units
        self.hidden_dim = hidden_dim
        self.return_sequences = return_sequences
        self.U = None
        self.W = None
        self.V = None

    def one_forward(self, x):
        x = np.expand_dims(x, axis=1)
        state = np.zeros((x.shape[-1], self.hidden_dim))
        y = []

        for time_step in x:
            mul_u = self.U(time_step[0])
            mul_w = self.W(state)
            state = Tanh()(mul_u + mul_w)

            if self.return_sequences:
                y.append(self.V(state))

        if not self.return_sequences:
            state.value = state.value.squeeze()
            return state

        return y

    def __call__(self, x):
        if self.U is None:
            self.U = Linear(self.hidden_dim)
            self.W = Linear(self.hidden_dim)
            self.V = Linear(self.units)

        if not self.return_sequences:
            states = []
            for seq in x:
                state = self.one_forward(seq)
                states.append(state)

            s = ad.stack(states)
            return s

        sequences = []
        for seq in x:
            out_seq = self.one_forward(seq)
            sequences.append(out_seq)

        return sequences

    def update(self, optim):
        self.U.update(optim) 
        self.W.update(optim)

        if self.return_sequences:
            self.V.update(optim)

Let's break this down into its separate methods.

def __init__(self, units, hidden_dim, return_sequences=False):
        self.units = units
        self.hidden_dim = hidden_dim
        self.return_sequences = return_sequences
        self.U = None
        self.W = None
        self.V = None

The class constructor takes in three parameters.

units - the size of each time-step in the outputted sequence when return_sequences is true

hidden_dim - the size of the hidden state

return_sequences - if true, the RNN layer will return a newly calculated sequence the same length as the input sequence. If false, it returns the final hidden stae.

self.U = None
self.W = None
self.V = None

U, W, V are initialised as None, but will represent the different weight matrices shown earlier after the layer's first forward pass.

The following methods have been commented to be better understood.

def __call__(self, x):
        """
        This method takes in a batch of sequences and returns the final hidden states after iterating through each sequence if return_sequences is False, otherwise it returns the output sequences.
        """

        if self.U is None:
            #intialise U, W, V if this is first forward pass
            #Since we know U, W, V are all linear functions with trainable parameters, we can simply assign them to instances of our already existing Linear class.
            self.U = Linear(self.hidden_dim)
            self.W = Linear(self.hidden_dim)
            self.V = Linear(self.units)

        if not self.return_sequences:
            states = []
            #go through each sequence
            for seq in x:
                #apply the "one_forward" method to sequence
                state = self.one_forward(seq)

                #append the final hidden state
                states.append(state)

            #use "stack" method to convert these list of tensors into a single tensor, so that its derivative can be calculated
            s = ad.stack(states)

            #return hidden states
            return s

        sequences = []
        #go through each sequence
        for seq in x:
            #apply the "one_forward" method to sequence
            out_seq = self.one_forward(seq)

            #append the output sequence
            sequences.append(out_seq)

        #return output sequences
        return sequences

def one_forward(self, x):
        """
        This method takes in a list representing a single sequence.
        """

        x = np.expand_dims(x, axis=1) #making list numpy array with 2 dimensions, so that its shape can be calculated for the following line
        state = np.zeros((x.shape[-1], self.hidden_dim)) #hidden state intitialised as a matrix of 0s
        y = [] #this array will store the output sequence if return_sequences is True


        #iterate through each time-step in the sequence
        for time_step in x:
            #perform next state calculation
            mul_u = self.U(time_step[0]) 
            mul_w = self.W(state)
            state = Tanh()(mul_u + mul_w)

            if self.return_sequences:
                #calculate the output sequence time-step if return_sequences is True
                y.append(self.V(state))

        if not self.return_sequences:
            # return hidden state if return_sequence is False
            state.value = state.value.squeeze()
            return state

        #return output sequence is return_sequences is True
        return y

def update(self, optim):
        self.U.update(optim) 
        self.W.update(optim)

        if self.return_sequences:
            self.V.update(optim)

This update method was seen in the linear layer too. This method is called during backpropagation to adjust the weights of the layers.

You may have noticed a new function from our autodiff module - stack.

Here is the code for the stack function.

autodiff.py

def stack(tensor_list):
    tensor_values = [tensor.value for tensor in tensor_list]
    s = np.stack(tensor_values)

    var = Tensor(s)
    var.dependencies += tensor_list

    for tensor in tensor_list:
        var.grads.append(np.ones(tensor.value.shape))

    return var

This joins a list of tensors together into a single tensor.

This makes it possible for a group of separately computed tensors needed to be operated on at once, while still recording the operation onto the computation graph.

Notice how we did not need to write any backpropagation code for this class. Our autodiff module provides that layer of abstraction for us, making implementing new layer types much easier!

Building an RNN model to test it all out!

To see if this all is working well, we are going to build an RNN model to extrapolate trigonometric curves.

The RNN will take in a sequence of y-values from consecutive points from the curve. It's job is to predict the next y-value of the next point on the curve.

The RNN will extrapolate the curve by repeatedly feeding the most recent points on the curve

Imports...

import numpy as np
import nn
import optim
import loss
from matplotlib import pyplot as plt

Generate the first 200 points of a sin wave

x_axis = np.array(list(range(200))).T
seq = [np.sin(i) for i in range(200)]

Prepping dataset.

Inputs will contain a window of 49 points from the curve. The output is the y-value of the next point on the curve.

x = []
y = []

for i in range(len(seq) - 50):
    new_seq = [[i] for i in seq[i:i+50]]
    x.append(new_seq)
    y.append([seq[i+50]])

Building model

model = nn.Model([
    nn.RNN(32, hidden_dim=32, return_sequences=True),
    nn.RNN(0, hidden_dim=32), #units is 0 since it is irrelevant. return_sequence is False by default!
    nn.Linear(8),
    nn.Linear(1)
])

Train model and predict!

The predictions are plotted to a graph and saved as a png file.

model.train(np.array(x[:50]), np.array(y[:50]), epochs=10, optimizer=optim.RMSProp(0.0005), loss_fn=loss.MSE, batch_size=8)

preds = []

for i in x:
    preds.append(model(np.expand_dims(np.array(i), axis=0)).value[0][0])

plt.plot(x_axis[:150], seq[:150])
plt.plot(x_axis[:150], preds)
plt.savefig("graph.png")

Running the code will results in something similar to the following....

**

EPOCH 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.24it/s]
LOSS 3.324489628181238

**

EPOCH 2
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.49it/s]
LOSS 2.75052987566751

**

EPOCH 3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.28it/s]
LOSS 2.2723695780732083

**

...

EPOCH 9
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.44it/s]
LOSS 0.07373163731751195

**

EPOCH 10
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  6.63it/s]
LOSS 0.06355743981401539

Results

The results for a few expressions have been shown below.

Blue represents the true values
Orange represents the model's predictions

sin(x)

sin(x) * cos(x)

I am happy with these results!

Thanks for reading!

Thanks for reading through this blog post.

We will continue to add new layer types in the next few posts of this series, so be ready for those!

Deep Learning Library From Scratch 6: Integrating new autodiff module and MNIST digit classifier

ashwins-code — Sat, 17 Sep 2022 11:51:22 +0000

Hello and welcome to part 6 of this series of building a deep learning library from scratch.

The github repo for this series is....

ashwins-code / Zen-Deep-Learning-Library

Deep Learning library written in Python. Contains code for my blog series on building a deep learning library.

Zen - Deep Learning Library

A deep learning library written in Python.

Contains the code for my blog series where we build this library from scratch

mnist.py contains an example for a MNIST digit recogniser using the library
rnn.py contains an example of a Recurrent Neural Network which learns to fit to the graph sin(x) * cos(x)

View on GitHub

What are we doing?

If you recall from the previous post, we finished the code for our automatic differentiation module (for now at least!)

Deep learning libraries rely on an automatic differentiation module to handle the backpropagation process during model training. However, our library currently calculates weight derivatives "by hand". Now that we have our own autodiff module, let's have our library use it to carry out backpropagation!

We are also going to build a digit classifier to test out if everything works.

What was wrong with doing without the module?

Doing it without the module was not wrong as such. After all, it did work perfectly fine.

However, when we start to implement more complex types of layers and activation functions in our library, hard coding the derivative calculations may become difficult to get your head around.

An autodiff module provides that layer of abstraction for us, calculating the derivates for us, so we don't have to.

nn.py

Let's create a file called "nn.py".

This file will contain all the components that make up a neural network such as layers, activations etc...

Linear Layer


import autodiff as ad
import numpy as np
import loss 
import optim

np.random.seed(345)

class Layer:
    def __init__(self):
        pass

class Linear(Layer):
    def __init__(self, units):
        self.units = units
        self.w = None
        self.b = None

    def __call__(self, x):
        if self.w is None:
            self.w = ad.Tensor(np.random.uniform(size=(x.shape[-1], self.units), low=-1/np.sqrt(x.shape[-1]), high=1/np.sqrt(x.shape[-1])))
            self.b = ad.Tensor(np.zeros((1, self.units)))

        return x @ self.w + self.b

Quite simple so far. __call__ simply carries out the forward pass when an instance of this class is called as a function. It also initialises the layer's parameters if it's being called for the first time.

The weights and biases are now instances of the Tensor class, which means they will become part of the computation graph when operations begin. This means that our autodiff module will be able to calculate their derivatives.

Note how there is no backward method like we had previously. We don't need it anymore since the autodiff module will calculate the derivates for us!

Activations

class Sigmoid:
    def __call__(self, x):
        return 1 / (1 + np.e ** (-1 * x))

class Softmax:
    def __call__(self, x):
        e_x = np.e ** (x - np.max(x.value))
        s_x = (e_x) / ad.reduce_sum(e_x, axis=1, keepdims=True)
        return s_x

class Tanh:
    def __call__(self, x):
        return (2 / (1 + np.e ** (-2 * x))) - 1

These stay the pretty much same as before, just without the backward method of course!

Model class

class Model:
    def __init__(self, layers):
        self.layers = layers

    def __call__(self, x):
        output = x

        for layer in self.layers:
            output = layer(output)

        return output

    def train(self, x, y, epochs=10, loss_fn = loss.MSE, optimizer=optim.SGD(lr=0.1), batch_size=32):
        for epoch in range(epochs):
            _loss = 0
            print (f"EPOCH", epoch + 1)
            for batch in tqdm(range(0, len(x), batch_size)):
                output = self(x[batch:batch+batch_size])
                l = loss_fn(output, y[batch:batch+batch_size])
                optimizer(self, l)
                _loss += l

            print ("LOSS", _loss.value)

The model class stays similar to how it was before but now can train on the dataset in batches.

Training in batches, rather than using the whole dataset at once, enables the model to better understand the data it's given.

loss.py

loss.py will contain the different loss functions we implement in the library.

import autodiff as ad

def MSE(pred, real):
    loss = ad.reduce_mean((pred - real)**2)
    return loss

def CategoricalCrossentropy(pred, real):
    loss = -1 * ad.reduce_mean(real * ad.log(pred))

    return loss

Again, same as before, just without the backward methods.

New Autodiff functions

Before we got onto optimisers, you may have noticed so far that the code uses some new functions from the autodiff module!

Here are the new functions

def reduce_sum(tensor, axis = None, keepdims=False):
    var = Tensor(np.sum(tensor.value, axis = axis, keepdims=keepdims))
    var.dependencies.append(tensor)
    var.grads.append(np.ones(tensor.value.shape))

    return var

def reduce_mean(tensor, axis = None, keepdims=False):
    return reduce_sum(tensor, axis, keepdims) / tensor.value.size

def log(tensor):
    var = Tensor(np.log(tensor.value))
    var.dependencies.append(tensor)
    var.grads.append(1 / tensor.value)

    return var

optim.py

optim.py will contain the different optimisers we implement in this library.

SGD

from nn import Layer

class SGD:
    def __init__(self, lr):
        self.lr = lr

    def delta(self, param):
        return param.gradient * self.lr

    def __call__(self, model, loss):
        loss.get_gradients()

        for layer in model.layers:
            if isinstance(layer, Layer):
                layer.update(self)

Momentum

class Momentum:
    def __init__(self, lr = 0.01, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.averages = {}

    def momentum_average(self, prev, grad):
        return (self.beta * prev) + (self.lr * grad)

    def delta(self, param):
        param_id = param.id

        if param_id not in self.averages:
            self.averages[param_id] = 0

        self.averages[param_id] = self.momentum_average(self.averages[param_id], param.gradient)
        return self.averages[param_id]

    def __call__(self, model, loss):
        loss.get_gradients()
        for layer in model.layers:
            if isinstance(layer, Layer):
                layer.update(self)

RMSProp

class RMSProp:
    def __init__(self, lr = 0.01, beta=0.9, epsilon=10**-10):
        self.lr = lr
        self.beta = beta
        self.epsilon = epsilon
        self.averages = {}

    def rms_average(self, prev, grad):
        return self.beta * prev + (1 - self.beta) * (grad ** 2)

    def delta(self, param):
        param_id = param.id

        if param_id not in self.averages:
            self.averages[param_id] = 0

        self.averages[param_id] = self.rms_average(self.averages[param_id], param.gradient)
        return (self.lr / (self.averages[param_id] + self.epsilon) ** 0.5) * param.gradient

    def __call__(self, model, loss):
        loss.get_gradients()
        for layer in model.layers:
            if isinstance(layer, Layer):
                layer.update(self)

Adam

class Adam:
    def __init__(self, lr = 0.01, beta1=0.9, beta2=0.999, epsilon=10**-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.averages = {}
        self.averages2 = {}

    def rms_average(self, prev, grad):
        return (self.beta2 * prev) + (1 - self.beta2) * (grad ** 2)

    def momentum_average(self, prev, grad):
        return (self.beta1 * prev) + ((1 - self.beta1) * grad)

    def delta(self, param):
        param_id = param.id

        if param_id not in self.averages:
            self.averages[param_id] = 0
            self.averages2[param_id] = 0

        self.averages[param_id] = self.momentum_average(self.averages[param_id], param.gradient)
        self.averages2[param_id] = self.rms_average(self.averages2[param_id], param.gradient)

        adjust1 = self.averages[param_id] / (1 - self.beta1)
        adjust2 = self.averages2[param_id] / (1 - self.beta2)


        return self.lr * (adjust1 / (adjust2 ** 0.5 + self.epsilon))

    def __call__(self, model, loss):
        loss.get_gradients()
        for layer in model.layers:            
            if isinstance(layer, Layer):
                layer.update(self)

The code here has changed quite a bit from what it was from before.

Let's have a closer look at what's going on.

call

def __call__(self, model, loss):
        loss.get_gradients()
        for layer in model.layers:            
            if isinstance(layer, Layer):
                layer.update(self)

When an instance of an optimiser class is called, it takes in the model it's training and the loss value.

loss.get_gradients()

Here we utilise our autodiff module!

If you can remember, the get_gradients method is part of the Tensor class and computes the derivates of all variables involved in the calculation of this tensor.

This means all the weights and biases in the network now have their derivatives computed, which are all stored in their gradient property.

for layer in model.layers:            
        if isinstance(layer, Layer):
            layer.update(self)

Now that the derivates have been computed, the optimiser will iterate through each layer of the network and update their parameters by calling the layer's update method, passing itself as a parameter to it.

The update method in our Linear layer class is as such...

#nn.py
class Linear(Layer):
    ...
    def update(self, optim):
        self.w.value -= optim.delta(self.w)
        self.b.value -= optim.delta(self.b)

        self.w.grads = []
        self.w.dependencies = []
        self.b.grads = []
        self.b.dependencies = []

This method takes an instance of an optimiser and updates the layer's parameters by a delta value calculated by the optimiser.

self.w.value -= optim.delta(self.w)
self.b.value -= optim.delta(self.b)

delta is a method in the optimiser's class. It takes in a tensor and uses its derivative to return the value of how much this tensor should adjust by.

The delta method varies depending on the optimiser that is being used.

Let's have a look at one of the delta methods.

class RMSProp:
    ...

    def rms_average(self, prev, grad):
        return self.beta * prev + (1 - self.beta) * (grad ** 2)

    def delta(self, param):
        param_id = param.id

        if param_id not in self.averages:
            self.averages[param_id] = 0

        self.averages[param_id] = self.rms_average(self.averages[param_id], param.gradient)
        return (self.lr / (self.averages[param_id] + self.epsilon) ** 0.5) * param.gradient

    ...

param_id = param.id

if param_id not in self.averages:
   self.averages[param_id] = 0

Remember that most optimisers keep track of a type of average of each parameter's gradients to help locate the global minimum.

This is why we assigned an id to each tensor, so that their gradient averages could be kept track of by an optimiser.

self.averages[param_id] = self.rms_average(self.averages[param_id], param.gradient)
        return (self.lr / (self.averages[param_id] + self.epsilon) ** 0.5) * param.gradient

If necessary, the parameter's gradient average is recalculated (remember SGD does not maintain an average).

The method then computes how much the parameter should be adjusted by and returns this value.

Have a look at the other optimisers and to help you figure out how it all works.

MNIST Digit Classifier

To see if all our new changes work as expected, let's build a neural network to classify images of handwritten digits.

Import modules

from sklearn.datasets import load_digits
import numpy as np
import nn
import optim
import loss
from autodiff import *
from matplotlib import pyplot as plt

Prepare dataset

def one_hot(n, max):
    arr = [0] * max

    arr[n] = 1

    return arr


mnist = load_digits()
images = np.array([image.flatten() for image in mnist.images])
targets = np.array([one_hot(n, 10) for n in mnist.target])

The mnist dataset contains images as 2D arrays. However, our library does not have layers that accept 2D inputs yet, so we have to flatten them.

one_hot takes in a number and returns the one hotted array for it of length max

one_hot(3, 10) => [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

Building the model

model = nn.Model([
    nn.Linear(64),
    nn.Tanh(),
    nn.Linear(32),
    nn.Sigmoid(),
    nn.Linear(10),
    nn.Softmax()
])

This is a simple feed-forward network, which uses the softmax function to output a probability distribution.

This distribution specifies the probability of each class (each digit in this case) being true given its input (the image).

Training the model

model.train(images[:1000], targets[:1000], epochs=50, loss_fn=loss.CategoricalCrossentropy, optimizer=optim.RMSProp(0.001), batch_size=128)

All we need to train our model is this one line.

I've decided to use the first 1000 images to train the model (there are around 1700) in the dataset.

Feel free to see how the model reacts when you change the training configuration. Maybe try changing the optimiser, loss function or the learning rate and see how that affects training.

Testing the model

images = images[1000:]
np.random.shuffle(images)

for image in images:
    plt.imshow(image.reshape((8, 8)), cmap='gray')
    plt.show()
    pred = np.argmax(model(np.array([image])).value, axis=1)
    print (pred)

Here we shuffle the images that the model didn't train on into a random order.

We then go through each image, display it and get our model to predict what digit the image shows.

Let's run it!

Here is the output just after the model trains

**
**

EPOCH 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 153.60it/s]
LOSS 3.4742936714113055

**

EPOCH 2
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 166.29it/s]
LOSS 3.046120806655171

...

EPOCH 49
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 143.15it/s]
LOSS 0.003474840987802178

**

EPOCH 50                                                                                                                                                                                   
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 139.63it/s]
LOSS 0.0031392859241948746

Here you can see the model finishes with a loss of 0.0031392859241948746. Not bad at all.

Note how each epoch doesn't even take a second to complete! It seems our autodiff module is able to compute several derivatives at an acceptable speed.

Now let's see if the model can really classify digits...

These were the first 4 random digits when I ran the code...

These look like a 5, 6, 5 and a 7 to me....

(maybe the last one is a 3. I can't tell haha)

Let's see what the model predicted

Looks good to me!

Just to make sure, let's make it go through all the images and calculate it's accuracy.

correct = 0
cnt = 0

for image in images:
    pred = np.argmax(model(np.array([image])).value, axis=1)
    real = np.argmax(targets[cnt])
    if pred[0] == real:
        correct += 1
    cnt += 1

print ("accuracy: ", correct / cnt)

Running this results in the following output...

accuracy:  0.9604897050639956

Wow! It is safe to say that our model has successfully trained for this specific task.

Thank you for reading

Thank you for reading through this post. The next few posts will involve implementing more layer types such as recurrent layers and convolutional layers, so I hope you stay tuned for that!

How do Convolutional Neural Networks work?

ashwins-code — Sun, 03 Jul 2022 19:35:41 +0000

Facial Recognition, Object Detection and automated cancer detection... how do these all work?

At the core of many state-of-the-art image processing models lies Convolutional networks, which use what are known as convolutions to accurately capture the features of an image.

Not only can CNNs be used for image processing, they can also be handy in NLP tasks, often serving as an alternative to recurrent neural nets, due to the fact they can be trained much faster. This article, however, will focus mainly on CNNs with an image processing task in mind.

What are Convolutions?

At the core of CNNs is an operation known as a "convolution".

Each convolutional layer performs this operation.

A convolution uses a filter and passes this filter over a given 2D array of numbers, performing a weighted sum using the numbers the filter passes over to create a feature map.

For example...

As the diagram shows, a 3x3 filter and a 6x6 2D array of numbers have been setup.

Now let's do the first step of passing this filter over the 2D array...

The filter is applied to the first 3x3 section of the 2D array, performing a weighted sum of the numbers it covers.

In this case, the weighted sum is (0.4*1) + (0.2*0) + (1.0*1) + (0.7*0) + (0.1*2) + (0.2*0) + (0.7*1) + (0.3*0) + (1.0*1) = 3.3

The result of this sum is then added into the feature map.

Stride length specifies how many columns the filter should move across after doing a weighted sum. In our example, we will set the stride length to 1.

This means the following occurs...

This process repeats until the filter has passed through the whole 2D array, forming a full feature map.

In terms of image processing, images can be represented as a 2D array of numbers, with each element representing an image pixel.

After a convolution is applied, the feature map, if a useful filter is used, should represent a meaningful feature of the image.

Exploring convolution filters

There are some commonly known filters, which result in interesting feature maps.

Below is a Gaussian blur filter...

(010111010)\left(\begin{matrix} 0 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \\ \end{matrix}\right)

and here is a short script which applies this convolves an image using this filter and shows the result...

from matplotlib import image
from matplotlib import pyplot
from scipy.ndimage import convolve
import numpy as np

def rgb2gray(rgb):

    r, g, b = rgb[:,:,0], rgb[:,:,1], rgb[:,:,2]
    gray = 0.2989 * r + 0.5870 * g + 0.1140 * b

    return gray

image = rgb2gray(image.imread('city.jpg')) #converts to grayscale for simpler convolving

blur_kernel = np.array([[0, 1, 0], [1, 1, 1], [0, 1, 0]])

pyplot.imshow(image, cmap="gray", vmin=np.min(image), vmax=np.max(image))
pyplot.show()

convolved = convolve(image, blur_kernel)

pyplot.imshow(convolved, cmap="gray", vmin=np.min(image), vmax=np.max(image))
pyplot.show()

Using this script, if we inputted this image...

the convolution's feature map would look like this...

As you can see, this particular filter blurs an image its given.

Another well known filter is this edge detection filter...

(0101−41010)\left(\begin{matrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \\ \end{matrix}\right)

Here is the same script, but the kernel/filter has been changed

from matplotlib import image
from matplotlib import pyplot
from scipy.ndimage import convolve
import numpy as np

def rgb2gray(rgb):

    r, g, b = rgb[:,:,0], rgb[:,:,1], rgb[:,:,2]
    gray = 0.2989 * r + 0.5870 * g + 0.1140 * b

    return gray

image = rgb2gray(image.imread('city.jpg'))
edge_kernel = np.array([[0, 1, 0], [1, -4, 1], [0, 1, 0]])

pyplot.imshow(image, cmap="gray", vmin=np.min(image), vmax=np.max(image))
pyplot.show()

convolved = convolve(image, edge_kernel)

pyplot.imshow(convolved, cmap="gray", vmin=np.min(image), vmax=np.max(image))
pyplot.show()

and here is the resulting image...

As shown, this particular filter detects any edges in the input image.

Hopefully, you can see that filters are used to pick out specific features in an image and we can use different filters to pick out different features we would like to see.

How are Convolutions used in CNNs?

Learning Filters

Obviously, manually specifying what convolution filters the network should use would most likely yield poor results.

CNNs initially start with random filters and adjust them during the training process, which eventually results in the optimal filters which pick out specific features in the input, which would not be obvious to a human eye.

Multiple Filters

A convolutional layer does not just use one filter - they learn many filters in parallel, usually from 32 to 512 filters.

This allows the network to look at the input from different perspectives and pick up very specific details within the image e.g certain curve shapes.

Multiple Channels

Coloured images consists of channels, usually having one channel for each RGB value.

Images will now have a 3D shape, meaning the filter must also have a 3D shape e.g having a 3x3x3 filter instead of a 3x3.

The mechanism is the same, instead the weighted sum happens across multiple channels instead of one.

Working with multiple channels allows the network to work with more data from images and, therefore, develop more complex understandings of them.

Pooling

Along with Convolutional layers, pooling layers are commonly used.

There are 3 main types of Pooling:
-Average Pooling
-Max Pooling
-Global Pooling

Pooling is a type of dimensional reduction, where a window slides across a 2D array of numbers and the numbers in within this window is reduced to a single number.

Average Pooling

The average of all numbers within the window is taken and kept into the new resulting 2D array.

Max Pooling

The largest number within the window is taken and kept into the new resulting 2D array.

Global Pooling

If you wanted to pool a 3D array (meaning the input has channels), you could either Global Max Pool it or Global Average Pool it.

They reduce the 3 channels into 1 channel using the maximum channel values or average channel values respectively.

Thank You

Thank you for reading and hopefully you understand the theory behind CNNs.

If you are interested in implementing a CNN, here is how you could do it in Tensorflow.

How I built a Deep Learning Powered Search Engine

ashwins-code — Sat, 28 May 2022 15:14:11 +0000

We use search engines everyday. Without them, we would be using the Internet in a much different way, a way which I can't imagine what it would be like.

I decided to have a go at making my own small search engine, using Deep Learning to power the search results.

Breaking the problem down

Because making a fully fledged search engine is a very long and expensive task, I decided that my search engine would be for Wikipedia pages only (a.k.a given a search query, it will return any relevant Wikipedia pages for it)

There were 3 main steps to building this search engine:

Find/Create a database of Wikipedia pages
Index the database of pages
Efficiently match search queries to pages in the index

Creating the Database

Instead of trying to find a database of Wikipedia pages online, I decided to create my own, simply because I wanted to build a web crawler!

What is a web crawler?

A web crawler is initially given a few starting URLs to visit. They then visit these URLs and parse through the HTML to find more URLs they can visit. Any URLs they find are added to a queue, so that they can visit them later.

Search engines, like Google and Bing, use web crawlers to keep adding to their index of website links, so that they can come up in their search results.

Since most of the URLs on Wikipedia pages are URLs to other Wikipedia pages, all I had to do was write a simple, generic web crawler and feed it a few intial Wikipedia pages.

import requests
import os
from tinydb import TinyDB
from html.parser import HTMLParser


db = TinyDB("results.json")

def get_host(url):
    while os.path.dirname(url) not in ["http:", "https:"]:
        url = os.path.dirname(url)

    return url

class Parser(HTMLParser):
    def __init__(self):
        super(Parser, self).__init__(convert_charrefs=True)
        self.url = ""
        self.urls = []
        self.meta_description = ""
        self.title = ""
        self.paragraph_content = ""
        self.paragraph = False
        self.set_description = False
        self.set_title = False


    def set_url(self, url):
        if url[-1] == "/":
            if os.path.dirname(url[:-1]) not in ["http:", "https:"]:
                url = url[:-1]
        self.url = url

    def handle_starttag(self, tag, attrs):
        if tag == "meta":
            for attr in attrs:
                if attr[0] == "name" and attr[1] == "description":
                    self.set_description = True

                if self.set_description:
                    if attr[0] == "content":
                        self.meta_description = attr[1]
                        self.set_description = False

        elif tag == "a":
            for attr in attrs:
                if attr[0] == "href":
                    link = attr[1]

                    if link:
                        if link[0] == "/":
                            link = get_host(self.url) + link
                            self.urls.append(link)
                        elif "http://" in link and link.index("http://") == 0:
                            self.urls.append(link)
                        elif "https://" in link and link.index("https://") == 0:
                            self.urls.append(link)

        elif tag == "p" and len(self.paragraph_content) < 100:
            self.paragraph = True

        elif tag == "title":
            self.set_title = True

    def handle_endtag(self, tag):
        if tag == "p":
            self.paragraph = False

    def handle_data(self, data):
        if self.set_title:
            self.title = data
            self.set_title = False
        elif self.paragraph:
            self.paragraph_content += data

    def clear(self):
        self.urls = []
        self.meta_description = ""
        self.title = ""
        self.paragraph_content = ""
        self.paragraph = False
        self.set_description = False
        self.set_title = False


def crawl(start_queue):
    parser = Parser()

    queue = start_queue
    seen_urls = []
    while len(queue) > 0:
        if queue[0] not in seen_urls:
            try:
                print (queue[0])

                page = requests.get(queue[0])
                parser.set_url(queue[0])
                parser.feed(page.text)


                db.insert({
                    "title": parser.title,
                    "description": parser.meta_description,
                    "content": parser.paragraph_content,
                    "url": queue[0]
                })

                seen_urls.append(queue[0])
                queue = queue + parser.urls
                parser.clear()
            except:
                pass

        queue = queue[1:]


crawl(["https://en.wikipedia.org/wiki/Music", "https://en.wikipedia.org/wiki/Cricket", "https://en.wikipedia.org/wiki/Football"])

Everytime the crawler visits a page, it scans through the HTML for any a tags and adds its href attribute to the queue.

It also records the page's title, found in between the title tags, the first 100 characters of the page's article, by looking at the content within the p tags and the page's meta description (however after crawling, I found that none of the Wikipedia pages have meta descriptions!)

After the page is scanned through, its URL and recorded details are saved to a local JSON file, using TinyDB.

I ran the crawler for a few mintues and managed to scrape around 1000 pages.

Indexing the Database

To return relevant pages to a user's search query, I was planning to use a KNN algorithm to compare the vector encodings of the search query and the vector encodings of the page contents store in the database.

Vector encodings of natural language are crucial when it comes to understanding what a user is saying. As you will see later, I decided to use a Transformer model to encode sentences into a vector representation.

With what I had so far, it was possible to build a working search engine with the above method in mind, but it would be extremely inefficient, since it would involve going through EVERY record in the database, vectorising their article's content and comparing it to the query vector.

Increasing Search Efficiency

In order to increase the efficiency of the search, I had an idea to index/preprocess the database, in order to reduce the search space for the KNN algorithm.

The first step was to vectorise the content of all the pages I had scraped and stored in my database.

from tinydb import TinyDB
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')
db = TinyDB("results.json")

contents = []

for record in db:
    content = record["content"]
    contents.append(content)

embeddings = model.encode(contents)

As you can see, for vectorising sentences, I used the SentenceTransformer library and the "stsb-roberta-base-v2" transformer model, which was fine-tuned for tasks like Neural Search, where a query needs to be matched with relevant documents (the exact task I had at hand).

Now that I had all the vector representations of the pages, I decided to cluster semantically similar pages together, using the K-Means clustering algorithm.

The idea behind this was that, when a search query is entered, the query would be vectorised and be classified into a cluster. Then we could perform KNN with the pages in that cluster only, instead of the whole database, which should improve efficiency.

kmeans = KMeans(n_clusters=3)
kmeans.fit(embeddings)
buckets = {}

for record, label in zip(db, kmeans.labels_):
    if label not in buckets:
        buckets[label] = []

    buckets[label].append(dict(record)) 

import pickle

pickle.dump(kmeans, open("kmeans-model.save", "wb"))
pickle.dump(buckets, open("buckets.save","wb"))

Clustering pages

from sentence_transformers import SentenceTransformer
import pickle
import numpy as np
from sklearn.neighbors import NearestNeighbors

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')

kmeans = pickle.load(open("cluster.save", 'rb'))
buckets = pickle.load(open("buckets.save", 'rb'))

while True:
    query = input("Search: ")
    encoded_query = model.encode(query)
    encoded_query = np.expand_dims(encoded_query, axis=0)
    bucket = kmeans.predict(encoded_query)
    bucket = int(bucket.squeeze())
    embeddings = []

    for result in buckets[bucket]:
        embeddings.append(result["content"])
    embeddings = model.encode(embeddings)
    neigh = NearestNeighbors(n_neighbors=10)
    neigh.fit(embeddings)
    indexes = neigh.kneighbors(encoded_query, return_distance=False)

    for index in indexes.T:
      doc = buckets[bucket][int(index.squeeze())]
      print (doc["title"], doc["url"])
      print ("")

Matching user query to 10 most relevant pages

In the code above, it shows that I broke the database down into 3 clusters.

However, I did a lot of tinkering with the number of clusters.

If we group the database into too many clusters, searches would be extremely efficient, but there would be a huge drop in result accuracy. The same goes the other way.

I found that, with 3 clusters, the accuracy of the results were really high, however it was still extremely slow to return results (~7 seconds for each search, but the worst was 32 seconds).

Search: how do i compose music

Musical instrument - Wikipedia https://en.wikipedia.org/wiki/Musical_instrument

Elements of music - Wikipedia https://en.wikipedia.org/wiki/Elements_of_music

Music criticism - Wikipedia https://en.wikipedia.org/wiki/Music_criticism

Contemporary classical music - Wikipedia https://en.wikipedia.org/wiki/Contemporary_music

Accompaniment - Wikipedia https://en.wikipedia.org/wiki/Accompaniment

Musical improvisation - Wikipedia https://en.wikipedia.org/wiki/Musical_improvisation

Musique concrète - Wikipedia https://en.wikipedia.org/wiki/Musique_concr%C3%A8te

Programming (music) - Wikipedia https://en.wikipedia.org/wiki/Programming_(music)

Film score - Wikipedia https://en.wikipedia.org/wiki/Film_score

Song - Wikipedia https://en.wikipedia.org/wiki/Song

Harpsichord - Wikipedia https://en.wikipedia.org/wiki/Harpsichord

Music theory - Wikipedia https://en.wikipedia.org/wiki/Music_theory

Music industry - Wikipedia https://en.wikipedia.org/wiki/Music_industry

Definition of music - Wikipedia https://en.wikipedia.org/wiki/Definitions_of_music

Wolfgang Amadeus Mozart - Wikipedia https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart

Invention (musical composition) - Wikipedia https://en.wikipedia.org/wiki/Invention_(musical_composition)

Music - Wikipedia https://en.wikipedia.org/wiki/Music

Music - Wikipedia https://en.wikipedia.org/wiki/Music

Aesthetics of music - Wikipedia https://en.wikipedia.org/wiki/Aesthetics_of_music

Musicology - Wikipedia https://en.wikipedia.org/wiki/Musicology

Searches relating to music took the most time, possibly because the database contained a lot of music pages, and so they were all grouped into one big cluster.

With anything above 6 clusters, I found that results were being returned at a quick speed, however the accuracy was poor. For example I'd search something simple, such as "Liverpool Football", but the engine would fail to return the Liverpool F.C page, despite it being present in the database.

A better solution

With the trade-off between speed and accuracy in the above solution being way too sensitive, I had to find a better solution.

After a bit of research, I came across ANNOY.

ANNOY stands for "Approximate Nearest Neighbours Oh Yeah" and is a small library, provided by Spotify, to search for points in space that are close to a given query point.

Spotify themselves use ANNOY for their user music recommendations!

Approximate Nearest Neighbours?

You may be thinking why would we want an approximate nearest neighbours alogrithm? Why not an exact one?

For KNN to be exact, it has to iterate through each and every datapoint given to it, which is obviously extremely inefficient.

Things can be drastically sped up if a little bit of accuracy is sacrificed, but, in practice, this sacrifice in accuracy does not matter at all. A user would not mind if the second closest datapoint and first closest datapoint are swapped around, since they are both probably good matches to their query.

How ANNOY works

Here is a good article explaining how ANNOY works (written by the man who built ANNOY himself!)

ANNOY works by building loads of binary trees (a forest) from the dataset its given.

To build a tree, it selects two random points in the vector space and divides the space into two subspaces, by the hyperplane equidistant to the two random points.

The process repeats again, in the new subspaces that were just made.

This keeps going until there is a certain n number of points in each subspace.

Points that are near to each other should be in the same subspace, since it is unlikely that there would be a hyperplane to separate them into separate subspaces.

Now that we have all the subspaces, a binary tree can be constructed.

The nodes of the tree represent a hyperplane. So, when given a query vector, we can traverse down the tree, telling us which hyperplanes we should go down to, in order to find some x most relevant points in the vector space.

ANNOY builds many of these trees to build a forest. The number of trees in the forest is specified by the programmer.

When given a query vector, ANNOY uses a priority queue to search this query through the binary trees in its forest. The priority queue allows for the search to focus on trees that are best for the query (aka trees whose hyperplanes are far from the query vector).

After it has finished searching, ANNOY looks at all the common points its trees have found, which would form the query vector's "neighbours".

Now the k nearest neighbours can be ranked and returned

Refactoring Indexing and Search code

It didn't take long to change the code to use ANNOY, thanks to its straightforward API.

from tinydb import TinyDB
from sentence_transformers import SentenceTransformer
from annoy import AnnoyIndex

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')
db = TinyDB("results.json")

descriptions = []

for record in db:
    description = record["content"]
    descriptions.append(description)

embeddings = model.encode(descriptions)

index = AnnoyIndex(embeddings.shape[-1], "euclidean")

vec_idx = 0
for vec in embeddings:
    index.add_item(vec_idx, vec)
    vec_idx += 1

index.build(10)
index.save("index.ann") #stores the results of the indexing

Indexing code

from tinydb import TinyDB
from sentence_transformers import SentenceTransformer
from annoy import AnnoyIndex

model = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')
db = TinyDB("results.json")

index = AnnoyIndex(768, "euclidean")
index.load("index.ann")

while True:
    query = input("Search: ")

    vec = model.encode(query)
    indexes = (index.get_nns_by_vector(vec, 20))
    all_db = db.all()

    for i in indexes:
        print (all_db[i]["title"], all_db[i]["url"])
        print ("")

user search code

Now let's see the results...

Search: great composers
Igor Stravinsky - Wikipedia https://en.wikipedia.org/wiki/Igor_Stravinsky

Contemporary classical music - Wikipedia https://en.wikipedia.org/wiki/Contemporary_music

Gustav Mahler - Wikipedia https://en.wikipedia.org/wiki/Gustav_Mahler

Symphony - Wikipedia https://en.wikipedia.org/wiki/Symphony

Art music - Wikipedia https://en.wikipedia.org/wiki/Art_music

Symphony No. 5 (Beethoven) - Wikipedia https://en.wikipedia.org/wiki/Symphony_No._5_(Beethoven)

Wolfgang Amadeus Mozart - Wikipedia https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart

Ludwig van Beethoven - Wikipedia https://en.wikipedia.org/wiki/Ludwig_van_Beethoven

Music of Central Asia - Wikipedia https://en.wikipedia.org/wiki/Central_Asian_music

Big band - Wikipedia https://en.wikipedia.org/wiki/Big_band

Program music - Wikipedia https://en.wikipedia.org/wiki/Program_music

Cello Suites (Bach) - Wikipedia https://en.wikipedia.org/wiki/Bach_cello_suites

Film score - Wikipedia https://en.wikipedia.org/wiki/Film_score

Johann Sebastian Bach - Wikipedia https://en.wikipedia.org/wiki/Johann_Sebastian_Bach

Elements of music - Wikipedia https://en.wikipedia.org/wiki/Elements_of_music

Music of China - Wikipedia https://en.wikipedia.org/wiki/Chinese_classical_music

Organ (music) - Wikipedia https://en.wikipedia.org/wiki/Organ_(music)

Toccata and Fugue in D minor, BWV 565 - Wikipedia https://en.wikipedia.org/wiki/Toccata_and_Fugue_in_D_minor,_BWV_565

Georg Philipp Telemann - Wikipedia https://en.wikipedia.org/wiki/Georg_Philipp_Telemann

Sonata form - Wikipedia https://en.wikipedia.org/wiki/Sonata_form


Search: what are the rules of cricket
No-ball - Wikipedia https://en.wikipedia.org/wiki/No-ball

International cricket - Wikipedia https://en.wikipedia.org/wiki/International_cricket

Toss (cricket) - Wikipedia https://en.wikipedia.org/wiki/Toss_(cricket)

Match referee - Wikipedia https://en.wikipedia.org/wiki/Match_referee

Cricket ball - Wikipedia https://en.wikipedia.org/wiki/Cricket_ball

Board of Control for Cricket in India - Wikipedia https://en.wikipedia.org/wiki/Board_of_Control_for_Cricket_in_India

India national cricket team - Wikipedia https://en.wikipedia.org/wiki/India_national_cricket_team

Caught - Wikipedia https://en.wikipedia.org/wiki/Caught

Substitute (cricket) - Wikipedia https://en.wikipedia.org/wiki/Substitute_(cricket)

International Cricket Council - Wikipedia https://en.wikipedia.org/wiki/International_Cricket_Council

Portal:Cricket - Wikipedia https://en.wikipedia.org/wiki/Portal:Cricket

Delivery (cricket) - Wikipedia https://en.wikipedia.org/wiki/Delivery_(cricket)

Cricketer (disambiguation) - Wikipedia https://en.wikipedia.org/wiki/Cricketer_(disambiguation)

Cricket (disambiguation) - Wikipedia https://en.wikipedia.org/wiki/Cricket_(disambiguation)

Cricket West Indies - Wikipedia https://en.wikipedia.org/wiki/Cricket_West_Indies

Zimbabwe Cricket - Wikipedia https://en.wikipedia.org/wiki/Zimbabwe_Cricket

Bowled - Wikipedia https://en.wikipedia.org/wiki/Bowled

Pakistan Cricket Board - Wikipedia https://en.wikipedia.org/wiki/Pakistan_Cricket_Board

World Cricket League - Wikipedia https://en.wikipedia.org/wiki/World_Cricket_League

West Indies cricket team - Wikipedia https://en.wikipedia.org/wiki/West_Indies_cricket_team

As you can see, the results returned are pretty accurate! It obviously doesn't help that the database I had was pretty small, but this did yield some good results despite it!

On top of that, these results were produced almost instantly, much much quicker than the previous solution.

Conclusion

The only thing I could see that stopped from the search results being better was the size of the database. If I did set out to build a much bigger search engine in the future, I'd look to use databases such as Firebase or MongoDB, and look into how ANNOY could interface with them.

Having said that, I built this project to investigate how deep learning models could be used in document searching tasks and what can be done to efficiently perform the searches and I think I've taken a lot away from this project.

Thank you for reading and I hope you've learnt something from this too!

How do Machines understand language? A look into the architectures behind Natural Language Understanding

ashwins-code — Sun, 08 May 2022 08:11:31 +0000

Machines are rapidly getting better and better at understanding our languages.

Personal Assistants like Google Assistant, Siri and Alexa can effectively understand user prompts and carry out any instructions that they might have been told.

Google Translate, a tool that allows you to translate between several different languages, is powered through deep learning techniques.

And the most impressive of call is Open AI's GPT-3, which has produced unbelievable results. Given any text from the user, it provides what is known as a "completion" to this text. GPT-3 can "complete" texts to write its own little movie scripts, song lyrics, translate between languages, write code and so much more. Google have also recently announced its vision for its own similar model, PaLM, which is said to have 3x times as many the parameters that GPT-3 has, which is extremely exciting, since we have the prospect of a machine that can produce even more human-like text.

But how do all of these models understand what we say? There are quite a few different approaches to understanding language, which we will go through today.

Encoders and Decoders

A common theme with all the models that seem to understand what we say well is that they consist of two parts: an encoder and a decoder.

The job of the encoder, as the name suggests, is to embed/encode the user's input into a vector, that captures the meaning of the user.

The job of the decoder is to take the vector produced by the encoder and to decode it into a meaningful output. If we think about translating between English and German, this would mean that the decoder would produce a sentence in German, given the vector embedding of the input English sentence.

Sequence To Sequence (Seq2Seq) RNN Architecture

"Sequence to Sequence" simply just refers to the fact that this architecture is designed for taking in one sequence and outputting another sequence, which is what language translation and question answering is.

In this architecture, both the encoder and decoder are LSTM models, but GRUs can also be used.

If you are unsure at to how RNNs work (which LSTMs and GRUs are), I recommend researching a bit about them first, in order to understand this architecture.

Encoder

Firstly, the encoder reads over the input sequence a time step at a time (as RNNs always do). The outputs of these LSTMs/GRUs are discarded and only the final internal state(s) is preserved (LSTMs have 2 internal states while GRUs only have 1).

The name given to the final internal state is the context vector, which captures the meaning of the input sentence.

Decoder

The decoder, also being a LSTM/GRU model, takes the context vector as its initial hidden state.

It is then initially fed a "start" token and outputs a token (the start token essentially acts as a trigger for the sequence outputting). This outputted token is then fed back in as an input into the decoder, which produces another token. This process repeats until the decoder produces a stop token, which signals to the decoder that it no longer needs to produce any more tokens. All the tokens produced through this process form the output sequence.

Note: A token may be a word or a character in the output sequence, but most models usually produce the output sequence at word level.

With the encoder and decoder together, here is how the whole thing would look like.

While this architecture can produce solid results, its reliance on recurrent units means it takes a long time to train. This is because data needs to be passed sequentially to recurrent units, time step by time step. In other words, we can not parallel process the way we train recurrent nets and, with modern GPUs' being designed to parallel-ly compute, this seems like a lot of missed out potential. This is where Transformers come in.

Attention Mechanism

Before we get into how Transformers work, we should get to look at what Attention is and how it works, since it is core to how transformers work, but has also been used with Seq2Seq RNN models to improve their results.

One problem with encoder-decoder models in general was that it was difficult to decode long sequences. This was due to the fact that the context vector, due to the nature of recurrent units, ultimately captured information just from the ending of the input sequence, instead of the whole thing.

Attention was introduced to solve this limitation of Encoder-Decoder models.

Attention not only allows for the whole sequence's information to be captured but it also allows the model to see which part of the sequence has more importance than others.

For example, if we take the sentence "He is from Germany so he speaks German".

If we wanted to predict the word "German" in this sentence, we'd obviously have to pay more attention to "Germany", which came earlier on in the sentence.

In the Seq2Seq model that was talked about before, Bahdanau Attention is commonly used, which performs a weighted sum of all the internal states of all the Recurrent units, in order to capture the information of the whole sequence. The weights of used in this weighted sum is learned throughout the training process.

The Attention mechanism used in Transformers, however, is slightly different...

Self-Attention

Self Attention is a form of attention that aims to find how much weight each token in a sequence has with other tokens in the same sequence.

For example, let's take the sentence "The boy did not want to play because he was tired"...

If we look at the word "he", it is obvious to us that it is referring to "The boy". However, for a neural network, this relationship is not as straightforward. Self-Attention, however, enables a neural network to discover such relationships within a sequence.

How does this work then?

The aim of self-attention is to give each token in the sequence a list of scores, with each score corresponding to how much they relate to each token in the sequence it's in.

Self-Attention takes in 3 matrices: Query, Key and Value.

Through the matrix, each token is essentially given a query vector, a key vector and a value vector.

When scoring a token, its query vector is taken and scored against the key vectors of all the other tokens (scores range from 0 to 1). The value vector of the other tokens (which represents the value of the token itself) are taken and multiplied by their respective scores. The idea behind this is to keep the values of the tokens that have relevance and to wipe out that tokens that don't have much relevance.

Here's how the maths looks like for this...

softmax(Q⊗KTd)V=Zsoftmax(\frac{Q \otimes K^T}{\sqrt{d}})V = Z

Q is the Query Matrix
K is the Key Matrix
d is the number of dimensions (length of each row in the Q,K,V matrices)
V is the Value Matrix
Z is the output matrix of the tokens' scores

Transformers

Now that we have got Attention down, let's look into Transformers.

Transformers are able to do the same job as Seq2Seq models, but much more efficiently. They also aren't just limited to sequence to sequence modelling, but can be used for several classification tasks too.

As the diagram shows, transformers also have an encoder decoder architecture and use a combination of attention mechanisms and feed forward neural networks to produce internal representations of their input sequences.

Both the encoder and decoder are made of modules that can stack up on themselves as many times as they need to (shown by the "Nx" beside each module).

Positional Encoding

You may also notice that there are no recurrent units in transformers, which is what is special about transformers. Its lack of recurrent units means it can train much quicker than Seq2Seq RNN models, since it allows for inputs to be processed in parallel and there is no backpropagating through time.

However, does this lack of recurrent units mean we can't capture any positional/contextual information? No!

Transformers are still able to capture positional information without the use of a recurrent unit. As the diagram shows, before the input sequences enter the encoder/decoder, they are first embedded into a vector (since neural networks work with vectors and not the words themselves) and then positionally encoded.

The formulae for positional encoding is as follows...

the first formula is applied to all even positions of the input vector
the second formula is applied to all odd positions of the input vector

You may also be wondering how the encoder passes its sequence representation to the decoder without the use of an RNN.

Well, if you look closely at the diagram, the output vector of the encoder is passed into the second attention block of the decoder module as the Query and Key vectors.

Text Generating with Transformers

Transformers generate texts just like how the Seq2Seq RNN model does.

The decoder is fed a start token, which then produces an output. This output is fed back as an input into decoder and this process repeats until a stop token is produced.

Classifying with Transformers

Transformers aren't just for generating text.

Since transformers end up building their own internal "understanding" of language, we can use the encoder to extract their language representation and use it to classify text!

For example, BERT is a transformer model that consists of the encoder ONLY, but can be effectively used for things like sentiment classification, question answering and named entity recognition.

Thank you!

I hope you've enjoyed learning a bit about how machines understand language. This is by no means the most detailed explanation of how these models work, but I hope they provide a solid overview of what goes under the hood.

If you interested in using transformers in code, visit

How DeepFakes Are Made: Generative Adversarial Networks

ashwins-code — Fri, 08 Apr 2022 10:52:33 +0000

What are Generative Adversarial Networks? (GANs)

Let's say we wanted to build a model that would draw us images of different types of cars that we haven't seen before.

It is easy for us to build a model to take an image and to predict whether it is a car or not (simple binary classification problem). Even though this model would have an "understanding" of what a car is, we can not use this model to produce us an image of a car.

Another way that may come to mind is to use an RNN. RNNs are great at generating text, so they must be good at generating images, right? Well, yes. RNNs have been used for Image Generation. For example, DRAW utilises an encoder-decoder RNN architecture, for producing images.

Having said that, GANs are most commonly used for these type of generative problems.

GANs are an approach to training generative models, usually using deep learning methods. The architecture consists of two sub-models known as the discriminator and the generator.

How do GANs work?

As said before, GANs consist of two sub-models: the discriminator and the generator.

Generator

The job of the generator, as the name suggests, is to generate the images that we see.

They take in a vector produced from a random distribution and uses it to generate an image. The random vector essentially acts as a seed for the produced image.

Discriminator

The job of the discriminator is to tell whether an image is fake/generated or not with our task in mind.

In the case of our earlier example, the discriminator's job would be to take in an image and to tell whether it is a real image of a car or whether it is a fake/generated image of a car.

So what is the point of having these two sub-models?

These two sub-models help to make each improve.

When a GAN trains, the generator is fed a random vector, so that it produces an image.

This image is then fed into the discriminator. Since this image is a generated image, the discriminator should classify it as fake. But the aim of the generator is to TRICK the discriminator into classifying the produced image as a real image.

If the discriminator is able to spot this fake, this means the generator needs to improve itself. So the parameters of the generator are adjusted so that it does so.

If the discriminator is unable to spot this fake, then that means the discriminator needs to improve, so its parameters are adjusted accordingly.

The discriminator is then fed real images (provided by some example dataset). In this case, the discriminator should classify it as real. If not, its parameters are adjusted to improve.

This repeats again and again throughout the training process.

As the training process goes on, the generator and discriminator keep getting better. The generator would start to produce images that are similar to the real images in the example dataset and the discriminator would be able to distinguish between real and fake, even if the fake images are getting better and better.

However, as the training process goes on and on, the generator, in theory, should get so good that it produces images indistinguishable from real images. The discriminator would be outputting 0.5 each time at this point (if 0 means fake and 1 means real), since it should not be able to tell what's real or not, so just does a 50/50 guess.

At this point, we can throw away the discriminator and feed the generator random vectors to produce images!

Latent Space

At the end of the training process, the generator would have produced a mapping from the random vector space it was trained from to images of the problem domain (one point from the random vector space would correspond to a point in the problem domain). In our example, the problem domain would be images of cars.

This mapping is known as a latent space. This latent space effectively gives meaning to the random distribution the generator was trained on.

This is important, since any transformation performed on the latent space would result in a meaningful change in the resulting image!

For example, say we fed the generator an initial random vector and it produced an image of a red sports car.

We could apply a transformation to this initial random vector (e.g adding 1 to the whole vector) and that could lead to a change meaningful change in the resulting image (e.g it would now be a blue sports car instead of a red one).

Similarly, other transformations would correspond to different things. There may be a transformation for changing the car's colour, a transformation for the size of the car or maybe even a transformation for if the windows are tinted or not!

Conditional GANs

GANs are great at producing a real looking image, but they do not give us any control over what type of image is produced.

If we use the example of drawing digits, we can only ask a GAN "draw me a digit", but we can really say "draw me the number 4".

This is where Conditional GANs come in.

The generator is still fed a random vector, but the vector is also conditioned with another input.

In our example, this other input would be whether we want the number 1, 2, 3, 4 etc.

The input to the discriminator is also conditioned in the same way. This means that the discriminator can tell whether an image is fake or not given, in our case, what the digit is meant to be. It could take a perfectly generated image of a 4, but is told that it's meant to be a 1 through the additional input, and therefore classify it as fake.

This extension of GANs allows for really impressive applications such as text-to-image generation, style transfer, translating photos from day to night and so much more!

The rise of GANs are extremely exciting, but obviously comes along with extreme danger, as you can imagine. DeepFakes (generated commonly by GANs) are getting increasingly realistic and have reach a point where they can fool are easily large amount of people. There is a huge worry on how governments can protect the public from maliciously used DeepFakes and how they should be controlled.

Regardless of that, GANs are really really cool 👍

Deep Learning Library From Scratch 5: Automatic Differentiation Continued

ashwins-code — Sat, 26 Mar 2022 09:18:07 +0000

Hi Guys! Welcome to part 5 of this series of building a deep learning library from scratch. This post will cover the code of the automatic differentiation part of the library. Automatic Differentiation was discussed in the previous post, so do check it out if you don't know what Autodiff is.

The github repo for this series is....

ashwins-code / Zen-Deep-Learning-Library

Deep Learning library written in Python. Contains code for my blog series on building a deep learning library.

Zen - Deep Learning Library

A deep learning library written in Python.

Contains the code for my blog series where we build this library from scratch

mnist.py contains an example for a MNIST digit recogniser using the library
rnn.py contains an example of a Recurrent Neural Network which learns to fit to the graph sin(x) * cos(x)

View on GitHub

Approach

Automatic Differentiation relies on a computation graph to calculate derivatives.

Ultimately, it just boils down to nodes with some connections and we traverse these nodes in some way to calculate derivatives.

For our library, we will build our computation graph on the fly, meaning any calculations performed would be recorded onto the computation graph.

Once we have the graph, we need to find a way to use it to calculate the derivatives of all the variables in that graph.

For example, say we produced the following graph...

This represents...

\newline e = c * d \newline

Now, using the graph, our aim is to find the derivative of e with respect all the variables in that graph (a,b,c,d,e)

For my implementation, I found it easiest to traverse the graph in a depth-first manner to calculate derivatives.

So firstly, we start at e and find $dede\frac{de}{de}$ with respect to e (which is just 1).

Then, we look at node c, meaning we now need to calculate $dedc\frac{de}{dc}$ . We can see that $e$ is the result of a multiplication between $c$ and $d$ , meaning $dedc=d\frac{de}{dc} = d$ (since we treat everything apart from the variable we are on as a constant).

Remembering we are traversing depth-first, the next node we move onto is node a, meaning we calculate $deda\frac{de}{da}$ . This is a bit more tricky, since a does not have a direct connection to e. However, using the chain rule, we know that $deda=dedcdcda\frac{de}{da} = \frac{de}{dc}\frac{dc}{da}$ . We just calculated $dedc\frac{de}{dc}$ , so all we need to calculate now is $dcda\frac{dc}{da}$ . We can see that c is an addition of a and b, so $deda=dedcdcda=d\frac{de}{da} = \frac{de}{dc}\frac{dc}{da} = d$

Hopefully, you can now see how we would use a graph to find the derivatives of all the variables in that graph.

Tensor Class

Firstly, we need to create our tensor class, which would act as the variable nodes on our graph.

import numpy as np
import string
import random

def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))


np.seterr(invalid='ignore')

def is_matrix(o):
    return type(o) == np.ndarray

def same_shape(s1, s2):
    for a, b in zip(s1, s2):
        if a != b:
            return False

    return True

class Tensor:
    __array_priority__ = 1000
    def __init__(self, value, trainable=True):
        self.value = value
        self.dependencies = []
        self.grads = []
        self.grad_value = None
        self.shape = 0
        self.matmul_product = False
        self.gradient = 0
        self.trainable = trainable
        self.id = id_generator()

        if is_matrix(value):
            self.shape = value.shape

What's going on here?

def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

Function that generates a unique id using random characters

def is_matrix(o):
    return type(o) == np.ndarray

Simple function which checks if a value is a numpy array or not.

self.value = value

This line should be self-explanatory, it just holds the value that is given to the tensor.

self.dependencies = []

If the tensor was the result of any operation e.g add or divide, this property would hold the list of tensors that were involved in the operation, producing this tensor (this is how the computation graph is built). If the tensor is not a result of any operation, then this is empty.

self.grads = []

This property would hold the list of derivative of each of the tensor's dependencies with respect to the tensor.

self.shape = 0
...
if is_matrix(value):
            self.shape = value.shape

self.shape holds the shape of the tensor's value. Only numpy arrays can have a shape, which is why it is 0 by default.

self.matmul_product = False

Specifies whether the tensor was a result of a matrix multiplicaton or not (this will help later since the chain rule works differently for matrix multiplacation).

self.gradient = np.ones_like(self.value)

After we use the computation graph to calculate gradients, this property would hold the gradient calculated for the tensor. It is initially set to a matrix of 1s, the same shape as its value.

self.trainable = trainable

Some nodes on the graph do not need their derivatives to be calculated, so this property specifies whether this is the case or not for this tensor.

self.id = id_generator()

Tensors will need to have something to uniquely identify themselves. We will see this coming in use when we reimplement our optimisers to use this automatic differentiation module in a later post.

Operations with Tensors

class Tensor:
    __array_priority__ = 1000
    def __init__(self, value, trainable=True):
        self.value = value
        self.dependencies = []
        self.grads = []
        self.grad_value = None
        self.shape = 0
        self.matmul_product = False
        self.gradient = 0
        self.trainable = trainable
        self.id = id_generator()

        if is_matrix(value):
            self.shape = value.shape

    def depends_on(self, target):
        if self == target:
            return True

        dependencies = self.dependencies

        for dependency in dependencies:
            if dependency == target:
                return True
            elif dependency.depends_on(target):
                return True

        return False

    def __mul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value * other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value)
        var.grads.append(self.value)
        return var

    def __rmul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value * other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value)
        var.grads.append(self.value)
        return var

    def __add__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value + other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(np.ones_like(self.value))
        var.grads.append(np.ones_like(other.value))
        return var

    def __radd__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value + other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(np.ones_like(self.value))
        var.grads.append(np.ones_like(other.value))
        return var

    def __sub__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other)

        var = Tensor(self.value - other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(np.ones_like(self.value))
        var.grads.append(-np.ones_like(other.value))
        return var

    def __rsub__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(other.value - self.value)
        var.dependencies.append(other)
        var.dependencies.append(self)
        var.grads.append(np.ones_like(other.value))
        var.grads.append(-np.one_like(self.value))
        return var

    def __pow__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value ** other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)

        grad_wrt_self = other.value * self.value ** (other.value - 1)
        var.grads.append(grad_wrt_self)

        grad_wrt_other = (self.value ** other.value) * np.log(self.value)
        var.grads.append(grad_wrt_other)

        return var

    def __rpow__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(other.value ** self.value)
        var.dependencies.append(other)
        var.dependencies.append(self)

        grad_wrt_other = self.value * other.value ** (self.value - 1)
        var.grads.append(grad_wrt_other)

        grad_wrt_self = (other.value ** self.value) * np.log(other.value)
        var.grads.append(grad_wrt_self)

        return var


    def __truediv__(self, other):
        return self * (other ** -1)

    def __rtruediv__(self, other):
        return other * (self ** -1)

    def __matmul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value @ other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value.T)
        var.grads.append(self.value.T)

        var.matmul_product = True
        return var

    def __rmatmul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(other.value @ self.value)
        var.dependencies.append(other)
        var.dependencies.append(self)
        var.grads.append(self.value.T)
        var.grads.append(other.value.T)

        var.matmul_product = True

        return var

To understand what is happening here, let's look at one of the methods

def __mul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value * other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value)
        var.grads.append(self.value)
        return var

Firstly, __mul__ is an operator overloader. This just means that whenever we want to multiply a tensor with something else, this method would be called. You also see __rmul__, which is the same thing, but is called when the Tensor object is on the right hand side of the operation.

For example...

t = Tensor(10)
t * 5 #__mul__ is called
5 * t #__rmul__ is called

if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

other represents the thing that this tensor is being multiplied with. If other is not a tensor, we want to convert it to a tensor, holding the value of other. Since other was not already a tensor, it means that it is a constant, so we do not need to calculate its derivative, since we only want to calculate the derivative of something if we need to change its value, usually when training a model. This is why trainable=False.

var = Tensor(self.value * other.value)
var.dependencies.append(self)
var.dependencies.append(other)

var holds the resulting tensor of this operation. The second and third lines add the two tensors used in this operator to var's dependencies (look back up if you forgot what the dependencies property is used for)

var.grads.append(other.value) # dvar/dother
var.grads.append(self.value) # dvar/dself
return var

This now adds the derivates of the two operands with respect to var to var's grads. These lines would obviously be different in the other class methods, since the derivatives depend on the operation being applied (in this case it's multiplication). Note how the order of how they're added corresponds with the order of the tensors in the dependencies property. We don't store them as tensors, since it will be much quicker to calculate derivatives using raw values, instead of our tensor class.

Calculating gradients using the graph!

def get_gradients(self, grad = None):
        grad = np.ones_like(self.value) if grad is None else grad
        grad = np.float32(grad)

        for dependency, _grad in zip(self.dependencies, self.grads):
            if dependency.trainable:
                local_grad = np.float32(_grad)

                if self.matmul_product:                
                    if dependency == self.dependencies[0]:
                        local_grad = grad @ local_grad
                    else:
                        local_grad = local_grad @ grad
                else:
                    if dependency.shape != 0 and not same_shape(grad.shape, local_grad.shape):
                        ndims_added = grad.ndim - local_grad.ndim
                        for _ in range(ndims_added):
                            grad = grad.sum(axis=0)

                        for i, dim in enumerate(dependency.shape):
                            if dim == 1:
                                grad = grad.sum(axis=i, keepdims=True)

                    local_grad = local_grad * np.nan_to_num(grad)


                dependency.gradient += local_grad
                dependency.get_gradients(local_grad)

This method recursively implements the depth-first derivative calculating that was outlined earlier. This can be called for any tensor in the graph, not just the top one. It would result in all the derivatives of all the tensors, that were used to produce this tensor, being calculated.

grad holds the incoming gradients/the previously calculated gradient in the previous "level" of the graph.

for dependency, _grad in zip(self.dependencies, self.grads):
        if dependency.trainable:
           local_grad = np.float32(_grad)

            if self.matmul_product:                
                if dependency == self.dependencies[0]:
                    local_grad = grad @ local_grad
                else:
                    local_grad = local_grad @ grad
            else:
                if dependency.shape != 0 and not same_shape(grad.shape, local_grad.shape):
                    ndims_added = grad.ndim - local_grad.ndim
                    for _ in range(ndims_added):
                        grad = grad.sum(axis=0)

                    for i, dim in enumerate(dependency.shape):
                        if dim == 1:
                            grad = grad.sum(axis=i, keepdims=True)

                    local_grad = local_grad * np.nan_to_num(grad)

This part of the code goes through each of the tensor's dependencies along with the derivative the tensor with respect to that dependency (held in _grad). If the dependency is trainable, it then checks if it was a result of a matrix multiplication or not.

The chain rule is then accordingly applied, using the previously calculated gradient grad and the gradient of the current dependency _grad. local_grad stores the result after the chain rule is applied.

def same_shape(s1, s2):
    for a, b in zip(s1, s2):
        if a != b:
            return False

    return True

if dependency.shape != 0 and not same_shape(grad.shape, local_grad.shape):
                ndims_added = grad.ndim - local_grad.ndim
                for _ in range(ndims_added):
                    grad = grad.sum(axis=0)

                for i, dim in enumerate(dependency.shape):
                    if dim == 1:
                        grad = grad.sum(axis=i, keepdims=True)

If we focus on this part of the code, this handles any case where local_grad and grad aren't the same shape (they need to be in order for chain rule to be applied). This shape mismatch arises if there was any broadcasting performed in any of the calculations. Broadcasting is term used to describe how numpy would perform operations involving arrays of different shapes. You can read more about it on the numpy docs. All this part of the code does is sum across the broadcasted axis of grad, in order to reduce its shape to match the shape of local_grad.

dependency.gradient += local_grad
dependency.get_gradients(local_grad)

The gradient is then recorded to the gradient property of the dependency. It then continues the depth-first traversal by calling the get_gradients method on the dependency, passing through the gradient that was just calculated.

Overall, our code for autodiff should look like this...

import numpy as np
import string
import random

def id_generator(size=10, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))


np.seterr(invalid='ignore')

def is_matrix(o):
    return type(o) == np.ndarray

def same_shape(s1, s2):
    for a, b in zip(s1, s2):
        if a != b:
            return False

    return True

class Tensor:
    __array_priority__ = 1000
    def __init__(self, value, trainable=True):
        self.value = value
        self.dependencies = []
        self.grads = []
        self.grad_value = None
        self.shape = 0
        self.matmul_product = False
        self.gradient = 0
        self.trainable = trainable
        self.id = id_generator()

        if is_matrix(value):
            self.shape = value.shape

    def depends_on(self, target):
        if self == target:
            return True

        dependencies = self.dependencies

        for dependency in dependencies:
            if dependency == target:
                return True
            elif dependency.depends_on(target):
                return True

        return False

    def __mul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value * other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value)
        var.grads.append(self.value)
        return var

    def __rmul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value * other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value)
        var.grads.append(self.value)
        return var

    def __add__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value + other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(np.ones_like(self.value))
        var.grads.append(np.ones_like(other.value))
        return var

    def __radd__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value + other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(np.ones_like(self.value))
        var.grads.append(np.ones_like(other.value))
        return var

    def __sub__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other)

        var = Tensor(self.value - other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(np.ones_like(self.value))
        var.grads.append(-np.ones_like(other.value))
        return var

    def __rsub__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(other.value - self.value)
        var.dependencies.append(other)
        var.dependencies.append(self)
        var.grads.append(np.ones_like(other.value))
        var.grads.append(-np.one_like(self.value))
        return var

    def __pow__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value ** other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)

        grad_wrt_self = other.value * self.value ** (other.value - 1)
        var.grads.append(grad_wrt_self)

        grad_wrt_other = (self.value ** other.value) * np.log(self.value)
        var.grads.append(grad_wrt_other)

        return var

    def __rpow__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(other.value ** self.value)
        var.dependencies.append(other)
        var.dependencies.append(self)

        grad_wrt_other = self.value * other.value ** (self.value - 1)
        var.grads.append(grad_wrt_other)

        grad_wrt_self = (other.value ** self.value) * np.log(other.value)
        var.grads.append(grad_wrt_self)

        return var


    def __truediv__(self, other):
        return self * (other ** -1)

    def __rtruediv__(self, other):
        return other * (self ** -1)

    def __matmul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(self.value @ other.value)
        var.dependencies.append(self)
        var.dependencies.append(other)
        var.grads.append(other.value.T)
        var.grads.append(self.value.T)

        var.matmul_product = True
        return var

    def __rmatmul__(self, other):
        if not (isinstance(other, Tensor)):
            other = Tensor(other, trainable=False)

        var = Tensor(other.value @ self.value)
        var.dependencies.append(other)
        var.dependencies.append(self)
        var.grads.append(self.value.T)
        var.grads.append(other.value.T)

        var.matmul_product = True

        return var

    def grad(self, target, grad = None):
        grad = self.value / self.value if grad is None else grad
        grad = np.float32(grad)

        if not self.depends_on(target):
            return 0

        if self == target:
            return grad

        final_grad = 0

        for dependency, _grad in zip(self.dependencies, self.grads):
            local_grad = np.float32(_grad) if dependency.depends_on(target) else 0

            if local_grad is not 0:
                if self.matmul_product:                
                    if dependency == self.dependencies[0]:
                        local_grad = grad @ local_grad
                    else:
                        local_grad = local_grad @ grad
                else:
                    if dependency.shape != 0 and not same_shape(grad.shape, local_grad.shape):
                        ndims_added = grad.ndim - local_grad.ndim
                        for _ in range(ndims_added):
                            grad = grad.sum(axis=0)

                        for i, dim in enumerate(local_grad.shape):
                            if dim == 1:
                                grad = grad.sum(axis=i, keepdims=True)

                    local_grad *= grad

            final_grad += dependency.grad(target, local_grad)

        return final_grad


    def get_gradients(self, grad = None):
        grad = np.ones_like(self.value) if grad is None else grad
        grad = np.float32(grad)

        for dependency, _grad in zip(self.dependencies, self.grads):
            if dependency.trainable:
                local_grad = np.float32(_grad)

                if self.matmul_product:                
                    if dependency == self.dependencies[0]:
                        local_grad = grad @ local_grad
                    else:
                        local_grad = local_grad @ grad
                else:
                    if dependency.shape != 0 and not same_shape(grad.shape, local_grad.shape):
                        ndims_added = grad.ndim - local_grad.ndim
                        for _ in range(ndims_added):
                            grad = grad.sum(axis=0)

                        for i, dim in enumerate(dependency.shape):
                            if dim == 1:
                                grad = grad.sum(axis=i, keepdims=True)

                    local_grad = local_grad * np.nan_to_num(grad)


                dependency.gradient += local_grad
                dependency.get_gradients(local_grad)

    def __repr__(self):
        return f"Tensor ({self.value})"

and it can be used like so...

a = Tensor(10)
b = Tensor(5)
c = 2
d = (a*b)**c
d.get_gradients()

print (a.gradient, b.gradient)

OUTPUT:
500.0 1000.0

Thank you

Thank you for reading through all of this post! The code of this post can be seen in the Github repo linked at the start in autodiff.py.

Business Applications of AI

ashwins-code — Tue, 08 Mar 2022 11:26:40 +0000

Hello and welcome to this slightly different blog post where we talk about the applications of AI in the business world.

In general, AI is used to increase efficiency in a workplace, which often means improving profits. This post will highlight how different types of businesses would utilise AI to their advantage.

McDonald's Automated Drive-Thrus

70% of McDonald's sales come from their drive-thrus, so finding ways for the business to increase drive-thru efficiency is crucial.

In 2021, McDonald's partnered with IBM help develop their ideas of an automated drive-thru.

The order-taking process would be controlled by an AI powered bot which utilises machine learning for voice recognition and for natural language processing.

McDonald's have already implemented this in a small number of drive-thrus and have said it has cut the average time of a customer to be in a drive-thru by 30 seconds, which resulted in an increase in customer satisfaction.

To further increase drive-thru efficiency, McDonald's decided to implement order prediction at a few of their restaurant locations in 2018.

They devised a machine learning solution which considered the time of day, restaurant traffic, current order selections, the weather and the popularity of menu items to update the digital menu displays at drive-thrus.

Amazon AI Powered Grocery Stores

Amazon Fresh / Amazon Go are AI powered stores, where customers can simply walk into the store, get what they need and walkout. It is a completely till-less experience, which revolutionises the shopping experience. No need to wait in long queues to checkout - you just leave.

Users need to download the Amazon Go app, which links to their Amazon account for billing. Users enter the shop through a turnstile, where they scan a QR code provided by the app.

The shop uses a combination of machine vision and in-store sensors to watch customers and keep track of their virtual baskets. They can put the products into any of their bags without the need to scan them and they can put them back on the shelf if they change their mind. The app shows what items you have picked up so far and has a receipt screen to show what you bought by the time you leave.

AI In Recruitment

Several companies use Natural Language Processing to scan through the resumes of shortlisted candidates.

The algorithm would look at several different factors such as education, location, skills and some would even recommend other job positions for the candidate.

This save a huge amount of time for recruiters, while also ensuring a non-biased selection.

However, it does come to question if the algorithm is truly unbiased. All machine learning models learn from real world data, generated by human behaviour. All humans would have a certain degree bias, whether subconscious or not, so it is possible that the machine learning models learn this bias too. It all comes down to the quality of the dataset use to train the model.

This small game shows a fun insight into how bias in AI models could appear!

Social Media Insights

Many business use their social media platform to gain more exposure to people in the world and utilise AI to reach out to the right people.

This can be seen on YouTube or Instagram where users are shown ads, based on the type of content they commonly see and their location.

This gives rise to the problem where people think their data is being sold to 3rd party companies without them knowing or giving permission. Companies often just say to a social media platform that they want to advertise on there and the platform uses its AI models to take the companies adverts and recommend it to the right users, all done without giving any data to the advertising company. But selling of data does happen with some companies and it is unknown what the businesses may do with the info they are given.

However, I believe a lot of data collected is anonymised, so the selling of data may not be malicious as thought to be. Most of the time, the data is used for targeted ads and not for any malicious purpose and users are given options to say that they do not want their data to be shared.

Thank you for reading through the post and please do checkout my other posts!

Deep Learning Library From Scratch 4: Automatic differentiation

ashwins-code — Sun, 06 Mar 2022 12:05:58 +0000

Welcome to part 4 of this series, where we will talk about automatic differentiation.

Github repo to code for this series:

ashwins-code / Zen-Deep-Learning-Library

Deep Learning library written in Python. Contains code for my blog series on building a deep learning library.

Last post: https://dev.to/ashwinscode/deep-learning-library-from-scratch-3-more-optimisers-4l23

What is automatic differentiation?

Firstly, we need to recap on what a derivative is.

In simple terms, a derivative of a function with respect to a variable measures how much the result of the function would change with a change in the variable. It essentially measures how sensitive the function is to a change in that variable. This is an essential part of training neural networks.

So far in our library, we have been calculating derivatives of variables by hand. However, in practice, deep learning libraries rely on automatic differentiation.

Automatic differentiation is the process of accurately calculating derivates of any numerical function expressed as code.

In simpler terms, for any calculations we perform in our code, we should be able to calculate the derivates of any variables used in that calculation.

...
y = 2*x + 10
y.grad(x) #what is the gradient of x???
...

Forward-mode autodiff and reverse-mode autodiff

There are two popular methods of performing automatic differentiation: forward-mode and reverse-mode.

Forward-mode utilises dual numbers to compute derivatives.

A dual number is anything number in the form...

x = a + b ε

where $ε$ is a really small number close to 0, such that $ε^2 = 0$

If we apply a function to a dual number as such...

\newline f(x) = f(a + bε) = f(a) + (f'(a) \cdot b)ε

you can see we calculate both the result of $f (a)$ and the gradient of $a$ , given by the coefficient of $ε$ .

Forward-mode is preferred when the input dimensions are smaller than the output dimensions of the function, however, in a deep learning setting, the input dimensions would be larger than that of the output. Reverse-mode is preferred for this situation.

In our library, we will implement reverse-mode differentiation for this reason.

Reverse-mode differentiation is a bit more difficult to implement.

As calculations are performed, a computation graph is built.

For example, the following diagram shows the computation graph for $\frac{2x^2 + 2y}{4}$

After this graph is built, the function is evaluated.

Using the function evaluation and the graph, derivatives of all variables used in the function can be calculated.

This is because each operator node would each come with a mechanism to calculate the partial derivatives of the nodes that it involves.

If we look at the bottom right node of the diagram (the $y^2$ node), the multiplier node should be able to calculate it's derivative with respect to the "y" node and the "2" node.

Each operator node would have different mechanisms, since the way a derivative is calculated depends on the operation involved.

When using the graph to calculate derivatives, I find it easier to traverse the graph in a depth-first manner. You start at the very top node and calculate it's derivative with respect to the next node (remember, you are traversing depth-first) and record that node's gradient. Move down to that node and repeat the process. Each time you move down a level in the graph, multiply the gradient you just calculated by the gradient you calculated in the previous level (this is due to the chain rule). Repeat until all the nodes' gradients have been recorded.

Note: it is not necessary to calculate all the gradients in the graph. If you want to find the gradient of a single variable, you can stop once it's gradient has been calculated. However, we'd usually want to find the gradients many variables, so calculating all the gradients in the graph all at once would be much computationally cheaper, since it would only require one graph evaluation. If you wanted to find the gradients of all the variables you wanted ONLY, you would have to do an evaluation of the graph for each variable, which would turn out to be much more computationally expensive to do.

Differentiation rules

Here are the different differentiation rules used by each node, which are used in calculating the derivates in the computation graph.

Note: all of these will show the partial derivative, meaning everything that is not the variable we are finding the gradient of is treated as a constant.

In the following, think of $x$ and $y$ as nodes in the graph and $z$ as the result of the operation applied between these nodes.

At multiplication nodes...

\newline \frac{dz}{dx} = y \newline \frac{dz}{dy} = x

At division nodes...

\frac{x}{y} \newline \frac{dz}{dx} = \frac{1}{y} \newline \frac{dz}{dy} = -xy^{-2}

At addition nodes...

\newline \frac{dz}{dx} = 1 \newline \frac{dz}{dy} = 1

At subtraction nodes...

\newline \frac{dz}{dx} = 1 \newline \frac{dz}{dy} = -1

At power nodes...

x^y \newline \frac{dz}{dx} = yx^{y-1} \newline \frac{dz}{dy} = x^y \cdot ln(x)

The chain rule is then used to backpropogate all the gradients in the graph...

\newline \frac{dy}{dx} = f'(g(x)) \cdot g'(x)

However, when matrix multiplying, the chain rule get a bit different..

\cdot y \newline \frac{dz}{dx} = f'(z) \otimes y^T \newline \frac{dz}{dy} = x^T \otimes f'(z)

... where $f (z)$ is a function that involves $z$ , meaning $f^{'} (z)$ would be the gradient calculated in the previous layer of the graph. By default (aka if z is the highest node in the graph), $f (z) = z$ , meaning $f^{'} (z)$ would be a matrix of 1s with the same shape as $z$ .

The code

The Github repo I linked at the start contains all the code for the automatic differentiation part of the library and has updated all the neural network layers, optimisers and loss function to use automatic differentiation to calculate gradients.

To avoid this post being too long, I will show and explain the code in the next post!

Thank you for reading!

Deep Learning Library From Scratch 3: More optimisers

ashwins-code — Sat, 08 Jan 2022 10:19:58 +0000

Welcome to part 3 of this series, where we build a deep learning library from scratch.

In this post, we will add more optimisation functions and loss functions to our library.

Here is the Github repo for this series

ashwin6-dev / Zen-Deep-Learning-Library

Deep Learning library written in Python. Contains code for my blog series on building a deep learning library.

Last post: https://dev.to/ashwinscode/deep-learning-library-from-scratch-2-backpropagation-116m

Optimisation functions

The goal of an optimisation function is to tweak the network parameters to minimise the neural network's loss.

It does this by taking the gradient of the parameters with respect to the loss and using this gradient to update the parameters.

Different loss functions use the gradients in different ways, which leads to an acceleration in the training process!

If we graph out the loss function, as seen in the image above, optimisers aim to change the parameters of the neural network, so that the minimum loss value is produced aka the lowest dip in the graph.

The path which the optimisers take during the training process is represented by the black ball.

Momentum

Momentum is an optimisation function, which extends the gradient descent algorithm (which we looked at in the last post).

It is designed to accelerate the training process, meaning it would minimise the loss in a fewer number of epochs. If we think about our "black ball", momentum causes this black ball to accelerate quickly towards the minimum, like rolling a ball down from the top of a hill.

Momentum accumulates the gradients calculated in previous epochs, which helps it to determine the direction to go to, in order to minimise the loss.

The formula it uses to update parameters is as follows

d_{t} = \beta \cdot d_{t-1} + l \cdot g \newline p_{t+1} = p_{t} - d_{t} \newline

$p_{t}$ is the parameter value at epoch t
$d_{t}$ is the "direction" to go at epoch t, calculated from previous epochs' gradients. It is initialised at 0.
$l$ is the learning rate
$β\beta$ is a predetermined value (usually chosen to be 0.9)
$g$ is the gradient of the parameter with respect to this loss

Here is our python implementation of this optimiser

#optim.py

#...
class Momentum:
    def __init__(self, lr = 0.01, beta=0.9):
        self.lr = lr
        self.beta = beta

    def momentum_average(self, prev, grad):
        return (self.beta * prev) + (self.lr * grad)

    def __call__(self, model, loss):
        grad = loss.backward()

        for layer in tqdm.tqdm(model.layers[::-1]):
            grad = layer.backward(grad)

            if isinstance(layer, layers.Layer):
                if not hasattr(layer, "momentum"):
                    layer.momentum = {
                        "w": 0,
                        "b": 0
                    }
                layer.momentum["w"] = self.momentum_average(layer.momentum["w"], layer.w_gradient)
                layer.momentum["b"] = self.momentum_average(layer.momentum["b"], layer.b_gradient)

                layer.w -= layer.momentum["w"]
                layer.b -= layer.momentum["b"]
#...

RMSProp

RMSProp works by taking an exponential average of the squares of the previous gradients. An exponential average is used to give recent gradients more weight than earlier gradients.

This exponential average is used to determine the update in the parameter.

RMSProp aims to minimise the oscillations in the training step. In terms of our "black ball", the "ball" would take a smooth, straight path towards the minimum, instead of zig-zagging towards it, which often happens with other optimisers.

Here are the equations for parameter updates...

d_{t} = \beta \cdot d_{t-1} + (1 - \beta) \cdot g^{2} \newline \varDelta p_{t} = \frac{l}{\sqrt{d_{t} + \epsilon}} \cdot g \newline p_{t+1} = p_{t} - \varDelta p_{t} \newline

$p_{t}$ is the parameter value at epoch t
$d_{t}$ is the exponential squared average of previous gradients. It is initialised at 0.
$l$ is the learning rate
$β\beta$ is a predetermined value (usually chosen to be 0.9)
$g$ is the gradient of the parameter with respect to this loss
$ϵ\epsilon$ is a predetermined value, to avoid division by 0. Usually set at $10^{-10}$

As seen in the second equation, we divide the learning rate by the exponential average. This leads to parameters in later epochs having a larger training step, since the exponential average gets smaller as more epochs occur.

RMSProp also automatically slows down as it approaches the minima, which is ideal, since a too large step size would cause an overcorrection in the updating of parameters.

Here is our python implementation...

#optim.py

#...
class RMSProp:
    def __init__(self, lr = 0.01, beta=0.9, epsilon=10**-10):
        self.lr = lr
        self.beta = beta
        self.epsilon = epsilon

    def rms_average(self, prev, grad):
        return self.beta * prev + (1 - self.beta) * (grad ** 2)

    def __call__(self, model, loss):
        grad = loss.backward()

        for layer in tqdm.tqdm(model.layers[::-1]):
            grad = layer.backward(grad)

            if isinstance(layer, layers.Layer):
                if not hasattr(layer, "rms"):
                    layer.rms = {
                        "w": 0,
                        "b": 0
                    }

                layer.rms["w"] = self.rms_average(layer.rms["w"], layer.w_gradient)
                layer.rms["b"] = self.rms_average(layer.rms["b"], layer.b_gradient)

                layer.w -= self.lr / (np.sqrt(layer.rms["w"] + self.epsilon)) * layer.w_gradient
                layer.b -= self.lr / (np.sqrt(layer.rms["b"] + self.epsilon)) * layer.b_gradient
#...

Adam

Adam combines the ideas in RMSProp and Momentum together.

Here are the update equations...

v_{t} = \beta_{1} \cdot v_{t-1} + (1 - \beta_{1}) \cdot g \newline s_{t} = \beta_{2} \cdot s_{t-1} + (1 - \beta_{2}) \cdot g^{2} \newline \varDelta p_{t} = l \cdot \frac{v_{t}}{\sqrt{s_{t}} + \epsilon } \newline p_{t+1} = p_{t} - \varDelta p_{t}

$p_{t}$ is the parameter value at epoch t
$v_{t}$ is the exponential average of previous gradients. It is initialised at 0.
$s_{t}$ is the exponential squared average of previous gradients. It is initialised at 0.
$l$ is the learning rate
$β1\beta_{1}$ is a predetermined value (usually chosen to be 0.9)
$β2\beta_{2}$ is a predetermined value (usually chosen to be 0.999)
$g$ is the gradient of the parameter with respect to this loss
$ϵ\epsilon$ is a predetermined value, to avoid division by 0. Usually set at $10^{-10}$

Here is our python implementation...

#optim.py

#...
class Adam:
    def __init__(self, lr = 0.01, beta1=0.9, beta2=0.999, epsilon=10**-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

    def rms_average(self, prev, grad):
        return (self.beta2 * prev) + (1 - self.beta2) * (grad ** 2)

    def momentum_average(self, prev, grad):
        return (self.beta1 * prev) + ((1 - self.beta1) * grad)

    def __call__(self, model, loss):
        grad = loss.backward()

        for layer in tqdm.tqdm(model.layers[::-1]):
            grad = layer.backward(grad)

            if isinstance(layer, layers.Layer):
                if not hasattr(layer, "adam"):
                    layer.adam = {
                        "w": 0,
                        "b": 0,
                        "w2": 0,
                        "b2": 0
                    }

                layer.adam["w"] = self.momentum_average(layer.adam["w"], layer.w_gradient)
                layer.adam["b"] = self.momentum_average(layer.adam["b"], layer.b_gradient)
                layer.adam["w2"] = self.rms_average(layer.adam["w2"], layer.w_gradient)
                layer.adam["b2"] = self.rms_average(layer.adam["b2"], layer.b_gradient)

                w_adjust = layer.adam["w"] / (1 - self.beta1)
                b_adjust = layer.adam["b"] / (1 - self.beta1)
                w2_adjust = layer.adam["w2"] / (1 - self.beta2)
                b2_adjust = layer.adam["b2"] / (1 - self.beta2)

                layer.w -= self.lr * (w_adjust / np.sqrt(w2_adjust) + self.epsilon)
                layer.b -= self.lr * (b_adjust /  np.sqrt(b2_adjust) + self.epsilon)
#...

Using our new optimisers!

This is how we'd use our new optimisers in our library, training a model for the same problem we described last post (XOR gate).

import layers
import loss
import optim
import numpy as np


x = np.array([[0, 1], [0, 0], [1, 1], [1, 0]])
y = np.array([[0, 1], [1, 0], [1, 0], [0, 1]]) 

net = layers.Model([
    layers.Linear(8),
    layers.Linear(4),
    layers.Sigmoid(),
    layers.Linear(2),
    layers.Softmax()
])

net.train(x, y, optim=optim.RMSProp(lr=0.02), loss=loss.MSE(), epochs=200)

print (net(x))

epoch 190 loss 0.00013359948998165245
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 191 loss 0.00012832321751534635
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 192 loss 0.0001232564322705172
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 193 loss 0.00011839076882215646
100%|██████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 5018.31it/s]
epoch 194 loss 0.00011371819900553786
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 195 loss 0.00010923101808808603
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 196 loss 0.00010492183152425807
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 197 loss 0.00010078354226798005
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 198 loss 9.680933861835338e-05
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 199 loss 9.299268257548828e-05
100%|████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
epoch 200 loss 8.932729868441197e-05
[[0.00832775 0.99167225]
 [0.98903246 0.01096754]
 [0.99082742 0.00917258]
 [0.00833392 0.99166608]]

As you can see, compared to the last post, our model has trained much much better, thanks to our new optimiser!

Thanks for reading! Next post we will apply our library so far to a more advanced problem (handwritten digit recognition!)