Forem: Philipp Muens

Swift for modern Machine Learning

Philipp Muens — Thu, 13 Feb 2020 13:31:03 +0000

Note: In this post we'll compare and contrast different programming languages. Everything discussed should be taken with a grain of salt. There's no single programming language which solves all the problems in an elegant and performant way. Every language has its up- and downsides. Swift is no exception.

Entering the Data Science and Machine Learning world there are various programming languages and tools to choose from. There's MATLAB, a commercial programming environment which is used across different industries and usually the tool of choice for practicioners with a heavy Math background. A free and Open Source alternative is the R project, a programming language created in 1993 to simplify statistical data processing. People working with R usually report enjoyment as R is "hackable" and comes bundled with different math modules and plotting libraries.

A more recent incarnation is the Julia scientific programming language which was created at MIT to resolve the issues older tools such as MATLAB and R struggled with. Julia cleverly incorporates modern engineering efforts from the fields of compiler construction and parallel computing and given its Open Source nature it has gained a lot of industry-wide adoption when it reached v1 maturity in 2018.

Python as the de-facto standard

If you're doing some more research to find the most used programming language in Data Science and Machine Learning you might be surprised to see a language which wasn't built from the ground up for scientific computing: Python.

The Python programming language was created by Guido van Rossum in 1989 to help bridge the gap between Bash scripting and C programming. Since then Python took the world by storm mainly due to its flat learning curve, its expressiveness and its powerful standard library which makes it possible to focus on the core problems rather than reinveing the wheel over and over again.

Funny tangent: Open up a shell, run the Python interpreter via python and enter import this or import antigravity

Python is a general purpose programming language which was never designed to solve problems in a niche subject matter. Are you an Instagram user? They're running Python. Do you curate content via Pinterest? They're running Python. Do you store your data via Dropbox? They've developed their MVP in Python and still use it today. Even Google (then called BackRub) started out with Python and Java. The list goes on and on.

Given such an industry-wide adoption it's easy to see why a lot of care and effort were put into the ecosystem of reusable packages as well as the language itself. No matter what use case you're working on, chances are that there are numerous Python packages helping you solve your problems.

While more famous Python projects include Web Frameworks such as Django or Flask there are a also lot of mature Scientific and Machine Learning implementations written in Python. Having access to such a robust foundation it only makes sense that modern Deep Learning frameworks such as TensorFlow or PyTorch are also leveraging those libraries under the covers.

Hitting the limits

All of the things discussed so far sound great. Python, a general purpose programming language which has quite a few years of existence under its belt is used across industries in mission critical software systems. Over the course of 30 years a vibrant Open Source community emerged which develops and maintains powerful libraries used by millions of users on a daily basis.

Why bother and replace Python? If it ain't broke, don't fix it!

Technology is constantly improving. What was once unthinkable might all of the sudden be possible thanks to breakthroughs in Hard- and Software development. Python was created in a different era with a different purpose. It was never engineered to directly interface with hardware or run complex computations across a fleet of distributed machines.

A modern Deep Learning Framework such as TensorFlow uses dozens of programming languages behind the scenes. The core of such a library might be written in high-performance C++ which occasionally interfaces with different C libraries, Fortran programs or even parts of Assembly language to squeeze out every bit of performance possible. A Python interface is usually built on top of the C++ core to expose a simple public API Data Scientists and Deep Learning enthusiasts use.

Why isn't Python used throughout the whole stack?

The answer to this question is rather involved but the gist of it is that Pythons language design is more tailored towards high level programming. Furthermore it's just not fast enough to be used at the lower layers.

The following is an incomplete list of Pythons (subjective) shortcomings:

Speed

Python code usually runs an order of magnitude slower compared to other interpreted and compiled languages. Language implementations such as Cython which compile Python code to raw C try to mitigate this problem but they come with other issues (e.g. language inconsistencies, compatibility problems, ...).

Parallel Processing

It's not that straightforward to write Python code which reliably performs parallel processing tasks on multiple cores or even multiple machines. Deep Neural Networks can be expressed as graphs on which Tensor computations are carried out, making it a prime use case for parallel processing.

Hardware integration

Python is a high level language with lots of useful abstractions which unfortunately get in the way when trying to directly interface with the computers underlying hardware. Because of that heavy GPU computations are usually moved into lower-level code written in e.g. C or CUDA.

Interpreted rather than compiled

Since it's a scripting language at its core, Python comes with its own runtime that evaluates the script line-by-line as it runs it. A process called "interpretation".

The other branch of programming languages are compiled languages. Compiling code means that the humand-readable program code is translated into code a machine can read and understand. Compiled programs have the downside that there's a compilation step in between writing and running the program. The upside of such step is that various checks and optimizations can be performed while translating the code, eventually emitting the most efficient machine code possible.

Dynamic typing

Python has no concept of typing. There's no problem in passing an integer into a function which expects a string. Python will run the program and raise an exception as soon as it evaluates the broken code.

Strongly typed languages have the upside that mistakes like the one described above are impossible to make. The developer has to explicitly declare which types are expected.

Python has recently added support for type hints. Type hinting merely serves as another form of documentation as it still won't prevent type misuses in programs.

Interoperability

A lot of prominent packages such as Numpy wrap other languages such as Fortran or C to offer reliable performance when working on computational expensive data processing tasks.

While it's certainly not impossible to introduce existing libraries written in other languages into Python, the process to do that is oftentimes rather involved.

Entering Swift

Without going into too much detail it makes sense to take a quick detour and study the origins of the Swift programming language in order to see why it has such a potential to replace Python as the go-to choice for Data Science and Machine Learning projects.

Chris Lattner, the inventor of Swift has a long history and established track record in modern compiler development. During college he worked on a project which eventually became LLVM ("Low Level Virtual Machine"), the infamous compiler infrastructure toolchain. The revolutionary idea behing LLVM is the introduction of frontends and backends which can be mixed and matched. One frontend could be written for Swift which is then coupled with a backend implementation for the x86 architecture. Making it possible to compile to another architecture is as simple as using another backend such as the one for PowerPC. Back in the early compiler days one had to write the compiler end-to-end, tightly coupling the frontend and backend, making it a heroic effort to offer the compiler for different platforms.

LLVM gained a lot of traction and Christ Lattner was eventually hired by Apple to work on its developer toolings which heavily relied on LLVM. During that time he worked on a C++ compiler and thought about ways how a better, more modern programming langauge might look like. He figured that it should be compiled, easy to learn, flexible enough to feel like a scripting language and at the same time "hackable" at every layer. Those ideas translated into the Swift programming langauage which was officially released at WWDC in 2014.

But what exactly makes Swift such a natural fit as a Python replacement? Isn't Swift only used for iOS and macOS apps? The following section shows why Swift could be Pythons successor.

It's compiled

Swift is compiled via LLVM which means that its code is translated into optimized machine code directly running on the target platform. Improvements made to the LLVM compiler toolchain automatically benefit the Swift code generation.

There's the saying that Swift is "syntactic sugar for LLVM" which rings true as one can see with the Builtin usage for its core types. The linked code snippet shows that Swifts core types directly interface with their LLVM equivalents.

Python-like syntax

Despite the compilation process Swift feels like a dynamic, Python-esque language. Swift was designed from the ground up for programs to incrementally grow in complexity as necessary. The simplest of all Swift programs is just one line of code: print("Hello World").

let greeting = "Hello World"
print(greeting)
// Hello World

let num1 = 1
let num2 = 2
print(num1 + num2)
// 3

let scores = [10, 35, 52, 92, 88]
for score in scores {
    print(score)
}
// 10
// 35
// 52
// 92
// 88

class Cat {
    var name: String
    var livesRemaining: Int = 9

    init(name: String) {
        self.name = name
    }

    func describe() -> String {
        return "👋 I'm \(self.name) and I have \(self.livesRemaining) lives 😸"
    }
}
let mitsy = Cat(name: "Mitsy")
print(mitsy.describe())
// 👋 I'm Mitsy and I have 9 lives 😸

Static typing

Given that Swift is compiled via LLVM, it's statically type checked during the compilation process. There's no way you can pass an invalid type to a function and run into an error during runtime. If your code compiles you can be pretty sure that you're passing around the expected types.

func sum(xs: [Int]) -> Int {
    var result: Int = 0
    for x: Int in xs {
        result = result + x
    }
    return result
}

// Using correct types
let intNumbers: [Int] = [1, 2, 3, 4, 5]
let resultInt = sum(xs: intNumbers)
print(resultInt)
// 15

// Using incorrect types
let stringNumbers: [String] = ["one", "two", "three"]
let resultString = sum(xs: stringNumbers)
print(resultString)
// error: cannot convert value of type '[String]' to expected argument type '[Int]'

Hackable

Swifts concepts of protocols and extensions make it dead simple to add new functionality to existing libraries or even types which ship with the language core itself. Want to add a new method to Int? No problem!

// One needs to implement `help` when using the `Debugging` Protocol
protocol Debugging {
    func help() -> String
}

// Implementing `Debugging` for MatrixMultiply
class MatrixMultiply: Debugging {
    func help() -> String {
        return "Offers methods to aid with matrix-matrix multiplications."
    }

    func multiply() {
        // ...
    }
}
var matMult = MatrixMultiply()
print(matMult.help())
// Offers methods to aid with matrix-matrix multiplications.

// Implementing `Debugging` for VectorMultiply
class VectorMultiply: Debugging {
    func help() -> String {
        return "Offers methods to aid with matrix-vector multiplications."
    }
}
var vecMult = VectorMultiply()
print(vecMult.help())
// Offers methods to aid with matrix-vector multiplications.

// Makes it possible to emojify an existing type
protocol Emojifier {
    func emojify() -> String
}

// Here we're extending Swifts core `Int` type
extension Int: Emojifier {
    func emojify() -> String {
        if self == 8 {
            return "🎱"
        } else if self == 100 {
            return "💯"
        }
        return String(self)
    }
}

print(8.emojify())
// 🎱
print(100.emojify())
// 💯
print(42.emojify())
// 42

Value semantics

I'm sure everyone ran into this problem before. An object is passed into a function and modified without bad intentions. Meanwhile the object is used in a different place and all of the sudden its internal state isn't what it's supposed to be. The culprit is the data mutation within the function.

This problem can be mitigated easily via value semantics. When using value semantics a "copy" rather than an object reference is passed around.

// As seen on: https://marcosantadev.com/copy-write-swift-value-types/

import Foundation

// Prints the memory address of the given object
func address(of object: UnsafeRawPointer) -> String {
    let addr = Int(bitPattern: object)
    return String(format: "%p", addr)
}

var list1 = [1, 2, 3, 4, 5]
print(address(of: list1))
// 0x7f2021f845d8

var list2 = list1
print(address(of: list2))
// 0x7f2021f845d8 <-- Both lists share the same address

list2.append(6) // <-- Mutating `list2`

print(list1)
// [1, 2, 3, 4, 5]

print(list2)
// [1, 2, 3, 4, 5, 6]

print(address(of: list1))
// 0x7f2021f84a38
print(address(of: list2))
// 0x128fb50 <-- `list2` has a different address

First-class C interoperability

Given that Swift compiles via LLVM it has access to existing LLVM-based implementations to interoperate with. One such project is Clang, a C language family frontend written for LLVM. Thanks to Clang it's dead simple to wrap existing C libraries and bring them into Swift projects.

The following video demonstrates how easy it is:

Swift for TensorFlow (S4TF)

Given all the upsides described above, the TensorFlow team decided to experiment with Swift as a Python replacement to interface with TensorFlow. Early prototypes were fruitful, encouraging the TensorFlow team to officially released Swift for TensorFlow (S4TF) in 2019.

S4TF extends the Swift core language with various features especially useful for Machine Learning tasks. Such enhancements include first-class autodiff support to calculate derivatives for functions or Python interoperability which makes it possible to reuse existing Python packages such as matplotlib, scikit-learn or pandas via Swift.

The following is a demonstartion which shows how Swift for TensorFlow can be used to describe and train a deep neural network in TensorFlow:

Do you want to play around with Swift for TensorFlow yourself? Just run the following code in a terminal to spin up a Jupyter Notebook server with Swift Kernel support in a Docker container:

docker run -it -p 8888:8888 --rm --name jupyter-s4tf \
  -v "$PWD":/home/jovyan/work \
  --ipc=host \
  pmuens/jupyter-s4tf:latest jupyter notebook \
  --ip=0.0.0.0 \
  --no-browser \
  --allow-root \
  --NotebookApp.token=\
  --notebook-dir=/home/jovyan/work

The code for the repository can be found here and the Docker Hub entry is here.

Conclusion

Python, the de-facto standard programming language for Data Science and Machine Learning has served the community very well in the past. Nevertheless, given the trajectory of technological advancements we're slowly but surely hitting the limits with the toolings we currently have.

Performance critical code is already pushed down into lower-level implementations written in programming languages such as C or Fortran and wrapped via public Python APIs. Wouldn't it be nice to write expressive, yet performant code from the get go at every layer? And what about all the libraries out there? Wouldn't it be nice to wrap and reuse them with only a couple lines of code?

The lack of static typing in Python makes it painful to work on larger, more complex projects. It's all too easy to define a model and train it on a huge dataset just to realize that a type error interrupts the training process halfway through. An error which could've been mitigated via thorough type checks.

And what if we're hitting other roadblocks? Wouldn't it be nice to be able to peek under the covers and fix the issues ourselves in an "official" way without all the monkey-patching?

Most large-scale Machine Learning projects already faced some, if not all of the issues listed above. The TensorFlow team experienced them too and looked into ways to solve them once and for all. What they came up with is Swift for TensorFlow (S4TF), a Swift language extension tailored towards modern Machine Learning projects. The Swift programming language comes with various properties which makes it a perfect fit for a Python replacement: It shares a similar syntax, is compiled (and therefore runs fast), has a type system and seamlessly interoperates with exisiting C and Python libraries.

What do you think? Is Swift for TensorFlow the future or do we stick with Python for now? Will a language such as Julia dominate the Data Science and Machine Learning world in the future?

Additional Resources

The following is a list of resources I've used to compile this blog post. There are also a couple of other sources linked within the article itself.

Generics

Philipp Muens — Tue, 01 Oct 2019 10:44:50 +0000

Generic programming makes it possible to describe an implementation in an abstract way with the intention to reuse it with different data types.

While generic programming is a really powerful tool as it prevents the programmer from repeating herself it can be hard to grasp for newcomers. This is especially true if you're not too familiar with typed programming languages.

This blog post aims to shed some light into the topic of generic programming. We'll discover why Generics are useful and which thought process can be applied to easily derive generic function signatures. At the end of post you'll be able to author and understand functions like the this:

function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
  /* ... */
}

Note: Throughout this post we'll use TypeScript as our language of choice. Feel free to code along while reading through it.

Of course you can "just use JavaScript" (or another dynamically typed language) to not deal with concepts such as typing or Generics. But that's not the point. The point of this post is to introduce the concepts of Generics in a playful way. TypeScript is just a replaceable tool to express our thoughts.

Motivation

Before we jump right into the application of generic programming it might be useful to understand what problem Generics are solving. We'll re-implement one of JavaScripts built-in Array methods called filter to get first-hand experience as to why Generics were invented.

Let's start with an example to understand what filter actually does. The JavaScript documentation for filter states that:

The filter() method creates a new array with all elements that pass the test implemented by the provided function.

Let's take a look at a concrete example to see how we would use filter in our programs. First off we have to define an array. Let's call our array numbers as it contains some numbers:

const numbers = [1, 2, 3, 4, 5, 6]

Next up we ned to come up with a function our filter method applies to each element of such array. This function determines whether the element-under-test should be included in the resulting / filtered array. Based on the quote above and the description we just wrote down we can derive that our function which is used by the filter method should return a boolean value. The function should return true if the element passes the test and false otherwise.

To keep things simple we pretend that we want to filter our numbers array such that only even numbers will be included in our resulting array. Here's the isEven function which implements that logic:

const isEven = (num: number): boolean => num % 2 === 0

Our isEven function takes in a num argument of type number and returns a boolean. We use the modulo operation to determine whether the number-under-test is even.

Next up we can use this function as an argument for the filter method on our array to get a resulting array which only includes even numbers:

const res = numbers.filter(isEven)

console.log(res)
// --> [2, 4, 6]

As we've stated earlier our goal is to implement the filter function on our own. Now that we've used filter with an example we should be familiar with it's API and usage.

To keep things simple we won't implement filter on arrays but rather define a standalone function which accepts an array and a function as its arguments.

What we do know is that filter loops through every element of the array and applies the custom function to it in order to see if it should be included in the resulting array. We can translate this into the following code:

function filter(xs: number[], func: (x: number) => boolean): number[] {
  cons res: number = []
  for (const x of xs) {
    if (func(x)) {
      res.push(x)
    }
  }
  return res
}

Now there's definitely a lot happening here and it might look intimidating but bear with me. It's simpler than it might look.

In the first line we define our function called filter which takes an array called xs (you can imagine pronouncing this "exes") and a function called func as its arguments. The array xs is of type number as we're dealing with numbers and the function func takes an x of type number, runs some code and returns a boolean. Once done our filter function returns an array of type number.

The function body simply defines an intermediary array of type number which is used to store the resulting numbers. Other than that we're looping over every element of our array and apply the function func to it. If the function returns true we push the element into our res array. Once done looping over all elements we return the res array which includes all the numbers for which our func function returned the value true.

Alright. Let's see if our homebrew filter function works the same way the built-in JavaScript filter function does:

const res = filter(numbers, isEven)

console.log(res)
// --> [2, 4, 6]

Great! Looks like it's working!

If we think about filtering in the abstract we can imagine that there's more than just the filtering of numbers.

Let's imagine we're building a Rolodex-like application. Here's an array with some names from our Rolodex:

const names = ['Alice', 'Bob', 'John', 'Alex', 'Pete', 'Anthony']

Now one of our application requirements is to only display names that start with a certain letter.

That sounds like a perfect fit for our filter function as we basically filter all the names based on their first character!

Let's start by writing our custom function we'll use to filter out names that start with an a:

const startsWithA = (name: string): boolean =>
  name.toLowerCase().charAt(0) === 'a'

As we can see our function takes one argument called name of type string and it returns a boolean which our function computes by checking if the first character of the name is an a.

Now let's use our filter function to filter the names:

const res = filter(names, startsWithA)

console.log(res)
// --> Type Error

Hmm. Something seems to be off here.

Let's revisit the signature of our filter function:

function filter(xs: number[], func: (x: number) => boolean): number[] {
  /* ... */
}

Here we can see that the xs parameter is an array of type number. Furthermore the func parameter takes an x of type number and returns a boolean.

However in our new Rolodex application we're dealing with names which are strings and the startsWithA function we've defined takes a string as an argument, not a number.

One way to fix this problem would be to create a copy of filter called e.g. filter2 which arguments can handle strings rather than numbers. But we programmers know that we shouldn't repeat ourselves to keep things maintainable. In addition to that we're lazy, so using one function to deal with different data types would be ideal.

Entering Generics

And that's exactly the problem Generics tackle. As the introduction of this blog post stated, Generics can be used to describe an implementation in an abstract way in order to reuse it with different data types.

Let's use Generics to solve our problem and write a function that can deal with any data type, not just numbers or strings.

Before we jump into the implementation we should articulate what we're about to implement. Talking in the abstract we're basically attempting to filter an array of type T (T is our "placeholder" for some valid type here) with the help of our custom function. Given that our array has elements of type T our function should take each element of such type and produce a boolean as a result (like we did before).

Alright. let's translate that into code:

function filter<T>(xs: T[], func: (x: T) => boolean): T[] {
  const res: T[] = []
  for (const x of xs) {
    if (func(x)) {
      res.push(x)
    }
  }
  return res
}

At a first glance this might look confusing since we've sprinkled in our T type here and there. However overall it should look quite familiar. Let's take a closer look into how this implementation works.

In the first line we define our filter function as a function which takes an array named xs of type T and a function called func which takes a parameter x of type T and returns a boolean. Our function filter then returns a resulting array which is also of type T, since it's basically a subset of elements of our original array xs.

The code inside the function body is pretty much the same as before with the exception that our intermediary res array also needs to be of type T.

There's one little detail we haven't talked about yet. There's this <T> at the beginning of the function. What does that actually do?

Well our compiler doesn't really know what the type T might be at the end of the day. And it doesn't really care that much whether it's a string, a number or an object. It only needs to know that it's "some placeholder" type. We programmers have to tell the compiler that we're abstracting the type away via Generics here. So in TypeScript for example we use the syntax <TheTypePlaceHolder> right after the function names to signal the compiler that we want our function to be able to deal with lots of different types (to be generic). Using T is just a convention. You could use any name you want as your "placeholder type". If your functions deals with more than one generic type you'd just list them comma-separated inside the <> like this: <A, B>.

That's pretty much all we have to do to turn our limited, number-focused filter function into a generic function which can deal with all kinds of types. Let's see if it works with our numbers and names arrays:

let res

// using `filter` with numbers and our `isEven` function
res = filter(numbers, isEven)
console.log(res)
// --> [2, 4, 6]

// using `filter` with strings and our `startsWithA` function
res - filter(names, startsWithA)

console.log(res)
// --> ['Alice', 'Alex', 'Anthony']

Awesome! It works!

Function signatures as documentation

One of the many benefits of using a type system is that you can get a good sense of what the function will be doing based solely on its signature.

Let's take the function signature from the beginning of the post and see if we can figure out what it'll be doing:

function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
  /* ... */
}

The first thing we notice is that it's a generic function as we're dealing with 2 "type placeholders" A and B here. Next up we can see that this function takes in an array called xs of type A and a function func which takes an A and turns it into a B. At the end the foo function returns an array of type B,

Take a couple of minutes to parse the function signature in order to understand what it's doing.

Do you know how this function is called? Here's a tip: It's also one of those functions from the realm of functional programming used on e.g. arrays.

Here's the solution: The function we called foo here is usually called map as it iterates over the elements of the array and uses the provided function to map every element from one type to the other (note that it can also map to the same type, i.e. from type A to type A).

I have to admit that this was a rather challenging question. Here's how map is used in the wild:

const number = [1, 2, 3, 4, 5, 6]
const numToString = (num: number): string => num.toString()

const res = map(numbers, numToString)

console.log(res)
// --> ['1', '2', '3', '4', '5', '6']

Conclusion

In this blog post we've looked into Generics as a way to write code in an abstract and reusable way.

We've implemented our own filter function to understand why generic programming is useful and how it helps us to allow the filtering of lists of numbers, strings or more broadly speaking Ts.

Once we understood how to read and write Generic functions we've discovered how typing and Generics can help us to get a sense of what a function might be doing just by looking at its signature.

I hope that you've enjoyed this journey and feel equipped to read and write highly generic code.

Do you have any questions, comments, feedback? Feel free to send me an E-Mail or reach out to me via Twitter.

The intuition behind Word2Vec

Philipp Muens — Tue, 04 Jun 2019 18:22:00 +0000

Have you ever wondered how YouTube knows which videos to recommend, how Google Translate is able to translate whole texts into a decent version of the target language or how your Smartphone keyboard knows which words and text snippets to suggest while you type your texts?

There’s a very high likelihood that so-called Embeddings were used behind the scenes. Embeddings are one of the central ideas behind modern Natural Language Processing models.

In the following writeup we’ll discover the main building blocks and basic intuition behind Embeddings. We’ll learn how and why they work and how Word2Vec, a method to turn words into vectors, can be used to show that:

[king - man + woman = queen ]

All the code we’ll write here can be found in my “Lab” repository on GitHub. Feel free to code along while reading through this tutorial.

Basic Setup

Before jumping right into the code we need to make sure that all Python packages we’ll be using are installed on our machine.

We install Seaborn, a visualization tool which helps us to plot nice-looking charts and diagrams. We don’t really work with Seaborn directly but rather use its styles in conjunction with Matplotlib to make our plots look a little bit more “modern”.

!pip install seaborn

Next up we need to import the modules we’ll use throughout this tutorial (the last few lines configure Matplotlib to use Seaborn styles).

import json
from pathlib import Path

import pandas as pd
import seaborn as sns
import numpy as np
from IPython.display import HTML, display

# prettier Matplotlib plots
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn')

Since we’re dealing with different datasets we should create a separate directory to store them in.

!mkdir -p data
data_dir = Path('data')

Comparing Countries

Let’s start with our first data analysis task. Our goal is to compare and contrast different countries based on their surface area and population. The main idea being that we want to analyze which countries are quite similar and which are rather different based on those two metrics.

The dataset we’ll use is part of the country-json project by @samayo. Make sure to take some time to browse through the different JSON files to get an idea about the structure of the data.

In our example we’re only interested in the country-by-surface-area.json and country-by-population.json files. Let’s go ahead and download the files to our data directory.

After that we can define 2 variables which will point to the files on our file system.

SURFACE_AREA_FILE_NAME = 'country-by-surface-area.json'
POPULATION_FILE_NAME = 'country-by-population.json'

!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-surface-area.json -O data/country-by-surface-area.json
!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-population.json -O data/country-by-population.json

surface_area_file_path = str(data_dir / SURFACE_AREA_FILE_NAME)
population_file_path = str(data_dir / POPULATION_FILE_NAME)

During our data analysis we’ll utilize Pandas, a great Python library which makes it dead simple to inspect and manipulate data.

Since our data is in JSON format we can use Pandas read_json function to load the data into a so-called DataFrame (think of it as an Excel spreadsheet on steroids).

The dropna function makes sure that we remove all entries which are undefined and therefore useless for further inspection.

df_surface_area = pd.read_json(surface_area_file_path)
df_population = pd.read_json(population_file_path)

df_population.dropna(inplace=True)
df_surface_area.dropna(inplace=True)

You might’ve noticed that dealing with 2 separate files will get quite hairy if we want to compare countries based on their 2 metrics.

Since both files contain the same countries with the same names and only differ in terms of their area and population data we can use merge to create a new DataFrame containing all countries with their respective area and population numbers.

Another tweak we perform here is to set the index to the country name. This way we can easily query for country data based on the country names rather than having to deal with non-expressive integer values.

df = pd.merge(df_surface_area, df_population, on='country')
df.set_index('country', inplace=True)
df.head()

len(df)

227

As you can see we have a total of 227 countries in our DataFrame. 227 are way too many countries for our need. Especially since we’re about to plot the data in the next step.

Let’s reduce our result set by performing some range-queries with the area and population data.

df = df[
    (df['area'] > 100000) & (df['area'] < 600000) &
    (df['population'] > 35000000) & (df['population'] < 100000000)
]
len(df)

12

Great! 12 countries are way easier to analyze once plotted.

Speaking of which, let’s do a 2D scatterplot of our 12 countries. We decide to plot the area on the X axis and the population on the Y axis.

fig, ax = plt.subplots()
df.plot(x='area', y='population', figsize=(10, 10), kind='scatter', ax=ax)

for k, v in df.iterrows():
    ax.annotate(k, v)

fig.canvas.draw()

Looking at the plotted data we can immediately see some relationships. It appears that Vietnam has a high population compared to its area. Kenya on the other hand has a large surface area but a smaller population compared to its size.

Plotting the data like this helps us to reason about it in a visual way. In addition to that we can also easily validate the integrity of our data.

While we as humans can immediately tell the relationships in our country data just by looking at our plot it’s necessary to translate our visual reasoning into raw numbers so our computer can understand them too.

Looking at the plot again it seems like the distance between the data points of the countries is a good measure to determine how “similar” or “different” the countries are.

There are several algorithms to calculate the distance between two (or more) coordinates. The Euclidean distance is a very common formula to do just that. Here’s the Math notation:

[d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} ]

While the formula might look intimidating at first it’s rather simple to turn it into code.

def euclidean_distance(x, y):
    x1, x2 = x
    y1, y2 = y
    result = np.sqrt((x1 - x2) **2 + (y1 - y2)** 2)
    # we'll cast the result into an int which makes it easier to compare
    return int(round(result, 0))

According to our plot it seems like Thailand and Uganda are 2 countries which are very different. Computing the Euclidean distance between both validates our hunch.

# Uganda <--> Thailand
uganda = df.loc['Uganda']
thailand = df.loc['Thailand']

x = (uganda['area'], thailand['area'])
y = (uganda['population'], thailand['population'])

euclidean_distance(x, y)

26175969

If we compare this result to the Euclidean distance between Iraq and Morocco we can see that those countries seem to be more “similar”.

# Iraq <--> Morocco
iraq = df.loc['Iraq']
morocco = df.loc['Morocco']

x = (iraq['area'], morocco['area'])
y = (iraq['population'], morocco['population'])

euclidean_distance(x, y)

2535051

While this exercise was quite simple and intuitive if one is fluent in geography it also introduced us to the basic concepts of Embeddings. With Embeddings we map data (e.g. words or raw numbers) into multi-dimensional spaces and use Math to manipulate and calculate relationships between that data.

This might sound rather abstract and I agree that the relationship between our Country data analysis and Embeddings is still a little bit fuzzy.

Trust me, the upcoming example will definitely result in an “Aha Moment” and suddenly what we’ve learned so far will click!

Color Math

Now that we’ve seen some of the underlying principles of Embeddings let’s take another look at a slightly more complicated example. This time we’ll work with different colors and their representation as a combination of Red, Green and Blue values (also known as RGB).

Before we jump right into our analysis we’ll define a helper function which lets us render the color according to its RGB representation.

The following code defines a function which takes the integer values of Red, Green and Blue (values in the range of 0 - 255) and renders a HTML document with the given color as its background.

def render_color(r, g, b):
    display(HTML('''
      <div style="background-color: rgba(%d, %d, %d, 1); height: 100px;"></div>
    ''' % (r, g, b)),
    metadata=dict(isolated=True))

The color black is represented as 0 Red, 0 Green and 0 Blue. Let’s validate that our render_color function works as expected.

render_color(0, 0, 0)

Great. It works!

Next up it’s time to download the dataset we’ll be using for our color analysis. We’ve decided to use the 256 Colors dataset by @jonasjacek. It lists the 256 colors used by xterm, a widely used terminal emulator. Make sure to take a couple of minutes to familiarize yourself with the data and its structure.

Downloading the dataset follows the same instruction we’ve used in the beginning of this tutorial where we downloaded the Country data.

COLORS_256_FILE_NAME = 'colors-256.json'

!wget -nc https://jonasjacek.github.io/colors/data.json -O data/colors-256.json

colors_256_file_path = str(data_dir / COLORS_256_FILE_NAME)

Now that we have access to the data in our programming environment it’s time to inspect the structure and think about ways to further process it.

color_data = json.loads(open(colors_256_file_path, 'r').read())
color_data[:5]

[{'colorId': 0,
  'hexString': '#000000',
  'rgb': {'r': 0, 'g': 0, 'b': 0},
  'hsl': {'h': 0, 's': 0, 'l': 0},
  'name': 'Black'},
 {'colorId': 1,
  'hexString': '#800000',
  'rgb': {'r': 128, 'g': 0, 'b': 0},
  'hsl': {'h': 0, 's': 100, 'l': 25},
  'name': 'Maroon'},
 {'colorId': 2,
  'hexString': '#008000',
  'rgb': {'r': 0, 'g': 128, 'b': 0},
  'hsl': {'h': 120, 's': 100, 'l': 25},
  'name': 'Green'},
 {'colorId': 3,
  'hexString': '#808000',
  'rgb': {'r': 128, 'g': 128, 'b': 0},
  'hsl': {'h': 60, 's': 100, 'l': 25},
  'name': 'Olive'},
 {'colorId': 4,
  'hexString': '#000080',
  'rgb': {'r': 0, 'g': 0, 'b': 128},
  'hsl': {'h': 240, 's': 100, 'l': 25},
  'name': 'Navy'}]

As we can see there are 3 different color representations available in this dataset. There’s a Hexadecimal, a HSL (Hue, Saturation, Lightness) and a RGB (Red, Green, Blue) representation. Furthermore we have access to the name of the color via the name attribute.

In our analysis we’re only interested in the name and the RGB value of every color. Given that we can create a simple dict which key is the lowercased color name and its value is a tuple containing the Red, Green and Blue values respectively.

colors = dict()

for color in color_data:
    name = color['name'].lower()
    r = color['rgb']['r']
    g = color['rgb']['g']
    b = color['rgb']['b']
    rgb = tuple([r, g, b])
    colors[name] = rgb

To validate that our data structure works the way we described above we can print out some sample colors with their RGB values.

print('Black: %s' % (colors['black'],))
print('White: %s' % (colors['white'],))

print()

print('Red: %s' % (colors['red'],))
print('Lime: %s' % (colors['lime'],))
print('Blue: %s' % (colors['blue'],))

Black: (0, 0, 0)
White: (255, 255, 255)

Red: (255, 0, 0)
Lime: (0, 255, 0)
Blue: (0, 0, 255)

While our dict is a good starting point it’s often easier and sometimes faster to do computations on the data if it’s stored in a Pandas DataFrame. The from_dict function helps us to turn a simple Python dictionary into a DataFrame.

df = pd.DataFrame.from_dict(colors, orient='index', columns=['r', 'g', 'b'])
df.head()

Seeing the data formatted in this way we can think of its representation as a mapping of the Red, Green and Blue values into a 3-Dimensional space where for example Red is the X axis, Green is the Y axis and Blue is the Z axis.

You might recall that we’ve used Euclidean distance in our Country example above to determine how “similar” countries are. The main idea was that similar countries have less distance between their data points compared to dissimilar countries whose data points are farther apart.

Another very useful formula to calculate the similarity of data points is the so-called Cosine similarity. The Cosine similarity measures the angle between two vectors in a multi-dimensional space. The smaller the angle, the more similar the underlying data.

Translating this to our color example we can think of every color being represented as a vector with 3 values (Red, Green and Blue) which (as stated above) can be mapped to the X, Y and Z axis in a 3D coordinate system. Using the Cosine similarity we can take one of such vectors and calculate the distance between it and the rest of the vectors to determine how similar or dissimilar they are. And that’s exactly what we’ll be doing here.

The Math notation for the Cosine similarity looks like this:

[similarity = \cos(\Theta) = \frac{A \cdot B}{\left\lVert A\right\rVert \left\lVert B\right\rVert} ]

We’re taking the dot-product between the two vectors A and B and divide it by the product of their magnitudes.

The following code-snippet implements such formula. Again, it might look intimidating and rather complicated but if you take some time to read through it you’ll see that it’s not that hard to understand.

In fact our implementation here does more than just calculating the Cosine similarity. In addition to that we copy our DataFrame containing the colors and add another column to it which will include the distance as a value between 0 and 1. Once done we sort our copied DataFrame by such distance in descending order. We do this to see the computed values when querying for similar colors later on.

def similar(df, coord, n=10):
    # turning our RGB values (3D coordinates) into a numpy array
    v1 = np.array(coord, dtype=np.float64)

    df_copy = df.copy()

    # looping through our DataFrame to calculate the distance for every color
    for i in df_copy.index:
        item = df_copy.loc[i]
        v2 = np.array([item.r, item.g, item.b], dtype=np.float64)
        # cosine similarty calculation starts here
        theta_sum = np.dot(v1, v2)
        theta_den = np.linalg.norm(v1) * np.linalg.norm(v2)
        # check if we're trying to divide by 0
        if theta_den == 0:
            theta = None
        else:
            theta = theta_sum / theta_den
        # adding the `distance` column with the result of our computation
        df_copy.at[i, 'distance'] = theta
    # sorting the resulting DataFrame by distance
    df_copy.sort_values(by='distance', axis=0, ascending=False, inplace=True)
    return df_copy.head(n)

To validate that our similar function works we can use it to find similar colors to red.

similar(df, colors['red'])

We can also pass in colors as a list of RGB values.

similar(df, [100, 20, 120])

Since it’s hard to imagine what color 100, 20 and 120 represent it’s worthwhile to use our render_color function to see it.

render_color(100, 20, 120)

Looking at the list of most similar colors from above it appears that darkvioletis quite similar to 100, 20, 120. Let’s see how this color looks like.

darkviolet = df.loc['darkviolet']
render_color(darkviolet.r, darkviolet.g, darkviolet.b)

And we can validate that darkviolet in fact looks quite similar to 100, 20, 120!

But it doesn’t end here. Our 3 color values are numbers in the range of 0 - 255. Given that, it should be possible to do some basic Math computations such as addition or subtraction on them.

Since we only have access to 256 different colors it’s highly unlikely that our resulting color values for Red, Green and Blue will exactly match one of our 256 colors. That’s where our similar function comes in handy! The similarfunction should make it possible to calculate a new color and find its most similar representation in our 256 color dataset.

We can look at a Color Wheel to see that subtracintg a red color from purpleone should result in a Blueish color. Let’s do the Math and check whether that’s true.

blueish = df.loc['purple'] - df.loc['red']

similar(df, blueish)

And sure enough the most similar colors in our dataset are Blueish ones. We can validate that by rendering darkblue, one of the best matches.

darkblue = df.loc['darkblue']
render_color(darkblue.r, darkblue.g, darkblue.b)

Here’s a simple one. If we have Black and add some White to the mix we should get something Greyish, correct?

greyish = df.loc['black'] + df.loc['white']

similar(df, greyish)

And sure enough we do. Rendering grey93 shows a light grey color.

grey93 = df.loc['grey93']
render_color(grey93.r, grey93.g, grey93.b)

Let’s end our color exploration with a more complex formula. So far we’ve only done some very simple Math like subtracting and adding colors. But there’s more we can do. We can also express our search for a color as a “solve for x” problem.

Mixing Yellow and Red will result in Orange. We can translate this behavior to other colors as well. Here we ask “Yellow is to Red as X is to Blue” and express it in Math notation to get the result for X.

# yellow is to red as X is to blue
yellow_to_red = df.loc['yellow'] - df.loc['red']
X = yellow_to_red + df.loc['blue']

similar(df, X)

Our calculation shows us that lightseargreen is to Blue as Yellow is to Red. Intuitively that makes sense if you think about it.

lightseagreen = df.loc['lightseagreen']
render_color(lightseagreen.r, lightseagreen.g, lightseagreen.b)

Word2Vec

In the beginnig of this tutorial I promised that once done we should understand the intuition behind Word2Vec, a key component for modern Natural Language Processing models.

The Word2Vec model does to words what we did with our colors represented as RGB values. It maps words into a multi-dimensional space (our colors were mapped into a 3D space). Once such words are mapped into that space we can perform Math calculations on their vectors the same way we e.g. calculated the similarity between our color vectors.

Having a mapping of words into such a vector space makes it possible to do calculations resulting in:

[king - man + woman = queen ]

Conclusion

In this tutorial we took a deep dive into the main building blocks and intuitions behind Embeddings, a powerful concept which is heavily utilized in modern Natural Language Processing models.

The main idea is to map data into a multi-dimensional space so that Math calculations from the realm of Linear Algebra can be performed on it.

We started our journey with a simple example in which we mapped the surface area and population of different countries into a 2D vector space. We then used the Euclidean distance to verify that certain countries are similar while others are dissimilar based on their metrics.

Another, more advanced example mapped colors and their RGB representation into a 3D vector space. We then used Cosine similarity and some basic Math to add and subtract colors.

With this knowledge we’re now able to understand how more advanced models such as Word2Vec or Doc2Vec make it possible to do calculations on words and texts.

The Lab

You can find more code examples, experiments and tutorials in my GitHub Lab repository.

Additional Resources

Eager to learn more? Here’s a list with all the resources I’ve used to write this post.

Minimax and Monte Carlo Tree Search

Philipp Muens — Tue, 02 Apr 2019 16:21:00 +0000

Do you remember your childhood days when you discovered the infamous game Tic-Tac-Toe and played it with your friends over and over again?

You might’ve wondered if there’s a certain strategy you can exploit that lets you win all the time (or at least force a draw). Is there such an algorithm that will show you how you can defeat your opponent at any given time?

It turns out there is. To be precise there are a couple of algorithms which can be utilized to predict the best possible moves in games such as Tic-Tac-Toe, Connect Four, Chess and Go among others. One such family of algorithms leverages tree search and operates on game state trees.

In this blog post we’ll discuss 2 famous tree search algorithms called Minimax and Monte Carlo Tree Search (abbreviated to MCTS). We’ll start our journey into tree search algorithms by discovering the intuition behind their inner workings. After that we’ll see how Minimax and MCTS can be used in modern game implementations to build sophisticated Game AIs. We’ll also shed some light into the computational challenges we’ll face and how to handle them via performance optimization techniques.

The Intuition behind tree search

Let’s imagine that you’re playing some games of Tic-Tac-Toe with your friends. While playing you’re wondering what the optimal strategy might be. What’s the best move you should pick in any given situation?

Generally speaking there are 2 modes you can operate in when determining the next move you want to play:

Aggressive:

Play a move which will cause an immediate win (if possible)
Play a move which sets up a future winning situation

Defensive:

Play a move which prevents your opponent from winning in the next round (if possible)
Play a move which prevents your opponent from setting up a future winning situation in the next round

These modes and their respective actions are basically the only strategies you need to follow to win the game of Tic-Tac-Toe.

The “only” thing you need to do is to look at the current game state you’re in and play simulations through all the potential next moves which could be played. You do this by pretending that you’ve played a given move and then continue playing the game until the end, alternating between the X and O player. While doing that you’re building up a game tree of all the possible moves you and your opponent would play.

The following illustration shows a simplified version of such a game tree:

Note that for the rest of this post we’ll only use simplified game tree examples to save screen space

Of course, the set of strategic rules we’ve discussed at the top is specifically tailored to the game of Tic-Tac-Toe. However we can generalize this approach to make it work with other board games such as Chess or Go. Let’s take a look at Minimax, a tree search algorithm which abstracts our Tic-Tac-Toe strategy so that we can apply it to various other 2 player board games.

The Minimax Algorithm

Given that we’ve built up an intuition for tree search algorithms let’s switch our focus from simple games such as Tic-Tac-Toe to more complex games such as Chess.

Before we dive in let’s briefly recap the properties of a Chess game. Chess is a 2 player deterministic game of perfect information. Sound confusing? Let’s unpack it:

In Chess, 2 players (Black and White) play against each other. Every move which is performed is ensured to be “fulfilled” with no randomness involved (the game doesn’t use any random elements such as a die). During gameplay every player can observe the whole game state. There’s no hidden information, hence everyone has perfect information about the whole game at any given time.

Thanks to those properties we can always compute which player is currently ahead and which one is behind. There are several different ways to do this for the game of Chess. One approach to evaluate the current game state is to add up all the remaining white pieces on the board and subtract all the remaining black ones. Doing this will produce a single value where a large value favors white and a small value favors black. This type of function is called an evaluation function.

Based on this evaluation function we can now define the overall goal during the game for each player individually. White tries to maximize this objective while black tries to minimize it.

Let’s pretend that we’re deep in an ongoing Chess game. We’re player white and have already played a couple of clever moves, resulting in a large number computed by our evaluation function. It’s our turn right now but we’re stuck. Which of the possible moves is the best one we can play?

We’ll solve this problem with the same approach we already encountered in our Tic-Tac-Toe gameplay example. We build up a tree of potential moves which could be performed based on the game state we’re in. To keep things simple we pretend that there are only 2 possible moves we can play (in Chess there are on average ~30 different options for every given game state). We start with a (white) root node which represents the current state. Starting from there we’re branching out 2 (black) child nodes which represent the game state we’re in after taking one of the 2 possible moves. From these 2 child nodes we’re again branching out 2 separate (white) child nodes. Each one of those represents the game state we’re in after taking one of the 2 possible moves we could play from the black node. This branching out of nodes goes on and on until we’ve reached the end of the game or hit a predefined maximum tree depth.

The resulting tree looks something like this:

Given that we’re at the end of the tree we can now compute the game outcome for each end state with our evaluation function:

With this information we now know the game outcome we can expect when we take all the outlined moves starting from the root node and ending at the last node where we calculated the game evaluation. Since we’re player white it seems like the best move to pick is the one which will set us up to eventually end in the game state with the highest outcome our evaluation function calculated.

While this is true there’s one problem. There’s still the black player involved and we cannot directly manipulate what move she’ll pick. If we cannot manipulate this why don’t we estimate what the black player will likely do based on our evaluation function? As a white player we always try to maximize our outcome. The black player always tries to minimize the outcome. With this knowledge we can now traverse back through our game tree and compute the values for all our individual tree nodes step by step.

White tries to maximize the outcome:

While black wants to minimize it:

Once done we can now pick the next move based on the evaluation values we’ve just computed. In our case we pick the next possible move which maximizes our outcome:

What we’ve just learned is the general procedure of the so-called Minimax algorithm. The Minimax algorithm got its name from the fact that one player wants to Mini -mize the outcome while the other tries to Max -imize it.

Code

def minimax(state, max_depth, is_player_minimizer):
  if max_depth == 0 or state.is_end_state():
    # We're at the end. Time to evaluate the state we're in
    return evaluation_function(state)

  # Is the current player the minimizer?
  if is_player_minimizer:
    value = -math.inf
    for move in state.possible_moves():
      evaluation = minimax(move, max_depth - 1, False)
      min = min(value, evaluation)
    return value

  # Or the maximizer?
  value = math.inf
  for move in state.possible_moves():
    evaluation = minimax(move, max_depth - 1, True)
    max = max(value, evaluation)
  return value

Search space reduction with pruning

Minimax is a simple and elegant tree search algorithm. Given enough compute resources it will always find the optimal next move to play.

But there’s a problem. While this algorithm works flawlessly with simplistic games such as Tic-Tac-Toe, it’s computationally infeasible to implement it for strategically more involved games such as Chess. The reason for this is the so-called tree branching factor. We’ve already briefly touched on that concept before but let’s take a second look at it.

In our example above we've artificially restricted the potential moves one can play to 2 to keep the tree representation simple and easy to reason about. However the reality is that there are usually more than 2 possible next moves. On average there are ~30 moves a Chess player can play in any given game state. This means that every single node in the tree will have approximately 30 different children. This is called the width of the tree. We denote the trees width as (w).

But there's more. It takes roughly ~85 consecutive turns to finish a game of Chess. Translating this to our tree means that it will have an average depth of 85. We denote the trees depth as (d).

Given (w) and (d) we can define the formula (w^d) which will show us how many different positions we have to evaluate on average.

Plugging in the numbers for Chess we get (30^{85}). Taking the Go board game as an example which has a width (w) of ~250 and an average depth (d) of ~150 we get (250^{150}). I encourage you to type those numbers into your calculator and hit enter. Needless to say that current generation computers and even large scale distributed systems will take "forever" to crunch through all those computations.

Does this mean that Minimax can only be used for games such as Tic-Tac-Toe? Absolutely not. We can apply some clever tricks to optimize the structure of our search tree.

Generally speaking we can reduce the search trees width and depth by pruning individual nodes and branches from it. Let's see how this works in practice.

Alpha-Beta Pruning

Recall that Minimax is built around the premise that one player tries to maximize the outcome of the game based on the evaluation function while the other one tries to minimize it.

This gameplay behavior is directly translated into our search tree. During traversal from the bottom to the root node we always picked the respective “best” move for any given player. In our case the white player always picked the maximum value while the black player picked the minimum value:

Looking at our tree above we can exploit this behavior to optimize it. Here’s how:

While walking through the potential moves we can play given the current game state we’re in we should build our tree in a depth-first fashion. This means that we should start at one node and expand it by playing the game all the way to the end before we back up and pick the next node we want to explore:

Following this procedure allows us to identify moves which will never be played early on. After all, one player maximizes the outcome while the other minimizes it. The part of the search tree where a player would end up in a worse situation based on the evaluation function can be entirely removed from the list of nodes we want to expand and explore. We prune those nodes from our search tree and therefore reduce its width.

The larger the branching factor of the tree, the higher the amount of computations we can potentially save!

Assuming we can reduce the width by an average of 10 we would end up with (w^d = (30 - 10)^{85} = 20^{85}) computations we have to perform. That's already a huge win.

This technique of pruning parts of the search tree which will never be considered during gameplay is called Alpha-Beta pruning. Alpha-Beta pruning got its name from the parameters (\alpha) and (\beta) which are used to keep track of the best score either player can achieve while walking the tree.

Code

def minimax(state, max_depth, is_player_minimizer, alpha, beta):
  if max_depth == 0 or state.is_end_state():
    return evaluation_function(state)

  if is_player_minimizer:
    value = -math.inf
    for move in state.possible_moves():
      evaluation = minimax(move, max_depth - 1, False, alpha , beta)
      min = min(value, evaluation)
      # Keeping track of our current best score
      beta = min(beta, evaluation)
      if beta <= alpha:
        break
    return value

  value = math.inf
  for move in state.possible_moves():
    evaluation = minimax(move, max_depth - 1, True, alpha, beta)
    max = max(value, evaluation)
    # Keeping track of our current best score
    alpha = max(alpha, evaluation)
    if beta <= alpha:
      break
  return value

Using Alpha-Beta pruning to reduce the trees width helps us utilize the Minimax algorithm in games with large branching factors which were previously considered as computationally too expensive.

In fact Deep Blue, the Chess computer developed by IBM which defeated the Chess world champion Garry Kasparov in 1997 heavily utilized parallelized Alpha-Beta based search algorithms.

Monte Carlo Tree Search

It seems like Minimax combined with Alpha-Beta pruning is enough to build sophisticated game AIs. But there’s one major problem which can render such techniques useless. It’s the problem of defining a robust and reasonable evaluation function. Recall that in Chess our evaluation function added up all the white pieces on the board and subtracted all the black ones. This resulted in high values when white had an edge and in low values when the situation was favorable for black. While this function is a good baseline and is definitely worthwhile to experiment with there are usually more complexities and subtleties one needs to incorporate to come up with a sound evaluation function.

Simple evaluation metrics are easy to fool and exploit once the underlying internals are surfaced. This is especially true for more complex games such as Go. Engineering an evaluation function which is complex enough to capture the majority of the necessary game information requires a lot of thought and interdisciplinary domain expertise in Software Engineering, Math, Psychology and the game at hand.

Isn’t there a universally applicable evaluation function we could leverage for all games, no matter how simple or complex they are?

Yes, there is! And it’s called randomness. With randomness we let chance be our guide to figure out which next move might be the best one to pick.

In the following we’ll explore the so-called Monte Carlo Tree Search (MCTS) algorithm which heavily relies on randomness (the name “Monte Carlo” stems from the gambling district in Monte Carlo) as a core component for value approximations.

As the name implies, MCTS also builds up a game tree and does computations on it to find the path of the highest potential outcome. But there’s a slight difference in how this tree is constructed.

Let’s once again pretend that we’re playing Chess as player white. We’ve already played for a couple of rounds and it’s on us again to pick the next move we’d like to play. Additionally let’s pretend that we’re not aware of any evaluation function we could leverage to compute the value of each possible move. Is there any way we could still figure out which move might put us into a position where we could win at the end?

As it turns out there’s a really simple approach we can take to figure this out. Why don’t we let both player play dozens of random games starting from the state we’re currently in? While this might sound counterintuitive it make sense if you think about it. If both player start in the given game state, play thousands of random games and player white wins 80% of the time, then there must be something about the state which gives white an advantage. What we’re doing here is basically exploiting the Law of large numbers (LLN) to find the “true” game outcome for every potential move we can play.

The following description will outline how the MCTS algorithm works in detail. For the sake of simplicity we again focus solely on 2 playable moves in any given state (as we’ve already discovered there are on average ~30 different moves we can play in Chess).

Before we move on we need to get some minor definitions out of the way. In MCTS we keep track of 2 different parameters for every single node in our tree. We call those parameters (t) and (n). (t) stands for "total" and represents the total value of that node. (n) is the "number of visits" which reflects the number of times we've visited this node while walking through the tree. When creating a new node we always initialize both parameters with the value 0.

In addition to the 2 new parameters we store for each node, there's the so-called "Upper Confidence Bound 1" (UCT) formula which looks like this

[x_i + C\sqrt{\frac{\ln(N)}{n_i}} ]

This formula basically helps us in deciding which upcoming node and therefore potential game move we should pick to start our random game series (called "rollout") from. In the formula (x_i) represents the average value of the game state we're working with, (C) is a constant called "temperature" we need to define manually (we just set it to 1.5 in our example here. More on that later), (N) represents the parent node visits and (n_i) represents the current nodes visits. When using this formula on candidate nodes to decide which one to explore further, we're always interested in the largest result.

Don't be intimidated by the Math and just note that this formula exists and will be useful for us while working with out tree. We'll get into more details about the usage of it while walking through our tree.

With this out of the way it's time apply MCTS to find the best move we can play.

We start with the same root node of the tree we're already familiar with. This root node is our start point and reflects the current game state. Based on this node we branch off our 2 child nodes:

The first thing we need to do is to use the UCT formula from above and compute the results for both child nodes. As it turns out we need to plug in 0 for almost every single variable in our UCT formula since we haven't done anything with our tree and its nodes yet. This will result in (\infty) for both calculations.

[S_1 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty ]

We've replaced the 0 in the denominator with a very small number because division by zero is not defined

Given this we're free to choose which node we want to explore further. We go ahead with the leftmost node and perform our rollout phase which means that we play dozens of random games starting with this game state.

Once done we get a result for this specific rollout (in our case the percentage of wins for player white). The next thing we need to do is to propagate this result up the tree until we reach the root node. While doing this we update both (t) and (n) with the respective values for every node we encounter. Once done our tree looks like this:

Next up we start at our root node again. Once again we use the UCT formula, plug in our numbers and compute its score for both nodes:

[S_1 = 30 + 1.5\sqrt{\frac{\ln(1)}{1}} = 30 ]

[S_2 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty ]

Given that we always pick the node with the highest value we'll now explore the rightmost one. Once again we perform our rollout based on the move this node proposes and collect the end result after we've finished all our random games.

The last thing we need to do is to propagate this result up until we reach the root of the tree. While doing this we update the parameters of every node we encounter.

We've now successfully explored 2 child nodes in our tree. You might've guessed it already. We'll start again at our root node and calculate every child nodes UCT score to determine the node we should further explore. In doing this we get the following values:

[S_1 = 30 + 1.5\sqrt{\frac{\ln(2)}{1}} \approx 31.25 ]

[S_2 = 20 + 1.5\sqrt{\frac{\ln(2)}{1}} \approx 21.25 ]

The largest value is the one we've computed for the leftmost node so we decide to explore that node further.

Given that this node has no child nodes we add two new nodes which represent the potential moves we can play to the tree. We initialize both of their parameters ((t) and (n)) with 0.

Now we need to decide which one of those two nodes we should explore further. And you're right. We use the UCT formula to calculate their values. Given that both have (t) and (n) values of zero they're both (\infty) so we decide to pick the leftmost node. Once again we do a rollout, retrieve the value of those games and propagate this value up to the tree until we reach the trees root node, updating all the node parameters along the way.

The next iteration will once again start at the root node where we use the UCT formula to decide which child node we want to explore further. Since we can see a pattern here and I don't want to bore you I'm not going to describe the upcoming steps in great detail. What we'll be doing is following the exact same procedure we've used above which can be summarized as follows:

Start at the root node and use the UCT formula to calculate the score for every child node
Pick the child node for which you've computed the highest UCT score
Check if the child has already been visited

If not, do a rollout
If yes, determine the potential next states from there
Use the UCT formula to decide which child node to pick
Do a rollout

Propagate the result back through the tree until you reach the root node

We iterate over this algorithm until we run out of time or reached a predefined threshold value of visits, depth or iterations. Once this happens we evaluate the current state of our tree and pick the child node(s) which maximize the value (t). Thanks to dozens of games we've played and the Law of large numbers we can be very certain this move is the best one we can possibly play.

That's all there is. We've just learned, applied and understood Monte Carlo Tree Search!

You might agree that it seems like MCTS is very compute intensive since you have to run through thousands of random games. This is definitely true and we need to be very clever as to where we should invest our resources to find the most promising path in our tree. We can control this behavior with the aforementioned "temperature" parameter (C) in our UCT formula. With this parameter we balance the trade-off between "exploration vs. exploitation".

A large (C) value puts us into "exploration" mode. We'll spend more time visiting least-explored nodes. A small value for (C) puts us into "exploitation" mode where we'll revisit already explored nodes to gather more information about them.

Given the simplicity and applicability due to the exploitation of randomness, MCTS is a widely used game tree search algorithm. DeepMind extended MCTS with Deep Neural Networks to optimize its performance in finding the best Go moves to play. The resulting Game AI was so strong that it reached superhuman level performance and defeated the Go World Champion Lee Sedol 4-1.

Conclusion

In this blog post we’ve looked into 2 different tree search algorithms which can be used to build sophisticated Game AIs.

While Minimax combined with Alpha-Beta pruning is a solid solution to approach games where an evaluation function to estimate the game outcome can easily be defined, Monte Carlo Tree Search (MCTS) is a universally applicable solution given that no evaluation function is necessary due to its reliance on randomness.

Raw Minimax and MCTS are only the start and can easily be extended and modified to work in more complex environments. DeepMind cleverly combined MCTS with Deep Neural Networks to predict Go game moves whereas IBMextended Alpha-Beta tree search to compute the best possible Chess moves to play.

I hope that this introduction to Game AI algorithms sparked your interest in Artificial Intelligence and helps you understand the underlying mechanics you’ll encounter the next time you pick up a board game on your computer.

Additional Resources

Do you want to learn more about Minimax and Monte Carlo Tree Search? The following list is a compilation of resources I found useful while studying such concepts.

If you’re really into modern Game AIs I highly recommend the book “Deep Learning and the Game of Go” by Max Pumperla and Kevin Ferguson. In this book you’ll implement a Go game engine and refine it step-by-step until at the end you implement the concepts DeepMind used to build AlphaGo and AlphaGo Zero based on their published research papers.

Learning Deep Learning

Philipp Muens — Tue, 05 Mar 2019 14:42:00 +0000

Deep Learning, a branch of Machine Learning gained a lot of traction and press coverage over the last couple of years. Thanks to significant scientific breakthroughs we’re now able to solve a variety of hard problems with the help of machine intelligence.

Computer systems were taught to identify skin cancer with a significantly higher accuracy than human doctors do. Neural Networks can generate photorealisticimages of fake people and fake celebrities. It’s even possible for an algorithm to teach itself entire games from first principles, surpassing human-level mastery after only a couple of hours training.

In summary Deep Learning is amazing, mystical and sometimes even scary and intimidating.

In order to demystify and understand this “Black Box” end-to-end I decided to take a deep dive into Deep Learning, looking at it through the practical as well as the theoretical lens.

With this post I’d like to share the Curriculum I came up with after spending months following the space, reading books and research papers, doing lectures, classes and courses to find some of the best educational resources out there.

Before we take a closer look I’d like to point out that the Curriculum as a whole is still a work in progress and might change over time since new material covering state-of-the-art Deep Learning techniques is released on an ongoing basis. Feel free to bookmark this page and revisit it from time to time to stay up-to-date with the most recent changes.

The Approach

During the research phase which resulted in the following Curriculum I triaged dozens of classes, lectures, tutorials, talks, MOOCs, papers and books. While the topics covered were usually the same the required levels of expertise in advanced Mathematics and computer programming were not.

Generally speaking one can divide most educational Deep Learning resources in two categories: “Shallow” and “Deep”. Authors of “Shallow” resources tend to heavily utilize high-level Frameworks and abstractions without taking enough time to talk about the underlying theoretical pieces. “Deep” resources on the other hand usually take the bottom-up approach, starting with a lot of Mathematical fundamentals until eventually some code is written to translate the theory into practice.

I personally believe that both is important: Understanding how the technology works under the covers while using Frameworks to put this knowledge into practice. The proposed Curriculum is structured in a way to achieve exactly that. Learning and understanding Deep Learning from a theoretical as well as a practical point-of-view.

In our case we’ll approach our Deep Learning journey with a slight twist. We won’t follow a strict bottom-up or top-down approach but will blend both learning techniques together.

Our first touchpoint with Deep Learning will be in a practical way. We’ll use high-level abstractions to build and train Deep Neural Networks which will categorize images, predict and generate text and recommend movies based on historical user data. This first encounter is 100% practice-oriented. We won’t take too much time to learn about the Mathematical portions just yet.

Excited about the first successes we had we’ll brush up our Mathematical understanding and take a deep dive into Deep Learning, this time following a bottom-up approach. Our prior, practical exposure will greatly benefit us here since we already know what outcomes certain methodologies produce and therefore have specific questions about how things might work under the hood.

In the last part of this Curriculum we’ll learn about Deep Reinforcement Learning which is the intersection of Reinforcement Learning and Deep Learning. A thorough analysis of AlphaGo Zero, the infamous agent that learned the Go board game from scratch and later on played against itself to become basically unbeatable by humans, will help us understand and appreciate the capabilities this approach has to offer.

During our journey we’ll work on two distinct Capstone projects (“Capstone I” and “Capstone II”) to put our knowledge into practice. While working on this we’ll solve real problems with Deep Neural Networks and build up a professional portfolio we can share online.

Once done we’ll be in a good position to continue our Deep Learning journey reading through the most recent academic research papers, implementing new algorithms and coming up with our own ideas to contribute to the Deep Learning community.

The Curriculum

As already discussed above, Deep Learning is… Deep. Given the traction and momentum, Universities, Companies and individuals have published a near endless stream of resources including academic research papers, Open Source tools, reference implementations as well as educational material. During the last couple of months I spent my time triaging those to find the highest quality, yet up-to-date learning resources.

I then took a step back and structured the materials in a way which makes it possible to learn Deep Learning from scratch up to a point where enough knowledge is gained to solve complex problems, stay on top of the current research and participate in it.

1. A Practical Encounter

We begin our journey in the land of Deep Learning with a top-down approach by introducing the subject “Deep Learning” in a practical and playful way. We won’t start with advanced college Math, theoretical explanations and abstract AI topics. Rather we’ll dive right into the application of tools and techniques to solve well-known problems.

The main reason of doing this is that it keeps us motivated since we’ll solve those problems with state-of-the-art implementations which will help us see and understand the bigger picture. It’s a whole lot easier to take a look under the covers of the abstractions we’ll use once we know what can be achieved with such. We’ll automatically come up with questions about certain results and behaviors and develop an own intuition and excitement to understand how the results came to be.

In doing this we’ll take the great “Practical Deep Learning for Coders” course by the Fast.ai team which will walk us through many real-world examples of Deep Neural Network usage. Theoretical concepts aren’t completely left out but will be discussed “just-in-time”.

It’s important to emphasize that it’s totally fine (and expected) that we won’t understand everything which is taught during this course the first time we hear about it. Most of the topics will be covered multiple times throughout this Curriculum so we’ll definitely get the hang of it later on. If you’re having problems with one topic or the other, feel free to rewatch the respective part in the video or do some research on your own. Keep in mind though that you shouldn’t get too deep into the weeds since our main focus is still on the practical portions.

You should definitely recreate each and every single Jupyter Notebook which was used in the Fast.ai course from scratch. This helps you to get a better understanding of the workflow and lets you play around with the parameters to see the effects they have on the data.

When done it’s a good idea to watch the following great talk by Google and this mini-course by Leo Isikdogan to solidify the knowledge we’ve just acquired.

Resources

2. Mathematical Foundations

Once we have a good understanding of what Deep Learning is, how it’s used in practice and how it roughly works under the hood it’s time to take a step back and refresh our Math knowledge. Deep Neural Networks heavily utilize Matrix multiplications, non-linearities and optimization algorithms such as Gradient Descent. We therefore need to familiarize ourselves with Linear Algebra, Calculus and some basic Probability Theory which build the Mathematical foundations of Deep Learning.

While this is certainly advanced Mathematics it’s important to highlight that High School level Math knowledge is usually enough to get by in the beginnings. For the most part we should just refresh our knowledge a little bit. It’s definitely not advisable to spent weeks or even months studying every aspect of Linear Algebra, Calculus or Probability Theory (if that’s even possible) to consider this part “done”. Basic fluency in the aforementioned topics is enough. There’s always enough time to learn the more advanced topics as soon as we come across them.

Having a good Mathematical understanding will pay dividends later on as we progress with more advanced Deep Learning topics. Don’t be intimidated by this part of the Curriculum. Mathematics can and should be fun!

Stanford has some great refreshers on Linear Algebra and Probability Theory. If that’s too shallow and you need a little bit more to get up to speed you might find Part 1 of the Deep Learning Book helpful.

Once you’ve brushed up the basics it’s worthwhile to take a couple of days and thoroughly study “The Matrix Calculus You Need For Deep Learning” by Terence Parr and Jeremy Howard (one of the founders of Fast.ai) and the “Computational Linear Algebra” course by Rachel Thomas (also a co-founder of Fast.ai). Both resources are heavily tailored to teach the Math behind Deep Learning.

Resources

3. Deep Dive

Now we’re armed with a good understanding of the capabilities and the underlying Math of Deep Learning.

Given this it’s time to take a deep dive to broaden our knowledge of Deep Learning. The main goal of this part is to take the practical experience and blend it with our Mathematical exposure to fully understand the theoretical building blocks of Deep Neural Networks. A thorough understanding of this will be key later on once we learn more about topics such as Deep Reinforcement Learning.

The following describes 3 different ways to take the deep dive. The approaches are certainly not mutually exclusive but could (and should) be used in conjunction to complement each other.

The path you might want to take will depend on your prior exposure to Deep Learning and you favorite learning style.

If you’re a person who appreciates classical MOOCs in the form of high-quality, pre-recorded videos with quizzes and exercises you’ll definitely enjoy Andrew Ng’s Deeplearning.ai “Specialization for Deep Learning”. This course is basically split up into 5 different sub-courses which will take you from the basics of Neural Networks to advanced topics such as as Recurrent Neural Networks. While learning about all of this you’ll also pick up a lot of valuable nuggets Andrew shares as he talks about his prior experience as a Deep Learning practicioner.

You can certainly get around the tuition fee for the Deeplearning.aispecialization, but it’s important to emphasize that it’s definitely worth every penny! You’ll have access to high quality course content, can request help when you’re stuck and get project reviews by classmates and experts.

Readers who enjoy books should definitely look into the “Dive into Deep Learning” book. This book was created to be a companion guide for the STAT 157 course at UC Berkeley but turned into more than that. The main focus of this book is to be at the intersection of Mathematical formulations, real world applications and the intuition behind Deep Learning complemented by interactive Jupyter Notebooks to play around with. “Dive into Deep Learning” covers all of the important concepts of a modern Deep Learning class. It requires no prior knowledge and starts with the basics of Neural Networks while moving onwards to cover advanced topics such as Convolutional Neural Networks, ending in discussions about state-of-the-art NLP implementations.

Another method to study Deep Learning in great detail is with the help of recorded university class videos. MIT released the terrific “Introduction to Deep Learning” course which is basically a recording of their 6.S191 class accessible for everyone to watch! This option is definitely one of the more advanced ways to learn the subject as some prior university-level Math and Computer Science knowledge is necessary to grok it. The huge benefit of this format is that it touches on a lot of different topics other courses simply don’t cover due to missing prerequisites. If you’ve already been exposed to university-level Computer Science and Mathematics and like to learn with a focus on more rigor theory, then this course is definitely for you.

Whatever route you take, it’s really important that you take your time to revisit concepts and recreate their implementations from scratch. It’s totally fine if you’re struggling at first. It’s this wandering through the dark alleys where you’ll actually learn the most! Don’t waste your time passively consuming content. Go out and reproduce what you’ve just learned!

At the end of the day it doesn’t really matter what format you choose. All courses will equally well prepare you for the next step in your journey to Deep Learning mastery which is your first Capstone project!

Resources

4. Capstone Project I

Focus: Supervised Deep Learning

Enough theory (for now). It’s time to put our hard earned knowledge to practice.

In our first Capstone project we’ll demonstrate that we fully understand the basic building blocks of modern Deep Learning. We’ll pick a problem of interest and solve it with the help of a Deep Neural Network. Since we’ve mostly dealt with Supervised Learning so far it’s worth mentioning that our solution will be based on such an implementation.

Our programmatic environment will be a separate Jupyter Notebook where we code and describe every step together with a brief justification of its necessity in great detail. Taking the time to think through the steps necessary to solve our problem helps us check ourselves as we have to think through our architecture as well as the underlying processes that take place when our code is executed.

To further deepen our knowledge and help us get out of the comfort zone we’ll restrict our implementation to the usage of low-level Frameworks, meaning that we’re only allowed to use Frameworks such as PyToch, TensorFlow or MXNet. Any usage of high-level abstraction libraries such as Fastai or Keras is strictly forbidden. Those libraries, while being great for the experienced practicioner, abstract too much away, hindering us to go through the tough decisions and tradeoffs we have to make when working on our problem.

Remember that this is the part where we’ll learn the most as we’re really getting into the weeds here. Don’t give up as enlightment will find you once you made it. It’s also more than ok to go back and reread / rewatch the course material if you’re having problems and need some help.

While working on this project always keep in mind that it’s one of your personal portfolio projects you should definitely share online. It’s those projects where you can demonstrate that you’re capable to solve complex problems with Deep Learning technologies. Make sure that you really spend a good portion of your time on it and “make it pretty”.

Are you struggling to find a good project to work on? Here are some project ideas which will help you get started:

5. Deep Reinforcement Learning

Deep Reinforcement Learning is the last major topic we’ll cover in this Curriculum.

One might ask the question as to what the difference between the Deep Learning we’re studying and Deep Reinforcement Learning is. All the techniques we’ve learned and used so far were built around the concept of Supervised Learning. The gist of Supervised Learning is that we utilize large datasets to train our model by showing it data, letting it make predictions about what it thinks the data represents and then using the labeled solution to compute the difference between the prediction and the actual solution. We then use algorithms such as Gradient Descent and Backpropagation to subsequently readjust our model until the predictions it makes meet our expectations.

You might’ve already noticed that Supervised Learning heavily relies on huge datasets to train and test our models via examples.

What if there’s a way that our AI can teach itself what it should do based on self-exploration and guidelines we define? That’s where Reinforcement Learningcomes into play. With Reinforcement Learning we’re able to let our model learn from first principles by exploring the environment. The researches at DeepMindwere one of the first who successfully blended Deep Learning and Reinforcement Learning to let an AI teach itself to play Atari games. The only inputs the AI agent got were the raw input pixels and the score.

In this part of our Curriculum we’ll learn what Reinforcement Learning is and how we can combine Deep Learning and Reinforcement Learning to build machine intelligence which learns to master tasks in an autodidactic way.

As per usual there are different ways to learn Deep Reinforcement Learning.

Thomas Simonini has a great “Deep Reinforcement Learning Course” which focuses on the practical pieces of Deep Reinforcement Learning as you’ll implement real world applications throughout his class.

OpenAIs “SpinningUp AI” course is another great resource which strikes a really good balance between practical examples and theoretical foundations.

If you’re looking for a University-level class which heavily focuses on the theoretical underlyings I’d highly recommend the “Advanced Deep Learning and Reinforcement Learning Class” which was taught by UCL and DeepMind.

Every resource listed here will help you understand and apply Deep Reinforcement Learning techniques. While some are more focused on the practical portions others go really deep into the trenches of theoretical rigor. It’s definitely worthwhile to look into all of them to get the all-around view and best mixture between theory and practice.

Once you successfully made your way through one of the Deep Reinforcement Learning courses it’s a good idea to revisit the key ideas by reading the excellent blog posts “Deep Reinforcement Learning: Pong from Pixels” by Andrej Karpathyand “A (Long) Peek into Reinforcement Learning” by Lilian Weng as they give a nice, broader overview of the different topics which were covered during class.

Aside: If you’re fascinatied by the possibilities of Reinforcement Learning I’d highly recommend the book “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto. The recently updated 2nd edition includes chapters about Neuroscience, Deep Neural Networks and more. While it’s possible and desirable to buy the book at your local bookstore you can also access the book as a freely available PDF online.

Resources

6. Capstone Project II

Focus: Deep Reinforcement Learning

It’s time for our second and last Capstone Project where we’ll use Deep Reinforcement Learning to let our AI teach itself to solve difficult real-world problems.

The same restrictions from our first Capstone project also apply here. We’ll implement the solution in a dedicated Jupyter Notebook where we write our code and the prose to describe what we’re doing and why we’re doing it. This helps us test our knowledge since we have to take the time to think through our current implementation and its implications to the system as a whole.

As with the Capstone I project it’s forbidden to use higher level abstraction libraries such as Fastai or Keras. Our implementation here should only use APIs provided by lower-level Frameworks such as PyToch, TensorFlow or MXNet.

Keep in mind that it’s totally fine to feel stuck at some point. Don’t be discouraged! Take your time to revisit the material and ensure that you fill your knowledge gaps before moving on. It’s those moments of struggle where you grow the most. Once you’ve made it, you’ll feel excited and empowered.

The result of this Capstone project is another crucial piece of your personal Deep Learning portfolio. Make sure to set aside enough time to be able to put in the effort so that you can showcase your implementation online.

Do you need some inspiration for projects you might want to work on? Here’s a list with some ideas:

Conclusion

Deep Learning has gained a lot of traction in last couple of years as major scientific breakthroughs made it finally possible to train and utilize Deep Neural Networks to perform tasks at human expert level ranging from cancer detection to mastery in games such as Go or Space Invaders.

In this blog post I shared the Curriculum I follow to learn Deep Learning from scratch. Right in the beginning of the journey one learns how Deep Learning techniques are used in practice to solve real-world problems. Once a baseline understanding is established it’s time to take a deep dive into the Mathematical and theoretical pieces to demystify the Deep Learning “Black Box”. A final exploration of the intersection of Deep Learning and Reinforcement Learning puts the reader in a great position to understand state-of-the art Deep Learning solutions. Throughout the whole Curriculum we’ll pratice our skills and showcase our fluency in such while working on dedicated Capstone projects.

While putting this together I had the feeling that this Curriculum can look quite intimidating at first glance since lots of topics are covered and it’ll definitely take some time to get through it. While I’d advise the avid reader to follow every single step in the outlined order it’s totally possible to adapt and skip some topics given that everyone has different experiences, goals and interests. Learning Deep Learning should be fun and exciting. If you ever feel exhausted or struggle to get through a certain topic you should take a step back and revisit it later on. Oftentimes complicated facts and figures turn into no-brainers if we give ourselves the permission and time to do something else for the moment.

I personally believe that it’s important to follow a goal while learning a new topic or skill. Make sure that you know why you want to learn Deep Learning. Do you want to solve a problem at your company? Are you planning to switch careers? Is a high level overview enough for you since you just want to be educated about AI and its social impacts? Whatever it is, keep this goal in mind as it’ll make everything reasonable and easier during the hard times when the motivation might be lacking and everything just feels too hard to pick up.

Learning Advanced Mathematics

Philipp Muens — Tue, 19 Feb 2019 09:57:00 +0000

For me personally Math was one of those mysterious subjects I had to go through in school but never really understood, let alone appreciated. It was too abstract, involved lengthy computations, rode formula memorization with little to no explanation as to why it’s useful and how it’s applied in the real world. Frankly put Math was one of my weakest spots. My parents were surprised and shocked when I told them that I planned to study Computer Science, which is a branch of applied Mathematics. Throughout my life I had a love-hate relationship with Math. I still remember that feeling of relief when I passed my last Math exam in college.

During my career as a Software Engineer I was mostly Math absent. From time to time I consulted old Computer Science books to do some research on algorithms I then implemented. However those were usually the only touchpoints I had with Math.

Something changed over the last couple of years. While looking for the next personal challenges and goals to grow I figured that most of the really exciting achievements heavily utilize Math as a fundamental building block. That’s actually true for a lot of scientific fields including Econometrics, Data Science and Artifical Intelligence. It’s easy to follow the news and roughly understand how things might work but once you try to dig deeper and look under the hood it gets pretty hairy.

I found myself regularly lost somewhere in the dark alleys of Linear Algebra, Calculus and Statistics. Last year I finally stuck a fork in the road. I wanted to fundamentally change my understanding and decided to re-learn Math from scratch. After countless late nights, early mornings and weekends doing classes, exercises and proofs I’m finally at a pretty decent level of understanding advanced Mathematics. Right now I’m building upon this foundation to learn even more.

During this process I learned one important thing: Math is really amazing!

Math is the language of nature. Understanding it helps you understand how our world works!

With this blog post I’d like to share how I went from “What is a root of a polynomial again?” to “Generalized Autoregressive Conditional Heteroskedasticity” (at least to some extend). I’ll share the Curriculum I created and followed, the mistakes I made (spoiler: I made a lot) and the most useful resources I used throughout this journey.

Before we start I want to be honest with you: Math is a really involved discipline. There’s a lot out there…

And it can certainly be overwhelming. However if you’re really dedicated and want to put in those hours you’ll make it! If I can do it so can you!

Please keep in mind that this is the path which worked for me. This doesn’t necessarily mean that it will be as efficient for you. In my case I need to study, self-explain and practice, practice, practice to really understand a topic at hand. I know of people who can just sit in class, listen and ultimately get it. That’s definitely not how I operate.

Alright. Let’s get started!

The Curriculum

Math is one of those subjects where you’ll find a nearly endless stream of resources. Looking closer they all vary widely in terms of quality, density and understandability.

My first approach to ramp up my Math skills was to skim through an interesting research paper, write down all the Math I won’t understand and look those terms up to study them in greater detail. This was fundamentally wrong on many levels. After some trial and error I took a step back and did a lot of research to figure out which topics I should study to support my goal and how those topics are related to one another.

The Curriculum I finally put together is a good foundation if you want to jump into other “Hard Sciences”. My personal goal was to learn the knowledge I need to take a really deep dive into Artificial Intelligence. To be more specific I’m really excited about Deep Learning and the next steps in the direction of Machine intelligence.

Every topic which is covered in this Curriculum uses 3 basic pillars to build a solid Mathematical foundation:

Intuition

Videos, interactive visualizations and other helpful resources which outline how the Math came to be and how it works on an intuitive level.

Deep Dive

A good enough “deep dive” to get familiar with the foundational concepts while avoiding confusion due to overuse of theorems, proofs, lemmas, etc.

Practicality

Practice, practice, practice. Resources such as books with lots of exercises to solidify the knowledge.

Algebra

Algebra is the first topic which should be studied extensively.

Having a really good understanding of Algebra makes everything a whole lot easier! Calculus comes down to 90% Algebra most of the time. If you know how to solve Algebra problems you won’t have a hard time in Calculus either.

Most of you might remember a phrase similar to

“Solve this equation for x”

That’s what Algebra is about. In an Algebra class you’ll learn about the following topics:

Solving equations
Solving inequalities
Polynomials
Factoring
Functions
Graphing
Symmetry
Fractions
Radicals
Exponents
Logarithms
Linear systems of equations
Nonlinear systems of equations

As stated above it’s of uber importance that you really hone your Algebra skills. I’m repeating myself but Algebra is one of the main building blocks for advanced Mathematics.

Resources

Trigonometry

In Trigonometry you’ll study the relationship of lengths and angles of triangles.

You’ll learn about the unit circle and it’s relation to sin and cos, cones and their relation to circles, ellipses, parabolas and hyperbolas, Pythagoras’ Theorem and more. Trigonometry is interesting in itself since it can be immediately applied to real life problems.

Here’s a list of topics you’ll usually learn in a Trigonometry class:

Pythagoras’ Theorem
Sin and cos
The unit circle
Trigonometric identities
Radians vs. Degree

Generally speaking this course is rather short. Nevertheless it’s a good preparation class for Calculus.

Resources

Calculus

The study of continuous change is one of the main focus areas in Calculus.

This might sound rather abstract and the intuition behind it is really paradox if you think about it (see “Essence of Calculus” below). However you might remember that you dealt with Derivatives, Limits and area calculations for functions.

There are usually 3 different Calculus classes (namely Calculus I, II and II) one can take. Those 3 classes range from easy topics such as “Derivatives” and “Limits” to advanced topics such as “Triple Integrals in Spherical Coordinates”. I’d suggest to definitely take the first class (Calculus I) and continue with the second one (Calculus II) if time permits. If you’re in a hurry taking Calculus I is usually sufficient.

In Calculus I you’ll learn about:

Limits
Continuity
L’Hospitals Rule
Derivatives
Power, Product, Quotient, Chain rule
Higher Order Derivatives
Min / Max Values
Concavity
Integrals
Substitution Rule

Calculus is an important topic since it’s heavily used in optimization problems to find local minima. The “Gradient Descent” algorithm uses techniques from Calculus such as Derivatives and is leveraged in modern (Deep) Neural Networks to adjust the weights of Neurons during Backpropagation.

Resources

Linear Algebra

Linear Algebra is one of the most, if not the most important topic when learning Math for Data Science, Artificial Intelligence and Deep Learning.

Linear Algebra is pretty much omnipresent in modern computing since it lets you efficiently do calculations on multi-dimensional data. During childhood you probably spent quite some time in in front of your computer screen while wading through virtual worlds. Photorealistic 3D renderings are possible thanks to Math and more specifically Linear Algebra.

Linear Algebra courses usually cover:

Systems of Equations
Vectors
Matrices
Inverse Matrices
Identity Matrix
Matrix Arithmetic
Determinants
Dot & Cross Product
Vector Spaces
Basis and Dimension
Linear Transformation
Eigenvectors & Eigenvalues

As already stated above, Linear Algebra is one of the most important topics in modern computing. Lots of problems such as image recognition can be broken down into calculations on multi-dimensional data.

You might have heard about the Machine Learning framework TensorFlow which was developed and made publicly available by Google. Well, a Tensor is just fancy word for a higher-dimensional way to organize information. Hence a Scalar is a Tensor of rank 0, a Vector is a Tensor of rank 1, a N x N Matrix is a Tensor of rank 2, etc.

Another interesting fact is that Deep Neural Networks are usually trained on GPUs (Graphic Processing Unit) or TPUs (Tensor Processing Unit). The simple reason is that GPUs and TPUs are way better at processing Linear Algebra computations compared to CPUs since (at least GPUs) were invented as a dedicated hardware unit to do exactly that when rendering computer graphics.

Aside: Here’s the original paper by Andrew Ng et al. where GPUs were first explored to carry out Deep Learning computations.

Resources

Statistics & Probabilities

The last topic which should be covered in this Curriculum is Statistics & Probabilities.

While both topics are sometimes taught separately it makes sense to learn them in conjunction since statistics and probabilities share a deep underlying relationship.

A typical Statistics & Probabilities class covers:

Charting and plotting
Probability
Conditional Probability
Bayes Rule
Probability Distributions
Average
Variance
Binomial Distribution
Central Limit Theorem
Normal Distribution
Confidence Intervals
Hypothesis Test
Regression
Correlation

In Data Science one usually has to deal with statistical analysis to see if the computations actually made sense. Furthermore it’s helpful to compute and visualize correlations between data and certain events. Bayes Rule is another important tool which helps us update our belief about our view of “the world” when more evidence is available. The realms of Machine Learning and Deep Learning usually deal with lots of uncertainty. Having a good toolbox to deal with this makes our life a whole lot easier.

A pretty popular example of applied statistics is the Monte Carlo Tree Search algorithm. This heuristic algorithm was used in DeepMinds AI breakthrough “AlphaGo” to determine which moves it should consider while playing the Go boardgame.

Feel free to read through the official paper for more information about the underlying technologies. Trust me, it’s amazing to read and understand how Math played such a huge role to build such a powerful contestant.

Resources

Mistakes

As I already stated above it’s been quite a journey and I made lots of mistakes along the way.

In this section I’d like to share some of those mistakes so that you don’t have to go through this yourself.

The first mistake I made was jumping straight into Math without having a clear plan / Curriculum and more importantly goal. I dived right into certain topics I picked up while reading research papers and quickly figured that some of them were too advanced since I understood only little (if anything at all). My approach was to back off and start somewhere else. “Trial an error” so to say. This was obviously very costly in terms of time and resources.

The solution here was to actually have a clear goal (learning Math to understand the underlying principles of Artificial Intelligence) and take the time to research a lot to come up with a sound Curriculum and start from there. Having that sorted out I only had to follow the path and knew that I was good.

During this aforementioned trial and error phase I made the mistake of taking way too many MOOCs. Don’t get me wrong, MOOCs are great! It has never been possible before to take an MIT course from your couch. In my case exactly that was the problem. Most of the time I was passively watching the course content nodding along. After a couple of “completed courses” and the feeling of knowing the ins and outs I jumped into more sophisticated problems to figure that I developed a pretty shallow knowledge.

Doing a retrospective on the “completed courses” I saw that my learning style isn’t really tailored to MOOCs. I decided to switch my focus to the good old physical textbooks. I especially focused on textbooks with good didactics, lots of examples and exercises with solutions (the Schaum’s Outlines series is golden here). Switching from passively consuming to actively participating in the form of working through numerous exercises was really the breakthrough for me. It ensured that I left my comfort zone, went deep into the trenches and really battle-tested my knowledge about the topic at hand.

The other upside of using textbooks is that it made it possible to learn in a distraction free environment. No computer, no notifications, no distractions. Just me, a black coffee and my Math textbook!

Another, final tip I’d like to share is that you should really keep track of your feeling and engagement while studying. Do you feel fired up? Are you excited? Or are you just consuming and your thoughts are constantly wandering off because you don’t really care that much? If that’s the case then it’s usually time to move on. Don’t try to push through. There’s nothing worse than completing a course just for the sake of completing it. If it doesn’t feel right or isn’t working for you it’s important to let go and move on. There’s enough material and maybe the next one suits your needs.

Conclusion

In this blog post I’ve shared my journey from being someone who acknowledged Math as something one should’ve heard about to someone who learned to love Math and its applications to solve complex problems. In order to really understand and learn more about Artificial Intelligence and Deep Learning I created a Curriculum which does not only cover the underlying Math concepts of such disciplines but will serve the student well when learning more about other “Hard Sciences” such as Computer Science in general, Physics, Meteorology, Biology, etc.

I’m still early in my Math journey and there’s an infinite amount of exciting stuff to learn. With the given Curriculum I feel confident that I’ve gained a solid foundation to pick up more complex topics I’ll encounter while learning about Artificial Intelligence, Deep Learning and Advanced Mathematics in general.