Forem: Adam

PyTorch vs. TensorFlow: Which Framework Is Best?

Adam — Sun, 22 Sep 2019 14:04:04 +0000

If you are reading this you've probably already started your journey into deep learning. If you are new to this field, in simple terms deep learning is an add-on to develop human-like computers to solve real-world problems with its special brain-like architectures called artificial neural networks. To help develop these architectures, tech giants like Google, Facebook and Uber have released various frameworks for the Python deep learning environment, making it easier for to learn, build and train diversified neural networks. In this article, we’ll take a look at two popular frameworks and compare them: PyTorch vs. TensorFlow. be comparing, in brief, the most used and relied Python frameworks TensorFlow and PyTorch.

GOOGLE’S TENSORFLOW

TensorFlow is open source deep learning framework created by developers at Google and released in 2015. The official research is published in the paper “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.”

TensorFlow is now widely used by companies, startups, and business firms to automate things and develop new systems. It draws its reputation from its distributed training support, scalable production and deployment options, and support for various devices like Android.

FACEBOOK’S PYTORCH

PyTorch is one of the latest deep learning frameworks and was developed by the team at Facebook and open sourced on GitHub in 2017. You can read more about its development in the research paper "Automatic Differentiation in PyTorch."

PyTorch is gaining popularity for its simplicity, ease of use, dynamic computational graph and efficient memory usage, which we'll discuss in more detail later.

WHAT CAN WE BUILD WITH TENSORFLOW AND PYTORCH?

Initially, neural networks were used to solve simple classification problems like handwritten digit recognition or identifying a car’s registration number using cameras. But thanks to the latest frameworks and NVIDIA’s high computational graphics processing units (GPU’s), we can train neural networks on terra bytes of data and solve far more complex problems. A few notable achievements include reaching state of the art performance on the IMAGENET dataset using convolutional neural networks implemented in both TensorFlow and PyTorch. The trained model can be used in different applications, such as object detection, image semantic segmentation and more.

Although the architecture of a neural network can be implemented on any of these frameworks, the result will not be the same. The training process has a lot of parameters that are framework dependent. For example, if you are training a dataset on PyTorch you can enhance the training process using GPU’s as they run on CUDA (a C++ backend). In TensorFlow you can access GPU’s but it uses its own inbuilt GPU acceleration, so the time to train these models will always vary based on the framework you choose.

TOP TENSORFLOW PROJECTS

Magenta: An open source research project exploring the role of machine learning as a tool in the creative process. (https://magenta.tensorflow.org/)

Sonnet: Sonnet is a library built on top of TensorFlow for building complex neural networks. (https://sonnet.dev/)

Ludwig: Ludwig is a toolbox to train and test deep learning models without the need to write code. (https://uber.github.io/ludwig/)

TOP PYTORCH PROJECTS

CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. (https://stanfordmlgroup.github.io/projects/chexnet/)

PYRO: Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. (https://pyro.ai/)

Horizon: A platform for applied reinforcement learning (Applied RL) (https://horizonrl.com)

These are a few frameworks and projects that are built on top of TensorFlow and PyTorch. You can find more on Github and the official websites of TF and PyTorch.

COMPARING PYTORCH AND TENSORFLOW

The key difference between PyTorch and TensorFlow is the way they execute code. Both frameworks work on the fundamental datatype tensor. You can imagine a tensor as a multi-dimensional array shown in the below picture.

1. MECHANISM: DYNAMIC VS STATIC GRAPH DEFINITION
TensorFlow is a framework composed of two core building blocks:

A library for defining computational graphs and runtime for executing such graphs on a variety of different hardware.
A computational graph which has many advantages (but more on that in just a moment).
A computational graph is an abstract way of describing computations as a directed graph. A graph is a data structure consisting of nodes (vertices) and edges. It’s a set of vertices connected pairwise by directed edges.

When you run code in TensorFlow, the computation graphs are defined statically. All communication with the outer world is performed via tf.Session object and tf.Placeholder, which are tensors that will be substituted by external data at runtime. For example, consider the following code snippet.

This is how a computational graph is generated in a static way before the code is run in TensorFlow. The core advantage of having a computational graph is allowing parallelism or dependency driving scheduling which makes training faster and more efficient.

View the rest with more code examples on BuiltIn.com:PYTORCH VS. TENSORFLOW: WHICH FRAMEWORK IS BEST FOR YOUR DEEP LEARNING PROJECT?

Recurrent neural networks: The powerhouse of language modeling

Adam — Sun, 09 Jun 2019 15:42:31 +0000

During the spring semester of my junior year in college, I had the opportunity to study abroad in Copenhagen, Denmark. I had never been to Europe before that, so I was incredibly excited to immerse myself into a new culture, meet new people, travel to new places, and, most important, encounter a new language. Now although English is not my native language (Vietnamese is), I have learned and spoken it since early childhood, making it second-nature. Danish, on the other hand, is an incredibly complicated language with a very different sentence and grammatical structure. Before my trip, I tried to learn a bit of Danish using the app Duolingo; however, I only got a hold of simple phrases such as Hello (Hej) and Good Morning (God Morgen).

When I got there, I had to go to the grocery store to buy food. Well, all the labels there were in Danish, and I couldn’t seem to discern them. After a long half hour struggling to find the difference between whole grain and wheat breads, I realized that I had installed Google Translate on my phone not long ago. I took out my phone, opened the app, pointed the camera at the labels… and voila, those Danish words were translated into English instantly. Turns out that Google Translate can translate words from whatever the camera sees, whether it is a street sign, restaurant menu, or even handwritten digits. Needless to say, the app saved me a ton of time while I was studying abroad.

Google Translate is a product developed by the Natural Language Processing Research Group at Google. This group focuses on algorithms that apply at scale across languages and across domains. Their work spans the range of traditional NLP tasks, with general-purpose syntax and semantic algorithms underpinning more specialized systems.

Looking at a broader level, NLP sits at the intersection of computer science, artificial intelligence, and linguistics. The goal is for computers to process or “understand” natural language in order to perform tasks that are useful, such as Sentiment Analysis, Language Translation, and Question Answering. Fully understanding and representing the meaning of language is a very difficulty goal; thus it has been estimated that perfect language understanding is only achieved by AI-complete system. The first step to know about NLP is the concept of language modeling.

Language Modeling

Language Modeling is the task of predicting what word comes next. For example, given the sentence “I am writing a …”, the word coming next can be “letter”, “sentence”, “blog post” … More formally, given a sequence of words x(1), x(2), …, x(t), language models compute the probability distribution of the next word x(t+1).

The most fundamental language model is the n-gram model. An n-gram is a chunk of n consecutive words. For example, given the sentence “I am writing a …”, then here are the respective n-grams:

unigrams: “I”, “am”, “writing”, “a”
bigrams: “I am”, “am writing”, “writing a”
trigrams: “I am writing”, “am writing a”
4-grams: “I am writing a”

The basic idea behind n-gram language modeling is to collect statistics about how frequent different n-grams are, and use these to predict next word. However, n-gram language models have the sparsity problem, in which we do not observe enough data in a corpus to model language accurately (especially as n increases).

Instead of the n-gram approach, we can try a window-based neural language model, such as feed-forward neural probabilistic language modelsand recurrent neural network language models. This approach solves the data sparsity problem by representing words as vectors (word embeddings) and using them as inputs to a neural language model. The parameters are learned as part of the training process. Word embeddings obtained through neural language models exhibit the property whereby semantically close words are likewise close in the induced vector space. Moreover, recurrent neural language model can also capture the contextual information at the sentence-level, corpus-level, and subword-level.

Recurrent Neural Net Language Model

The idea behind RNNs is to make use of sequential information. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output depended on previous computations. Theoretically, RNNs can make use of information in arbitrarily long sequences, but empirically, they are limited to looking back only a few steps. This capability allows RNNs to solve tasks such as unsegmented, connected handwriting recognition or speech recognition.

Let’s try an analogy. Suppose you are watching Avengers: Infinity War (by the way, a phenomenal movie). There are so many superheroes and multiple story plots happening in the movie, which may confuse many viewers who don’t have prior knowledge about the Marvel Cinematic Universe. However, you have the context of what’s going on because you have seen the previous Marvel series in chronological order (Iron Man, Thor, Hulk, Captain America, Guardians of the Galaxy) to be able to relate and connect everything correctly. It means that you remember everything that you have watched to make sense of the chaos happening in Infinity War.

Similarly, RNN remembers everything. In other neural networks, all the inputs are independent of each other. But in RNN, all the inputs are related to each other. Let’s say you have to predict the next word in a given sentence, the relationship among all the previous words helps to predict a better output. In other words, RNN remembers all these relationships while training itself.

RNN remembers what it knows from previous input using a simple loop. This loop takes the information from previous time stamp and adds it to the input of current time stamp. The figure below shows the basic RNN structure. At a particular time step t, X(t) is the input to the network and h(t) is the output of the network. A is the RNN cell which contains neural networks just like a feed-forward net.

This loop structure allows the neural network to take the sequence of the input. If you see the unrolled version below, you will understand it better:

First, RNN takes the X(0) from the sequence of input and then outputs h(0)which together with X(1) is the input for the next step. Next, h(1) from the next step is the input with X(2) for the next step and so on. With this recursive function, RNN keeps remembering the context while training.

If you are a math nerd, many RNNs use the equation below to define the values of their hidden units:

of which h(t) is the hidden state at timestamp t, ∅ is the activation function (either Tanh or Sigmoid), W is the weight matrix for input to hidden layer at time stamp t, X(t) is the input at time stamp t, U is the weight matrix for hidden layer at time t-1 to hidden layer at time t, and h(t-1) is the hidden state at timestamp t.

RNN learns weights U and W through training using back propagation. These weights decide the importance of hidden state of previous timestamp and the importance of the current input. Essentially, they decide how much value from the hidden state and the current input should be used to generate the current input. The activation function ∅ adds non-linearity to RNN, thus simplifying the calculation of gradients for performing back propagation.

RNN Disadvantage

RNNs are not perfect. It suffers from a major drawback, known as the vanishing gradient problem, which prevents it from high accuracy. As the context length increases, layers in the unrolled RNN also increase. Consequently, as the network becomes deeper, the gradients flowing back in the back propagation step becomes smaller. As a result, the learning rate becomes really slow and makes it infeasible to expect long-term dependencies of the language. In other words, RNNs experience difficulty in memorizing previous words very far away in the sequence and is only able to make predictions based on the most recent words.

RNN Extensions

Over the years, researchers have developed more sophisticated types of RNNs to deal with this shortcoming of the standard RNN model. Let’s briefly go over the most important ones:

Bidirectional RNNs are simply composed of 2 RNNs stacking on top of each other. The output is then composed based on the hidden state of both RNNs. The idea is that the output may not only depend on previous elements in the sequence but also on future elements.
Long Short-Term Memory Networks are quite popular these days. They inherit the exact architecture from standard RNNs, with the exception of the hidden state. The memory in LSTMs (called cells) take as input the previous state and the current input. Internally, these cells decide what to keep in and what to eliminate from the memory. Then, they combine the previous state, the current memory, and the input. This process efficiently solves the vanishing gradient problem.
Gated Recurrent Unit Networks extends LSTM with a gating network generating signals that act to control how the present input and previous memory work to update the current activation, and thereby the current network state. Gates are themselves weighted and are selectively updated according to an algorithm.
Neural Turing Machines extend the capabilities of standard RNNs by coupling them to external memory resources, which they can interact with through attention processes. The analogy is that of Alan Turing’s enrichment of finite-state machines by an infinite memory tape.

Read more about Recurrent Neural Networks on Built In.

Deep learning with Python: The human brain imitation.

Adam — Fri, 07 Jun 2019 13:05:26 +0000

The main idea behind deep learning is that artificial intelligence should draw inspiration from the brain. This perspective gave rise to the "neural network” terminology. The brain contains billions of neurons with tens of thousands of connections between them. Deep learning algorithms resemble the brain in many conditions, as both the brain and deep learning models involve a vast number of computation units (neurons) that are not extraordinarily intelligent in isolation but become intelligent when they interact with each other.

I think people need to understand that deep learning is making a lot of things, behind-the-scenes, much better. Deep learning is already working in Google search, and in image search; it allows you to image search a term like “hug.”— Geoffrey Hinton

Neurons

The basic building block for neural networks is artificial neurons, which imitate human brain neurons. These are simple, powerful computational units that have weighted input signals and produce an output signal using an activation function. These neurons are spread across several layers in the neural network.

Below is the image of how a neuron is imitated in a neural network. The neuron takes in a input and has a particular weight with which they are connected with other neurons. Using the Activation function the nonlinearities are removed and are put into particular regions where the output is estimated.

How Do Artificial Neural Network Works?

Deep learning consists of artificial neural networks that are modeled on similar networks present in the human brain. As data travels through this artificial mesh, each layer processes an aspect of the data, filters outliers, spots familiar entities, and produces the final output.

Input layer : This layer consists of the neurons that do nothing than receiving the inputs and pass it on to the other layers. The number of layers in the input layer should be equal to the attributes or features in the dataset.

Output Layer:The output layer is the predicted feature, it basically depends on the type of model you’re building.

Hidden Layer: In between input and output layer there will be hidden layers based on the type of model. Hidden layers contain vast number of neurons. The neurons in the hidden layer apply transformations to the inputs and before passing them. As the network is trained the weights get updated, to be more predictive.

Neuron Weights

Weights refer to the strength or amplitude of a connection between two neurons, if you are familiar with linear regression you can compare weights on inputs like coefficients we use in a regression equation.Weights are often initialized to small random values, such as values in the range 0 to 1.

Feedforward Deep Networks

Feedforward supervised neural networks were among the first and most successful learning algorithms. They are also called deep networks, multi-layer Perceptron (MLP), or simply neural networks and the vanilla architecture with a single hidden layer is illustrated. Each Neuron is associated with another neuron with some weight,

The network processes the input upward activating neurons as it goes to finally produce an output value. This is called a forward pass on the network.

The image below depicts how data passes through the series of layers.

Activation Function

An activation function is a mapping of summed weighted input to the output of the neuron. It is called an activation/ transfer function because it governs the inception at which the neuron is activated and the strength of the output signal.

Mathematically,

There are several activation functions that are used for different use cases. The most commonly used activation functions are relu, tanh, softmax. The cheat sheet for activation functions is given below.

BackPropagation

The predicted value of the network is compared to the expected output, and an error is calculated using a function. This error is then propagated back within the whole network, one layer at a time, and the weights are updated according to the value that they contributed to the error. This clever bit of math is called the backpropagation algorithm. The process is repeated for all of the examples in your training data. One round of updating the network for the entire training dataset is called an epoch. A network may be trained for tens, hundreds or many thousands of epochs.

Cost Function and Gradient Descent

The cost function is the measure of “how good” a neural network did for its given training input and the expected output. It also may depend on attributes such as weights and biases.

A cost function is single-valued, not a vector because it rates how well the neural network performed as a whole. Using the gradient descent optimization algorithm, the weights are updated incrementally after each epoch.

Read the rest with deep learning Python examples and libraries here.

A tour of the top 10 algorithms for machine learning newbies

Adam — Mon, 03 Jun 2019 18:14:23 +0000

In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem, and it’s especially relevant for supervised learning (i.e. predictive modeling).

For example, you can’t say that neural networks are always better than decision trees or vice-versa. There are many factors at play, such as the size and structure of your dataset.

As a result, you should try many different algorithms for your problem, while using a hold-out “test set” of data to evaluate performance and select the winner.

Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in. As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn’t bust out a shovel and start digging.

THE BIG PRINCIPLE

However, there is a common principle that underlies all supervised machine learning algorithms for predictive modeling.

Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y): Y = f(X)

This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or its form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms.

The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate predictions possible.

For machine learning newbies who are eager to understand the basic of machine learning, here is a quick tour on the top 10 machine learning algorithms used by data scientists.

1 — Linear Regression
Linear regression is perhaps one of the most well-known and well-understood algorithms in statistics and machine learning.

Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. We will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.

The representation of linear regression is an equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B).

For example: y = B0 + B1 * x

We will predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.

Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.

Linear regression has been around for more than 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible. It is a fast and simple technique and good first algorithm to try.

2 — Logistic Regression

Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).

Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.

The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.

Because of the way that the model is learned, the predictions made by logistic regression can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful for problems where you need to give more rationale for a prediction.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. It’s a fast model to learn and effective on binary classification problems.

3 — Linear Discriminant Analysis
Logistic Regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:

The mean value for each class.

The variance calculated across all classes.

Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand. It’s a simple and powerful method for classification predictive modeling problems.

4 — CLASSIFICATION AND REGRESSION TREES
Decision Trees are an important type of algorithm for predictive modeling machine learning.

The representation of the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.

Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.

5 — NAIVE BAYES

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.

The model is comprised of two types of probabilities that can be calculated directly from your training data: 1) The probability of each class; and 2) The conditional probability for each class given each x value. Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem. When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.

Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

View the rest at BuiltIn.com:
https://builtin.com/data-science/tour-top-10-algorithms-machine-learning-newbies

An easy introduction to natural language processing with Python

Adam — Sat, 01 Jun 2019 13:32:00 +0000

Computers are great at working with standardized and structured data like database tables and financial records. They are able to process that data much faster than we humans can. But us humans don’t communicate in “structured data” nor do we speak binary! We communicate using words, a form of unstructured data.

Unfortunately, computers suck at working with unstructured data because there’s no standardized techniques to process it. When we program computers using something like C++, Java, or Python, we are essentially giving the computer a set of rules that it should operate by. With unstructured data, these rules are quite abstract and challenging to define concretely.

There’s a lot of unstructured natural language on the internet; sometimes even Google doesn’t know what you’re searching for!

HUMAN VS COMPUTER UNDERSTANDING OF LANGUAGE

Human’s have been writing things down for thousands of years. Over that time, our brain has gained a tremendous amount of experience in understanding natural language. When we read something written on a piece of paper or in a blog post on the internet, we understand what that thing really means in the real-world. We feel the emotions that reading that thing elicits and we often visualize how that thing would look in real life.

Natural language processing (NLP) is a sub-field of artificial intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language. Computers don’t yet have the same intuitive understanding of natural language that humans do. They can’t really understand what the language is really trying to say. In a nutshell, a computer can’t read between the lines.

That being said, recent advances in machine learning have enabled computers to do quite a lot of useful things with natural language! Deep learning has enabled us to write programs to perform things like language translation, semantic understanding, and text summarization. All of these things add real-world value, making it easy for you to understand and perform computations on large blocks of text without the manual effort.

Let’s start with a quick primer on how NLP works conceptually. Afterwards we’ll dive into some Python code so you can get started with NLP yourself!

THE REAL REASON WHY NLP IS HARD

The process of reading and understanding language is far more complex than it seems at first glance. There are many things that go in to truly understanding what a piece of text means in the real-world. For example, what do you think the following piece of text means?

“Steph Curry was on fire last nice. He totally destroyed the other team”

To a human it’s probably quite obvious what this sentence means. We know Steph Curry is a basketball player; or even if you don’t we know that he plays on some kind of team, probably a sports team. When we see “on fire” and “destroyed” we know that it means Steph Curry played really well last night and beat the other team.

Computers tend to take things a bit too literally. Viewing things literally like a computer, we would see “Steph Curry” and based on the capitalization assume it’s a person, place, or otherwise important thing which is great! But then we see that Steph Curry “was on fire”…. A computer might tell you that someone literally lit Steph Curry on fire yesterday! … yikes. After that, the computer might say that Mr. Curry has physically destroyed the other team…. they no longer exist according to this computer… great…

The rest of this article with Python tutorials can be found here: https://builtin.com/data-science/easy-introduction-natural-language-processing