Jeremy Jordan

Understanding the Transformer architecture for neural networks

Jeremy Jordan — Sat, 06 May 2023 18:17:57 GMT

In a previous post, we discussed a key innovation in sequence to sequence neural network architectures: the attention mechanism. This innovation significantly improved performance for encoder-decoder style recurrent neural network architectures by removing a bottleneck passing information from the encoder network to the decoder network. As a quick recap, before the introduction of attention we would typically summarize the entire input sequence in a single fixed-size vector to pass along as context to the decoder network. The attention mechanism allows us to merge a variable-length sequence of vectors into a fixed-size context vector at each time step in the decoder network. This provides significantly more information bandwidth to the decoder via a learnable mechanism which is capable of determining which time-steps in the input sequence are relevant at each time-step being generated in the output sequence.

Let's stop and think about that at a high-level. We have a mechanism which allows us to take a variable-length sequence and merge this information together to output a fixed-size vector. Doesn't that sound a lot like to role of a recurrent neural network? We introduced recurrence as a way to learn from variable-length sequences; but in an effort to squeeze more performance out of those recurrent networks, we accidentally stumbled across another way to process variable-length sequences.

This begs the question, what if we tried to get rid of the recurrent layers and simply used attention everywhere? This is the premise behind a seminal paper from 2017, Attention Is All You Need, which introduced the Transformer architecture for neural networks. Let's dive in and explore that idea in this blog post.

Overview

Swapping out recurrent layers with attention
Generalizing the attention mechanism
Defining the Transformer architecture
Benefits of an attention-only architecture
Resources

Swapping out recurrent layers with attention

Let's first walk through what exactly it means to "replace the recurrent layers with attention". Recall that a standard recurrent neural network architecture involves evolving a hidden state across a sequence of inputs. We process each time step by combining information from the current input with information from the previous hidden state. This allows us to pass information along the temporal dimension as we process our sequence.

The above visualization shows the "unrolled" representation of our network, but if we were to look at a single time step it becomes more clear that we're only processing the immediately previous hidden state and the current input.

This forces us to evolve a summary of the entire input sequence (represented by the previous hidden state) as we process the sequence step by step. What if we could instead leverage an attention mechanism to look back at every previous time step when building our representation of the current time step?

With this approach, we can flexibly look back across the sequence to retrieve relevant information as we process the next item in the input sequence.

Notice how there is no longer information flowing through the hidden state representation like we had in the unrolled recurrent neural network; in this new architecture, each time-step instead looks back across the entire input sequence.

In some applications, we want to see the entire sequence at each time step, not just the preceding time-steps. With recurrent neural networks we typically achieved this with a bidirectional recurrent layer where we had one recurrent layer processing the input in the forward direction and one recurrent layer processing the input in the backward direction, concatenating these outputs together.

We can achieve this same functionality with an attention-only architecture by simply allowing each time step to attend across all time steps in the sequence.

Further, we can still leverage the attention mechanism for its original design, passing information between an encoder and decoder model. In this case we have attention from two different perspectives:

self attention: looking across the current sequence being processed (e.g. an encoder network looking across the input sequence being encoded, such as the example shown above)
cross attention: looking across a related sequence (e.g. the decoder network looking across a sequence of representations from the encoder network)

We describe the type of attention based on the relationship between the sequence that we're currently processing and the sequence that we're using as context.

Generalizing the attention mechanism

With this conceptual understanding in place, let's spend some time digging deeper into how the attention mechanism looks across multiple time steps and combines relevant information into a fixed-size context vector.

At a high level, the attention mechanism:

compares a reference vector against a candidate vector to determine a relevance score between the two vectors
performs the above calculation for a set of candidate vectors
normalizes all of the computed relevance scores (e.g. softmax)
uses the normalized scores to take a weighted combination of the candidate vectors

This provides us with a single "context" vector which highlights relevant information from across the sequence, relative to the current time-step.

In their paper introducing the Transformer architecture, Vaswani et al refer to the original attention mechanism implementation as "additive attention" and compare this to another proposed mechanism known as "dot-product (multiplicative) attention". The authors then build on this second mechanism to introduce a variant known as "scaled dot product attention". Let's discuss these various implementations and show how they relate to the high-level structure presented above.

Additive attention

I previously introduced the attention model as a small neural network which maps $\left[ \text{reference} ,\ \text{candidate} \right] \rightarrow \text{relevance score}$ where our input is a concatenation of the reference and candidate vectors.

A really tiny attention network for the purposes of demonstration.

However, instead of concatenating these vectors together and using a single weights matrix to project the input into the hidden layer, we could also project the reference and candidate vectors separately, and then simply add the two projections.

A visualization of matrix operations to demonstrate $Wx^{\left[ \text{ref} ,\text{cand} \right] }=Wx^{\text{ref} }+Ux^{\text{cand} }$

It's advantageous to frame this as two projections which are added together, since the same set of candidate vectors will need to be projected for each reference vector. If we separate these projections, we can cache the results of $Ux^{\text{cand} }$ and reuse this projection in cases where we attend over the same set of candidate vectors multiple times (e.g. cross attention where a decoder autoregressively generates an output sequence and attends over the same sequence of vectors from our encoder representation at each time step).

With this reframing, we will define our additive attention model as:

$$\text{relevance} \left( x^{\text{ref} },\ x^{\text{cand} }\right) =v^{\intercal }\tanh \left( Wx^{\text{ref} }+Ux^{\text{cand} }\right) = e_{i,j}$$

$$\text{attention} \left( x^{\text{ref} },X^{\text{cand} }\right) =\text{softmax} \left( \begin{matrix}e_{i,1}\\ \ldots \\ e_{i,n}\end{matrix} \right) X^{\text{cand} }$$

where $W$ and $U$ are our weights matrices to project from the input to the hidden layer and $v$ represents the weights used to project from the hidden layer to our output. We use $\tanh$ as our nonlinearity in this case but in theory you could swap this out for your nonlinearity of choice.

$X^{\text{cand}}$ denotes our set of candidate vectors as a matrix. These candidate vectors are weighted by the normalized relevance scores and combined to produce our final context vector.

This is the attention model introduced by Bahdanau et al (see appendix section A.1.2) with our inputs renamed to use slightly more generic terminology.

Dot product (multiplicative) attention

Instead of using a small neural network to compute the relevance score between a reference and candidate vector, a simpler approach might be to take the dot product of the two vectors; this dot product gives us a direct measure of similarity between two vectors. As a quick reminder, the dot product between two vectors is calculated as:

$$\left[ \begin{matrix}a_{1}\\ a_{2}\\ a_{3}\end{matrix} \right] \cdot \left[ \begin{matrix}b_{1}\\ b_{2}\\ b_{3}\end{matrix} \right] =a_{1}b_{1}+a_{2}b_{2}+a_{3}b_{3}$$

If two vectors are aligned in the same direction with similar magnitudes the dot product will be a large positive number. If two vectors are pointing in opposite directions with similar magnitudes, the dot product will be a large negative number. If two vectors are perpendicular, their dot product is zero.

This attention model was originally introduced by Luong et al where attention scores are simply computed by the dot product of our reference and candidate vectors (in their case, the vectors were hidden states of encoder and decoder LSTM networks).

Vaswani et al make a slight modification to this technique, where rather than directly computing the dot product between our reference and candidate vectors, we'll opt to use two learnable weights matrices to project $x^{\text{ref} }$ and $x^{\text{cand} }$ into a new representation subspace, similar to what we did for additive attention.

The projection of our reference vector is referred to as our query. The query vector represents the information that we're searching for across our set of candidate vectors. The projection of our candidate vector is referred to as a key. This represents the information contained within a specified candidate vector. The dot product between these two vectors helps us find keys that a relevant for a given query.

This transformation into a new representation subspace allows us to extract/highlight information from the input vectors before computing their relevance as determined by the dot product. This also allows us to project the inputs into multiple representation subspaces such that our model can attend to different characteristics of the input in parallel; this is known as multi-headed attention which we'll discuss further in a later section.

Finally, rather than using the candidate vectors directly to perform a weighted combination of our candidate vectors (based on the normalized relevance scores), we'll instead introduce a third projection to use in producing our weighted combination across the candidate vectors.

The projection of our candidate vector used in producing the final weighted combination is known as the value. This represents the information from our candidate vector that we wish to express for a given query.

In summary, we'll define dot product attention as:

$$\text{relevance} \left( x^{\text{ref} },\ x^{\text{cand} }\right) =W_{\text{query}}x^{\text{ref} }\cdot W_{\text{key}}x^{\text{cand} }=q\cdot k = e_{i,j}$$

$$\text{attention} \left( x^{\text{ref} },X^{\text{cand} }\right) =\text{softmax} \left( \begin{matrix}e_{i,1}\\ \ldots \\ e_{i,n}\end{matrix} \right) W_{\text{value} }X^{\text{cand} }$$

where our weights matrices $W$ are used to project the input into our query, key, and value vectors and the corresponding relevance scores are denoted as $e_{i,j}$. Here, $i$ denotes our index into a matrix of reference vectors $X^{\text{ref}}$ and $j$ represents the index into the matrix of candidate vectors $X^{\text{cand} }$. For simplicity, we focus the attention calculation on a single reference vector but in practice this can be done in parallel as a single matrix operation treating every vector in our sequence as a reference vector.

Scaled dot product attention

Recall that our relevance scores, $e_{i,j}$, for a given set of candidate vectors are normalized using the softmax function to produce our final attention weights, $a_{i,j}$.

$$a_{i,j}=\frac{\exp \left( e_{i,j}\right)}{\sum^{K}_{k=1}\exp \left( e_{i,k}\right)}$$

Further recall that our dot product between the query and key vectors is computed as $q_{1}k_{1}+q_{2}k_{2}+\ldots +q_{d}k_{d}$. If we were to assume that our two vectors were normally distributed (mean of 0 and variance of 1), we would expect their dot product to have a mean of 0 and a variance of $d$. Thus, as the dimensionality $d$ grows larger, so does the variance of the dot product.

Unfortunately, large magnitudes of our relevance scores push the softmax function into regions of extremely small gradients, which attenuates our learning signal during training. To counteract this effect, Vaswani et al opt to scale the dot products by $\frac{1}{\sqrt{d_{k}} }$ before normalizing them with the softmax function.

With this variant of attention, we compute the relevance score as

$$\text{relevance} \left( x^{\text{ref} },\ x^{\text{cand} }\right) =\frac{1}{\sqrt{d_{k}} } \left( W_{\text{query} }x^{\text{ref} }\cdot W_{\text{key} }x^{\text{cand} }\right) =\frac{1}{\sqrt{d_{k}} } \left( q\cdot k\right)$$

where $d_{k}$ denotes the dimensionality of our query and key vectors.

Putting this all together, we can visualize a scaled dot product attention layer using self-attention to process a sequence of inputs. This visual just shows the computation to produce the context vector for the first time step in the sequence, but you can imagine a similar computation happening in parallel for all time steps.

Defining the Transformer architecture

Let's now discuss the Transformer architecture as presented by Vaswani et al. This architecture was constructed for the task of machine translation and leverages the typical encoder-decoder approach where the encoder processes a sentence in the input language and the decoder uses representations from the encoder to generate a sentence in the target language.

The overall Transformer architecture is mainly a composition and stacking of just a few building blocks:

scaled dot product attention,
residual connections,
layer normalization,
and feed forward networks.

Stacking multiple attention layers on top of each other has the effect of increasing the receptive field. The first attention layer produces context vectors based on interactions between pairs from the original sequence. The second attention layer produces context vectors based on pairs of pairs of the original sequence. As we continue to stack more attention layers on top of each other, we gain a wider perspective considering multiple levels of interactions between items in the original sequence.

Multi-head attention

For the attention layers, the authors leverage a multi-headed attention implementation where we perform multiple queries in parallel for each time step.

With this approach, we end up with a set of context vectors at each time step in the sequence. These context vectors represent different attention summaries of the same sequence which provides us with multiple perspectives on the same input. The vectors are then concatenated together and passed through a final linear layer.

A visualization of multi-head attention from the original paper.

Encoder

An encoder block is defined as:

multi-head (self) attention: queries are computed for each step in the input sequence and are compared against keys computed across all steps in the input sequence
residual connection and layer normalization: the input embeddings are added to context vectors produced from the self-attention layer to create a residual connection, and normalized across the layer dimension (e.g. the output vectors for each sequence in the batch will be separately normalized to have zero mean and unit variance)
feed forward: we then pass the normalized output from our attention sublayer through a linear projection, ReLU activation, and another linear projection
residual connection and layer normalization: finally, we add the projection from the feed forward sublayer to the normalized output from our attention sublayer, and once again normalize this vector to have zero mean and unit variance

The encoder model is comprised of 6 of these blocks stacked on top of each other.

Decoder

A decoder block is defined as:

multi-head (masked self) attention: queries are computed for each step in the output sequence and are compared against keys computed across all steps in the output sequence, due to the fact output sequence is intended to be generated autoregressively we apply a causal mask to the computed relevance scores (limiting the model to only attend across previous time steps)

A visualization of causal masking for self-attention.

residual connection and layer normalization: the embeddings of our output sequence are added to context vectors produced from the causal self-attention layer to create a residual connection, and normalized across the layer dimension
multi-head (cross) attention: we use the output of the previous self-attention sublayer to generate a second set of queries for each step in the output sequence, the keys and values are computed from the output of the final encoder block across all steps in the input sequence

A visualization of cross attention, where the keys and values are computed on a separate sequence from the queries.

residual connection and layer normalization: the context vectors from the causal self-attention sublayer are added to context vectors produced from the cross-attention sublayer to create a residual connection, and normalized across the layer dimension
feed forward: we then pass the normalized output from our attention sublayer through a linear projection, ReLU activation, and another linear projection
residual connection and layer normalization: finally, we add the projection from the feed forward sublayer to the normalized output from our cross-attention sublayer, and once again normalize this vector to have zero mean and unit variance

The decoder model is comprised of 6 of these blocks stacked on top of each other.

Embeddings

Our input and output sequences are ultimately represented as a sequence of token ids (integers). To prepare these sequences for our model, we will leverage the standard technique of using this token id as an index to look up a vector of learnable parameters from an embeddings matrix. However, we must also inject some additional information into these embeddings so that the model can learn positional relationships between tokens in our input and output sequences.

This is because the attention mechanism doesn't have any notion of spatial relationships between tokens in the sequence. Our query at each time step simply looks across the set of keys and determines which keys are most relevant to our current time step; if we shuffled up the order of the sequence this would have no effect on our pairwise relevance comparisons.

However, we know intuitively that token order should matter, so the authors develop a technique known as positional encoding to inject information about the order of the tokens in our sequence.

We encode the position $pos$ in a sequence using a combination of $sin$ and $cos$ functions (alternating these functions for each dimension $i$ in the $d$-dimensional embedding vector).

$$PE_{\left( pos,2i\right) }=sin(\frac{pos}{10000^{\frac{2i}{d} }} )$$

$$PE_{\left( pos,2i+1\right) }=cos(\frac{pos}{10000^{\frac{2i}{d} }} )$$

These position encodings are then added to our embedding vectors before being passed to the encoder or decoder models.

💡

Note: Many subsequent Transformer-based architectures have opted to inject positional information in different ways. One promising approach (ALiBi) advocates for removing the positional encoding entirely from our embeddings and rather introducing an inductive bias to our relevance scores to give higher weight to nearby tokens.

Overall architecture

Putting all this together, we have the Transformer architecture.

A visualization of the overall Transformer architecture from the original paper.

💡

Note: Many subsequent Transformer-based architectures have opted to to move layer normalization to be performed before the attention/feed forward layers instead of after. This has led to more stable training dynamics.

Benefits of an attention-only architecture

One of the biggest benefits of replacing recurrence with attention is that we now have a fully parallelizable architecture. Whereas recurrence requires us to process a sequence one time-step at a time (due to the fact that we rely on the previous hidden state to process the next item in a sequence), an attention-only architecture does not have such a limitation. You can use a single matrix operation to project queries, keys, and values for every time-step in your sequence, simultaneously attending across the sequence for every time-step in parallel. This allows us to more efficiently leverage accelerated compute infrastructure (e.g. GPUs) and train models faster.

The attention mechanism also allows our gradient signal to flow through the network more efficiently during backpropagation. With recurrent neural networks, if you needed propagate a gradient signal from $t=3$ back to $t=0$, you would have to back-propagate through all of the recurrent hidden layers between each of these time-steps.

However, with the attention mechanism we can directly back-propagate our gradient signal across an arbitrarily number of time steps using a constant number of operations.

These shorter path lengths make it easier for the model to learn long-term dependencies between items in our sequence.

Finally, the attention mechanism has an inherent level of interpretability as a result of inspecting the attention weights. That is, we can use the computed attention weights to understand which tokens across the sequence were deemed relevant and used to produce a context vector for any given token in the sequence.

An example from the Transformers paper showing how visualizing the attention weights can help aid in understanding how the model processes our input sequence.

Resources

Papers

Neural Machine Translation by Jointly Learning to Align and Translate
Effective Approaches to Attention-based Neural Machine Translation
Attention Is All You Need 🐐
On Layer Normalization in the Transformer Architecture
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (replace positional encoding)
- The author gives a nice overview of the paper in this talk

Lectures

3Blue1Brown Series
- But what is a GPT? Visual intro to Transformers
- Visualizing Attention, a Transformer's Heart
Sebastian Raschka's lectures on this topic
Andrej Karpathy's deep dive + live coding
- Let's build GPT: from scratch, in code, spelled out

Blog posts

Other

Calculus (OpenStax) 12.3: The Dot Product
LLM Visualization
- This is a really impressive interactive visualization.

Thanks to James Black, Tim Hopper, and ChatGPT for pointing me to ffmpeg and helping me figure out the right settings to use to create GIFs from my diagrams. I hope you found the animations to be useful!

Understanding the attention mechanism in sequence models

Jeremy Jordan — Wed, 01 Mar 2023 22:44:22 GMT

In this blog post, we'll discuss a key innovation in sequence-to-sequence model architectures: the attention mechanism. This architecture innovation dramatically improved model performance for sequence-to-sequence tasks such as machine translation and text summarization.

Moreover, the success of this attention mechanism led to the seminal paper, "Attention Is All You Need", which introduced the Transformer model architecture. I'll discuss the Transformer model architecture in subsequent posts, but I feel that it's useful to start by building a foundational understanding of the attention mechanism itself (which is the focus of this post). With that said, let's dive right in!

Overview

A quick refresher: sequence to sequence models
The context vector bottleneck
Introducing the attention mechanism
Training an attention model
Benefits of attention: shorter path lengths
Conclusion
Resources

A quick refresher: sequence to sequence models

Let us consider the sequence modeling task where we have a variable-length input sequence and we're expected to predict a variable-length output sequence. A common example of this type of task is machine translation, where the input might be "i love you" (English) and the expected output might be "te amo" (Spanish).

Notice how the input length (3 words) doesn't necessarily match the output length (2 words). This poses a challenging sequence prediction task since we can't simply train a model which learns to map each input token to its corresponding output token; there isn't a 1:1 mapping!

Cho et al introduced the RNN Encoder–Decoder neural network architecture for this type of sequence modeling where the variable-length input is encoded into a fixed-size vector (using a recurrent neural network) which is subsequently decoded into another variable-length output sequence (using a second recurrent neural network).

A visualization of the proposed architecture from Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

Sutskever et al introduced a similar architecture that builds on this approach using LSTM networks to encode and decode variable-length sequences. In this architecture, however, we simply take the last hidden state of the encoder and use that to initialize the decoder rather than passing it as context to every step during decoding. Also note how the input sequence is reversed in order to make the optimization problem easier, we'll dig deeper into this architecture choice in a later section.

A visualization of the proposed architecture from Sequence to Sequence Learning with Neural Networks.

While these two architectures had some slight differences, the overall approach was the same. We encode our variable-length input sequence into a fixed-size vector which provides a summary of the entire input sequence. This summary is then provided as context to the decoder when generating the output sequence.

The context vector bottleneck

Although the fixed-size context vector enables us to flexibly learn sequence-to-sequence tasks such as machine translation or text summarization, let's consider the implications of this architecture design.

A fixed-size vector implies that there's a fixed amount of information that we can store in our summary of the input sequence. For the sake of an example, let's assume that our context vector can store 100 bits of information. Let's also assume that each time-step in the input sequence contributes equally to the information stored in the context vector. For an input sequence of 5 tokens, each token would contribute 20 bits of information to the context vector. However, for an input sequence of 10 tokens, each token would only contribute 10 bits of information to the context vector.

Through this example one can see how performance may degrade in sequence-to-sequence models for longer input sequences. As the size of the input grows, we can store less and less information per token in the input sequence.

Introducing the attention mechanism

In order to address this context bottleneck challenge, Bahdanau et al introduced a neural network architecture which allows us to build a unique context vector for every decoder time-step based on different weighted aggregations across all of the encoder hidden states.

Whereas in the previous hypothetical example we only had 100 bits of information to summarize the entire input sequence, with this new approach we can provide a hypothetical 100 bits of information about the input sequence at each step of the decoder model.

In the example visualized below, you can see how we take a weighted combination across all of the encoder hidden states to produce our context vector, rather than simply using the final hidden state. These weights are determined by an attention mechanism which determines the encoder hidden states that are relevant given the current decoder hidden state. We'll discuss more about how this attention mechanism is trained in the next section.

Notice how the attention weights in the example above place most of the weight on the last token of the input sequence when generating the first token of the output sequence. If you recall our running example of translating "i love you" to Spanish, we're effectively aligning the related tokens of "i love you" and "te amo". Bahdanau et al refer to this attention mechanism as an alignment model since it allows the decoder to search across the input sequence for the most relevant time-steps regardless of the temporal position of the decoder model.

When decoding the second token, you can see that we have an entirely different weighted combination of contexts from the encoder hidden states, which helps align the "i love you" with "te amo" (in case you don't speak Spanish, "amo" is the first person singular conjugation of the Spanish word for love).

As the decoder model continues to generate the output sequence, we'll continue to compare the current decoder hidden state against all of our encoder hidden states to produce a unique context vector.

Training an attention model

Now that you understand how the attention weights are used to produce a unique context vector at each step in the decoder model, let's dive into the architecture and training process for our attention mechanism.

As we discussed previously, the attention model is effectively comparing a given decoder hidden state with a specific encoder hidden state to determine the relevance between these two vectors.

One way to compute this relevance score is to concatenate the encoder hidden state and decoder hidden state vectors and pass them through a multi-layer perception. The hidden layer of our MLP model can learn to identify relationships between certain dimensions in the encoder hidden state and the decoder hidden state. The output of this model is effectively answering the question: "how relevant is the information in this encoder hidden state for computing the next output in the sequence, given my current decoder hidden state?"

As a concrete example, the figure below visualizes how the attention weights are computed for the first time-step in the decoder model. Notice how we compare the previous decoder hidden state to every encoder hidden state in our sequence. The attention model outputs a "relevance" score (e_i,j) for each combination, which we can normalize via the softmax function to obtain our attention weights (α_i,j).

This attention mechanism is trained jointly with the rest of the sequence-to-sequence network, since we can backpropagate through the entire calculation of our attention weights and corresponding context vector. This means that the attention model is optimized to find relevant hidden states for the specific sequence-to-sequence task at hand. In other words, the attention mechanism for a machine translation model might learn to find different patterns than the attention mechanism for a text summarization model, depending on what's most useful for the task at hand.

Benefits of attention: shorter path lengths

In a previous section we touched on Sutskever et al's architecture decision to reverse the input sequence in order to make the optimization problem easier. In fact, they state "the simple trick of reversing the words in the source sentence is one of the key technical contributions of this work." Let's dig in to why this may be the case.

Let us assume that, on average, the position of a token in the input sequence is roughly equivalent to its relevant position in the output sequence. For example, in the English to Spanish translation of "this is an example" to "esto es un ejemplo", the first word "this" of the input sequence can be translated directly to the first word "esto" of the output sequence.

Recall that a standard encoder-decoder model architecture (without attention) uses the last hidden state of the encoder model as the context vector. This means that the information from the first time-step of the input sequence must be preserved in the hidden state passed through all of the time-steps in the input sequence. This information is then used to decode the first time-step of the output sequence.

The path from an input token and it's corresponding output token in a standard encoder-decoder model is visualized in red.

If we instead reverse the order of the input sequence, the information from the first time-step in the input has a shorter path length through the unrolled recurrent networks to the corresponding output. In general, shorter path lengths are desired because it's easier for gradient signals to propagate. Neural networks have historically struggled with learning longer-term dependencies (read: patterns across longer path lengths) due to the vanishing/exploding gradient problem coupled with the fact that we can only store a limited amount of information in our hidden state.

In section 3.3 the authors state:

By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance. – Sequence to Sequence Learning with Neural Networks

The trick of reversing the input sequence is effectively making it easier to pass information from the earlier parts of the input sequence while making it harder to pass information from the later parts of the input sequence.

As a quick aside, this tradeoff makes sense when you consider the fact that our decoder model autoregressively generates the output vector. By making it easier for the model to start off with good initial output sequence predictions (by decreasing the path length to the initial input tokens), we're reducing the likelihood of introducing errors early on which would continue to accumulate as you generate the rest of the output sequence.

While this does appear to make the optimization problem easier, reversing the input sequence doesn't have an effect on the average distance between related tokens in the input and output sequences.

Let's contrast this with an encoder-decoder model architecture which leverages attention, where every time-step in the input sequence now has an equal path length to each time-step in the output sequence.

Notice how the information from the encoder's first hidden state is directly available to the decoder; our encoder model no longer needs to remember all of the details from this first time step throughout the entire input sequence.

Because the attention mechanism introduces shorter path lengths at every time step in the input sequence, we no longer need to make a tradeoff regarding which portions of the input sequence are most important for the decoder; the entire sequence is now easier to optimize as a result of shorter path lengths.

Conclusion

Standard encoder-decoder sequence models struggle from an information bottleneck passing information from the encoder to the decoder. The attention mechanism helps alleviate this bottleneck by allowing the decoder to search across the entire input sequence for information at each step during the output sequence generation.

This is accomplished by introducing a small attention model which learns to compute a relevance score between an encoder hidden state and a decoder hidden state. From these relevance scores, we can produce a unique weighted combination of encoder hidden states as context for each step of the decoder model.

Resources

Papers

Lectures

Blog posts

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

Managing your machine learning infrastructure as code with Terraform

Jeremy Jordan — Wed, 27 Jul 2022 04:38:22 GMT

Let's say you want to deploy a recommender system at your company. A typical architecture might include a set of inference servers to run your embedding and ranking models, an approximate nearest neighbor index to select a set of candidate items that match your query, a database to retrieve features about your selected candidate items, and a server that runs some business logic and filters your set of candidate items. These servers, databases, and the networking configurations that allow them to communicate with each other comprise the infrastructure of your machine learning system.

In this blog post, I'll provide an introduction to Terraform and how you can use it to provision the infrastructure layer of a machine learning system.

Overview

What is infrastructure as code?
Benefits of infrastructure as code
A conceptual introduction to Terraform
Typical workflow deploying infrastructure with Terraform
Managing ML infrastructure with Terraform
Best practices
Design guide for creating modules
Debugging
Resources

What is infrastructure as code?

Terraform and similar infrastructure-as-code tools (CloudFormation, Pulumi, etc.) provide a mechanism for declaratively defining your infrastructure via a set of configuration files. This allows us to focus on what infrastructure we want to deploy (e.g. three servers with a load balancer) without needing to worry about how to provision these resources (e.g. first spin up three servers, then create a load balancer, then attach the servers to the load balancer).

These configuration files are stored in a version-controlled code repository which enables the same set of collaboration patterns that we already follow as part of the software development lifecycle. For example, if you wanted to propose changes to your team's infrastructure, you might create a new branch, update the relevant configuration files, and open a pull request to be reviewed by your team.

Benefits of infrastructure as code

When you first started working with cloud services, you probably just created whatever resources you needed directly in the AWS/GCP/Azure console. However, as the size of your organization or the number of projects grow, this becomes difficult to manage at scale.

Infrastructure-as-code tools simplify your life in a number of ways:

Easier to keep track of resources. You don’t have to remember which resources you spun up for a given project, it’s all tracked in a centralized location. Spin up/down resources for a project with a single command.
Create an audit log of changes to your infrastructure over time. By tracking our desired infrastructure state as code in a git repository, we can easily see how our infrastructure has evolved and query for the desired infrastructure state at a specific point in time.
Repeatable across multiple environments. Creating reusable components helps ensure that your development environment accurately mirrors your production environment.
Automation improves efficiency and reliability. By codifying how we provision and manage infrastructure for our projects, we can reduce the potential for human error.

A conceptual introduction to Terraform

Let's first spend some time to discuss the common set of abstractions in Terraform's design that allows us to flexibly create infrastructure for whatever configuration we might need.

Terraform communicates with a wide variety of infrastructure providers which are responsible for creating, updating, and deleting the set of resources that we use to deploy our applications.

Terraform has providers for generic cloud infrastructure companies (e.g. AWS, GCP, Azure) as well as more focused, higher-level infrastructure services (e.g. Auth0, DataDog, Snowflake, and many more). If you happen to come across an infrastructure component that isn't already available via Terraform's registry of providers, you can always create your own provider as long as there's an API you can communicate with to create, read, update, and delete resources.

Terraform providers are versioned code which can communicate (via APIs) with external infrastructure providers.

A resource is the smallest unit of infrastructure managed by a given provider; each provider will have a different set of documented resources supported for you to deploy. For example, using the AWS provider you can spin up an EC2 machine (server) or an RDS database. Using the Snowflake provider, you can configure your data warehouse. These are just a few examples, each provider's documentation page will list the full set of available resources.

Here, the gray boxes in the top row represent the actual resources that we've specified in our configuration. Terraform providers are responsible for reconciling the actual state in each of those environments to match the desired state specified in our configuration.

Let's take a look at a minimal example of a Terraform configuration to see what this looks like in practice.

provider "aws" {
  region = "us-west-2"
}

resource "aws_instance" "my_example_server" {
  ami           = "ami-005e54dee72cc1d00"
  instance_type = "t2.micro"
  tags = {
    name  = "example"
    owner = "jeremy"
  }
}

For a closer look at Terraform's syntax, check out: Terraform configuration: quick reference.

Here, we're using the AWS provider to configure a single EC2 instance. We could save this configuration as a file named main.tf, run terraform apply, and Terraform would get to work reading in your AWS credentials and communicating with AWS to spin up our requested instance. We'll revisit this process for actually spinning up/down resources in the next section, but there's a few more Terraform concepts that I want to cover first.

The above example looks pretty simple, but it's likely that you'll need to provision more than a single EC2 instance for your infrastructure needs. I mean, have you seen how many different services AWS puts in their reference architecture diagrams?

Terraform modules allow us to group together common resources, making it easier to reuse across projects or environments. A module is defined as a folder or directory containing one or more Terraform files (e.g. files with the .tf extension) which specify a set of providers and resources that we'd like to deploy. In order to facilitate module reuse, we can define variables to flexibly configure certain resource attributes and outputs to provide useful information that other resources/modules may need to know about.

A minimal example of a Terraform module.

As an example, suppose we want to standardize how servers are provisioned at our company. We can do this with Terraform by creating a terraform-cloud-server module which uses a (fictitious) cloud provider to create some example resources.

We'll define two resource specifications:

a web server which always uses Linux as the operating system
a log stream which retains the server's log output for 90 days

resource "cloud_server" "server" {
  operating_system = "linux"
  memory           = var.server_memory
  cpus             = var.server_cpus
  log_stream       = cloud_log_stream.logs.id
}

resource "cloud_log_stream" "logs" {
  name              = "${var.project}_log_stream"
  retention_in_days = 90
}

main.tf

In this scenario, operating_system and retention_in_days will be consistent across all of the servers created using this module. However, we still may want to configure some values, such as the server memory, depending on the requirements of each individual server. For these values, we can create variables to pass configuration from the module level to individual resources within the module (as seen above).

variable "project" {
  description = "Provide a unique name for your project"
  type        = string
}

variable "server_cpus" {
  description = "Number of CPUs requested for the server"
  type        = number
  default     = 2
}

variable "server_memory" {
  description = "Amount of memory requested for the server"
  type        = string
  default     = "4 GiB"
}

variables.tf

And finally, we may want to know some information about the low-level resources that are created as part of this terraform-cloud-server module. We can expose this information by specifying outputs.

output "server_ip_address" {
  description = "IP address for the server"
  value       = cloud_server.server.ip_address
}

outputs.tf

Putting this all together, our Terraform module would look something like:

├── terraform-cloud-server
    ├── README.md
    ├── main.tf
    ├── outputs.tf
    └── variables.tf

If we wanted to use this module in our infrastructure configuration, we'd define a module block which specifies where our module is defined and provides all of the necessary variables to configure the underlying resources.

module "standard_server" {
  source        = "../path/to/terraform-cloud-server"
  server_cpus   = 4
  server_memory = "8 GiB"
}

In some cases, you may want to query information from a given provider when configuring a resource. For example, suppose we want to use an existing IP network (and its corresponding configuration) when assigning an IP address to our cloud server. A data resource allows you to query a given provider for information about existing components that might be useful when configuring new resources in your module.

data "vpc" "selected" {
  id = var.vpc_id
}

# e.g. use the data resource when configuring other resources
resource "cloud_server" "server" {
  ...
  subnet = data.vpc.selected.private_subets[0]
}

Putting this all together, we can see that our infrastructure as code is largely comprised of modules (reusable groups of resource definitions) and configurations (a specification of the resources/modules that we want to deploy in a given environment).

Note: the "modules" and "configurations" that I refer to above are both technically Terraform modules, since a module is just a directory of Terraform files, but I find it useful to draw a distinction between the two groups since they each have a different intention.

Typical workflow deploying infrastructure with Terraform

Now that we've covered the basic concepts, let's walk through the standard workflow for deploying your infrastructure with Terraform.

Specify your desired state

The first step is to specify the collection of resources (and modules) that you want to provision. We do this by defining a series of configuration blocks in one or more Terraform files within the same directory.

For more details on how to specify these configuration blocks, you can either review the Terraform documentation (more likely to be up-to-date over time) or check out the Terraform configuration: quick reference that I put together.

Building on the previous example using the (fictitious) cloud provider, a configuration for project_a might look something like:

module "database" {
  source       = "../path/to/terraform-cloud-database-module"
  name         = "project_a_database"
  cpu_limit    = 4
  memory_limit = "16 GiB"
  backup_daily = true
}

module "app_server" {
  source          = "../path/to/terraform-cloud-app-server-module"
  replicas        = 5
  cpu_limit       = 2
  memory_limit    = "1 GiB"
  container_image = "my_org/project_a:latest"
}

resource "cloud_domain_name" "domain" {
  name   = "app.example.com"
  target = module.app_server.load_balancer.ip_address
}

An example project configuration using fictitious modules and resources.

Note: It's generally best to deploy your infrastructure as separate, logical components rather than one giant pile of Terraform configurations. This enables you to create/destroy resources for project_a without having any affect on project_b's resources.

terraform/
  configuration/
    development/
    staging/
    production/
      account_base/
        main.tf       # resources you only need to create once per environment
      region_a/
        region_base/
          main.tf     # resources you only need to create once per region
        project_a/
          main.tf     # project-specific resources
        project_b/
          main.tf     # project-specific resources
      region_b/
        ...

A typical organization pattern for Terraform configurations.

Build an execution plan

Once we've specified our desired infrastructure state, we'll run terraform plan from the same directory as our configuration files. Terraform will get to work communicating with the necessary providers in order to figure out the actual state of our infrastructure, how that differs from our desired state, and the set of changes we need to make in order to achieve our desired state.

Terraform will output this plan to the console, sharing the set of resources will get created, modified in place, or deleted if we execute the plan.

$ terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.database.cloud_database_cluster.cluster will be created
  + resource "cloud_database_cluster" "cluster" {
      + id        = (known after apply)
      + name      = "project_a_database"
      ...
    }
  ...
  Plan: 10 to add, 0 to change, 0 to destroy.

In order for Terraform to know:

which resources need to be created,
which resources already exist and may simply need updated,
and which resources are no longer necessary,

Terraform maintains a state file which describes a mapping between resources defined in our configuration and resources that exist in the real world. This state file also contains other useful metadata such as resource dependencies, which helps Terraform know which order to safely destroy resources which are no longer needed. Most teams will configure Terraform to save this state file in a remote data store (e.g. Amazon S3) so that there's a single source of truth. This also allows us to set up state file locking (e.g. keeping track of locks in a DynamoDB table) so that multiple people can't be updating the same set of resources at the same time.

Apply your execution plan

Once you've reviewed the plan and confirmed that it looks correct, you can run terraform apply to execute the plan described in the previous step. Terraform will then perform the following steps:

Place a lock on the state file to prevent any other Terraform processes from modifying the same set of resources.
Create a new execution plan, unless you explicitly provide the output of a terraform plan that you ran previously.
Follow the above execution plan, leveraging the providers to create, update, and delete the appropriate resources in a safe order.
Update the state file to reflect changes made to our set of resources.
Remove the lock from the state file so that other Terraform processes can make changes when needed.

Clean up resources you're no longer using

Once we're ready to tear down the resources defined in your Terraform configuration, you can simply run terraform destroy from the same directory as our configuration files. Terraform will leverage its understanding of resource dependencies in order to delete the resources in a safe order.

Managing ML infrastructure with Terraform

Let's now walk through a standard machine learning model lifecycle and discuss how we might support the requisite infrastructure components with Terraform.

my generic mental model for the various components involved in operating a machine learning system. pic.twitter.com/CtL9sAGd6R
— Jeremy Jordan (@jeremyjordan) April 17, 2022

See this tweet as a high-level example of the various infrastructure components we may need to operationalize a machine learning model in our product.

We'll need the ability to define a batch job which runs a training job for our model, so we'll go ahead and define a terraform-aws-batch-job module which can (1) be triggered either based on an event or a cron schedule, (2) spin up the necessary compute to train our model, (3) connect to the appropriate data sources for generating our training and validation datasets, and (4) spin down the necessary compute once our job completes.

This training job will need to log key metrics and output the trained model artifacts into a model registry. We'll probably need to create a database and a cloud object store to save model metadata and artifacts respectively.

Further, we might configure the model registry to automate a deployment rollout when a user updates a registered model's stage from shadow mode to production. This may make calls out to a service such as Sagemaker to update the model artifact being served at a given endpoint. In this case, we may create some of the initial Sagemaker infrastructure using Terraform but mark certain attributes as ignore_changes in the lifecycle block (see more on this below) in order to allow our model registry to dynamically update the model configuration.

Best practices

Generate documentation for your Terraform modules

In order to provide a succinct summary of the modules you create, you can use terraform-docs to automatically generate a description of a module's variables and outputs, in addition to the resources and modules it creates.

terraform-docs markdown table \
  --output-file README.md \
  --output-mode inject .

Tag your resources

Some providers allow you to specify tags/labels for your resources as a method of organization. It's a good practice to leverage these tags to store information such as resource ownership, which projects a resource may belong to, how this resource should be categorized from a billing perspective, or other key information.

resource "resource_type" "resource_name" {
  # resource-specific configuration
  ...
  
  tags = {
    Owner       = "my-team"
    Project     = "my-project"
    Environment = "development"
    Billing     = "some-cost-center"
  }
}

Monitor for drift between your desired and actual infrastructure states

Terraform only checks for discrepancies between the actual state and desired state of your infrastructure when you run commands. It's entirely possible that someone has made changes to your infrastructure outside of Terraform (e.g. manual actions in the AWS console), so the desired state that you track in a code repository isn't always an accurate depiction of what's deployed. It's usually a good practice to periodically check for "infrastructure drift"; you can use tools such as driftctl to automate this process.

In order to bring Terraform configuration back in line with reality:

Run terraform plan -refresh-only to compare Terraform’s state file with the actual current infrastructure
Update the configuration files to match the current state (assuming you want to keep the resources which were created outside of Terraform)
Run terraform import to update Terraform’s state file and associate the updated configuration with the already-created infrastructure

You can run terraform apply -replace="resource_type.resource_name" to target a specific resource to recreate in the event that someone manually changed its configuration.

Safeguard critical resources from accidental downtime

Some terraform apply plan executions can introduce the possibility of downtime, which may not always be ideal for resources that need to be highly available. The Terraform lifecycle block allows you to specify some degree of control over how resources can be updated.

resource "resource_type" "resource_name" {

  lifecycle {
    prevent_destroy       = true   # prevent accidental deletion of resources
    create_before_destroy = true   # avoid downtime when replacing a resource
    ignore_changes        = [attr] # ignore differences between config and state
  }
}

Design guide for creating modules

Modules are incredibly useful for deploying the same set of infrastructure components across projects, regions, or environments. The first rule of Terraform modules is that you should use them. There's a wide variety of ready-to-use Terraform modules developed by the community; it's likely that a module already exists for some of your infrastructure needs. However, it's also likely that you'll end up needing to build some Terraform modules that are more specific to how your company provisions infrastructure.

Here are some tips to keep in mind when building Terraform modules:

Keep modules focused on a single purpose. Modules should be used to group a set of resources that are always deployed together.
Group resources together that change at a similar frequency. By keeping short-lived resources separate from long-lived resources, we can reduce risk associated with making changes to our infrastructure.
Consider the sensitivity of the infrastructure resources being grouped. A networking module might have a different set of security implications as compared to an application server module. It's best to keep highly privileged resources separate so that we can design with the principle of least access in mind.
Take a hierarchical approach when designing modules. In order to get the most reuse out of your modules, it can be helpful to define a base set of modules which define lower-level infrastructure components which can then be referenced across a set of higher-level modules.

Debugging

There are a number of areas where things can go wrong when managing infrastructure with Terraform. Here are a few of those scenarios, as well as some quick tips for identifying and resolving issues.

You have syntax errors in your configuration files.

Run terraform fmt to surface syntax errors. You can also configure your IDE to run the formatter every time you save a file.

You may improperly configured some of your resources.

Run terraform validate to check your configuration in the context of the infrastructure provider. If you want to interactively test your configuration expressions, you can run terraform console.

Your state file has been corrupted.

Run terraform state list [options] [address] to inspect the state file for a given resource or module. Alternatively, you can run terraform show to view the entire state file.

Your state file has a stale lock.

Terraform uses state file locking in order to prevent multiple users from overwriting the same set of resources at the same time. For the S3 remote state configuration, you can configure a locking table with DynamoDB where Terraform will simply write some lock metadata to a specified key in your table (and remove it when the lock is released). If Terraform ends up in a weird state, you may need to manually remove this lock with terraform force-unlock.

There's a bug in a provider that you're using.

Capture logs from the providers when executing a plan or apply.

export TF_LOG_CORE=TRACE
export TF_LOG_PROVIDER=TRACE
export TF_LOG_PATH=logs.txt

Resources

Tutorials

Official Terraform tutorials
- There's a lot of great content here, the only problem is it's a lot which is why I decided it might be useful to write a condensed introduction to Terraform (this blog post).
HashiCorp - Recommended Enterprise Patterns
Purpose of Terraform State
Troubleshoot Terraform
Terraform Best Practices: Key Concepts

Blog posts

Terraform Import - Leaving Click Ops Behind

Tools

Terraform Provider Iterative (TPI)
- The team at iterative.ai has created a Terraform provider which makes it really easy to create the infrastructure needed for running ML tasks, including best practices by default such as using spot instances and spinning down resources once the task successfully finishes.

GitHub - jeremyjordan/terraform-blog-examples: Code snippets from my Terraform blog post

Code snippets from my Terraform blog post. Contribute to jeremyjordan/terraform-blog-examples development by creating an account on GitHub.

GitHubjeremyjordan

Check out this repository containing the code examples from this blog post.

Acknowledgements

Thanks to Vicki Boykis, Jérémie Vallée, John Huffman, and Adam Laiacano for reading early drafts of this blog post and providing helpful feedback.

Terraform configuration: quick reference

Jeremy Jordan — Wed, 27 Jul 2022 04:36:50 GMT

This page contains a quick reference for writing Terraform configuration. For a conceptual introduction to Terraform and managing your infrastructure as code, read this blog post.

Disclaimer: for the most up-to-date and detailed information, check out the official Terraform documentation.

Overview - Configuration Language | Terraform by HashiCorp

You can use the Terraform language to write configuration files that tell Terraform how to manage a collection of infrastructure.

Terraform by HashiCorp

Jump to:

Configuration blocks
- Terraform
- Provider
- Resource
- Variable
- Output
- Module
- Data
- Locals
Dynamic expressions
- Conditional statements
- Wildcards

Configuration blocks

Terraform

Configuration block that allows you to specify various Terraform settings (such as required versions for Terraform and/or its providers).

terraform {
  required_providers {
    # e.g. include the AWS provider and require a specific major version
    aws = {
      source  = "hashicorp/aws"
      version = ">= 4.0.0, < 5.0.0"
    }
  }
  required_version = ">= 1.2.3"
}

Provider

Versioned code which is used to communicate with an external infrastructure provider.

provider "provider_name" {
  # provider-specific configuration
  # e.g. what region should this provider create your resources in?
}

Resource

Smallest unit of infrastructure managed by a provider.

resource "resource_type" "resource_name" {
  # resource-specific configuration
}

The resource_type always starts with the prefix {provider}_
You can reference a given resource in other parts of your Terraform configuration using the format {resource_type}.{resource_name}
There's a small set of meta-arguments which you can use for any resource:
- count: create multiple instances of the same resource specification
- for_each: configure a set of resources based on a provider iterable
- provider: override the default provider configuration (e.g. changing the region to deploy a specific resource)
- depends_on: specify hidden dependencies between resources that Terraform isn't able to infer
- lifecycle: configuration around resource lifecycle management (e.g. create a new version of the resource before destroying the old one)

Variable

Allow end users to specify values (to be used in resource creation) when defining a module.

variable "variable_name" {
  type        = string
  description = ""
  
  # optional
  default     = ""
  sensitive   = true
  validation {
    condition     = ...
    error_message = ""
  }
}

Usually defined in a separate file, variables.tf, for readability
These values are constant, they cannot be changed during planning or execution
You can reference a given variable as var.{variable_name}
Supported types:
- simple (string, bool, number)
- collection (map, list, set)
- structural (tuple, object)
You can optionally add one or more validation conditions (e.g. to make sure a resource matches AWS resource naming requirements). This can be helpful for "failing fast" when you provide improper values.
By default, variables are not treated as sensitive. You can specify sensitive = true in order to prevent Terraform from outputing these values to the console.

Output

Expose a specific attribute of a resource within a module.

output "output_name" {
  description = ""
  value       = resource_type.resource_name.attribute
  
  # optional
  sensitive   = true
  precondition {
    condition     = ...
    error_message = ""
  }
}

Usually defined in a separate file, outputs.tf, for readability
You can reference a module's output as {module_name}.{output_name}
You can optionally add one or more preconditions to check before returning an output value (e.g. to check that a certificate status is set to ISSUED). This can be used as a last line of defense to validate your assumptions before returning data about a given resource.
By default, variables are not treated as sensitive. You can specify sensitive = true in order to prevent Terraform from outputing these values to the console.

Module

A logical grouping of resources which can be configured and deployed together.

module "module_name" {
  source          = "../path/to/module"
  
  # optional
  version = ""
}

Encourages reuse of configuration and provides consistency across your infrastructure stack
There’s a small set of universal “meta-arguments” that you can use for any module:
- count: create multiple instances of the same module specification
- for_each: configure a set of modules based on a provider iterable
- provider: override the default provider configuration (e.g. changing the region to deploy the resources in your module)
- depends_on: specify hidden dependencies between your module and other resources that Terraform isn't able to infer
Modules can be referenced from local filepaths or remote repositories (Terraform registry, Github, S3)

In addition to using existing modules available on the Terraform registry, you can also define your own. A module can be defined as a set of Terraform configuration files within a single directory.

.
├── LICENSE
├── README.md
├── main.tf
├── variables.tf
├── outputs.tf

An example Terraform module definition.

Data

Retrieve information from an external infrastructure provider.

data "data_resource_type" "data_resource_name" {
  # parameters used to retrieve information
}

Locals

Used to provide a succinct or readable name for a Terraform expression which may be referenced multiple times.

locals {
  # e.g. define a set of required tags to add to all resources
  required_tags = {
    project     = var.project_name,
    environment = var.environment
  }
  tags = merge(var.resource_tags, local.required_tags)
  
  # e.g. define a common suffix for use in multiple resource definitions
  name_suffix = "${var.project_name}-${var.environment}"
}

Can be referenced as locals.{attribute}
Dynamic expressions (e.g. using Terraform functions) are allowed, providing more flexibility than you have available in Terraform variables
It's recommended you use locals sparringly, keeping in mind the trade-offs between readability and DRY principles
- In other words, don't make someone constantly scroll up to the locals block in order to figure out how a given resource is being configured

Dynamic expressions

Conditional statements

You can express conditional statements in Terraform, although I find the syntax a bit odd.

attribute  = (conditional_expression ? value_if_true : value_if_false)

Interpret ? as “then” and : as “else” when reading.

The conditional_expression itself is defined with common syntax:

Symbol	Meaning
`!=`	not equal
`==`	equal

Wildcards

You can use a splat * expression to reference a given attribute over a list of resources.

resource "aws_instance" "app_servers" {
  count = 5
  ...
}

output "private_addresses" {
  value = aws_instance.app_servers[*].private_dns
}

Comprehensions

You can create lists and maps using syntax similar to list/dictionary comprehensions in Python.

The general syntax is [for item in iterable: value] for a list and {for item in iterable: key => value} for a map. You can also add conditional statements, such as in the example below.

locals {
  tags = tomap({ foo = "123", bar = "456", empty = "" })
}

value = { for key, value in local.tags : key => value if value != "" }

What I'm working on in 2022

Jeremy Jordan — Fri, 07 Jan 2022 23:28:48 GMT

When I first started my self-study of machine learning after graduating from university, I found it really useful to have a structured curriculum of topics I wanted to cover. It was rewarding to be able to convert topics into blog posts as I progressed through my learning, and it kept me on a

A simple solution for monitoring ML systems.

Jeremy Jordan — Sun, 03 Jan 2021 01:20:13 GMT

This blog post aims to provide a simple, open-source solution for monitoring ML systems. We'll discuss industry-standard monitoring tools and practices for software systems and how they can be adapted to monitor ML systems.

To illustrate this, we'll use a scikit-learn model trained on the UCI Wine Quality dataset and served via FastAPI (see Github repo here). We'll collect metrics from the server using Prometheus and visualize the results in a Grafana dashboard. All of the services will be deployed on a Kubernetes cluster; if you're not familiar with Kubernetes, feel free to take a quick read through my introduction to Kubernetes blog post.

Overview

Why is monitoring important?
What should we be monitoring?
Monitoring a wine quality prediction model: a case study.
Going beyond a simple monitoring solution
- Purpose-built monitoring tools for ML models
Best practices for monitoring
Resources

Why is monitoring important?

It's a well-accepted practice to monitor software systems so that we can understand performance characteristics, react quickly to system failures, and ensure that we're upholding our Service Level Objectives.

Monitoring systems can help give us confidence that our systems are running smoothly and, in the event of a system failure, can quickly provide appropriate context when diagnosing the root cause.

When deploying machine learning models, we still have the same set of concerns discussed above. However, we'd also like to have confidence that our model is making useful predictions in production.

There are many reasons why a model can fail to make useful predictions in production:

The underlying data distribution has shifted over time and the model has gone stale.
The production data stream contains edge cases (not seen during model development) where the model performs poorly.
The model was misconfigured in its production deployment.

In all of these scenarios, the model could still make a "successful" prediction from a service perspective, but the predictions will likely not be useful. Monitoring our machine learning models can help us detect such scenarios and intervene (e.g. trigger a model retraining/deployment pipeline).

What should we be monitoring?

At a high level, there's three classes of metrics that we'll want to track and monitor.

Model metrics

Prediction distributions
Feature distributions
Evaluation metrics (when ground truth is available)

System metrics

Request throughput
Error rate
Request latencies
Request body size
Response body size

Resource metrics

CPU utilization
Memory utilization
Network data transfer
Disk I/O

Monitoring a wine quality prediction model: a case study.

Throughout the rest of this blog post, we'll walk through the process of instrumenting and monitoring a scikit-learn model trained on the UCI Wine Quality dataset. This model is trained to predict a wine's quality on the scale of 0 (lowest) to 10 (highest) based on a number of chemical attributes.

At a high level, we'll:

Create a containerized REST service to expose the model via a prediction endpoint.
Instrument the server to collect metrics which are exposed via a separate metrics endpoint.
Deploy Prometheus to collect and store metrics.
Deploy Grafana to visualize the collected metrics.
Finally, we'll simulate production traffic using Locust so that we have some data to see in our dashboards.

Feel free to clone this Github repository and follow along yourself. All of the instructions to deploy these components on your own cluster are provided in the README.md file.

Deploying a model with FastAPI

If you look in the model/ directory of the repo linked previously, you'll see a couple files.

train.py contains a simple script to produce a serialized model artifact.
app/api.py defines a few routes for our model service including a model prediction endpoint and a health-check endpoint.
app/schemas.py defines the expected schema for the request and response bodies in the model prediction endpoint.
Dockerfile lists the instructions to package our REST server as a container.

We can deploy this server on our Kubernetes cluster using the manifest defined in kubernetes/models/.

Instrumenting our service with metrics

In order to monitor this service, we'll need to collect and expose metrics data. We'll go into more details in the subsequent section, but for now our goal is to capture "metrics" and expose this data via a /metrics endpoint on our server.

For FastAPI servers, we can do this using prometheus-fastapi-instrumentator. This library includes FastAPI middleware that collects metrics for each request and exposes the metric data to a specified endpoint.

For our example, we'll capture some of the metrics included in the library (request size, response size, latency, request count) as well as one custom-defined metric (our regression model's output). You can see this configuration defined in model/app/monitoring.py.

After deploying our model service on the Kubernetes cluster, we can port forward to a pod running the server and check out the metrics endpoint running at 127.0.0.1:3000/metrics.

kubectl port-forward service/wine-quality-model-service 3000:80

An example of the data typically found at a /metrics endpoint.

Note: many of the framework-specific serving libraries offer the ability to expose a metrics endpoint out of the box. However, I'm not sure how you can define custom (model-specific) metrics to be logged using these serving platforms.

Capturing metrics with Prometheus

After exposing our metrics at a specified endpoint, we can use Prometheus to collect and store this metric data. We'll deploy Prometheus onto our Kubernetes cluster using helm, see the Setup section in the README.md file for full instructions.

Prometheus is an open-source monitoring service with a focus on reliability. It is responsible for collecting metrics data from our service endpoints and efficiently storing this data to be later queried.

Prometheus refers to endpoints containing metric data as targets which can be discovered either through service discovery or static configuration. In our example, we'll use service discovery to enable Prometheus to discover which targets should be scraped. We can do this by creating a ServiceMonitor resource for our wine quality prediction service. This resource specification is included in the kubernetes/models/wine_quality.yaml manifest. This resource must be defined in the same namespace that Prometheus is running in.

You can see all of the services configured to be discovered by Prometheus by running:

kubectl get servicemonitor -n monitoring

You'll notice that in addition to collecting metrics from our wine quality prediction service, Prometheus has already been configured to collect metrics from the Kubernetes cluster itself.

Prometheus will scrape the metrics data at each of these endpoints at a specified interval (every 15 seconds by default). There are four supported data types for metrics.

Counter: a single value that monotonically increases over time (e.g. for counting number of requests made)
Gauge: a single value that can increase or decrease over time (e.g. for tracking current memory utilization)
Summary: a collection of values aggregated by count and sum (e.g. for calculating average request size)
Histogram: a collection of values aggregated into buckets (e.g. for tracking request latency)

You can read more about these metric types here. Each metric has a name and an optional set of labels (key-value pairs) to describe the observed value. These metrics are designed such that they can be aggregated at different timescales for more efficient storage. This metric data can be queried by other services (such as a Grafana) which make requests to the HTTP server.

Note: Prometheus uses a PULL mechanism for collecting metrics, where services simply expose the metrics data at an endpoint and Prometheus collects the data. Some other monitoring services (e.g. AWS CloudWatch) use a PUSH mechanism where each service collects and sends its own metrics to the monitoring server. You can read about the tradeoffs for each approach here.

Visualizing results in Grafana

Now that our metrics are being collected and stored in Prometheus, we need a way to visualize the data. Grafana is often paired with Prometheus to provide the ability to create dashboards from the metric data. In fact, our helm install of the Prometheus stack included Grafana out of the box.

You can check out the Grafana dashboard by port-forwarding to the service and visiting 127.0.0.1:8000.

kubectl port-forward service/prometheus-stack-grafana 8000:80 -n monitoring

We can create visualizations by making queries to the Prometheus data source. Prometheus uses a query language called PromQL. Admittedly, it can take some time to get used to this query language. After reading through the official documentation, I'd recommend watching PromQL for Mere Mortals to get a better understanding of the query language.

The repository for this blog post contains a pre-built dashboard (see dashboards/model.json) for the wine quality prediction service which you can import and play around with.

Simulating production traffic with Locust

At this point, we have a model deployed as a REST service which is instrumented to export metric data. This metric data is being collected by Prometheus and we can visualize the results in a Grafana dashboard. However, in order to see something interesting in the dashboard we'll want to simulate production traffic.

We'll use locust, a Python load testing framework, to make requests to our model service and simulate production traffic. This behavior is defined in load_tests/locustfile.py where we define three tasks:

make a request to our health check endpoint
choose a random example from the wine quality dataset and make a request to our prediction service
choose a random example from the wine quality dataset, corrupt the data, and make a bad request to our prediction service

The manifests to deploy this load test to our cluster can be found in kubernetes/load_tests/.

Going beyond a simple monitoring solution

In the wine quality prediction model case study, we only tracked a single model metric: the prediction distributions. However, as I mentioned previously there's more model-related metrics that we may want to track such as feature distributions and evaluation metrics.

These remaining concerns introduce a few new challenges:

The Prometheus time series database was designed to store metric data, not features for ML models. This is not the right technology choice to store feature data for tracking their fluctuations over time.
Model evaluation metrics require a feedback signal containing the ground truth which is not available at inference time.

When it comes to logging feature distributions, there's a range of approaches you can take. The simplest approach for monitoring feature distributions might be to deploy a drift-detection service alongside the model. This service would fire (and increase a counter metric) when feature drift is detected.

For additional visibility into your production data, you can log a statistical profile of the features observed in production. This allows for a compact summary of data observed in production and can help you identify when data distributions are shifting. The information logged will vary depending on the type of data:

For tabular data, you can directly log features used through statistical measures such as histograms and value counts.
For text data, you can log metadata about the text such as character length, count of numeric characters, count of alphabetic characters, count of special characters, average word length, etc.
For image data, you can log metadata about the image such as average pixel value per channel, image width, image height, aspect ratio, etc.

For full visibility on your production data stream, you can log the full feature payload. This is not only useful for monitoring purposes, but also in data collection for labeling and future model training. If your deployment stack already supports saving production data, you can leverage this data source (typically a data lake) for monitoring whether or not the production data distribution is drifting away from the training data distribution.

An example architecture provided by Databricks for saving and monitoring production data.

In order to monitor model evaluation metrics, we'll need to persist model prediction results alongside a unique identifier which is included in the model inference response. This unique identifier will be used when we asynchronously receive feedback from the user or some upstream system regarding a prediction we made. Once we've received feedback we can calculate our evaluate metric and log the results via Prometheus to be displayed in the model dashboard.

Purpose-built monitoring tools for ML models

If you're looking for out-of-the-box support for some of these more advanced model monitoring use cases, there's a growing number of ML monitoring services that you can use.

WhyLogs is an open-source library for capturing statistical profiles of production data streams.

Seldon has developed an open-source library which runs on Kubernetes and uses a similar technology stack (Prometheus, Grafana, Elasticsearch) as discussed in this blog post.

Fiddler offers a slick monitoring product with a dashboard built specifically for monitoring models and understanding predictions on production data. Check out this pull request to see how I integrated Fiddler with the wine quality prediction model from this blog post.

Amazon Sagemaker offers a model monitoring suite which integrates well if you're already deploying models on Sagemaker.

Best practices for monitoring

Prometheus

Avoid storing high-cardinality data in labels. Every unique set of labels for is treated as a distinct time series, high-cardinality data in labels can drastically increase the amount of data being stored. As a general rule, try to keep the cardinality for a given metric (number of unique label-sets) under 10.
Metric names should have a suffix describing the unit (e.g. http_request_duration_seconds)
Use base units when recording values (e.g. seconds instead of milliseconds).
Use standard Prometheus exporters when available.

Grafana

Ensure your dashboards are easily discoverable and consistent by design.
Use template variables instead of hardcoding values or duplicating charts.
Provide appropriate context next to important charts.
Keep your dashboards in source control.
Avoid duplicating dashboards.

Resources

Blog posts

Talks

Acknowledgements

Thanks to Goku Mohandas, John Huffman, Shreya Shankar, and Binal Patel for reading early drafts of this blog post and providing feedback.

Effective testing for machine learning systems.

Jeremy Jordan — Wed, 19 Aug 2020 23:08:38 GMT

Working as a core maintainer for PyTorch Lightning, I've grown a strong appreciation for the value of tests in software development. As I've been spinning up a new project at work, I've been spending a fair amount of time thinking about how we should test machine learning systems. A couple weeks ago, one of my coworkers sent me a fascinating paper on the topic which inspired me to dig in, collect my thoughts, and write this blog post.

In this blog post, we'll cover what testing looks like for traditional software development, why testing machine learning systems can be different, and discuss some strategies for writing effective tests for machine learning systems. We'll also clarify the distinction between the closely related roles of evaluation and testing as part of the model development process. By the end of this blog post, I hope you're convinced of both the extra work required to effectively test machine learning systems and the value of doing such work.

What's different about testing machine learning systems?

In traditional software systems, humans write the logic which interacts with data to produce a desired behavior. Our software tests help ensure that this written logic aligns with the actual expected behavior.

However, in machine learning systems, humans provide desired behavior as examples during training and the model optimization process produces the logic of the system. How do we ensure this learned logic is going to consistently produce our desired behavior?

Let's start by looking at the best practices for testing traditional software systems and developing high-quality software.

A typical software testing suite will include:

unit tests which operate on atomic pieces of the codebase and can be run quickly during development,
regression tests replicate bugs that we've previously encountered and fixed,
integration tests which are typically longer-running tests that observe higher-level behaviors that leverage multiple components in the codebase,

and follow conventions such as:

don't merge code unless all tests are passing,
always write tests for newly introduced logic when contributing code,
when contributing a bug fix, be sure to write a test to capture the bug and prevent future regressions.

A typical workflow for software development.

When we run our testing suite against the new code, we'll get a report of the specific behaviors that we've written tests around and verify that our code changes don't affect the expected behavior of the system. If a test fails, we'll know which specific behavior is no longer aligned with our expected output. We can also look at this testing report to get an understanding of how extensive our tests are by looking at metrics such as code coverage.

An example output from a traditional software testing suite.

Let's contrast this with a typical workflow for developing machine learning systems. After training a new model, we'll typically produce an evaluation report including:

performance of an established metric on a validation dataset,
plots such as precision-recall curves,
operational statistics such as inference speed,
examples where the model was most confidently incorrect,

and follow conventions such as:

save all of the hyper-parameters used to train the model,
only promote models which offer an improvement over the existing model (or baseline) when evaluated on the same dataset.

A typical workflow for model development.

When reviewing a new machine learning model, we'll inspect metrics and plots which summarize model performance over a validation dataset. We're able to compare performance between multiple models and make relative judgements, but we're not immediately able to characterize specific model behaviors. For example, figuring out where the model is failing usually requires additional investigative work; one common practice here is to look through a list of the top most egregious model errors on the validation dataset and manually categorize these failure modes.

Assuming we write behavioral tests for our models (discussed below), there's also the question of whether or not we have enough tests! While traditional software tests have metrics such as the lines of code covered when running tests, this becomes harder to quantify when you shift your application logic from lines of code to parameters of a machine learning model. Do we want to quantify our test coverage with respect to the input data distribution? Or perhaps the possible activations inside the model?

Odena et al. introduce one possible metric for coverage where we track the model logits for all of the test examples and quantify the area covered by radial neighborhoods around these activation vectors. However, my perception is that as an industry we don't have a well-established convention here. In fact, it feels like that testing for machine learning systems is in such early days that this question of test coverage isn't really being asked by many people.

What's the difference between model testing and model evaluation?

While reporting evaluation metrics is certainly a good practice for quality assurance during model development, I don't think it's sufficient. Without a granular report of specific behaviors, we won't be able to immediately understand the nuance of how behavior may change if we switch over to the new model. Additionally, we won't be able to track (and prevent) behavioral regressions for specific failure modes that had been previously addressed.

This can be especially dangerous for machine learning systems since often times failures happen silently. For example, you might improve the overall evaluation metric but introduce a regression on a critical subset of data. Or you could unknowingly add a gender bias to the model through the inclusion of a new dataset during training. We need more nuanced reports of model behavior to identify such cases, which is exactly where model testing can help.

For machine learning systems, we should be running model evaluation and model tests in parallel.

Model evaluation covers metrics and plots which summarize performance on a validation or test dataset.
Model testing involves explicit checks for behaviors that we expect our model to follow.

Both of these perspectives are instrumental in building high-quality models.

In practice, most people are doing a combination of the two where evaluation metrics are calculated automatically and some level of model "testing" is done manually through error analysis (i.e. classifying failure modes). Developing model tests for machine learning systems can offer a systematic approach towards error analysis.

How do you write model tests?

In my opinion, there's two general classes of model tests that we'll want to write.

Pre-train tests allow us to identify some bugs early on and short-circuit a training job.
Post-train tests use the trained model artifact to inspect behaviors for a variety of important scenarios that we define.

Pre-train tests

There's some tests that we can run without needing trained parameters. These tests include:

check the shape of your model output and ensure it aligns with the labels in your dataset
check the output ranges and ensure it aligns with our expectations (eg. the output of a classification model should be a distribution with class probabilities that sum to 1)
make sure a single gradient step on a batch of data yields a decrease in your loss
make assertions about your datasets
check for label leakage between your training and validation datasets

The main goal here is to identify some errors early so we can avoid a wasted training job.

Post-train tests

However, in order for us to be able to understand model behaviors we'll need to test against trained model artifacts. These tests aim to interrogate the logic learned during training and provide us with a behavioral report of model performance.

Paper highlight: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

The authors of the above paper present three different types of model tests that we can use to understand behavioral attributes.

Invariance Tests

Invariance tests allow us to describe a set of perturbations we should be able to make to the input without affecting the model's output. We can use these perturbations to produce pairs of input examples (original and perturbed) and check for consistency in the model predictions. This is closely related to the concept of data augmentation, where we apply perturbations to inputs during training and preserve the original label.

For example, imagine running a sentiment analysis model on the following two sentences:

Mark was a great instructor.
Samantha was a great instructor.

We would expect that simply changing the name of the subject doesn't affect the model predictions.

Directional Expectation Tests

Directional expectation tests, on the other hand, allow us to define a set of perturbations to the input which should have a predictable effect on the model output.

For example, if we had a housing price prediction model we might assert:

Increasing the number of bathrooms (holding all other features constant) should not cause a drop in price.
Lowering the square footage of the house (holding all other features constant) should not cause an increase in price.

Let's consider a scenario where a model fails the second test - taking a random row from our validation dataset and decreasing the feature house_sq_ft yields a higher predicted price than the original label. This is surprising as it doesn't match our intuition, so we decide to look further into it. We realize that, without having a feature for the house's neighborhood/location, our model has learned that smaller units tend to be more expensive; this is due to the fact that smaller units from our dataset are more prevalent in cities where prices are generally higher. In this case, the selection of our dataset has influenced the model's logic in unintended ways - this isn't something we would have been able to identify simply by examining performance on a validation dataset.

Minimum Functionality Tests (aka data unit tests)

Just as software unit tests aim to isolate and test atomic components in your codebase, data unit tests allow us to quantify model performance for specific cases found in your data.

This allows you to identify critical scenarios where prediction errors lead to high consequences. You may also decide to write data unit tests for failure modes that you uncover during error analysis; this allows you to "automate" searching for such errors in future models.

Snorkel also has introduced a very similar approach through their concept of slicing functions. These are programatic functions which allow us to identify subsets of a dataset which meet certain criteria. For example, you might write a slicing function to identify sentences less than 5 words to evaluate how the model performs on short pieces of text.

Organizing tests

In traditional software tests, we typically organize our tests to mirror the structure of the code repository. However, this approach doesn't translate well to machine learning models since our logic is structured by the parameters of the model.

The authors of the CheckList paper linked above recommend structuring your tests around the "skills" we expect the model to acquire while learning to perform a given task.

For example, a sentiment analysis model might be expected to gain some understanding of:

vocabulary and parts of speech,
robustness to noise,
identifying named entities,
temporal relationships,
and negation of words.

For an image recognition model, we might expect the model to learn concepts such as:

object rotation,
partial occlusion,
perspective shift,
lighting conditions,
weather artifacts (rain, snow, fog),
and camera artifacts (ISO noise, motion blur).

Model development pipeline

Putting this all together, we can revise our diagram of the model development process to include pre-train and post-train tests. These tests outputs can be displayed alongside model evaluation reports for review during the last step in the pipeline. Depending on the nature of your model training, you may choose to automatically approve models provided that they meet some specified criteria.

A proposed workflow for developing high-quality models.

Conclusion

Machine learning systems are trickier to test due to the fact that we're not explicitly writing the logic of the system. However, automated testing is still an important tool for the development of high-quality software systems. These tests can provide us with a behavioral report of trained models, which can serve as a systematic approach towards error analysis.

Throughout this blog post, I've presented "traditional software development" and "machine learning model development" as two separate concepts. This simplification made it easier to discuss the unique challenges associated with testing machine learning systems; unfortunately, the real world is messier. Developing machine learning models also relies on a large amount of "traditional software development" in order to process data inputs, create feature representations, perform data augmentation, orchestrate model training, expose interfaces to external systems, and much more. Thus, effective testing for machine learning systems requires both a traditional software testing suite (for model development infrastructure) and a model testing suite (for trained models).

If you have experience testing machine learning systems, please reach out and share what you've learned!

Thank you, Xinxin Wu, for sending me the paper which inspired me to write this post! Additionally, I'd like to thank John Huffman, Josh Tobin, and Andrew Knight for reading earlier drafts of this post and providing helpful feedback.

An introduction to Kubernetes.

Jeremy Jordan — Wed, 27 Nov 2019 03:29:54 GMT

This blog post will provide an introduction to Kubernetes so that you can understand the motivation behind the tool, what it is, and how you can use it. In a follow-up post, I'll discuss how we can leverage Kubernetes to power data science workloads using more concrete (data science) examples. However, it helps to first build an understanding of the fundamentals - which is the focus of this post.

Prerequisites: I'm going to make the assumption that you're familiar with container technologies such as Docker. If you don't have experience building and running container images, I suggest starting here before you continue reading this post.

Overview

Here's what we'll discuss in this post. You can click on any top-level heading to jump directly to that section.

What is the point of Kubernetes?
Design principles.
- Declaritive
- Distributed
- Decoupled
- Immutable
Basic objects in Kubernetes.
- Pod
- Deployment
- Service
- Ingress
- Job
How? The Kubernetes control plane.
- Master node
  - API server
  - etcd
  - scheduler
  - controller-manager
- Worker nodes
  - kubelet
  - kube-proxy
When should you not use Kubernetes?
Resources

What is the point of Kubernetes?

Kubernetes is often described as a container orchestration platform. In order to understand what exactly that means, it helps to revisit the purpose of containers, what's missing, and how Kubernetes fills that gap.

Note: You will also see Kubernetes referred to by its numeronym, k8s. It means the same thing, just easier to type.

Why do we love containers? Containers provide a lightweight mechanism for isolating an application's environment. For a given application, we can specify the system configuration and libraries we want installed without worrying about creating conflicts with other applications that might be running on the same physical machine. We encapsulate each application as a container image which can be executed reliably on any machine* (as long as it has the ability to run container images), providing us the portability to enable smooth transitions from development to deployment. Additionally, because each application is self-contained without the concern of environment conflicts, it's easier to place multiple workloads on the same physical machine and achieve higher resource (memory and CPU) utilization - ultimately lowering costs.

What's missing? However, what happens if your container dies? Or even worse, what happens if the machine running your container fails? Containers do not provide a solution for fault tolerance. Or what if you have multiple containers that need the ability to communicate, how do you enable networking between containers? How does this change as you spin up and down individual containers? Container networking can easily become an entangled mess. Lastly, suppose your production environment consists of multiple machines - how do you decide which machine to use to run your container?

Kubernetes as a container orchestration platform. We can address many of the concerns mentioned above using a container orchestration platform.

The director of an orchestra holds the vision for a musical performance and communicates with the musicians in order to coordinate their individual instrumental contributions to achieve this overall vision. As the architect of a system, your job is simply to compose the music (specify the containers to be run) and then hand over control to the orchestra director (container orchestration platform) to achieve that vision.

Image credit

A container orchestration platform manages the entire lifecycle of individual containers, spinning up and shutting down resources as needed. If a container shuts down unexpectedly, the orchestration platform will react by launching another container in its place.

On top of this, the orchestration platform provides a mechanism for applications to communicate with each other even as underlying individual containers are created and destroyed.

Lastly, given (1) a set of container workloads to run and (2) a set of machines on a cluster, the container orchestrator examines each container and determines the optimal machine to schedule that workload. To understand why this can be valuable, watch Kelsey Hightower explain (17:47-20:55) the difference between automated deployments and container orchestration using an example game of Tetris.

Design principles.

Now that we understand the motivation for container orchestration in general, let's spend some time to discuss the motivating design principles behind Kubernetes. It helps to understand these principles so that you can use the tool as it was intended to be used.

Declarative

Perhaps the most important design principle in Kubernetes is that we simply define the desired state of our system and let Kubernetes automation work to ensure that the actual state of the system reflects these desires. This absolves you of the responsibility of fixing most things when they break; you simply need to state what your system should look like in an ideal state. Kubernetes will detect when the actual state of the system doesn't meet these expectations and it will intervene on your behalf to fix the problem. This enables our systems to be self-healing and react to problems without the need for human intervention.

The "state" of your system is defined by a collection of objects. Each Kubernetes object has (1) a specification in which you provide the desired state and (2) a status which reflects the current state of the object. Kubernetes maintains a list of all object specifications and constantly polls each object in order to ensure that its status is equal to the specification. If an object is unresponsive, Kubernetes will spin up a new version to replace it. If a object's status has drifted from the specification, Kubernetes will issue the necessary commands to drive that object back to its desired state.

Distributed

For a certain operating scale, it becomes necessary to architect your applications as a distributed system. Kubernetes is designed to provide the infrastructural layer for such distributed systems, yielding clean abstractions to build applications on top of a collection of machines (collectively known as a cluster). More specifically, Kubernetes provides a unified interface for interacting with this cluster such that you don't have to worry about communicating with each machine individually.

Decoupled

It is commonly recommended that containers be developed with a single concern in mind. As a result, developing containerized applications lends itself quite nicely to the microservice architecture design pattern, which recommends "designing software applications as suites of independently deployable services."

The abstractions provided in Kubernetes naturally support the idea of decoupled services which can be scaled and updated independently. These services are logically separated and communicate via well-defined APIs. This logical separation allows teams to deploy changes into production at a higher velocity since each service can operate on independent release cycles (provided that they respect the existing API contracts).

Immutable infrastructure

In order to achieve the most benefit from containers and container orchestration, you should be deploying immutable infrastructure. This is, rather than logging in to a container on a machine to make changes (eg. updating a library), you should build a new container image, deploy the new version, and terminate the older version. As you transition across environments during the life-cycle of a project (development -> testing -> production) you should use the same container image and only modify configurations external to the container image (eg. by mounting a config file).

This becomes very important since containers are designed to be ephemeral, ready to be replaced by another container instance at any time. If your original container had mutated state (eg. manual configuration) but was shut down due to a failed healthcheck, the new container spun up in its place would not reflect those manual changes and could potentially break your application.

When you maintain immutable infrastructure, it also becomes much easier to roll back your applications to a previous state (eg. if an error occurs) - you can simply update your configuration to use an older container image.

Basic objects in Kubernetes.

Previously, I mentioned that we describe our desired state of the system through a collection of Kubernetes objects. Up until now, our discussion of Kubernetes has been relatively abstract and high-level. In this section, we'll dig into more specifics regarding how you can deploy applications on Kubernetes by covering the basic objects available in Kubernetes.

Kubernetes objects can be defined using either YAML or JSON files; these files defining objects are commonly referred to as manifests. It's a good practice to keep these manifests in a version controlled repository which can act as the single source of truth as to what objects are running on your cluster.

Pod

The Pod object is the fundamental building block in Kubernetes, comprised of one or more (tightly related) containers, a shared networking layer, and shared filesystem volumes. Similar to containers, pods are designed to be ephemeral - there is no expectation that a specific, individual pod will persist for a long lifetime.

You won't typically explicitly create Pod objects in your manifests, as it's often simpler to use higher level components which manage Pod objects for you.

Deployment

A Deployment object encompasses a collection of pods defined by a template and a replica count (how many copies of the template we want to run). You can either set a specific value for the replica count or use a separate Kubernetes resource (eg. a horizontal pod autoscaler) to control the replica count based on system metrics such as CPU utilization.

Note: The Deployment object's controller actually creates another object, a ReplicaSet, under the hood. However, this is abstracted away from you as the user.

While you can't rely on any single pod to stay running indefinitely, you can rely on the fact that the cluster will always try to have $n$ pods available (where $n$ is defined by your specified replica count). If we have a Deployment with a replica count of 10 and 3 of those pods crash due to a machine failure, 3 more pods will be scheduled to run on a different machine in the cluster. For this reason, Deployments are best suited for stateless applications where Pods are able to be replaced at any time without breaking things.

The following YAML file provides an annotated example of how you might define a Deployment object. In this example, we want to run 10 instances of a container which serves an ML model over a REST interface.

Note: In order for Kubernetes to know how compute-intensive this workload might be, we should also provide resource limits in the pod template specification.

Deployments also allow us to specify how we would like to roll out updates when we have new versions of our container image; this blog post provides a good overview of your different options. If we wanted to override the defaults we would include an additional strategy field under the object spec. Kubernetes will make sure to gracefully shut down Pods running the old container image and spin up new Pods running the new container image.

Service

Each Pod in Kubernetes is assigned a unique IP address that we can use to communicate with it. However, because Pods are ephemeral, it can be quite difficult to send traffic to your desired container. For example, let's consider the Deployment from above where we have 10 Pods running a container serving a machine learning model over REST. How do we reliably communicate with a server if the set of Pods running as part of the Deployment can change at any time? This is where the Service object enters the picture. A Kubernetes Service provides you with a stable endpoint which can be used to direct traffic to the desired Pods even as the exact underlying Pods change due to updates, scaling, and failures. Services know which Pods they should send traffic to based on labels (key-value pairs) which we define in the Pod metadata.

Note: This blog post does a nice job explaining how traffic is actually routed.

In this example, our Service sends traffic to all healthy Pods with the label app="ml-model".

The following YAML file provides an example for how we might wrap a Service around the earlier Deployment example.

Ingress

While a Service allows us to expose applications behind a stable endpoint, the endpoint is only available to internal cluster traffic. If we wanted to expose our application to traffic external to our cluster, we need to define an Ingress object.

The benefit of this approach is that you can select which Services to make publicly available. For example, suppose that in addition to our Service for a machine learning model, we had a UI which leveraged the model's predictions as part of a larger application. We may choose to only make the UI available to public traffic, preventing users from being able to query the model serving Service directly.

The following YAML file defines an Ingress object for the above example, making the UI publicly accessible.

Job

The Kubernetes objects I've described up until this point can be composed to create reliable, long-running services. In contrast, the Job object is useful when you want to perform a discrete task. For example, suppose we want to retrain our model daily based on the information collected from the previous day. Each day, we want to spin up a container to execute a predefined workload (eg. a train.py script) and then shut down when the training finishes. Jobs provide us the ability to do exactly this! If for some reason our container crashes before finishing the script, Kubernetes will react by launching a new Pod in its place to finish the job. For Job objects, the "desired state" of the object is completion of the job.

The following YAML defines an example Job for training a machine learning model (assuming the training code is defined in train.py).

Note: This Job specification will only execute a single training run. If we wanted to execute this job daily, we could define a CronJob object instead.

...and many more.

The objects discussed above are certainly not an exhaustive list of resource types available in Kubernetes. Some other objects that you'll most likely find useful when deploying applications include:

Volume: for managing directories mounted onto Pods
Secret: for storing sensitive credentials
Namespace: for separating resources on your cluster
ConfigMap: for specifying application configuration values to be mounted as a file
HorizontalPodAutoscaler: for scaling Deployments based on the current resource utlization of existing Pods
StatefulSet: similar to a Deployment, but for when you need to run a stateful application

How? The Kubernetes control plane.

By this point, you're probably wondering how Kubernetes is capable of taking all of our object specifications and actually executing these workloads on a cluster. In this section we'll discuss the components that make up the Kubernetes control plane which govern how workloads are executed, monitored, and maintained on our cluster.

Before we dive in, it's important to distinguish two classes of machines on our cluster:

A master node contains most of the components which make up our control plane that we'll discuss below. In most moderate-sized clusters you'll only have a single master node, although it is possible to have multiple master nodes for high availability. If you use a cloud provider's managed Kubernetes service, they will typically abstract away the master node and you will not have to manage or pay for it.
A worker node is a machine which actually runs our application workloads. It is possible to have multiple different machine types tailored to different types of workloads on your cluster. For example, you might have some GPU-optimized nodes for faster model training and then use CPU-optimized nodes for serving. When you define object specifications, you can specify a preference as to what type of machine the workload gets assigned to.

Now let's dive in to the main components on our master node. When you're communicating with Kubernetes to provide a new or updated object specification, you're talking to the API server.

More specifically, the API server validates requests to update objects and acts as the unified interface for questions about our cluster's current state. However, the state of our cluster is stored in etcd, a distributed key-value store. We'll use etcd to persist information regarding: our cluster configuration, object specifications, object statuses, nodes on the cluster, and which nodes the objects are assigned to run on.

Note: etcd is the only stateful component in our control plane, all other components are stateless.

Speaking of where objects should be run, the scheduler is in charge of determining this! The scheduler will ask the API server (which will then communicate with etcd) which objects which haven't been assigned to a machine. The scheduler will then determine which machines those objects should be assigned to and will reply back to the API server to reflect this assignment (which gets propagated to etcd).

The last component on the master node that we'll discuss in this post is the controller-manager, which monitors the state of a cluster through the API server to see if the current state of the cluster aligns with our desired state. If the actual state differs from our desired state, the controller-manager will make changes via the API server in an attempt to drive the cluster towards the desired state. The controller-manager is defined by a collection of controllers, each of which are responsible for managing objects of a specific resource type on the cluster. At a very high level, a controller will watch a specific resource type (eg. deployments) stored in etcd and create specifications for the pods which should be run to acheive the object's desired state. It is then the controller's responsibility for ensuring that these pods stay healthy while running and are shut down when needed.

To summarize what we've covered so far...

Next, let's discuss the control plane components which are run on worker nodes. Most of the resources available on our worker nodes are spent running our actual applications, but our nodes do need to know which pods they should be running and how to communicate with pods on other machines. The two final components of the control plane that we'll discuss cover exactly these two concerns.

The kubelet acts as a node's "agent" which communicates with the API server to see which container workloads have been assigned to the node. It is then responsible for spinning up pods to run these assigned workloads. When a node first joins the cluster, kubelet is responsible for announcing the node's existence to the API server so the scheduler can assign pods to it.

Lastly, kube-proxy enables containers to be able to communicate with each other across the various nodes on the cluster. This component handles all the networking concerns such as how to forward traffic to the appropriate pod.

Hopefully, by this point you should be able to start seeing the whole picture for how things operate in a Kubernetes cluster. All of the components interact through the API server and we store the state of our cluster in etcd. There are various components which write to etcd (via the API server) to make changes to the cluster and nodes on the cluster listen to etcd (via the API server) to see which pods it should be running.

The overall system is designed such that failures have minimal impact on the overall cluster. For example, if our master node went down, none of our applications would be immediately affected; we would just be prevented from making any further changes to the cluster until a new master node is brought online.

When should you not use Kubernetes?

As with every new technology, you'll incur an overhead cost of adoption as you learn how it works and how it applies to the applications that you're building. It's a reasonable question to ask "do I really need Kubernetes?" so I'll attempt to provide some example situations where the answer might be no.

You can run your workload on a single machine. (Kubernetes can be viewed as a platform for building distributed systems, but you shouldn't build a distributed system if you don't need one!)
Your compute needs are light. (In this case, the compute spent on the orchestration framework is relatively high!)
You don't need high availability and can tolerate downtime.
You don't envision making a lot of changes to your deployed services.
You already have an effective tool stack that you're satisfied with.
You have a monolithic architecture and don't plan to separate it into microservices. (This goes back to using the tool as it was intended to be used.)
You read this post and thought "holy shit this is complicated" rather than "holy shit this is useful".

Gratitude

Thank you to Derek Whatley and Devon Kinghorn for teaching me most of what I know about Kubernetes and answering my questions as I've been trying to wrap my head around this technology. Thank you to Goku Mohandas, John Huffman, Dan Salo, and Zack Abzug for spending your time to review an early version of this post and provide thoughtful feedback. And lastly, thank you to Kelsey Hightower for all that you've contributed to the Kubernetes community - your talks have helped me understand the bigger picture and gave me confidence that I could learn about this topic.

Resources

Here are some resources that I found to be useful when learning about Kubernetes.

Blogs

Julia Evans - Reasons Kubernetes is cool
Julia Evans - A few things I've learned about Kubernetes (Julia's zines were a big inspiration for my visuals explaining the Kubernetes control plane)
Jessie Frazelle - You might not need Kubernetes
Matt Rogish - Is Kubernetes Overkill?
Major Trends in the 2019 Data & AI Landscape
Introduction to cloud-native applications and defining cloud native
Kubernetes Best Practices 101

Videos

Kelsey Hightower - Kuberenetes for Pythonistas discusses the motivation for Kubernetes and provides an example of running a Python application.
Kubernetes by Kelsey Hightower introduces the core components of Kubernetes and how they work together, with the API server at the core.
Kubernetes The Easy Way! presents a developer-centric workflow for building products and leveraging Kubernetes as the infrastructure.
Kubernetes in Real Life - Ian Crosby
Kubernetes Design Principles: Understand the Why - Saad Ali, Google
Kubernetes Deconstructed: Understanding Kubernetes by Breaking It Down - Carson Anderson, DOMO
From COBOL to Kubernetes: A 250 Year Old Bank's Cloud-Native Journey - Laura Rehorst

Books

Building machine learning products: a problem well-defined is a problem half-solved.

Jeremy Jordan — Sun, 22 Sep 2019 00:09:49 GMT

Previously, I wrote about organizing machine learning projects where I presented the framework that I use for building and deploying models. However, that framework operates on the implicit assumption that you already know generally what your model should do. In this post, we'll dig deeper into how to develop the requirements for a machine learning project when you're given a vague problem to solve. Some questions that we'll address include:

What specific task should our model be automating?
How does the user interact with the model?
What information should we expose to the user?

Note: Sometimes machine learning projects can be very straightforward; your stakeholders define an API specification stating the inputs to the system and the desired outputs and you agree that the task seems feasible. These projects typically support existing products with existing "intelligence" solutions - your task is to simply encapsulate the "intelligence" task using machine learning in lieu of the existing solution (ie. rule-based systems, mechanical turk workers, etc).

Throughout this blog post, I'll use the following problem statement as a running example.

We're building an application to help people keep their photos organized. Our target user base are casual smartphone photographers who have a large number of photos in their camera roll. We want to make it easier for these users to find photos of interest.

Notice how vague that problem statement is - what defines a photo of interest? There's a multitude of ways we could address this task and without understanding the problem in more detail we won't know which direction to take. At this point in time, we have insufficient information to specify an objective function for training a model.

Understanding the problem and developing the requirements isn't something you typically get right on the first attempt; this is often an iterative process where we initially define a set of coarse requirements and refine the detail as we gain more information.

Jump to:

Understand the problem from the perspective of the user
Mock out your machine learning model and iterate on the user experience
Develop a shared language with your project stakeholders
Win by shipping
Conclusion

Understand the problem from the perspective of the user.

The first step towards establishing any set of requirements for a project is understanding the problem you're setting out to solve. No one knows the problem better than the intended user - the person we are attempting to solve the problem for.

Perform informational interviews with end users to understand their perspective. The further removed you are from the end user, the less likely you are to solve the actual problem they're experiencing.

Do you remember playing the telephone game as a kid, where you have a chain of people and try to deliver a message from one end to the other by asking each person to transmit the message to the person next to them? Usually the message at the end is very different from the original message. In order to ensure you're solving the right problem, you'll want to be able to empathize with the people who currently experience that problem.

For the case of our photo app example, this would involve speaking with casual smartphone photographers and understanding how they use their app.

When they're searching through their camera roll, what are they typically looking for?
- Are they Instagram influencers who take 100 photos of the same thing and are trying to find the best photo to post?
- Are they nature photographers looking to share the scenic views from their latest camping trip?
- Are they recent parents trying to document the development of their newborn baby?
- Are they often looking for photos with a specific friend to share on social media with a happy birthday wish?
What are their current strategies for finding the photos of interest?
- Do they know when the photo was taken?
- Do they flip through their camera roll photo by photo or scroll through a series of thumbnails?
- Are they typically searching for a photo of a specific person?
- Do they mentally "chunk" photos together and search chunk by chunk? What's the criteria for chunking?

At this stage you don't want to start prescribing a solution, you're simply trying to understand the problem. However, it can be good to consider the capabilities of machine learning systems and ask questions which may eventually guide your scoping of a solution. For example, let's consider the motivation behind asking the first set of questions.

Are they Instagram influencers who take 100 photos of the same thing and are trying to find the best photo to post? Perhaps we want to consider clustering similar photos and then applying a ranking model to help them find the best photo to share.
Are they nature photographers looking to share the scenic views from their latest camping trip? Do we need to consider additional metadata such as the GPS coordinates of where the photo was taken to enrich the data representation?
Are they recent parents trying to document the development of their newborn baby? Perhaps we want to perform action recognition on the photographs to help the parents find the moment where their newborn walked for the first time or did a cute dance.
Are they often looking for photos with a specific friend to share on social media with a happy birthday wish? Maybe we'll need to use facial recognition to identify the user's friends.

Even though you might be considering solutions, it's important that the conversations with users are focused on the problems they experience. You won't ask them about their thoughts on specific solutions until the next stage.

Subject yourself to the problem. As a machine learning practitioner, it can be tempting to jump right into training a model to learn some task. However, it's often very instructive to first force yourself to perform the task manually. Additionally, this helps you empathize with the users. Pay close attention to how you solve the task, as this might inform what features might be important to include when you do train a model to perform the task.

For example, after speaking with some casual smartphone photographers you might construct a couple photo albums and go through the tasks described during your interviews. What strategies did you find effective for finding content more quickly?

Mock out your machine learning model and iterate on the user experience.

After (and only after) getting a better understanding of the problem, you'll want to start sketching out the set of possible solutions and approaches that you could take. It's generally a good idea to think broadly at this stage rather than prematurely honing in on your first decent idea. At this stage, we're trying to elicit the desired user experience, as this can ultimately drive the requirements of the project.

Prototype and iterate on the user experience using design tools to communicate possible solutions. There's something magical about seeing something concretely that has the ability to elicit tangible feedback from your users. When speaking in the abstract sense, it's possible for both you and the other stakeholders to have different understandings whilst under the illusion that you're in agreement. However, these latent misunderstandings often become very clear when you reduce an abstract idea to practice.

For example, I used a design tool called Figma to sketch out a couple different ways we might help users find photos of interest more quickly. The goal of these sketches is to spark discussion with our stakeholders and/or users in an attempt to start narrowing down the possible solution space.

These tools allow you to easily stitch together user flows and import data to use in your mockups.

Fake the machine learning component with "Wizard-of-Oz" experiments. Building a machine learning model takes a significant amount of work. You have to acquire data for training, decide on a model architecture, ensure the model is performing at a sufficient level, etc. Our goal here is to validate the utility of a model without actually going through all of the effort involved in building it. These types of experiments can be especially useful when there's still uncertainty surrounding how users will interact with your model.

Apple researchers wrote about one such study when deciding whether their digital assistant should mimic the conversational style of the user. Their core hypothesis was that users would prefer digital assistants that match their level of chattiness when conversing with the agent. Rather than training a machine learning model that would allow them to modulate the "chattiness" level in the agent's response, they first tested this hypothesis by faking the digital assistant component and having humans in a separate room follow a script pretending to be the digital assistant.

Figure out how to establish trust with the user. Ideally you'd like to design your product such that user interactions can improve your model, which in turn improve the user experience. This is commonly referred to as the "data flywheel" in machine learning products. However, in order to source meaningful interactions from your users, you may need to first establish trust with the user that those interactions will indeed improve their experience. For example, if we decided that we would perform facial recognition and allow you to search for photos with a specific person present, you'd likely require the user to assign identities to collections of photos with a common face. In order to motivate and incentivize users, they need to feel as if their effort in identifying faces in photos is meaningful. One common technique used here is to show the model's effort in an unobtrusive manner which informs the user how it's working. Continuing the facial recongition example, you might allow the user to toggle a setting which draws bounding boxes around the identified faces in photos.

Develop a shared language with your project stakeholders.

At this point, you're likely to have a decent grasp on the approach you'll take to solve the presented problem. However, much work still lies ahead! It'll be important that you communicate effectively with stakeholders as you work to build out your solution; this starts with speaking a common language.

Perhaps a number of users leverage a technique of chunking photos together when searching, while you typically refer to that same technique as clustering. Without converging on a shared language, it's all too easy to talk right past one another without realizing you're actually on the same page.

Over-communicate and ask a lot of clarifying questions at the onset. This can include asking questions when you're already pretty sure what the answer will be, sometimes you may be surprised.

Present ideas and progress often. It's important to share progress with your stakeholders and hold discussion to ensure you're still heading in the right direction. As you present your work, it can be helpful to discuss things such as model metrics which you're using to evaluate performance and why we should care about those metrics; keeping things simple combined with repeated exposure can go a long way here.

Win by shipping.

Getting your product in the hands of your users is one of the best ways to validate your ideas. Quicker iterations allow you to validate more ideas, which in turn allows you to fine-tune your solution offering in order to maximize value to the user. Further, incremental deployments help ensure that you're heading in the right direction as you work to build out the full solution.

In my post on organizing machine learning projects, I presented a diagram for model development which represents the iterative nature of the work.

I also called out Stephen Merity's advice and Martin Zinkevich's Rule #4 of Machine Learning, which both advocate for initially deploying simple models in the spirit of winning by shipping. In this context, "winning by shipping" involves making quick iterations through the outermost development loop.

It can take getting a few projects under your belt to fully appreciate this wisdom. When I initially published the diagram a year ago, I tended to view progression through this iterative flow as a stage-gate process where you might loop back to an earlier step if you gain more information, but you still advance through the cycle in the order that sections appear on the diagram. However, after joining the machine learning team at Proofpoint, I've learned from the team that there can be immense value in shortcutting parts of the development process (eg. see shadow mode deployment below) in order to gain insights from deploying models on production data.

Deploy a baseline model on production data as soon as possible. Deploying your model on production data can be enlightening. Often times, the "live data" varies in unexpected ways from the data you've collected to use during development (often referred to as "train/test skew" or "production data drift").

As a countermeasure, it's often a good idea to deploy a simple model on production data as soon as possible. Depending on the consequence of wrong predictions, you might choose to deploy this simple model in "shadow mode" (don't actually use the predictions) or as a canary deployment on a small subset of your users. The goal of this deployment is to observe and characterize the types of errors that the simple model makes, which can inform further model improvements.

You may choose to initially skip some steps in the model development process in order to prioritize a quick first iteration and gain insights.

Deploying a simple model with urgency also helps ensure that the additional engineering work required to run a machine learning model is done upfront. This allows you to deploy your incremental model improvements (enabling quick iterations) rather than slaving away trying to build the perfect model in isolation.

Deliver value incrementally and quickly. If you're automating a system, start with the simplest task and deploy a solution quickly. Let data guide the priority order of further automation. In other words, automate first according to the "happy path" (assume nothing goes wrong) and then categorize the times where control is reverted to a human.

Going back to our photo app example, there are many opportunities to apply machine learning models to the product. Ideally, we'd like to start at the intersection of a model which is simple to develop and provides value to users. For example, we might decide to initially deploy a facial recognition model given the fact that there are a number of open source implementations that we could use and user studies provided us with some level of evidence that this model would be valuable as part of the product.

Measure time to results, not results. Sometimes it can be tricky to get into the mindset of delivering value quickly. It's easy to think that in order to delight the user, we need to give them the perfect product. Measuring time to results rather than the results themselves instills a culture of quick iterations which ultimately enables your team to deliver better products, winning by shipping.

Conclusion

Often times we're presented with vague problems and tasked with developing a "machine learning solution." By spending the effort to properly scope your project and define its requirements up front, you can establish a foundation for smooth iterations through the model development loop as we work towards a final solution.

Now go build great things!

The insights shared in this blog post are shaped by my experience working on teams building and deploying machine learning models in products. On my current team at Proofpoint, we frequently discuss how we can refine our process to deliver value quickly on machine learning projects; these conversations and shared experiences are invaluable. Thank you to John Berryman, Dan Salo, John Huffman, and Michelle Carney for reading early drafts of this post and providing feedback.

External Resources

Blog posts

Machine learning, faster (introduced me to the concept of "measure time to results, not results")
Data Jujitsu: The art of turning data into product
What you need to know about product management for AI
Practical Skills for The AI Product Manager
Product Driven Machine Learning
Human-Centered Machine Learning
Design in a world where Machines are Learning
When are we going to start designing AI with purpose?
User Research Makes Your AI Smarter
What is the best way to write a Product Requirements Document?

Case studies

Talks

Guides

Guidelines for Human-AI Interaction (and also check out the pretty infographic here)
AI Meets Design
Google's People+AI Guidebook: Designing human-centered AI products
Designing Machine Learning
Human-centered AI cheat-sheet

People to follow

Michelle Carney @michellercarney works at the intersection of UX+ML and leads the MLUX meetup group.
Jess Holbrook @jessholbrook is the co-lead of Google’s People + AI Research team; he has a ton of great articles on Medium.
Nadia Piet @NadiaPiet works as a freelance design consultant; she put together the phenomenal aimeets.design toolkit referenced above.

Introduction to recurrent neural networks.

Jeremy Jordan — Sun, 09 Jun 2019 04:53:51 GMT

Jump to:

Overview
Evolving a hidden state over time
Common structures of recurrent networks
Bidirectionality
Limitations
Further reading

Overview

Previously, I've written about feed-forward neural networks as a generic function approximator and convolutional neural networks for efficiently extracting local information from data. In this post, I'll discuss a third type of neural networks, recurrent neural networks, for learning from sequential data.

For some classes of data, the order in which we receive observations is important. As an example, consider the two following sentences:

"I'm sorry... it's not you, it's me."
"It's not me, it's you... I'm sorry."

These two sentences are communicating quite different messages, but this can only be interpreted when considering the sequential order of the words. Without this information, we're unable to disambiguate from the collection of words: {'you', 'sorry', 'me', 'not', 'im', 'its'}.

Recurrent neural networks allow us to formulate the learning task in a manner which considers the sequential order of individual observations.

Evolving a hidden state over time

In this section, we'll build the intuition behind recurrent neural networks. We'll start by reviewing standard feed-forward neural networks and build a simple mental model of how these networks learn. We'll then build on that to discuss how we can extend this model to a sequence of related inputs.

Recall that neural networks perform a series of layer by layer transformations to our input data. The hidden layers of the network form intermediate representations of our input data which make it easier to solve the given task.

This is demonstrated in the example below. Observe how our input space is warped into one which allows for a linear decision boundary to cleanly separate the two classes. At a high level, you can think of the hidden layers as "useful representations" of the original input data.

Image credit

Now let's consider how we can leverage this insight for a sequence of related observations.

Let's first focus on the initial value in the sequence. As we calculate the forward pass through the network, we build a "useful representation" of our input in the hidden layers (the activations in these layers define our hidden state), continuing on to calculate an output prediction for the initial time-step.

When considering the next time-step in the sequence, we want to leverage any information we've already extracted from the sequence.

In order to do this, our next hidden state will be calculated as a combination of the previous hidden state and latest input.

The basic method for combining these two pieces of information is shown below; however, there exist other more advanced methods that we'll discuss later (gated recurrent units, long short-term memory units). Here, we have one set of weights $w_{ih}$ to transform the input to a hidden layer representation and a second set of weights $w_{hh}$ to bring along information from the previous hidden state into the next time-step.

We can continue performing this same calculation of incorporating new information to update the value of the hidden state for an arbitrarily long sequence of observations.

By always remembering the previous hidden state, we're able to chain a sequence of events together. This also allows us to backpropagate errors to earlier timesteps during training, often referred to as "backpropagation through time".

Common structures of recurrent networks

One of the benefits of recurrent neural networks is the ability to handle arbitrary length inputs and outputs. This flexibility allows us to define a broad range of tasks. In this section, I'll discuss the general architectures used for various sequence learning tasks.

One to many RNNs are used in scenarios where we have a single input observation and would like to generate an arbitrary length sequence related to that input. One example of this is image captioning, where you feed in an image as input and output a sequence of words to describe the image. For this architecture, we take our prediction at each time step and feed that in as input to the next timestep, iteratively generating a sequence from our initial observation and following predictions.

Image credit

Many to one RNNs are used to look across a sequence of inputs and make a single determination from that sequence. For example, you might look at a sequence of words and predict the sentiment of the sentence. Generally, this structure is used when you want to perform classification on sequences of data.

Image credit

Many to many (same) RNNs are used for tasks in which we would like to predict a label for each observation in a sequence, sometimes referred to as dense classification. For example, if we would like to detect named entities (person, organization, location) in sentences, we might produce a label for every single word denoting whether or not that word is part of a named entity. As another example, you could feed in a video (sequence of images) and predict the current activity in frame.

Image credit

Many to many (different) RNNs are useful for translating a sequence of inputs into a different but related sequence of outputs. In this case, both the input and the output can be arbitrary length sequences and the input length might not always be equal to the output length. For example, a machine translation model would be expected to translate "how are you" (input) into "cómo estás" (output) even though the sequence lengths are different.

Image credit

Bidirectionality

One of the weaknesses of a ordinary recurrent neural networks is that we can only use the set of observations which we have already seen when making a prediction. As an example, consider training a model for named entity recognition. Here, we want the model to output the start and end of phrases which contain a named entity. Consider the following two sentences:

"I can't believe that Teddy Roosevelt was your great grandfather!"

"I can't believe that Teddy bear is made out of chocolate!"

However, if you only read the input sequence from left to right, it's hard to tell whether or not you should mark "Teddy" as the start of a name.

Ideally, our model output would look something like this when reading the first sentence (roughly following the inside–outside–beginning tagging format).

When determining whether or not a token is the start of a name, it would sure be helpful to see which tokens follow after it; a bidirectional recurrent neural network provides exactly that. Here, we process the sequence reading from left-to-right and right-to-left in parallel and then combine these two representations such that at any point in a sequence you have knowledge of the tokens which came before and after it.

We have one set of recurrent cells which process the sequence from left to right...

... and another set of recurrent cells which process the sequence from right to left.

Thus, at any given time-step we have knowledge of all of the tokens which came before the current time-step and all of the tokens which came after that time-step.

Limitations

One key component that I glanced over previously is that the recurrent layer's weights are shared across time-steps. This provides us with the flexibility to process arbitrary length sequences, but also introduces a unique challenge when training the network.

For a concrete example, suppose you've trained a recurrent neural network as a language model (predict the next word in a sequence). As you're generating text, it might be important to know whether the current word is inside quotation marks. Let's assume this is true and consider the case where our model makes a wrong prediction because it wasn't paying attention to whether or not the current time-step is inside quotation marks. Ideally, you want a way to send back a signal to the earlier time-step where we entered the quotation mark to say "pay attention!" to avoid the same mistake in the future. Doing so requires sending our error signal back through many time-steps. (As an aside, Karpathy has a famous blog post which shows that a character-level RNN language model can indeed pay attention to this detail.)

Image credit

Let's consider what the backpropagation step would look like to send this signal to earlier time-steps.

As a reminder, the backpropagation algorithm states that we can define the relationship between a given layer's weights and the final loss using the following expression:

$$ \frac{{\partial E\left( w \right)}}{{\partial w^{(l)}}} = {\left( {{\delta ^{(l + 1)}}} \right)^T}{a^{(l)}} $$

where ${\delta ^{(l)}}$ (our "error" term) can be calculated as:

$$ {\delta ^{(l)}} = {\delta ^{(l + 1)}}{w ^{(l)}}f'\left( {{a^{(l)}}} \right) $$

This allows to efficiently calculate the gradient for any given layer by reusing the terms already computed at layer $l+1$. However, notice how there's a term for the weight matrix, ${w ^{(l)}}$, included in the computation at every layer. Now recall that I earlier mentioned recurrent layers share weights across time-steps. This means that the same exact value is being mulitplied every time we perform this layer by layer backpropagation through time.

Let's suppose one of the weights in our matrix is 0.5 and we're attempting to send a signal back 10 time-steps. By the time we've backpropagated to $t-10$, we've multiplied the overall gradient expression by $0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 = 0.00098$. This has the effect of drastically reducing the magnitude of our error signal! This phenomenon is known as the "vanishing gradient" problem which makes it very hard to learn using a vanilla recurrent neural network. The same problem can occur when the weight is greater than one, introducing an exploding gradient, although this is slightly easier to manage thanks to a technique known as gradient clipping.

In following posts, we'll look at two common variations of the standard recurrent cell which alleviate this problem of a vanishing gradient.

Scaling nearest neighbors search with approximate methods.

Jeremy Jordan — Mon, 04 Feb 2019 04:36:13 GMT

Jump to:

What is nearest neighbors search?
K-d trees
Quantization
Common datasets
Further reading

What is nearest neighbors search?

In the world of deep learning, we often use neural networks to learn representations of objects as vectors. We can then use these vector representations for a myriad of useful tasks.

To give a concrete example, let's consider the case of a facial recognition system powered by deep learning. For this use case, the objects are images of people's faces and the task is to identify whether or not the person in a submitted photo matches a person in a database of known identities. We'll use a neural network to build vector representations of all of the images; then, performing facial recognition is as simple as taking the vector representation of a submitted image (the query vector) and searching for similar vectors in our database. Here, we define similarity as vectors which are close together in vector-space. (How we actually train a network to produce these vector representations is outside of the scope of this blog post.)

All of these vectors were extracted from a ResNet50 model. Notice how the values in the query vector are quite similar to the vector in the top left of known identities.

The process of finding vectors that are close to our query is known as nearest neighbors search. A naive implementation of nearest neighbors search is to simply calculate the distance between the query vector and every vector in our collection (commonly referred to as the reference set). However, calculating these distances in a brute force manner quickly becomes infeasible as your reference set grows to millions of objects.

Imagine if Facebook had to compare each face in a new photo against all of its users every time it suggested who to tag, this would be computationally infeasible!

A class of methods known as approximate nearest neighbors search offer a solution to our scaling dilemma by partitioning the vector space in a clever way such that we only need to examine a small subset of the overall reference set.

Approximate methods alleviate this computational burden by cleverly partitioning the vectors such that we only need to focus on a small subset of objects.

In this blog post, I'll cover a couple of techniques used for approximate nearest neighbors search. This post will not cover approximate nearest neighbors methods exhaustively, but hopefully you'll be able to understand how people generally approach this problem and how to apply these techniques in your own work.

In general, the approximate nearest neighbor methods can be grouped as:

Tree-based data structures
Neighborhood graphs
Hashing methods
Quantization

K-dimensional trees

The first approximate nearest neighbors method we'll cover is a tree-based approach. K-dimensional trees generalize the concept of a binary search tree into multiple dimensions.

The general procedure for growing a k-dimensional tree is as follows:

pick a random dimension from your k-dimensional vector
find the median of that dimension across all of the vectors in your current collection
split the vectors on the median value
repeat!

A toy 2-dimensional example is visualized below. At the top level, we select a random dimension (out of the two possible dimensions, $x_0$ and $x_1$) and calculate the median. Then, we follow the same procedure of picking a dimension and calculating the median for each path independently. This process is repeated until some stopping criterion is satisfied; each leaf node in the tree contains a subset of vectors from our reference set.

We can view how the two-dimensional vectors are partitioned at each level of the k-d tree in the figure below. Take a minute to verify that this visualization matches what is described in the tree above.

In order to see the usefulness of this tree, let's now consider how we could use this data structure to perform an approximate nearest neighbor query. As we walk down the tree, notice how the highlighted area (the area in vector space that we're interested in) shrinks down to a small subset of the original space. (I'll use the level 4 subplot for this example.)

At the top level, we look at the first dimension of the query vector and ask whether or not its value is greater than or equal to 1. Since 4 is greater than 1, we walk down the "yes" path to the next level down. We can safely ignore any of the nodes that follow the first "no" path.

Now we look at the second dimension of the vector and ask whether its value is greater than or equal to 0. Since -2 is less than 0, we now walk down the "no" path. Notice again how the area of interest in our overall vector-space continues to shrink.

Finally, once we reach the bottom of the tree we are left with a collection of vectors. Thankfully, this is a small subset relative to the overall size of the reference set, so calculating the distance between the query vector and each vector in this subset is computationally feasible.

K-d trees are popular due to their simplicity, however, this technique struggles to perform well when dealing with high dimensional data. Further notice how we only returned vectors which are found in the same cell as the query point. In this example, the query vector happened to fall in the middle of a cell, but you could imagine a scenario where the query vector lies near the edge of a cell and we miss out on vectors which lie just outside of the cell.

Quantization

Another approach to the approximate nearest neighbors problem is to collapse our reference set into a smaller collection of representative vectors. We can find these "representative" vectors by simply running the K-means algorithm on our data. In the literature, this collection of "representative" vectors is commonly referred to as the codebook.

The right figure displays a Voronoi diagram which essentially partitions the space according to the set of points for which a given centroid is closest.

We'll then "map" all of our data onto these centroids. By doing this, we can represent our reference set of a couple hundred vectors with only 7 representative centroids. This greatly reduces the number of distance computations we need to perform (only 7!) when making an nearest neighbors query.

We can then maintain an inverted list to keep track of all of the original objects in relation to which centroid represents the quantized vector.

You can optionally retrieve the full vectors for all of the ids maintained in the inverted list for a given centroid, calculating the true distances between each vector and our query. This is a process known as re-ranking and can improve your query performance.

Similar to before, let's now look at how we can use this method to perform a query. For a given query vector, we'll calculate the distances between the query vector and each centroid in order to find the closest centroid. We can then look up the centroid in our inverted list in order to find all of the nearest vectors.

Unfortunately, in order to get good performance using quantization, you typically need to use very large numbers of centroids for quantization; this impedes on original goal of alleviating the computational burden of calculating too many distances.

Product quantization

Product quantization addresses this problem by first subdividing the original vectors into subcomponents and then quantizing (ie. running K-means on) each subcomponent separately. A single vector is now represented by a collection of centroids, one for each subcomponent.

To illustrate this, I've provided two examples. In the 8D case, you can see how our vector is divided into subcomponents and each subcomponent is represented by some centroid value. However, the 2D example shows us the benefit of this approach. In this case, we can only split our 2D vector into a maximum of two components. We'll then quantize each dimension separately, squashing all of the data onto the horizontal axis and running k-means and then squashing all of the data onto the vertical axis and running k-means again. We find 3 centroids for each subcomponent with a total of 6 centroids. However, the total set of all possible quantized states for the overall vector is the Cartesian product of the subcomponent centroids.

In other words, if we divide our vector into $m$ subcomponents and find $k$ centroids, we can represent $k^m$ possible quantizations using only $km$ vectors! The chart below shows how many centroids are needed in order to get 90% of the top 5 search results correct for an approximate nearest neighbors query. Notice how using product quantization ($m>1$) vastly reduces the number of centroids needed to represent our data. One of the reasons why I love this idea so much is that we've effectively turned the curse of dimensionality into something highly beneficial!

Image credit

Handling multi-modal data

Product quantization alone works great when our data is distributed relatively evenly across the vector-space. However, in reality our data is usually multi-modal. To handle this, a common technique involves first training a coarse quantizer to roughly "slice" up the vector-space, and then we'll run product quantization on each individual coarse cell.

Below, I've visualized the data that falls within a single coarse cell. We'll use product quantization to find a set of centroids which describe this local subset of data, and then repeat for each coarse cell. Commonly, people encode the vector residuals (the difference between the original vector and the closest coarse centroid) since the residuals tend to have smaller magnitudes and thus lead to less lossy compression when running product quantization. In simple terms, we treat each coarse centroid as a local origin and run product quantization on the data with respect to the local origin rather than the global origin.

Pro-tip: If you want to scale to really large datasets you can use product quantization as both the coarse quantizer and the fine-grained quantizer within each coarse cell. See this paper for the details.

Locally optimized product quantization

The ideal goal for quantization is to develop a codebook which is (1) concise and (2) highly representative of our data. More specifically, we'd like all of the vectors in our codebook to represent dense regions of our data in vector-space. A centroid in a low-density area of our data is inefficient at representing data and introduces high distortion error for any vectors which fall in its Voronoi cell.

One potential way we can attempt to avoid these inefficient centroids is to add an alignment step to our product quantization. This allows for our product quantizers to better cover the local data for each coarse Voronoi cell.

We can do this by applying a transformation to our data such that we minimize our quantization distortion error. One simple way to minimize this quantization distortion error is to simply apply PCA in order to mean-center the data and rotate it such that the axes capture most of the variance within the data.

Recall my earlier example where we ran product quantization on a toy 2D dataset. In doing so, we effectively squashed all of the data onto the horizontal axis and ran k-means and then repeated this for the vertical axis. By rotating the data such that the axes capture most of the variance, we can more effectively cover our data when using product quantization.

This technique is known as locally optimized product quantization, since we're manipulating the local data within each coarse Voronoi cell in order to optimize the product quantization performance. The authors who introduced this technique have a great illustrative example of how this technique can better fit a given set of vectors.

This blog post glances over (c) Optimized Product Quantization which is the same idea of aligning our data for better product quantization performance, but the alignment is performed globally instead of aligning local data in each Voronoi cell independently. Image credit

A quick sidenote regarding PCA alignment

The authors who introduced product quantization noted that the technique works best when the vector subcomponents had similar variance. A nice side effect of doing PCA alignment is that during the process we get a matrix of eigenvalues which describe the variance of each principal component. We can use this to our advantage by allocating principal components into buckets of equal variance.

Common datasets

Organizing machine learning projects: project management guidelines.

Jeremy Jordan — Sun, 02 Sep 2018 00:09:00 GMT

The goal of this document is to provide a common framework for approaching machine learning projects that can be referenced by practitioners. If you build ML models, this post is for you. If you collaborate with people who build ML models, I hope that this guide provides you with a good perspective on the common project workflow. Knowledge of machine learning is assumed.

Overview

This overview intends to serve as a project "checklist" for machine learning practitioners. Subsequent sections will provide more detail.

Project lifecycle
Machine learning projects are highly iterative; as you progress through the ML lifecycle, you’ll find yourself iterating on a section until reaching a satisfactory level of performance, then proceeding forward to the next task (which may be circling back to an even earlier step). Moreover, a project isn’t complete after you ship the first version; you get feedback from real-world interactions and redefine the goals for the next iteration of deployment.

Planning and project setup
- Define the task and scope out requirements
- Determine project feasibility
- Discuss general model tradeoffs (accuracy vs speed)
- Set up project codebase
Data collection and labeling
- Define ground truth (create labeling documentation)
- Build data ingestion pipeline
- Validate quality of data
- Label data and ensure ground truth is well-definend
- Revisit Step 1 and ensure data is sufficient for the task
Model exploration
- Establish baselines for model performance
- Start with a simple model using initial data pipeline
- Overfit simple model to training data
- Stay nimble and try many parallel (isolated) ideas during early stages
- Find SoTA model for your problem domain (if available) and reproduce results, then apply to your dataset as a second baseline
- Revisit Step 1 and ensure feasibility
- Revisit Step 2 and ensure data quality is sufficient
Model refinement
- Perform model-specific optimizations (ie. hyperparameter tuning)
- Iteratively debug model as complexity is added
- Perform error analysis to uncover common failure modes
- Revisit Step 2 for targeted data collection and labeling of observed failure modes
Testing and evaluation
- Evaluate model on test distribution; understand differences between train and test set distributions (how is “data in the wild” different than what you trained on)
- Revisit model evaluation metric; ensure that this metric drives desirable downstream user behavior
- Write tests for:
  - Input data pipeline
  - Model inference functionality
  - Model inference performance on validation data
  - Explicit scenarios expected in production (model is evaluated on a curated set of observations)
Model deployment
- Expose model via a REST API
- Deploy new model to small subset of users to ensure everything goes smoothly, then roll out to all users
- Maintain the ability to roll back model to previous versions
- Monitor live data and model prediction distributions
Ongoing model maintenance
- Understand that changes can affect the system in unexpected ways
- Periodically retrain model to prevent model staleness
- If there is a transfer in model ownership, educate the new team

Team roles

A typical team is composed of:

data engineer (builds the data ingestion pipelines)
machine learning engineer (train and iterate models to perform the task)
software engineer (aids with integrating machine learning model with the rest of the product)
project manager (main point of contact with the client)

Planning and project setup

It may be tempting to skip this section and dive right in to "just see what the models can do". Don't skip this section. All too often, you'll end up wasting time by delaying discussions surrounding the project goals and model evaluation criteria. Everyone should be working toward a common goal from the start of the project.

It's worth noting that defining the model task is not always straightforward. There's often many different approaches you can take towards solving a problem and it's not always immediately evident which is optimal. If your problem is vague and the modeling task is not clear, jump over to my post on defining requirements for machine learning projects before proceeding.

Prioritizing projects

Ideal: project has high impact and high feasibility.

Mental models for evaluating project impact:

Look for places where cheap prediction drives large value
Look for complicated rule-based software where we can learn rules instead of programming them

When evaluating projects, it can be useful to have a common language and understanding of the differences between traditional software and machine learning software. Andrej Karparthy's Software 2.0 is recommended reading for this topic.

Software 1.0

Explicit instructions for a computer written by a programmer using a programming language such as Python or C++. A human writes the logic such that when the system is provided with data it will output the desired behavior.

Software 2.0

Implicit instructions by providing data, "written" by an optimization algorithm using parameters of a specified model architecture. The system logic is learned from a provided collection of data examples and their corresponding desired behavior.

See this talk for more detail.

A quick note on Software 1.0 and Software 2.0 - these two paradigms are not mutually exclusive. Software 2.0 is usually used to scale the logic component of traditional software systems by leveraging large amounts of data to enable more complex or nuanced decision logic.

For example, Jeff Dean talks (at 27:15) about how the code for Google Translate used to be a very complicated system consisting of ~500k lines of code. Google was able to simplify this product by leveraging a machine learning model to perform the core logical task of translating text to a different language, requiring only ~500 lines of code to describe the model. However, this model still requires some "Software 1.0" code to process the user's query, invoke the machine learning model, and return the desired information to the user.

In summary, machine learning can drive large value in applications where decision logic is difficult or complicated for humans to write, but relatively easy for machines to learn. On that note, we'll continue to the next section to discuss how to evaluate whether a task is "relatively easy" for machines to learn.

Determining feasibility

Some useful questions to ask when determining the feasibility of a project:

Cost of data acquisition
- How hard is it to acquire data?
- How expensive is data labeling?
- How much data will be needed?
Cost of wrong predictions
- How frequently does the system need to be right to be useful?
- Are there scenarios where a wrong prediction incurs a large cost?
Availability of good published work about similar problems
- Has the problem been reduced to practice?
- Is there sufficient literature on the problem?
- Are there pre-trained models we can leverage?
Computational resources available both for training and inference
- Will the model be deployed in a resource-constrained environment?
- What are the latency requirements for the model?

Specifying project requirements

Establish a single value optimization metric for the project. Can also include several other satisficing metrics (ie. performance thresholds) to evaluate models, but can only optimize a single metric.

Example:

Optimize for accuracy
Prediction latency under 10 ms
Model requires no more than 1gb of memory
90% coverage (model confidence exceeds required threshold to consider a prediction as valid)

The optimization metric may be a weighted sum of many things which we care about. Revisit this metric as performance improves.

Some teams may choose to ignore a certain requirement at the start of the project, with the goal of revising their solution (to meet the ignored requirements) after they have discovered a promising general approach.

Decide at what point you will ship your first model.

Some teams aim for a “neutral” first launch: a first launch that explicitly deprioritizes machine learning gains, to avoid getting distracted. — Google Rules of Machine Learning

The motivation behind this approach is that the first deployment should involve a simple model with focus spent on building the proper machine learning pipeline required for prediction. This allows you to deliver value quickly and avoid the trap of spending too much of your time trying to "squeeze the juice."

Setting up a ML codebase

A well-organized machine learning codebase should modularize data processing, model definition, model training, and experiment management.

Example codebase organization:

configs/
    baseline.yaml
    latest.yaml
data/
docker/ 
project_name/
  api/
    app.py
  models/
    base.py
    simple_baseline.py
    cnn.py
  datasets.py
  train.py
  experiment.py
scripts/

data/ provides a place to store raw and processed data for your project. You can also include a data/README.md file which describes the data for your project.

docker/ is a place to specify one or many Dockerfiles for the project. Docker (and other container solutions) help ensure consistent behavior across multiple machines and deployments.

api/app.py exposes the model through a REST client for predictions. You will likely choose to load the (trained) model from a model registry rather than importing directly from your library.

models/ defines a collection of machine learning models for the task, unified by a common API defined in base.py. These models include code for any necessary data preprocessing and output normalization.

datasets.py manages construction of the dataset. Handles data pipelining/staging areas, shuffling, reading from disk.

experiment.py manages the experiment process of evaluating multiple models/ideas. This constructs the dataset and models for a given experiment.

train.py defines the actual training loop for the model. This code interacts with the optimizer and handles logging during training.

See other examples here, here, here and here.

Data collection and labeling

An ideal machine learning pipeline uses data which labels itself. For example, Tesla Autopilot has a model running that predicts when cars are about to cut into your lane. In order to acquire labeled data in a systematic manner, you can simply observe when a car changes from a neighboring lane into the Tesla's lane and then rewind the video feed to label that a car is about to cut in to the lane.

As another example, suppose Facebook is building a model to predict user engagement when deciding how to order things on the newsfeed. After serving the user content based on a prediction, they can monitor engagement and turn this interaction into a labeled observation without any human effort. However, just be sure to think through this process and ensure that your "self-labeling" system won't get stuck in a feedback loop with itself.

For many other cases, we must manually label data for the task we wish to automate. The quality of your data labels has a large effect on the upper bound of model performance.

Here is a real use case from work for model improvement and the steps taken to get there:

- Baseline: 53%
- Logistic: 58%
- Deep learning: 61%
- **Fixing your data: 77%**

Some good ol' fashion "understanding your data" is worth it's weight in hyperparameter tuning!
— Alex Gude (@alex_gude) April 24, 2019

Most data labeling projects require multiple people, which necessitates labeling documentation. Even if you're the only person labeling the data, it makes sense to document your labeling criteria so that you maintain consistency.

One tricky case is where you decide to change your labeling methodology after already having labeled data. For example, in the Software 2.0 talk mentioned previously, Andrej Karparthy talks about data which has no clear and obvious ground truth.

Image credit

If you run into this, tag "hard-to-label" examples in some manner such that you can easily find all similar examples should you decide to change your labeling methodology down the road. Additionally, you should version your dataset and associate a given model with a dataset version.

Tip: After labeling data and training an initial model, look at the observations with the largest error. These examples are often poorly labeled.

Active learning

Active learning is useful when you have a large amount of unlabeled data and you need to decide what data you should label. Labeling data can be expensive, so we'd like to limit the time spent on this task.

As a counterpoint, if you can afford to label your entire dataset, you probably should. Active learning adds another layer of complexity.

"The main hypothesis in active learning is that if a learning algorithm can choose the data it wants to learn from, it can perform better than traditional methods with substantially less data for training." - DataCamp

General approach:

Starting with an unlabeled dataset, build a "seed" dataset by acquiring labels for a small subset of instances
Train initial model on the seed dataset
Predict the labels of the remaining unlabeled observations
Use the uncertainty of the model's predictions to prioritize the labeling of remaining observations

Leveraging weak labels
However, tasking humans with generating ground truth labels is expensive. Often times you'll have access to large swaths of unlabeled data and a limited labeling budget - how can you maximize the value from your data? In some cases, your data can have information which provides a noisy estimate of the ground truth. For example, if you're categorizing Instagram photos, you might have access to the hashtags used in the caption of the image. Other times, you might have subject matter experts which can help you develop heuristics about the data.

Snorkel is an interesting project produced by the Stanford DAWN (Data Analytics for What’s Next) lab which formalizes an approach towards combining many noisy label estimates into a probabilistic ground truth. I'd encourage you to check it out and see if you might be able to leverage the approach for your problem.

Model exploration

Establish performance baselines on your problem. Baselines are useful for both establishing a lower bound of expected performance (simple model baseline) and establishing a target performance level (human baseline).

Simple baselines include out-of-the-box scikit-learn models (i.e. logistic regression with default parameters) or even simple heuristics (always predict the majority class). Without these baselines, it's impossible to evaluate the value of added model complexity.
If your problem is well-studied, search the literature to approximate a baseline based on published results for very similar tasks/datasets.
If possible, try to estimate human-level performance on the given task. Don't naively assume that humans will perform the task perfectly, a lot of simple tasks are deceptively hard!

Start simple and gradually ramp up complexity. This typically involves using a simple model, but can also include starting with a simpler version of your task.

Before doing anything intelligent with "AI", do the unintelligent version fast and at scale.
At worst you understand the limits of a simplistic approach and what complexities you need to handle.
At best you realize you don't need the overhead of intelligence.
— Smerity (@Smerity) February 13, 2019

Once a model runs, overfit a single batch of data. Don't use regularization yet, as we want to see if the unconstrained model has sufficient capacity to learn from the data.

Practical Advice for Building Deep Neural Networks (see case study on overfitting an initial model)

Survey the literature. Search for papers on Arxiv describing model architectures for similar problems and speak with other practitioners to see which approaches have been most successful in practice. Determine a state of the art approach and use this as a baseline model (trained on your dataset).

Reproduce a known result. If you're using a model which has been well-studied, ensure that your model's performance on a commonly-used dataset matches what is reported in the literature.

Understand how model performance scales with more data. Plot the model performance as a function of increasing dataset size for the baseline models that you've explored. Observe how each model's performance scales as you increase the amount of data used for training.

Model refinement

Once you have a general idea of successful model architectures and approaches for your problem, you should now spend much more focused effort on squeezing out performance gains from the model.

Build a scalable data pipeline. By this point, you've determined which types of data are necessary for your model and you can now focus on engineering a performant pipeline.

Apply the bias variance decomposition to determine next steps. Break down error into: irreducible error, avoidable bias (difference between train error and irreducible error), variance (difference between validation error and train error), and validation set overfitting (difference between test error and validation error).

If training on a (known) different distribution than what is available at test time, consider having two validation subsets: val-train and val-test. The difference between val-train error and val-test error is described by distribution shift.
Addressing underfitting:
1. Increase model capacity
2. Reduce regularization
3. Error analysis
4. Choose a more advanced architecture (closer to state of art)
5. Tune hyperparameters
6. Add features
Addressing overfitting:
1. Add more training data
2. Add regularization
3. Add data augmentation
4. Error analysis
5. Tune hyperparameters
6. Reduce model size
Addressing distribution shift:
1. Perform error analysis to understand nature of distribution shift
2. Synthesize data (by augmentation) to more closely match the test distribution
3. Apply domain adaptation techniques

Use coarse-to-fine random searches for hyperparameters. Start with a wide hyperparameter space initially and iteratively hone in on the highest-performing region of the hyperparameter space.

Hyperparameter tuning for machine learning models.

Perform targeted collection of data to address current failure modes. Develop a systematic method for analyzing errors of your current model. Categorize these errors, if possible, and collect additional data to better cover these cases.

Debugging ML projects

Why is your model performing poorly?

Implementation bugs
Hyperparameter choices
Data/model fit
Dataset construction

Key mindset for DL troubleshooting: pessimism.

In order to complete machine learning projects efficiently, start simple and gradually increase complexity. Start with a solid foundation and build upon it in an incremental fashion.

Tip: Fix a random seed to ensure your model training is reproducible.

Common bugs:

oh: 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters
— Andrej Karpathy (@karpathy) July 1, 2018

Discovering failure modes

Use clustering to uncover failure modes and improve error analysis:

Select all incorrect predictions. (Optionally, sort your observations by their calculated loss to find the most egregious errors.)
Run a clustering algorithm such as DBSCAN across selected observations.
Manually explore the clusters to look for common attributes which make prediction difficult.

Categorize observations with incorrect predictions and determine what best action can be taken in the model refinement stage in order to improve performance on these cases.

Testing and evaluation

If you haven't already written tests for your code yet, you should write them at this point.

Different components of a ML product to test:

Training system processes raw data, runs experiments, manages results, stores weights.
- Required tests:
  - Test the full training pipeline (from raw data to trained model) to ensure that changes haven't been made upstream with respect to how data from our application is stored. These tests should be run nightly/weekly.
Prediction system constructs the network, loads the stored weights, and makes predictions.
- Required tests:
  - Run inference on the validation data (already processed) and ensure model score does not degrade with new model/weights. This should be triggered every code push.
  - You should also have a quick functionality test that runs on a few important examples so that you can quickly (<5 minutes) ensure that you haven't broken functionality during development. These tests are used as a sanity check as you are writing new code.
  - Also consider scenarios that your model might encounter, and develop tests to ensure new models still perform sufficiently. The "test case" is a scenario defined by the human and represented by a curated set of observations.
    - Example: For a self driving car, you might have a test to ensure that the car doesn't turn left at a yellow light. For this case, you may run your model on observations where the car is at a yellow light and ensure that the prediction doesn't tell the car to proceed forward.
Serving system exposed to accept "real world" input and perform inference on production data. This system must be able to scale to demand.
- Required monitoring:
  - Alerts for downtime and errors
  - Check for distribution shift in data

Image credit

Evaluating production readiness

The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction

Data:

Feature expectations are captured in a schema.
All features are beneficial.
No feature’s cost is too much.
Features adhere to meta-level requirements.
The data pipeline has appropriate privacy controls.
New features can be added quickly.
All input feature code is tested.

Model:

Model specs are reviewed and submitted.
Offline and online metrics correlate.
All hyperparameters have been tuned.
The impact of model staleness is known.
A simple model is not better.
Model quality is sufficient on important data slices.
The model is tested for considerations of inclusion.

Infrastructure:

Training is reproducible.
Model specs are unit tested.
The ML pipeline is integration tested.
Model quality is validated before serving.
The model is debuggable.
Models are canaried before serving.
Serving models can be rolled back.

Monitoring:

Dependency changes result in notification.
Data invariants hold for inputs.
Training and serving are not skewed.
Models are not too stale.
Models are numerically stable.
Computing performance has not regressed.
Prediction quality has not regressed.

Model deployment

Be sure to have a versioning system in place for:

Model parameters
Model configuration
Feature pipeline
Training dataset
Validation dataset

A common way to deploy a model is to package the system into a Docker container and expose a REST API for inference.

Canarying: Serve new model to a small subset of users (ie. 5%) while still serving the existing model to the remainder. Check to make sure rollout is smooth, then deploy new model to rest of users.

Shadow mode: Ship a new model alongside the existing model, still using the existing model for predictions but storing the output for both models. Measuring the delta between the new and current model's predictions will give an indication for how drastically things will change when you switch to the new model.

Ongoing model maintenance

Hidden Technical Debt in Machine Learning Systems (quoted below, emphasis mine)

A primer on concept of technical debt:

As with fiscal debt, there are often sound strategic reasons to take on technical debt. Not all debt is bad, but all debt needs to be serviced. Technical debt may be paid down by refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening APIs, and improving documentation. The goal is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability. Deferring such payments results in compounding costs. Hidden debt is dangerous because it compounds silently.

Machine learning projects are not complete upon shipping the first version. If you are "handing off" a project and transferring model responsibility, it is extremely important to talk through the required model maintenance with the new team.

Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.

Maintenance principles

CACE principle: Changing Anything Changes Everything
Machine learning systems are tightly coupled. Changes to the feature space, hyper parameters, learning rate, or any other "knob" can affect model performance.

Specific mitigation strategies:

Create model validation tests which are run every time new code is pushed.
Decompose problems into isolated components where it makes sense to do so.

Undeclared consumers of your model may be inadvertently affected by your changes.

"Without access controls, it is possible for some of these consumers to be undeclared consumers, consuming the output of a given prediction model as an input to another component of the system."

If your model and/or its predictions are widely accessible, other components within your system may grow to depend on your model without your knowledge. Changes to the model (such as periodic retraining or redefining the output) may negatively affect those downstream components.

Specific mitigation strategies:

Control access to your model by making outside components request permission and signal their usage of your model.

Avoid depending on input signals which may change over time.
Some features are obtained by a table lookup (ie. word embeddings) or simply an input pipeline which is outside the scope of your codebase. When these external feature representations are changed, the model's performance can suffer.

Specific mitigation strategies:

Create a versioned copy of your input signals to provide stability against changes in external input pipelines. These versioned inputs can be specified in a model's configuration file.

Eliminate unnecessary features.
Regularly evaluate the effect of removing individual features from a given model. A model's feature space should only contain relevant and important features for the given task.

There are many strategies to determine feature importances, such as leave-one-out cross validation and feature permutation tests. Unimportant features add noise to your feature space and should be removed.

Tip: Document deprecated features (deemed unimportant) so that they aren't accidentally reintroduced later.

Model performance will likely decline over time.
As the input distribution shifts, the model's performance will suffer. You should plan to periodically retrain your model such that it has always learned from recent "real world" data.

This guide draws inspiration from the Full Stack Deep Learning Bootcamp, best practices released by Google, my personal experience, and conversations with fellow practitioners.

Find something that's missing from this guide? Let me know!

External Resources

Blog posts

A Recipe for Training Neural Networks
How to put machine learning models into production
Designing collaborative AI (clever product design can reduce model performance requirements)
Data project checklist - Jeremy Howard
Checklist for debugging neural networks
Troubleshooting Deep Neural Networks
Nitpicking Machine Learning Technical Debt
Accelerate Machine Learning with Active Learning

Papers

Case studies

Talks

Leading Data Science Teams: A Framework To Help Guide Data Science Project Managers - Jeffrey Saltz
An Only One Step Ahead Guide for Machine Learning Projects - Chang Lee
- An entertaining talk discussing advice for approaching machine learning projects. This talk will give you a "flavor" for the details covered in this guide.
Microsoft Research: Active Learning and Annotation

An overview of object detection: one-stage methods.

Jeremy Jordan — Wed, 11 Jul 2018 14:06:15 GMT

In this post, I'll discuss an overview of deep learning techniques for object detection using convolutional neural networks. Object detection is useful for understanding what's in an image, describing both what is in an image and where those objects are found.

In general, there's two different approaches for this task – we can either make a fixed number of predictions on grid (one stage) or leverage a proposal network to find objects and then use a second network to fine-tune these proposals and output a final prediction (two stage).

In this blog post, I'll discuss the one-stage approach towards object detection; a follow-up post will then discuss the two-stage approach. Each approach has its own strengths and weaknesses, which I'll discuss in the respective blog posts.

Jump to:

Understanding the task
Direct object prediction
- Predictions on a grid
- Non-maximum suppression
YOLO: You Only Look Once
SSD: Single Shot Detection
Addressing object imbalance with focal loss
Common datasets and competitions
Further reading

Understanding the task

The goal of object detection is to recognize instances of a predefined set of object classes (e.g. {people, cars, bikes, animals}) and describe the locations of each detected object in the image using a bounding box. Two examples are shown below.

Example images are taken from the PASCAL VOC dataset.

We'll use rectangles to describe the locations of each object, which may lead to imperfect localizations due to the shapes of objects. An alternative approach would be image segmentation which provides localization at the pixel-level.

Direct object prediction

This blog post will focus on model architectures which directly predict object bounding boxes for an image in a one-stage fashion. In other words, there is no intermediate task (as we'll discuss later with region proposals) which must be performed in order to produce an output. This leads to a simpler and faster model architecture, although it can sometimes struggle to be flexible enough to adapt to arbitrary tasks (such as mask prediction).

Predictions on a grid

In order to understand what's in an image, we'll feed our input through a standard convolutional network to build a rich feature representation of the original image. We'll refer to this part of the architecture as the "backbone" network, which is usually pre-trained as an image classifier to more cheaply learn how to extract features from an image. This is a result of the fact that data for image classification is easier (and thus cheaper) to label as it only requires a single label as opposed to defining bounding box annotations for each image. Thus, we can train on a very large labeled dataset (such as ImageNet) in order to learn good feature representations.

After pre-training the backbone architecture as an image classifier, we'll remove the last few layers of the network so that our backbone network outputs a collection of stacked feature maps which describe the original image in a low spatial resolution albeit a high feature (channel) resolution. In the example below, we have a 7x7x512 representation of our observation. Each of the 512 feature maps describe different characteristics of the original image.

We can relate this 7x7 grid back to the original input in order to understand what each grid cell represents relative to the original image.

We can also determine roughly where objects are located in the coarse (7x7) feature maps by observing which grid cell contains the center of our bounding box annotation. We'll assign this grid cell as being "responsible" for detecting that specific object.

In order to detect this object, we will add another convolutional layer and learn the kernel parameters which combine the context of all 512 feature maps in order to produce an activation corresponding with the grid cell which contains our object.

If the input image contains multiple objects, we should have multiple activations on our grid denoting that an object is in each of the activated regions.

However, we cannot sufficiently describe each object with a single activation. In order to fully describe a detected object, we'll need to define:

The likelihood that a grid cell contains an object ($p_{obj}$)
Which class the object belongs to ($c_1$, $c_2$, ..., $c_C$)
Four bounding box descriptors to describe the $x$ coordinate, $y$ coordinate, width, and height of a labeled box ($t_x$, $t_y$, $t_w$, $t_h$)

Thus, we'll need to learn a convolution filter for each of the above attributes such that we produce $5 + C$ output channels to describe a single bounding box at each grid cell location. This means that we'll learn a set of weights to look across all 512 feature maps and determine which grid cells are likely to contain an object, what classes are likely to be present in each grid cell, and how to describe the bounding box for possible objects in each grid cell.

The full output of applying $5 + C$ convolutional filters is shown below for clarity, producing one bounding box descriptor for each grid cell.

However, some images might have multiple objects which "belong" to the same grid cell. We can alter our layer to produce $B(5 + C)$ filters such that we can predict $B$ bounding boxes for each grid cell location.

Visualizing the full convolutional output of our $B(5 + C)$ filters, we can see that our model will always produce a fixed number of $N \times N \times B$ predictions for a given image. We can then filter our predictions to only consider bounding boxes which has a $p_{obj}$ above some defined threshold.

Because of the convolutional nature of our detection process, multiple objects can be detected in parallel. However, we also end up predicting for a large number grid cells where no object is found. Although we can filter these bounding boxes out by their $p_{obj}$ score, this introduces quite a large imbalance between the predicted bounding boxes which contain an object and those which do not contain an object.

The two models I'll discuss below both use this concept of "predictions on a grid" to detect a fixed number of possible objects within an image. In the respective sections, I'll describe the nuances of each approach and fill in some of the details that I've glanced over in this section so that you can actually implement each model.

Non-maximum suppression

The "predictions on a grid" approach produces a fixed number of bounding box predictions for each image. However, we would like to filter these predictions in order to only output bounding boxes for objects that are actually likely to be in the image. Moreover, we want a single bounding box prediction for each object detected.

We can filter out most of the bounding box predictions by only considering predictions with a $p_{obj}$ above some defined confidence threshold. However, we still may be left with multiple high-confidence predictions describing the same object. Thus, we need a method for removing redundant object predictions such that each object is described by a single bounding box.

To accomplish this, we'll use a technique known as non-max suppression. At a high level, this technique will look at highly overlapping bounding boxes and suppress (or discard) all of the predictions except the highest confidence prediction.

We'll perform non-max suppression on each class separately. Again, the goal here is to remove redundant predictions so we shouldn't be concerned if we have two predictions that overlap if one box is describing a person and the other box is describing a bicycle. However, if two bounding boxes with high overlap are both describing a person, it's likely that these predictions are describing the same person.

YOLO: You Only Look Once

The YOLO model was first published (by Joseph Redmon et al.) in 2015 and subsequently revised in two following papers. In each section, I'll discuss the specific implementation details and refinements that were made to improve performance.

Backbone network

The original YOLO network uses a modified GoogLeNet as the backbone network. Redmond later created a new model named DarkNet-19 which follows the general design of a $3 \times 3$ filters, doubling the number of channels at each pooling step; $1 \times 1$ filters are also used to periodically compress the feature representation throughout the network. His latest paper introduces a new, larger model named DarkNet-53 which offers improved performance over its predecessor.

All of these models were first pre-trained as image classifiers before being adapted for the detection task. In the second iteration of the YOLO model, Redmond discovered that using higher resolution images at the end of classification pre-training improved the detection performance and thus adopted this practice.

Adapting the classification network for detection simply consists of removing the last few layers of the network and adding a convolutional layer with $B(5 + C)$ filters to produce the $N \times N \times B$ bounding box predictions.

Bounding boxes (and concept of anchor boxes)

The first iteration of the YOLO model directly predicts all four values which describe a bounding box. The $x$ and $y$ coordinates of each bounding box are defined relative to the top left corner of each grid cell and normalized by the cell dimensions such that the coordinate values are bounded between 0 and 1. We define the boxes width and height such that our model predicts the square-root width and height; by defining the width and height of the boxes as a square-root value, differences between large numbers are less significant than differences between small numbers (confirm this visually by looking at a plot of $y = \sqrt {x}$). Redmond chose this formulation because “small deviations in large boxes matter less than in small boxes" and thus when calculating our loss function we would like the emphasis to be placed on getting small boxes more exact. The bounding box width and height are normalized by the image width and height and thus are also bounded between 0 and 1. An L2 loss is applied during training.

This formulation was later revised to introduce the concept of a bounding box prior. Rather than expecting the model to directly produce unique bounding box descriptors for each new image, we will define a collection of bounding boxes with varying aspect ratios which embed some prior information about the shape of objects we're expecting to detect. Redmond offers an approach towards discovering the best aspect ratios by doing k-means clustering (with a custom distance metric) on all of the bounding boxes in your training dataset.

In the image below, you can see a collection of 5 bounding box priors (also known as anchor boxes) for the grid cell highlighted in yellow. With this formulation, each of the $B$ bounding boxes explicitly specialize in detecting objects of a specific size and aspect ratio.

Note: Although it is not visualized, these anchor boxes are present for each cell in our prediction grid.

Rather than directly predicting the bounding box dimensions, we'll reformulate our task in order to simply predict the offset from our bounding box prior dimensions such that we can fine-tune our predicted bounding box dimensions. This reformulation makes the prediction task easier to learn.

For similar reasons as originally predicting the square-root width and height, we'll define our task to predict the log offsets from our bounding box prior.

Objectness (and assigning labeled objects to a bounding box)

In the first version of the model, the "objectness" score $p_{obj}$ was trained to approximate the Intersection over Union (IoU) between the predicted box and the ground truth label. When we calculate our loss during training, we'll match objects to whichever bounding box prediction (on the same grid cell) has the highest IoU score. For unmatched boxes, the only descriptor which we'll include in our loss function is $p_{obj}$.

After the addition bounding box priors in YOLOv2, we can simply assign labeled objects to whichever anchor box (on the same grid cell) has the highest IoU score with the labeled object.

In the third version, Redmond redefined the "objectness" target score $p_{obj}$ to be 1 for the bounding boxes with highest IoU score for each given target, and 0 for all remaining boxes. However, we will not include bounding boxes which have a high IoU score (above some threshold) but not the highest score when calculating the loss. In simple terms, it doesn't make sense to punish a good prediction just because it isn't the best prediction.

Class labels

Originally, class prediction was performed at the grid cell level. This means that a single grid cell could not predict multiple bounding boxes of different classes. This was later revised to predict class for each bounding box using a softmax activation across classes and a cross entropy loss.

Redmond later changed the class prediction to use sigmoid activations for multi-label classification as he found a softmax is not necessary for good performance. This choice will depend on your dataset and whether or not your labels overlap (eg. "golden retriever" and "dog").

Output layer

The first YOLO model simply predicts the $N \times N \times B$ bounding boxes using the output of our backbone network.

In YOLOv2, Redmond adds a weird skip connection splitting a higher resolution feature map across multiple channels as visualized below.

The weird "skip connection from higher resolution feature maps" idea that I don't like.

Fortunately, this was changed in the third iteration for a more standard feature pyramid network output structure. With this method, we'll alternate between outputting a prediction and upsampling the feature maps (with skip connections). This allows for predictions that can take advantage of finer-grained information from earlier in the network, which helps for detecting small objects in the image.

Image credit

SSD: Single Shot Detection

The SSD model was also published (by Wei Liu et al.) in 2015, shortly after the YOLO model, and was also later refined in a subsequent paper. In each section, I'll discuss the specific implementation details for this model.

Backbone network

A VGG-16 model, pre-trained on ImageNet for image classification, is used as the backbone network. The authors make a few slight tweaks when adapting the model for the detection task, including: replacing fully connected layers with convolutional implementations, removing dropout layers, and replacing the last max pooling layer with a dilated convolution.

Bounding boxes (and concept of anchor boxes)

Rather than using k-means clustering to discover aspect ratios, the SSD model manually defines a collection of aspect ratios (eg. {1, 2, 3, 1/2, 1/3}) to use for the $B$ bounding boxes at each grid cell location.

For each bounding box, we'll predict the offsets from the anchor box for both the bounding box coordinates ($x$ and $y$) and dimensions (width and height). We'll use ReLU activations trained with a Smooth L1 loss.

Objectness (and assigning labeled objects to a bounding box)

One major distinction between YOLO and SSD is that SSD does not attempt to predict a value for $p_{obj}$. Whereas the YOLO model predicted the probability of an object and then predicted the probability of each class given that there was an object present, the SSD model attempts to directly predict the probability that a class is present in a given bounding box.

When calculating the loss, we'll match each ground truth box to the anchor box with the highest IoU — defining this box with being "responsible" for making the prediction. However, we'll also match the ground truth boxes with any other anchor boxes with an IoU above some defined threshold (0.5) in the same light of not punishing good predictions simply because they weren't the best. We can always rely on non-max suppression at inference time to filter out redundant predictions.

Class labels

As I mentioned previously, the class predictions for SSD bounding boxes are not conditioned on the fact that an object is present. Thus, we directly predict the probability of each class using a softmax activation and cross entropy loss. Because we don't explicitly predict $p_{obj}$, it's important to have a class for "background" so that we can predict when no object is present.

Due to the fact that most of the boxes will belong to the "background" class, we will use a technique known as "hard negative mining" to sample negative (no object) predictions such that there is at most a 3:1 ratio between negative and positive predictions when calculating our loss.

Output layer

To allow for predictions at multiple scales, the SSD output module progressively downsamples the convolutional feature maps, intermittently producing bounding box predictions (as shown with the arrows from convolutional layers to the predictions box).

Addressing object imbalance with focal loss

As I mentioned earlier, we often end up with a large amount of bounding boxes in which no object is contained due to the nature of our "predictions on a grid" approach. Although we can easily filter these boxes out after making a fixed set of bounding box predictions, there is still a (foreground-background) class imbalance present which can introduce difficulties during training. This is especially difficult for models which don't separate prediction of objectness and class probability into two separate tasks, and instead simply include a "background" class for regions with no objects.

Researchers at Facebook proposed adding a scaling factor to the standard cross entropy loss such that it places more the emphasis on "hard" examples during training, preventing easy negative predictions from dominating the training process.

Image credit

As the researchers point out, easily classified examples can incur a non-trivial loss for standard cross entropy loss ($\gamma=0$) which, summed over a large collection of samples, can easily dominate the parameter update. The ${\left( {1 - {p_t}} \right)^\gamma }$ term acts as a tunable scaling factor to prevent this from occuring.

As the paper points out, "with $\gamma=2$, an example classified with $p_t = 0.9$ would have 100X lower loss compared with CE and with $p_t = 0.968$ it would have 1000X lower loss."

Common datasets and competitions

Below I've listed some common datasets that researchers use when evaluating new object detection models.

Evaluating image segmentation models.

Jeremy Jordan — Wed, 30 May 2018 20:08:46 GMT

When evaluating a standard machine learning model, we usually classify our predictions into four categories: true positives, false positives, true negatives, and false negatives. However, for the dense prediction task of image segmentation, it's not immediately clear what counts as a "true positive" and, more generally, how we can evaluate our predictions. In this post, I'll discuss common methods for evaluating both semantic and instance segmentation techniques.

Semantic segmentation

Recall that the task of semantic segmentation is simply to predict the class of each pixel in an image.

Image credit

Our prediction output shape matches the input's spatial resolution (width and height) with a channel depth equivalent to the number of possible classes to be predicted. Each channel consists of a binary mask which labels areas where a specific class is present.

Intersection over Union

The Intersection over Union (IoU) metric, also referred to as the Jaccard index, is essentially a method to quantify the percent overlap between the target mask and our prediction output. This metric is closely related to the Dice coefficient which is often used as a loss function during training.

Quite simply, the IoU metric measures the number of pixels common between the target and prediction masks divided by the total number of pixels present across both masks.

$$ IoU = \frac{{target \cap prediction}}{{target \cup prediction}} $$

As a visual example, let's suppose we're tasked with calculating the IoU score of the following prediction, given the ground truth labeled mask.

The intersection ($A \cap B$) is comprised of the pixels found in both the prediction mask and the ground truth mask, whereas the union ($A \cup B$) is simply comprised of all pixels found in either the prediction or target mask.

We can calculate this easily using Numpy.

intersection = np.logical_and(target, prediction)
union = np.logical_or(target, prediction)
iou_score = np.sum(intersection) / np.sum(union)

The IoU score is calculated for each class separately and then averaged over all classes to provide a global, mean IoU score of our semantic segmentation prediction.

Pixel Accuracy

An alternative metric to evaluate a semantic segmentation is to simply report the percent of pixels in the image which were correctly classified. The pixel accuracy is commonly reported for each class separately as well as globally across all classes.

When considering the per-class pixel accuracy we're essentially evaluating a binary mask; a true positive represents a pixel that is correctly predicted to belong to the given class (according to the target mask) whereas a true negative represents a pixel that is correctly identified as not belonging to the given class.

$$ accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}} $$

This metric can sometimes provide misleading results when the class representation is small within the image, as the measure will be biased in mainly reporting how well you identify negative case (ie. where the class is not present).

Instance segmentation

Instance segmentation models are a little more complicated to evaluate; whereas semantic segmentation models output a single segmentation mask, instance segmentation models produce a collection of local segmentation masks describing each object detected in the image. As such, evaluation methods for instance segmentation are quite similar to that of object detection, with the exception that we now calculate IoU of masks instead of bounding boxes.

Image credit

Calculating Precision

To evaluate our collection of predicted masks, we'll compare each of our predicted masks with each of the available target masks for a given input.

A true positive is observed when a prediction-target mask pair has an IoU score which exceeds some predefined threshold.
A false positive indicates a predicted object mask had no associated ground truth object mask.
A false negative indicates a ground truth object mask had no associated predicted object mask.

Precision effectively describes the purity of our positive detections relative to the ground truth. Of all of the objects that we predicted in a given image, how many of those objects actually had a matching ground truth annotation?

$$ Precision = \frac{TP}{TP + FP} $$

Recall effectively describes the completeness of our positive predictions relative to the ground truth. Of all of the objected annotated in our ground truth, how many did we capture as positive predictions?

$$ Recall = \frac{TP}{TP + FN} $$

However, in order to calculate the prediction and recall of a model output, we'll need to define what constitutes a positive detection. To do this, we'll calculate the IoU score between each (prediction, target) mask pair and then determine which mask pairs have an IoU score exceeding a defined threshold value.

However, computing a single precision and recall score at the specified IoU threshold does not adequately describe the behavior of our model's full precision-recall curve. Instead, we can use average precision to effectively integrate the area under a precision-recall curve.

Let's use the precision-recall curve below as an example.

First, we'll adjust our curve such that the precision at a given point $r$ is adjusted to the maximum precision for recall greater than $r$.

Then, we'll simply calculate the area under the curve by numerical integration. This method replaces an older approach of averaging over a range of recall values.

Note that the precision-recall curve will likely not extend out to perfect recall due to our prediction thresholding according to each mask IoU.

As an example, the Microsoft COCO challenge's primary metric for the detection task evaluates the average precision score using IoU thresholds ranging from 0.5 to 0.95 (in 0.05 increments).

For prediction problems with multiple classes of objects, this value is then averaged over all of the classes.

An overview of semantic image segmentation.

Jeremy Jordan — Tue, 22 May 2018 03:11:27 GMT

In this post, I'll discuss how to use convolutional neural networks for the task of semantic image segmentation. Image segmentation is a computer vision task in which we label specific regions of an image according to what's being shown.

"What's in this image, and where in the image is it located?"

Jump to:

Representing the task
Constructing an architecture
Defining a loss function
Common datasets and segmentation competitions
Further reading

More specifically, the goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented. Because we're predicting for every pixel in the image, this task is commonly referred to as dense prediction.

An example of semantic segmentation, where the goal is to predict class labels for each pixel in the image. (Source)

One important thing to note is that we're not separating instances of the same class; we only care about the category of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects. There exists a different class of models, known as instance segmentation models, which do distinguish between separate objects of the same class.

Segmentation models are useful for a variety of tasks, including:

Autonomous vehicles
We need to equip cars with the necessary perception to understand their environment so that self-driving cars can safely integrate into our existing roads.

A real-time segmented road scene for autonomous driving. (Source)

Medical image diagnostics
Machines can augment analysis performed by radiologists, greatly reducing the time required to run diagnositic tests.

A chest x-ray with the heart (red), lungs (green), and clavicles (blue) are segmented. (Source)

Representing the task

Simply, our goal is to take either a RGB color image ($height \times width \times 3$) or a grayscale image ($height \times width \times 1$) and output a segmentation map where each pixel contains a class label represented as an integer ($height \times width \times 1$).

Note: For visual clarity, I've labeled a low-resolution prediction map. In reality, the segmentation label resolution should match the original input's resolution.

Similar to how we treat standard categorical values, we'll create our target by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.

A prediction can be collapsed into a segmentation map (as shown in the first image) by taking the argmax of each depth-wise pixel vector.

We can easily inspect a target by overlaying it onto the observation.

When we overlay a single channel of our target (or prediction), we refer to this as a mask which illuminates the regions of an image where a specific class is present.

Constructing an architecture

A naive approach towards constructing a neural network architecture for this task is to simply stack a number of convolutional layers (with same padding to preserve dimensions) and output a final segmentation map. This directly learns a mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings; however, it's quite computationally expensive to preserve the full resolution throughout the network.

Image credit

Recall that for deep convolutional networks, earlier layers tend to learn low-level concepts while later layers develop more high-level (and specialized) feature mappings. In order to maintain expressiveness, we typically need to increase the number of feature maps (channels) as we get deeper in the network.

This didn't necessarily pose a problem for the task of image classification, because for that task we only care about what the image contains (and not where it is located). Thus, we could alleviate computational burden by periodically downsampling our feature maps through pooling or strided convolutions (ie. compressing the spatial resolution) without concern. However, for image segmentation, we would like our model to produce a full-resolution semantic prediction.

One popular approach for image segmentation models is to follow an encoder/decoder structure where we downsample the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes, and the upsample the feature representations into a full-resolution segmentation map.

Image credit

Methods for upsampling

There are a few different approaches that we can use to upsample the resolution of a feature map. Whereas pooling operations downsample the resolution by summarizing a local area with a single value (ie. average or max pooling), "unpooling" operations upsample the resolution by distributing a single value into a higher resolution.

Image credit

However, transpose convolutions are by far the most popular approach as they allow for us to develop a learned upsampling.

Image credit

Whereas a typical convolution operation will take the dot product of the values currently in the filter's view and produce a single value for the corresponding output position, a transpose convolution essentially does the opposite. For a transpose convolution, we take a single value from the low-resolution feature map and multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.

A simplified 1D example of upsampling through a transpose operation. (Source)

For filter sizes which produce an overlap in the output feature map (eg. 3x3 filter with stride 2 - as shown in the below example), the overlapping values are simply added together. Unfortunately, this tends to produce a checkerboard artifact in the output and is undesirable, so it's best to ensure that your filter size does not produce an overlap.

Input in blue, output in green. (Source)

Fully convolutional networks

The approach of using a "fully convolutional" network trained end-to-end, pixels-to-pixels for the task of image segmentation was introduced by Long et al. in late 2014. The paper's authors propose adapting existing, well-studied image classification networks (eg. AlexNet) to serve as the encoder module of the network, appending a decoder module with transpose convolutional layers to upsample the coarse feature maps into a full-resolution segmentation map.

Image credit (with modification)

The full network, as shown below, is trained according to a pixel-wise cross entropy loss.

Image credit

However, because the encoder module reduces the resolution of the input by a factor of 32, the decoder module struggles to produce fine-grained segmentations (as shown below).

The paper's authors comment eloquently on this struggle:

Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where... Combining fine layers and coarse layers lets the model make local predictions that respect global structure. ― Long et al.

Adding skip connections

The authors address this tension by slowly upsampling (in stages) the encoded representation, adding "skip connections" from earlier layers, and summing these two feature maps.

Image credit (with modification)

These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.

Ronneberger et al. improve upon the "fully convolutional" architecture primarily through expanding the capacity of the decoder module of the network. More concretely, they propose the U-Net architecture which "consists of a contracting path to capture context and a symmetric expanding path that enables precise localization." This simpler architecture has grown to be very popular and has been adapted for a variety of segmentation problems.

Image credit

Note: The original architecture introduces a decrease in resolution due to the use of valid padding. However, some practitioners opt to use same padding where the padding values are obtained by image reflection at the border.

Whereas Long et al. (FCN paper) reported that data augmentation ("randomly mirroring and “jittering” the images by translating them up to 32 pixels") did not result in a noticeable improvement in performance, Ronneberger et al. (U-Net paper) credit data augmentations ("random elastic deformations of the training samples") as a key concept for learning. It appears as if the usefulness (and type) of data augmentation depends on the problem domain.

Advanced U-Net variants

The standard U-Net model consists of a series of convolution operations for each "block" in the architecture. As I discussed in my post on common convolutional network architectures, there exist a number of more advanced "blocks" that can be substituted in for stacked convolutional layers.

Drozdzal et al. swap out the basic stacked convolution blocks in favor of residual blocks. This residual block introduces short skip connections (within the block) alongside the existing long skip connections (between the corresponding feature maps of encoder and decoder modules) found in the standard U-Net structure. They report that the short skip connections allow for faster convergence when training and allow for deeper models to be trained.

Expanding on this, Jegou et al. proposed the use of dense blocks, still following a U-Net structure, arguing that the "characteristics of DenseNets make them a very good fit for semantic segmentation as they naturally induce skip connections and multi-scale supervision." These dense blocks are useful as they carry low level features from previous layers directly alongside higher level features from more recent layers, allowing for highly efficient feature reuse.

Image credit (with modification)

One very important aspect of this architecture is the fact that the upsampling path does not have a skip connection between the input and output of a dense block. The authors note that because the "upsampling path increases the feature maps spatial resolution, the linear growth in the number of features would be too memory demanding." Thus, only the output of a dense block is passed along in the decoder module.

The FC-DenseNet103 model acheives state of the art results (Oct 2017) on the CamVid dataset.

Dilated/atrous convolutions

One benefit of downsampling a feature map is that it broadens the receptive field (with respect to the input) for the following filter, given a constant filter size. Recall that this approach is more desirable than increasing the filter size due to the parameter inefficiency of large filters (discussed here in Section 3.1). However, this broader context comes at the cost of reduced spatial resolution.

Dilated convolutions provide alternative approach towards gaining a wide field of view while preserving the full spatial dimension. As shown in the figure below, the values used for a dilated convolution are spaced apart according to some specified dilation rate.

Image credit

Some architectures swap out the last few pooling layers for dilated convolutions with successively higher dilation rates to maintain the same field of view while preventing loss of spatial detail. However, it is often still too computationally expensive to completely replace pooling layers with dilated convolutions.

Defining a loss function

The most commonly used loss function for the task of image segmentation is a pixel-wise cross entropy loss. This loss examines each pixel individually, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.

Because the cross entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class. Long et al. (FCN paper) discuss weighting this loss for each output channel in order to counteract a class imbalance present in the dataset.

Meanwhile, Ronneberger et al. (U-Net paper) discuss a loss weighting scheme for each pixel such that there is a higher weight at the border of segmented objects. This loss weighting scheme helped their U-Net model segment cells in biomedical images in a discontinuous fashion such that individual cells may be easily identified within the binary segmentation map.

Notice how the binary segmentation map produces clear borders around the cells. (Source)

Another popular loss function for image segmentation tasks is based on the Dice coefficient, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:

$$ Dice = \frac{{2\left| {A \cap B} \right|}}{{\left| A \right| + \left| B \right|}} $$

where ${\left| {A \cap B} \right|}$ represents the common elements between sets A and B, and $\left| A \right|$ represents the number of elements in set A (and likewise for set B).

For the case of evaluating a Dice coefficient on predicted segmentation masks, we can approximate ${\left| {A \cap B} \right|}$ as the element-wise multiplication between the prediction and target mask, and then sum the resulting matrix.

Because our target mask is binary, we effectively zero-out any pixels from our prediction which are not "activated" in the target mask. For the remaining pixels, we are essentially penalizing low-confidence predictions; a higher value for this expression, which is in the numerator, leads to a better Dice coefficient.

In order to quantify $\left| A \right|$ and $\left| B \right|$, some researchers use the simple sum whereas other researchers prefer to use the squared sum for this calculation. I don't have the practical experience to know which performs better empirically over a wide range of tasks, so I'll leave you to try them both and see which works better.

In case you were wondering, there's a 2 in the numerator in calculating the Dice coefficient because our denominator "double counts" the common elements between the two sets. In order to formulate a loss function which can be minimized, we'll simply use $1 - Dice$. This loss function is known as the soft Dice loss because we directly use the predicted probabilities instead of thresholding and converting them into a binary mask.

With respect to the neural network output, the numerator is concerned with the common activations between our prediction and target mask, where as the denominator is concerned with the quantity of activations in each mask separately. This has the effect of normalizing our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.

A soft Dice loss is calculated for each class separately and then averaged to yield a final score. An example implementation is provided below.

Common datasets and segmentation competitions

Below, I've listed a number of common datasets that researchers use to train new models and benchmark against the state of the art. You can also explore previous Kaggle competitions and read about how winning solutions implemented segmentation models for their given task.

Datasets

Past Kaggle Competitions

2018 Data Science Bowl
- Read about the first place solution.
Carvana Image Masking Challenge
- Read about the first place solution.
Dstl Satellite Imagery Feature Detection
- Read about the third place solution.