This blog post will provide an introduction to Kubernetes so that you can understand the motivation behind the tool, what it is, and how you can use it. In a follow-up post, I'll discuss how we can leverage Kubernetes to power data science workloads using more concrete (data science) examples.
]]>This blog post will provide an introduction to Kubernetes so that you can understand the motivation behind the tool, what it is, and how you can use it. In a follow-up post, I'll discuss how we can leverage Kubernetes to power data science workloads using more concrete (data science) examples. However, it helps to first build an understanding of the fundamentals - which is the focus of this post.
Prerequisites: I'm going to make the assumption that you're familiar with container technologies such as Docker. If you don't have experience building and running container images, I suggest starting here before you continue reading this post.
Here's what we'll discuss in this post. You can click on any top-level heading to jump directly to that section.
Kubernetes is often described as a container orchestration platform. In order to understand what exactly that means, it helps to revisit the purpose of containers, what's missing, and how Kubernetes fills that gap.
Note: You will also see Kubernetes referred to by its numeronym, k8s. It means the same thing, just easier to type.
Why do we love containers? Containers provide a lightweight mechanism for isolating an application's environment. For a given application, we can specify the system configuration and libraries we want installed without worrying about creating conflicts with other applications that might be running on the same physical machine. We encapsulate each application as a container image which can be executed reliably on any machine* (as long as it has the ability to run container images), providing us the portability to enable smooth transitions from development to deployment. Additionally, because each application is self-contained without the concern of environment conflicts, it's easier to place multiple workloads on the same physical machine and achieve higher resource (memory and CPU) utilization - ultimately lowering costs.
What's missing? However, what happens if your container dies? Or even worse, what happens if the machine running your container fails? Containers do not provide a solution for fault tolerance. Or what if you have multiple containers that need the ability to communicate, how do you enable networking between containers? How does this change as you spin up and down individual containers? Container networking can easily become an entangled mess. Lastly, suppose your production environment consists of multiple machines - how do you decide which machine to use to run your container?
Kubernetes as a container orchestration platform. We can address many of the concerns mentioned above using a container orchestration platform.
The director of an orchestra holds the vision for a musical performance and communicates with the musicians in order to coordinate their individual instrumental contributions to achieve this overall vision. As the architect of a system, your job is simply to compose the music (specify the containers to be run) and then hand over control to the orchestra director (container orchestration platform) to achieve that vision.
A container orchestration platform manages the entire lifecycle of individual containers, spinning up and shutting down resources as needed. If a container shuts down unexpectedly, the orchestration platform will react by launching another container in its place.
On top of this, the orchestration platform provides a mechanism for applications to communicate with each other even as underlying individual containers are created and destroyed.
Lastly, given (1) a set of container workloads to run and (2) a set of machines on a cluster, the container orchestrator examines each container and determines the optimal machine to schedule that workload. To understand why this can be valuable, watch Kelsey Hightower explain (17:47-20:55) the difference between automated deployments and container orchestration using an example game of Tetris.
Now that we understand the motivation for container orchestration in general, let's spend some time to discuss the motivating design principles behind Kubernetes. It helps to understand these principles so that you can use the tool as it was intended to be used.
Perhaps the most important design principle in Kubernetes is that we simply define the desired state of our system and let Kubernetes automation work to ensure that the actual state of the system reflects these desires. This absolves you of the responsibility of fixing most things when they break; you simply need to state what your system should look like in an ideal state. Kubernetes will detect when the actual state of the system doesn't meet these expectations and it will intervene on your behalf to fix the problem. This enables our systems to be self-healing and react to problems without the need for human intervention.
The "state" of your system is defined by a collection of objects. Each Kubernetes object has (1) a specification in which you provide the desired state and (2) a status which reflects the current state of the object. Kubernetes maintains a list of all object specifications and constantly polls each object in order to ensure that its status is equal to the specification. If an object is unresponsive, Kubernetes will spin up a new version to replace it. If a object's status has drifted from the specification, Kubernetes will issue the necessary commands to drive that object back to its desired state.
For a certain operating scale, it becomes necessary to architect your applications as a distributed system. Kubernetes is designed to provide the infrastructural layer for such distributed systems, yielding clean abstractions to build applications on top of a collection of machines (collectively known as a cluster). More specifically, Kubernetes provides a unified interface for interacting with this cluster such that you don't have to worry about communicating with each machine individually.
It is commonly recommended that containers be developed with a single concern in mind. As a result, developing containerized applications lends itself quite nicely to the microservice architecture design pattern, which recommends "designing software applications as suites of independently deployable services."
The abstractions provided in Kubernetes naturally support the idea of decoupled services which can be scaled and updated independently. These services are logically separated and communicate via well-defined APIs. This logical separation allows teams to deploy changes into production at a higher velocity since each service can operate on independent release cycles (provided that they respect the existing API contracts).
In order to achieve the most benefit from containers and container orchestration, you should be deploying immutable infrastructure. This is, rather than logging in to a container on a machine to make changes (eg. updating a library), you should build a new container image, deploy the new version, and terminate the older version. As you transition across environments during the life-cycle of a project (development -> testing -> production) you should use the same container image and only modify configurations external to the container image (eg. by mounting a config file).
This becomes very important since containers are designed to be ephemeral, ready to be replaced by another container instance at any time. If your original container had mutated state (eg. manual configuration) but was shut down due to a failed healthcheck, the new container spun up in its place would not reflect those manual changes and could potentially break your application.
When you maintain immutable infrastructure, it also becomes much easier to roll back your applications to a previous state (eg. if an error occurs) - you can simply update your configuration to use an older container image.
Previously, I mentioned that we describe our desired state of the system through a collection of Kubernetes objects. Up until now, our discussion of Kubernetes has been relatively abstract and high-level. In this section, we'll dig into more specifics regarding how you can deploy applications on Kubernetes by covering the basic objects available in Kubernetes.
Kubernetes objects can be defined using either YAML or JSON files; these files defining objects are commonly referred to as manifests. It's a good practice to keep these manifests in a version controlled repository which can act as the single source of truth as to what objects are running on your cluster.
The Pod object is the fundamental building block in Kubernetes, comprised of one or more (tightly related) containers, a shared networking layer, and shared filesystem volumes. Similar to containers, pods are designed to be ephemeral - there is no expectation that a specific, individual pod will persist for a long lifetime.
You won't typically explicitly create Pod objects in your manifests, as it's often simpler to use higher level components which manage Pod objects for you.
A Deployment object encompasses a collection of pods defined by a template and a replica count (how many copies of the template we want to run). You can either set a specific value for the replica count or use a separate Kubernetes resource (eg. a horizontal pod autoscaler) to control the replica count based on system metrics such as CPU utilization.
Note: The Deployment object's controller actually creates another object, a ReplicaSet, under the hood. However, this is abstracted away from you as the user.
While you can't rely on any single pod to stay running indefinitely, you can rely on the fact that the cluster will always try to have $n$ pods available (where $n$ is defined by your specified replica count). If we have a Deployment with a replica count of 10 and 3 of those pods crash due to a machine failure, 3 more pods will be scheduled to run on a different machine in the cluster. For this reason, Deployments are best suited for stateless applications where Pods are able to be replaced at any time without breaking things.
The following YAML file provides an annotated example of how you might define a Deployment object. In this example, we want to run 10 instances of a container which serves an ML model over a REST interface.
Note: In order for Kubernetes to know how compute-intensive this workload might be, we should also provide resource limits in the pod template specification.
Deployments also allow us to specify how we would like to roll out updates when we have new versions of our container image; this blog post provides a good overview of your different options. If we wanted to override the defaults we would include an additional strategy
field under the object spec
. Kubernetes will make sure to gracefully shut down Pods running the old container image and spin up new Pods running the new container image.
Each Pod in Kubernetes is assigned a unique IP address that we can use to communicate with it. However, because Pods are ephemeral, it can be quite difficult to send traffic to your desired container. For example, let's consider the Deployment from above where we have 10 Pods running a container serving a machine learning model over REST. How do we reliably communicate with a server if the set of Pods running as part of the Deployment can change at any time? This is where the Service object enters the picture. A Kubernetes Service provides you with a stable endpoint which can be used to direct traffic to the desired Pods even as the exact underlying Pods change due to updates, scaling, and failures. Services know which Pods they should send traffic to based on labels (key-value pairs) which we define in the Pod metadata.
Note: This blog post does a nice job explaining how traffic is actually routed.
In this example, our Service sends traffic to all healthy Pods with the label app="ml-model"
.
The following YAML file provides an example for how we might wrap a Service around the earlier Deployment example.
While a Service allows us to expose applications behind a stable endpoint, the endpoint is only available to internal cluster traffic. If we wanted to expose our application to traffic external to our cluster, we need to define an Ingress object.
The benefit of this approach is that you can select which Services to make publicly available. For example, suppose that in addition to our Service for a machine learning model, we had a UI which leveraged the model's predictions as part of a larger application. We may choose to only make the UI available to public traffic, preventing users from being able to query the model serving Service directly.
The following YAML file defines an Ingress object for the above example, making the UI publicly accessible.
The Kubernetes objects I've described up until this point can be composed to create reliable, long-running services. In contrast, the Job object is useful when you want to perform a discrete task. For example, suppose we want to retrain our model daily based on the information collected from the previous day. Each day, we want to spin up a container to execute a predefined workload (eg. a train.py
script) and then shut down when the training finishes. Jobs provide us the ability to do exactly this! If for some reason our container crashes before finishing the script, Kubernetes will react by launching a new Pod in its place to finish the job. For Job objects, the "desired state" of the object is completion of the job.
The following YAML defines an example Job for training a machine learning model (assuming the training code is defined in train.py
).
Note: This Job specification will only execute a single training run. If we wanted to execute this job daily, we could define a CronJob object instead.
The objects discussed above are certainly not an exhaustive list of resource types available in Kubernetes. Some other objects that you'll most likely find useful when deploying applications include:
By this point, you're probably wondering how Kubernetes is capable of taking all of our object specifications and actually executing these workloads on a cluster. In this section we'll discuss the components that make up the Kubernetes control plane which govern how workloads are executed, monitored, and maintained on our cluster.
Before we dive in, it's important to distinguish two classes of machines on our cluster:
Now let's dive in to the main components on our master node. When you're communicating with Kubernetes to provide a new or updated object specification, you're talking to the API server.
More specifically, the API server validates requests to update objects and acts as the unified interface for questions about our cluster's current state. However, the state of our cluster is stored in etcd, a distributed key-value store. We'll use etcd to persist information regarding: our cluster configuration, object specifications, object statuses, nodes on the cluster, and which nodes the objects are assigned to run on.
Note: etcd is the only stateful component in our control plane, all other components are stateless.
Speaking of where objects should be run, the scheduler is in charge of determining this! The scheduler will ask the API server (which will then communicate with etcd) which objects which haven't been assigned to a machine. The scheduler will then determine which machines those objects should be assigned to and will reply back to the API server to reflect this assignment (which gets propagated to etcd).
The last component on the master node that we'll discuss in this post is the controller-manager, which monitors the state of a cluster through the API server to see if the current state of the cluster aligns with our desired state. If the actual state differs from our desired state, the controller-manager will make changes via the API server in an attempt to drive the cluster towards the desired state. The controller-manager is defined by a collection of controllers, each of which are responsible for managing objects of a specific resource type on the cluster. At a very high level, a controller will watch a specific resource type (eg. deployments) stored in etcd and create specifications for the pods which should be run to acheive the object's desired state. It is then the controller's responsibility for ensuring that these pods stay healthy while running and are shut down when needed.
To summarize what we've covered so far...
Next, let's discuss the control plane components which are run on worker nodes. Most of the resources available on our worker nodes are spent running our actual applications, but our nodes do need to know which pods they should be running and how to communicate with pods on other machines. The two final components of the control plane that we'll discuss cover exactly these two concerns.
The kubelet acts as a node's "agent" which communicates with the API server to see which container workloads have been assigned to the node. It is then responsible for spinning up pods to run these assigned workloads. When a node first joins the cluster, kubelet is responsible for announcing the node's existence to the API server so the scheduler can assign pods to it.
Lastly, kube-proxy enables containers to be able to communicate with each other across the various nodes on the cluster. This component handles all the networking concerns such as how to forward traffic to the appropriate pod.
Hopefully, by this point you should be able to start seeing the whole picture for how things operate in a Kubernetes cluster. All of the components interact through the API server and we store the state of our cluster in etcd. There are various components which write to etcd (via the API server) to make changes to the cluster and nodes on the cluster listen to etcd (via the API server) to see which pods it should be running.
The overall system is designed such that failures have minimal impact on the overall cluster. For example, if our master node went down, none of our applications would be immediately affected; we would just be prevented from making any further changes to the cluster until a new master node is brought online.
As with every new technology, you'll incur an overhead cost of adoption as you learn how it works and how it applies to the applications that you're building. It's a reasonable question to ask "do I really need Kubernetes?" so I'll attempt to provide some example situations where the answer might be no.
Thank you to Derek Whatley and Devon Kinghorn for teaching me most of what I know about Kubernetes and answering my questions as I've been trying to wrap my head around this technology. Thank you to Goku Mohandas, John Huffman, Dan Salo, and Zack Abzug for spending your time to review an early version of this post and provide thoughtful feedback. And lastly, thank you to Kelsey Hightower for all that you've contributed to the Kubernetes community - your talks have helped me understand the bigger picture and gave me confidence that I could learn about this topic.
Here are some resources that I found to be useful when learning about Kubernetes.
Blogs
Videos
Books
]]>Previously, I wrote about organizing machine learning projects where I presented the framework that I use for building and deploying models. However, that framework operates on the implicit assumption that you already know generally what your model should do. In this post, we'll dig deeper into how to develop the
]]>Previously, I wrote about organizing machine learning projects where I presented the framework that I use for building and deploying models. However, that framework operates on the implicit assumption that you already know generally what your model should do. In this post, we'll dig deeper into how to develop the requirements for a machine learning project when you're given a vague problem to solve. Some questions that we'll address include:
Note: Sometimes machine learning projects can be very straightforward; your stakeholders define an API specification stating the inputs to the system and the desired outputs and you agree that the task seems feasible. These projects typically support existing products with existing "intelligence" solutions - your task is to simply encapsulate the "intelligence" task using machine learning in lieu of the existing solution (ie. rule-based systems, mechanical turk workers, etc).
Throughout this blog post, I'll use the following problem statement as a running example.
We're building an application to help people keep their photos organized. Our target user base are casual smartphone photographers who have a large number of photos in their camera roll. We want to make it easier for these users to find photos of interest.
Notice how vague that problem statement is - what defines a photo of interest? There's a multitude of ways we could address this task and without understanding the problem in more detail we won't know which direction to take. At this point in time, we have insufficient information to specify an objective function for training a model.
Understanding the problem and developing the requirements isn't something you typically get right on the first attempt; this is often an iterative process where we initially define a set of coarse requirements and refine the detail as we gain more information.
Jump to:
The first step towards establishing any set of requirements for a project is understanding the problem you're setting out to solve. No one knows the problem better than the intended user - the person we are attempting to solve the problem for.
Perform informational interviews with end users to understand their perspective. The further removed you are from the end user, the less likely you are to solve the actual problem they're experiencing.
Do you remember playing the telephone game as a kid, where you have a chain of people and try to deliver a message from one end to the other by asking each person to transmit the message to the person next to them? Usually the message at the end is very different from the original message. In order to ensure you're solving the right problem, you'll want to be able to empathize with the people who currently experience that problem.
For the case of our photo app example, this would involve speaking with casual smartphone photographers and understanding how they use their app.
At this stage you don't want to start prescribing a solution, you're simply trying to understand the problem. However, it can be good to consider the capabilities of machine learning systems and ask questions which may eventually guide your scoping of a solution. For example, let's consider the motivation behind asking the first set of questions.
Even though you might be considering solutions, it's important that the conversations with users are focused on the problems they experience. You won't ask them about their thoughts on specific solutions until the next stage.
Subject yourself to the problem. As a machine learning practitioner, it can be tempting to jump right into training a model to learn some task. However, it's often very instructive to first force yourself to perform the task manually. Additionally, this helps you empathize with the users. Pay close attention to how you solve the task, as this might inform what features might be important to include when you do train a model to perform the task.
For example, after speaking with some casual smartphone photographers you might construct a couple photo albums and go through the tasks described during your interviews. What strategies did you find effective for finding content more quickly?
After (and only after) getting a better understanding of the problem, you'll want to start sketching out the set of possible solutions and approaches that you could take. It's generally a good idea to think broadly at this stage rather than prematurely honing in on your first decent idea. At this stage, we're trying to elicit the desired user experience, as this can ultimately drive the requirements of the project.
Prototype and iterate on the user experience using design tools to communicate possible solutions. There's something magical about seeing something concretely that has the ability to elicit tangible feedback from your users. When speaking in the abstract sense, it's possible for both you and the other stakeholders to have different understandings whilst under the illusion that you're in agreement. However, these latent misunderstandings often become very clear when you reduce an abstract idea to practice.
For example, I used a design tool called Figma to sketch out a couple different ways we might help users find photos of interest more quickly. The goal of these sketches is to spark discussion with our stakeholders and/or users in an attempt to start narrowing down the possible solution space.
These tools allow you to easily stitch together user flows and import data to use in your mockups.
Fake the machine learning component with "Wizard-of-Oz" experiments. Building a machine learning model takes a significant amount of work. You have to acquire data for training, decide on a model architecture, ensure the model is performing at a sufficient level, etc. Our goal here is to validate the utility of a model without actually going through all of the effort involved in building it. These types of experiments can be especially useful when there's still uncertainty surrounding how users will interact with your model.
Apple researchers wrote about one such study when deciding whether their digital assistant should mimic the conversational style of the user. Their core hypothesis was that users would prefer digital assistants that match their level of chattiness when conversing with the agent. Rather than training a machine learning model that would allow them to modulate the "chattiness" level in the agent's response, they first tested this hypothesis by faking the digital assistant component and having humans in a separate room follow a script pretending to be the digital assistant.
Figure out how to establish trust with the user. Ideally you'd like to design your product such that user interactions can improve your model, which in turn improve the user experience. This is commonly referred to as the "data flywheel" in machine learning products. However, in order to source meaningful interactions from your users, you may need to first establish trust with the user that those interactions will indeed improve their experience. For example, if we decided that we would perform facial recognition and allow you to search for photos with a specific person present, you'd likely require the user to assign identities to collections of photos with a common face. In order to motivate and incentivize users, they need to feel as if their effort in identifying faces in photos is meaningful. One common technique used here is to show the model's effort in an unobtrusive manner which informs the user how it's working. Continuing the facial recongition example, you might allow the user to toggle a setting which draws bounding boxes around the identified faces in photos.
At this point, you're likely to have a decent grasp on the approach you'll take to solve the presented problem. However, much work still lies ahead! It'll be important that you communicate effectively with stakeholders as you work to build out your solution; this starts with speaking a common language.
Perhaps a number of users leverage a technique of chunking photos together when searching, while you typically refer to that same technique as clustering. Without converging on a shared language, it's all too easy to talk right past one another without realizing you're actually on the same page.
Over-communicate and ask a lot of clarifying questions at the onset. This can include asking questions when you're already pretty sure what the answer will be, sometimes you may be surprised.
Present ideas and progress often. It's important to share progress with your stakeholders and hold discussion to ensure you're still heading in the right direction. As you present your work, it can be helpful to discuss things such as model metrics which you're using to evaluate performance and why we should care about those metrics; keeping things simple combined with repeated exposure can go a long way here.
Getting your product in the hands of your users is one of the best ways to validate your ideas. Quicker iterations allow you to validate more ideas, which in turn allows you to fine-tune your solution offering in order to maximize value to the user. Further, incremental deployments help ensure that you're heading in the right direction as you work to build out the full solution.
In my post on organizing machine learning projects, I presented a diagram for model development which represents the iterative nature of the work.
I also called out Stephen Merity's advice and Martin Zinkevich's Rule #4 of Machine Learning, which both advocate for initially deploying simple models in the spirit of winning by shipping. In this context, "winning by shipping" involves making quick iterations through the outermost development loop.
It can take getting a few projects under your belt to fully appreciate this wisdom. When I initially published the diagram a year ago, I tended to view progression through this iterative flow as a stage-gate process where you might loop back to an earlier step if you gain more information, but you still advance through the cycle in the order that sections appear on the diagram. However, after joining the machine learning team at Proofpoint, I've learned from the team that there can be immense value in shortcutting parts of the development process (eg. see shadow mode deployment below) in order to gain insights from deploying models on production data.
Deploy a baseline model on production data as soon as possible. Deploying your model on production data can be enlightening. Often times, the "live data" varies in unexpected ways from the data you've collected to use during development (often referred to as "train/test skew" or "production data drift").
As a countermeasure, it's often a good idea to deploy a simple model on production data as soon as possible. Depending on the consequence of wrong predictions, you might choose to deploy this simple model in "shadow mode" (don't actually use the predictions) or as a canary deployment on a small subset of your users. The goal of this deployment is to observe and characterize the types of errors that the simple model makes, which can inform further model improvements.
You may choose to initially skip some steps in the model development process in order to prioritize a quick first iteration and gain insights.
Deploying a simple model with urgency also helps ensure that the additional engineering work required to run a machine learning model is done upfront. This allows you to deploy your incremental model improvements (enabling quick iterations) rather than slaving away trying to build the perfect model in isolation.
Deliver value incrementally and quickly. If you're automating a system, start with the simplest task and deploy a solution quickly. Let data guide the priority order of further automation. In other words, automate first according to the "happy path" (assume nothing goes wrong) and then categorize the times where control is reverted to a human.
Going back to our photo app example, there are many opportunities to apply machine learning models to the product. Ideally, we'd like to start at the intersection of a model which is simple to develop and provides value to users. For example, we might decide to initially deploy a facial recognition model given the fact that there are a number of open source implementations that we could use and user studies provided us with some level of evidence that this model would be valuable as part of the product.
Measure time to results, not results. Sometimes it can be tricky to get into the mindset of delivering value quickly. It's easy to think that in order to delight the user, we need to give them the perfect product. Measuring time to results rather than the results themselves instills a culture of quick iterations which ultimately enables your team to deliver better products, winning by shipping.
Often times we're presented with vague problems and tasked with developing a "machine learning solution." By spending the effort to properly scope your project and define its requirements up front, you can establish a foundation for smooth iterations through the model development loop as we work towards a final solution.
Now go build great things!
The insights shared in this blog post are shaped by my experience working on teams building and deploying machine learning models in products. On my current team at Proofpoint, we frequently discuss how we can refine our process to deliver value quickly on machine learning projects; these conversations and shared experiences are invaluable. Thank you to John Berryman, Dan Salo, John Huffman, and Michelle Carney for reading early drafts of this post and providing feedback.
Blog posts
Case studies
Talks
Guides
People to follow
Jump to:
Previously, I've written about feed-forward neural networks as a generic function approximator and convolutional neural networks for efficiently extracting local information from data. In this post, I'll discuss a third type
]]>Jump to:
Previously, I've written about feed-forward neural networks as a generic function approximator and convolutional neural networks for efficiently extracting local information from data. In this post, I'll discuss a third type of neural networks, recurrent neural networks, for learning from sequential data.
For some classes of data, the order in which we receive observations is important. As an example, consider the two following sentences:
These two sentences are communicating quite different messages, but this can only be interpreted when considering the sequential order of the words. Without this information, we're unable to disambiguate from the collection of words: {'you', 'sorry', 'me', 'not', 'im', 'its'}
.
Recurrent neural networks allow us to formulate the learning task in a manner which considers the sequential order of individual observations.
In this section, we'll build the intuition behind recurrent neural networks. We'll start by reviewing standard feed-forward neural networks and build a simple mental model of how these networks learn. We'll then build on that to discuss how we can extend this model to a sequence of related inputs.
Recall that neural networks perform a series of layer by layer transformations to our input data. The hidden layers of the network form intermediate representations of our input data which make it easier to solve the given task.
This is demonstrated in the example below. Observe how our input space is warped into one which allows for a linear decision boundary to cleanly separate the two classes. At a high level, you can think of the hidden layers as "useful representations" of the original input data.
Now let's consider how we can leverage this insight for a sequence of related observations.
Let's first focus on the initial value in the sequence. As we calculate the forward pass through the network, we build a "useful representation" of our input in the hidden layers (the activations in these layers define our hidden state), continuing on to calculate an output prediction for the initial time-step.
When considering the next time-step in the sequence, we want to leverage any information we've already extracted from the sequence.
In order to do this, our next hidden state will be calculated as a combination of the previous hidden state and latest input.
The basic method for combining these two pieces of information is shown below; however, there exist other more advanced methods that we'll discuss later (gated recurrent units, long short-term memory units). Here, we have one set of weights $w_{ih}$ to transform the input to a hidden layer representation and a second set of weights $w_{hh}$ to bring along information from the previous hidden state into the next time-step.
We can continue performing this same calculation of incorporating new information to update the value of the hidden state for an arbitrarily long sequence of observations.
By always remembering the previous hidden state, we're able to chain a sequence of events together. This also allows us to backpropagate errors to earlier timesteps during training, often referred to as "backpropagation through time".
One of the benefits of recurrent neural networks is the ability to handle arbitrary length inputs and outputs. This flexibility allows us to define a broad range of tasks. In this section, I'll discuss the general architectures used for various sequence learning tasks.
One to many RNNs are used in scenarios where we have a single input observation and would like to generate an arbitrary length sequence related to that input. One example of this is image captioning, where you feed in an image as input and output a sequence of words to describe the image. For this architecture, we take our prediction at each time step and feed that in as input to the next timestep, iteratively generating a sequence from our initial observation and following predictions.
Many to one RNNs are used to look across a sequence of inputs and make a single determination from that sequence. For example, you might look at a sequence of words and predict the sentiment of the sentence. Generally, this structure is used when you want to perform classification on sequences of data.
Many to many (same) RNNs are used for tasks in which we would like to predict a label for each observation in a sequence, sometimes referred to as dense classification. For example, if we would like to detect named entities (person, organization, location) in sentences, we might produce a label for every single word denoting whether or not that word is part of a named entity. As another example, you could feed in a video (sequence of images) and predict the current activity in frame.
Many to many (different) RNNs are useful for translating a sequence of inputs into a different but related sequence of outputs. In this case, both the input and the output can be arbitrary length sequences and the input length might not always be equal to the output length. For example, a machine translation model would be expected to translate "how are you" (input) into "cómo estás" (output) even though the sequence lengths are different.
One of the weaknesses of a ordinary recurrent neural networks is that we can only use the set of observations which we have already seen when making a prediction. As an example, consider training a model for named entity recognition. Here, we want the model to output the start and end of phrases which contain a named entity. Consider the following two sentences:
"I can't believe that Teddy Roosevelt was your great grandfather!"
"I can't believe that Teddy bear is made out of chocolate!"
However, if you only read the input sequence from left to right, it's hard to tell whether or not you should mark "Teddy" as the start of a name.
Ideally, our model output would look something like this when reading the first sentence (roughly following the inside–outside–beginning tagging format).
When determining whether or not a token is the start of a name, it would sure be helpful to see which tokens follow after it; a bidirectional recurrent neural network provides exactly that. Here, we process the sequence reading from left-to-right and right-to-left in parallel and then combine these two representations such that at any point in a sequence you have knowledge of the tokens which came before and after it.
We have one set of recurrent cells which process the sequence from left to right...
... and another set of recurrent cells which process the sequence from right to left.
Thus, at any given time-step we have knowledge of all of the tokens which came before the current time-step and all of the tokens which came after that time-step.
One key component that I glanced over previously is that the recurrent layer's weights are shared across time-steps. This provides us with the flexibility to process arbitrary length sequences, but also introduces a unique challenge when training the network.
For a concrete example, suppose you've trained a recurrent neural network as a language model (predict the next word in a sequence). As you're generating text, it might be important to know whether the current word is inside quotation marks. Let's assume this is true and consider the case where our model makes a wrong prediction because it wasn't paying attention to whether or not the current time-step is inside quotation marks. Ideally, you want a way to send back a signal to the earlier time-step where we entered the quotation mark to say "pay attention!" to avoid the same mistake in the future. Doing so requires sending our error signal back through many time-steps. (As an aside, Karpathy has a famous blog post which shows that a character-level RNN language model can indeed pay attention to this detail.)
Let's consider what the backpropagation step would look like to send this signal to earlier time-steps.
As a reminder, the backpropagation algorithm states that we can define the relationship between a given layer's weights and the final loss using the following expression:
$$ \frac{{\partial E\left( w \right)}}{{\partial w^{(l)}}} = {\left( {{\delta ^{(l + 1)}}} \right)^T}{a^{(l)}} $$
where ${\delta ^{(l)}}$ (our "error" term) can be calculated as:
$$ {\delta ^{(l)}} = {\delta ^{(l + 1)}}{w ^{(l)}}f'\left( {{a^{(l)}}} \right) $$
This allows to efficiently calculate the gradient for any given layer by reusing the terms already computed at layer $l+1$. However, notice how there's a term for the weight matrix, ${w ^{(l)}}$, included in the computation at every layer. Now recall that I earlier mentioned recurrent layers share weights across time-steps. This means that the same exact value is being mulitplied every time we perform this layer by layer backpropagation through time.
Let's suppose one of the weights in our matrix is 0.5 and we're attempting to send a signal back 10 time-steps. By the time we've backpropagated to $t-10$, we've multiplied the overall gradient expression by $0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 = 0.00098$. This has the effect of drastically reducing the magnitude of our error signal! This phenomenon is known as the "vanishing gradient" problem which makes it very hard to learn using a vanilla recurrent neural network. The same problem can occur when the weight is greater than one, introducing an exploding gradient, although this is slightly easier to manage thanks to a technique known as gradient clipping.
In following posts, we'll look at two common variations of the standard recurrent cell which alleviate this problem of a vanishing gradient.
Papers
Lectures/Notes
Blog posts
]]>Jump to:
In the world of deep learning, we often use neural networks to learn representations of objects as vectors. We can then use
]]>Jump to:
In the world of deep learning, we often use neural networks to learn representations of objects as vectors. We can then use these vector representations for a myriad of useful tasks.
To give a concrete example, let's consider the case of a facial recognition system powered by deep learning. For this use case, the objects are images of people's faces and the task is to identify whether or not the person in a submitted photo matches a person in a database of known identities. We'll use a neural network to build vector representations of all of the images; then, performing facial recognition is as simple as taking the vector representation of a submitted image (the query vector) and searching for similar vectors in our database. Here, we define similarity as vectors which are close together in vector-space. (How we actually train a network to produce these vector representations is outside of the scope of this blog post.)
All of these vectors were extracted from a ResNet50 model. Notice how the values in the query vector are quite similar to the vector in the top left of known identities.
The process of finding vectors that are close to our query is known as nearest neighbors search. A naive implementation of nearest neighbors search is to simply calculate the distance between the query vector and every vector in our collection (commonly referred to as the reference set). However, calculating these distances in a brute force manner quickly becomes infeasible as your reference set grows to millions of objects.
Imagine if Facebook had to compare each face in a new photo against all of its users every time it suggested who to tag, this would be computationally infeasible!
A class of methods known as approximate nearest neighbors search offer a solution to our scaling dilemma by partitioning the vector space in a clever way such that we only need to examine a small subset of the overall reference set.
Approximate methods alleviate this computational burden by cleverly partitioning the vectors such that we only need to focus on a small subset of objects.
In this blog post, I'll cover a couple of techniques used for approximate nearest neighbors search. This post will not cover approximate nearest neighbors methods exhaustively, but hopefully you'll be able to understand how people generally approach this problem and how to apply these techniques in your own work.
In general, the approximate nearest neighbor methods can be grouped as:
The first approximate nearest neighbors method we'll cover is a tree-based approach. K-dimensional trees generalize the concept of a binary search tree into multiple dimensions.
The general procedure for growing a k-dimensional tree is as follows:
A toy 2-dimensional example is visualized below. At the top level, we select a random dimension (out of the two possible dimensions, $x_0$ and $x_1$) and calculate the median. Then, we follow the same procedure of picking a dimension and calculating the median for each path independently. This process is repeated until some stopping criterion is satisfied; each leaf node in the tree contains a subset of vectors from our reference set.
We can view how the two-dimensional vectors are partitioned at each level of the k-d tree in the figure below. Take a minute to verify that this visualization matches what is described in the tree above.
In order to see the usefulness of this tree, let's now consider how we could use this data structure to perform an approximate nearest neighbor query. As we walk down the tree, notice how the highlighted area (the area in vector space that we're interested in) shrinks down to a small subset of the original space. (I'll use the level 4 subplot for this example.)
At the top level, we look at the first dimension of the query vector and ask whether or not its value is greater than or equal to 1. Since 4 is greater than 1, we walk down the "yes" path to the next level down. We can safely ignore any of the nodes that follow the first "no" path.
Now we look at the second dimension of the vector and ask whether its value is greater than or equal to 0. Since -2 is less than 0, we now walk down the "no" path. Notice again how the area of interest in our overall vector-space continues to shrink.
Finally, once we reach the bottom of the tree we are left with a collection of vectors. Thankfully, this is a small subset relative to the overall size of the reference set, so calculating the distance between the query vector and each vector in this subset is computationally feasible.
K-d trees are popular due to their simplicity, however, this technique struggles to perform well when dealing with high dimensional data. Further notice how we only returned vectors which are found in the same cell as the query point. In this example, the query vector happened to fall in the middle of a cell, but you could imagine a scenario where the query vector lies near the edge of a cell and we miss out on vectors which lie just outside of the cell.
Another approach to the approximate nearest neighbors problem is to collapse our reference set into a smaller collection of representative vectors. We can find these "representative" vectors by simply running the K-means algorithm on our data. In the literature, this collection of "representative" vectors is commonly referred to as the codebook.
The right figure displays a Voronoi diagram which essentially partitions the space according to the set of points for which a given centroid is closest.
We'll then "map" all of our data onto these centroids. By doing this, we can represent our reference set of a couple hundred vectors with only 7 representative centroids. This greatly reduces the number of distance computations we need to perform (only 7!) when making an nearest neighbors query.
We can then maintain an inverted list to keep track of all of the original objects in relation to which centroid represents the quantized vector.
You can optionally retrieve the full vectors for all of the ids maintained in the inverted list for a given centroid, calculating the true distances between each vector and our query. This is a process known as re-ranking and can improve your query performance.
Similar to before, let's now look at how we can use this method to perform a query. For a given query vector, we'll calculate the distances between the query vector and each centroid in order to find the closest centroid. We can then look up the centroid in our inverted list in order to find all of the nearest vectors.
Unfortunately, in order to get good performance using quantization, you typically need to use very large numbers of centroids for quantization; this impedes on original goal of alleviating the computational burden of calculating too many distances.
Product quantization addresses this problem by first subdividing the original vectors into subcomponents and then quantizing (ie. running K-means on) each subcomponent separately. A single vector is now represented by a collection of centroids, one for each subcomponent.
To illustrate this, I've provided two examples. In the 8D case, you can see how our vector is divided into subcomponents and each subcomponent is represented by some centroid value. However, the 2D example shows us the benefit of this approach. In this case, we can only split our 2D vector into a maximum of two components. We'll then quantize each dimension separately, squashing all of the data onto the horizontal axis and running k-means and then squashing all of the data onto the vertical axis and running k-means again. We find 3 centroids for each subcomponent with a total of 6 centroids. However, the total set of all possible quantized states for the overall vector is the Cartesian product of the subcomponent centroids.
In other words, if we divide our vector into $m$ subcomponents and find $k$ centroids, we can represent $k^m$ possible quantizations using only $km$ vectors! The chart below shows how many centroids are needed in order to get 90% of the top 5 search results correct for an approximate nearest neighbors query. Notice how using product quantization ($m>1$) vastly reduces the number of centroids needed to represent our data. One of the reasons why I love this idea so much is that we've effectively turned the curse of dimensionality into something highly beneficial!
Product quantization alone works great when our data is distributed relatively evenly across the vector-space. However, in reality our data is usually multi-modal. To handle this, a common technique involves first training a coarse quantizer to roughly "slice" up the vector-space, and then we'll run product quantization on each individual coarse cell.
Below, I've visualized the data that falls within a single coarse cell. We'll use product quantization to find a set of centroids which describe this local subset of data, and then repeat for each coarse cell. Commonly, people encode the vector residuals (the difference between the original vector and the closest coarse centroid) since the residuals tend to have smaller magnitudes and thus lead to less lossy compression when running product quantization. In simple terms, we treat each coarse centroid as a local origin and run product quantization on the data with respect to the local origin rather than the global origin.
Pro-tip: If you want to scale to really large datasets you can use product quantization as both the coarse quantizer and the fine-grained quantizer within each coarse cell. See this paper for the details.
The ideal goal for quantization is to develop a codebook which is (1) concise and (2) highly representative of our data. More specifically, we'd like all of the vectors in our codebook to represent dense regions of our data in vector-space. A centroid in a low-density area of our data is inefficient at representing data and introduces high distortion error for any vectors which fall in its Voronoi cell.
One potential way we can attempt to avoid these inefficient centroids is to add an alignment step to our product quantization. This allows for our product quantizers to better cover the local data for each coarse Voronoi cell.
We can do this by applying a transformation to our data such that we minimize our quantization distortion error. One simple way to minimize this quantization distortion error is to simply apply PCA in order to mean-center the data and rotate it such that the axes capture most of the variance within the data.
Recall my earlier example where we ran product quantization on a toy 2D dataset. In doing so, we effectively squashed all of the data onto the horizontal axis and ran k-means and then repeated this for the vertical axis. By rotating the data such that the axes capture most of the variance, we can more effectively cover our data when using product quantization.
This technique is known as locally optimized product quantization, since we're manipulating the local data within each coarse Voronoi cell in order to optimize the product quantization performance. The authors who introduced this technique have a great illustrative example of how this technique can better fit a given set of vectors.
This blog post glances over (c) Optimized Product Quantization which is the same idea of aligning our data for better product quantization performance, but the alignment is performed globally instead of aligning local data in each Voronoi cell independently. Image credit
The authors who introduced product quantization noted that the technique works best when the vector subcomponents had similar variance. A nice side effect of doing PCA alignment is that during the process we get a matrix of eigenvalues which describe the variance of each principal component. We can use this to our advantage by allocating principal components into buckets of equal variance.
Papers
I didn't cover binary codes in this post - but I should have! I may come back and edit the post to include more information soon. Until then, enjoy this paper.
Lectures
Blog posts/talks
Libraries and Github repos
]]>The goal of this document is to provide a common framework for approaching machine learning projects that can be referenced by practitioners. If you build ML models, this post is for you. If you collaborate with people who build ML models, I hope that this guide provides you with a
]]>The goal of this document is to provide a common framework for approaching machine learning projects that can be referenced by practitioners. If you build ML models, this post is for you. If you collaborate with people who build ML models, I hope that this guide provides you with a good perspective on the common project workflow. Knowledge of machine learning is assumed.
This overview intends to serve as a project "checklist" for machine learning practitioners. Subsequent sections will provide more detail.
Project lifecycle
Machine learning projects are highly iterative; as you progress through the ML lifecycle, you’ll find yourself iterating on a section until reaching a satisfactory level of performance, then proceeding forward to the next task (which may be circling back to an even earlier step). Moreover, a project isn’t complete after you ship the first version; you get feedback from real-world interactions and redefine the goals for the next iteration of deployment.
Team roles
A typical team is composed of:
It may be tempting to skip this section and dive right in to "just see what the models can do". Don't skip this section. All too often, you'll end up wasting time by delaying discussions surrounding the project goals and model evaluation criteria. Everyone should be working toward a common goal from the start of the project.
It's worth noting that defining the model task is not always straightforward. There's often many different approaches you can take towards solving a problem and it's not always immediately evident which is optimal. If your problem is vague and the modeling task is not clear, jump over to my post on defining requirements for machine learning projects before proceeding.
Ideal: project has high impact and high feasibility.
Mental models for evaluating project impact:
When evaluating projects, it can be useful to have a common language and understanding of the differences between traditional software and machine learning software. Andrej Karparthy's Software 2.0 is recommended reading for this topic.
Software 1.0
Software 2.0
See this talk for more detail.
A quick note on Software 1.0 and Software 2.0 - these two paradigms are not mutually exclusive. Software 2.0 is usually used to scale the logic component of traditional software systems by leveraging large amounts of data to enable more complex or nuanced decision logic.
For example, Jeff Dean talks (at 27:15) about how the code for Google Translate used to be a very complicated system consisting of ~500k lines of code. Google was able to simplify this product by leveraging a machine learning model to perform the core logical task of translating text to a different language, requiring only ~500 lines of code to describe the model. However, this model still requires some "Software 1.0" code to process the user's query, invoke the machine learning model, and return the desired information to the user.
In summary, machine learning can drive large value in applications where decision logic is difficult or complicated for humans to write, but relatively easy for machines to learn. On that note, we'll continue to the next section to discuss how to evaluate whether a task is "relatively easy" for machines to learn.
Some useful questions to ask when determining the feasibility of a project:
Establish a single value optimization metric for the project. Can also include several other satisficing metrics (ie. performance thresholds) to evaluate models, but can only optimize a single metric.
Example:
The optimization metric may be a weighted sum of many things which we care about. Revisit this metric as performance improves.
Some teams may choose to ignore a certain requirement at the start of the project, with the goal of revising their solution (to meet the ignored requirements) after they have discovered a promising general approach.
Decide at what point you will ship your first model.
Some teams aim for a “neutral” first launch: a first launch that explicitly deprioritizes machine learning gains, to avoid getting distracted. — Google Rules of Machine Learning
The motivation behind this approach is that the first deployment should involve a simple model with focus spent on building the proper machine learning pipeline required for prediction. This allows you to deliver value quickly and avoid the trap of spending too much of your time trying to "squeeze the juice."
A well-organized machine learning codebase should modularize data processing, model definition, model training, and experiment management.
Example codebase organization:
data/
docker/
api/
app.py
project_name/
models/
base.py
simple_baseline.py
cnn.py
configs/
baseline.yaml
latest.yaml
datasets.py
train.py
experiment.py
scripts/
data/
provides a place to store raw and processed data for your project. You can also include a data/README.md
file which describes the data for your project.
docker/
is a place to specify one or many Dockerfiles for the project. Docker (and other container solutions) help ensure consistent behavior across multiple machines and deployments.
api/app.py
exposes the model through a REST client for predictions. You will likely choose to load the (trained) model from a model registry rather than importing directly from your library.
models/
defines a collection of machine learning models for the task, unified by a common API defined in base.py
. These models include code for any necessary data preprocessing and output normalization.
datasets.py
manages construction of the dataset. Handles data pipelining/staging areas, shuffling, reading from disk.
experiment.py
manages the experiment process of evaluating multiple models/ideas. This constructs the dataset and models for a given experiment.
train.py
defines the actual training loop for the model. This code interacts with the optimizer and handles logging during training.
See other examples here, here and here.
An ideal machine learning pipeline uses data which labels itself. For example, Tesla Autopilot has a model running that predicts when cars are about to cut into your lane. In order to acquire labeled data in a systematic manner, you can simply observe when a car changes from a neighboring lane into the Tesla's lane and then rewind the video feed to label that a car is about to cut in to the lane.
As another example, suppose Facebook is building a model to predict user engagement when deciding how to order things on the newsfeed. After serving the user content based on a prediction, they can monitor engagement and turn this interaction into a labeled observation without any human effort. However, just be sure to think through this process and ensure that your "self-labeling" system won't get stuck in a feedback loop with itself.
For many other cases, we must manually label data for the task we wish to automate. The quality of your data labels has a large effect on the upper bound of model performance.
Here is a real use case from work for model improvement and the steps taken to get there:
— Alex Gude (@alex_gude) April 24, 2019
- Baseline: 53%
- Logistic: 58%
- Deep learning: 61%
- **Fixing your data: 77%**
Some good ol' fashion "understanding your data" is worth it's weight in hyperparameter tuning!
Most data labeling projects require multiple people, which necessitates labeling documentation. Even if you're the only person labeling the data, it makes sense to document your labeling criteria so that you maintain consistency.
One tricky case is where you decide to change your labeling methodology after already having labeled data. For example, in the Software 2.0 talk mentioned previously, Andrej Karparthy talks about data which has no clear and obvious ground truth.
If you run into this, tag "hard-to-label" examples in some manner such that you can easily find all similar examples should you decide to change your labeling methodology down the road. Additionally, you should version your dataset and associate a given model with a dataset version.
Tip: After labeling data and training an initial model, look at the observations with the largest error. These examples are often poorly labeled.
Active learning is useful when you have a large amount of unlabeled data and you need to decide what data you should label. Labeling data can be expensive, so we'd like to limit the time spent on this task.
As a counterpoint, if you can afford to label your entire dataset, you probably should. Active learning adds another layer of complexity.
"The main hypothesis in active learning is that if a learning algorithm can choose the data it wants to learn from, it can perform better than traditional methods with substantially less data for training." - DataCamp
General approach:
Leveraging weak labels
However, tasking humans with generating ground truth labels is expensive. Often times you'll have access to large swaths of unlabeled data and a limited labeling budget - how can you maximize the value from your data? In some cases, your data can have information which provides a noisy estimate of the ground truth. For example, if you're categorizing Instagram photos, you might have access to the hashtags used in the caption of the image. Other times, you might have subject matter experts which can help you develop heuristics about the data.
Snorkel is an interesting project produced by the Stanford DAWN (Data Analytics for What’s Next) lab which formalizes an approach towards combining many noisy label estimates into a probabilistic ground truth. I'd encourage you to check it out and see if you might be able to leverage the approach for your problem.
Establish performance baselines on your problem. Baselines are useful for both establishing a lower bound of expected performance (simple model baseline) and establishing a target performance level (human baseline).
Start simple and gradually ramp up complexity. This typically involves using a simple model, but can also include starting with a simpler version of your task.
Before doing anything intelligent with "AI", do the unintelligent version fast and at scale.
— Smerity (@Smerity) February 13, 2019
At worst you understand the limits of a simplistic approach and what complexities you need to handle.
At best you realize you don't need the overhead of intelligence.
Once a model runs, overfit a single batch of data. Don't use regularization yet, as we want to see if the unconstrained model has sufficient capacity to learn from the data.
Survey the literature. Search for papers on Arxiv describing model architectures for similar problems and speak with other practitioners to see which approaches have been most successful in practice. Determine a state of the art approach and use this as a baseline model (trained on your dataset).
Reproduce a known result. If you're using a model which has been well-studied, ensure that your model's performance on a commonly-used dataset matches what is reported in the literature.
Understand how model performance scales with more data. Plot the model performance as a function of increasing dataset size for the baseline models that you've explored. Observe how each model's performance scales as you increase the amount of data used for training.
Once you have a general idea of successful model architectures and approaches for your problem, you should now spend much more focused effort on squeezing out performance gains from the model.
Build a scalable data pipeline. By this point, you've determined which types of data are necessary for your model and you can now focus on engineering a performant pipeline.
Apply the bias variance decomposition to determine next steps. Break down error into: irreducible error, avoidable bias (difference between train error and irreducible error), variance (difference between validation error and train error), and validation set overfitting (difference between test error and validation error).
Use coarse-to-fine random searches for hyperparameters. Start with a wide hyperparameter space initially and iteratively hone in on the highest-performing region of the hyperparameter space.
Perform targeted collection of data to address current failure modes. Develop a systematic method for analyzing errors of your current model. Categorize these errors, if possible, and collect additional data to better cover these cases.
Why is your model performing poorly?
Key mindset for DL troubleshooting: pessimism.
In order to complete machine learning projects efficiently, start simple and gradually increase complexity. Start with a solid foundation and build upon it in an incremental fashion.
Tip: Fix a random seed to ensure your model training is reproducible.
Common bugs:
oh: 5) you didn't use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won't make you silently fail, but they are spurious parameters
— Andrej Karpathy (@karpathy) July 1, 2018
Use clustering to uncover failure modes and improve error analysis:
Categorize observations with incorrect predictions and determine what best action can be taken in the model refinement stage in order to improve performance on these cases.
If you haven't already written tests for your code yet, you should write them at this point.
Different components of a ML product to test:
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
Data:
Model:
Infrastructure:
Monitoring:
Be sure to have a versioning system in place for:
A common way to deploy a model is to package the system into a Docker container and expose a REST API for inference.
Canarying: Serve new model to a small subset of users (ie. 5%) while still serving the existing model to the remainder. Check to make sure rollout is smooth, then deploy new model to rest of users.
Shadow mode: Ship a new model alongside the existing model, still using the existing model for predictions but storing the output for both models. Measuring the delta between the new and current model's predictions will give an indication for how drastically things will change when you switch to the new model.
Hidden Technical Debt in Machine Learning Systems (quoted below, emphasis mine)
A primer on concept of technical debt:
As with fiscal debt, there are often sound strategic reasons to take on technical debt. Not all debt is bad, but all debt needs to be serviced. Technical debt may be paid down by refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening APIs, and improving documentation. The goal is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability. Deferring such payments results in compounding costs. Hidden debt is dangerous because it compounds silently.
Machine learning projects are not complete upon shipping the first version. If you are "handing off" a project and transferring model responsibility, it is extremely important to talk through the required model maintenance with the new team.
Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.
CACE principle: Changing Anything Changes Everything
Machine learning systems are tightly coupled. Changes to the feature space, hyper parameters, learning rate, or any other "knob" can affect model performance.
Specific mitigation strategies:
Undeclared consumers of your model may be inadvertently affected by your changes.
"Without access controls, it is possible for some of these consumers to be undeclared consumers, consuming the output of a given prediction model as an input to another component of the system."
If your model and/or its predictions are widely accessible, other components within your system may grow to depend on your model without your knowledge. Changes to the model (such as periodic retraining or redefining the output) may negatively affect those downstream components.
Specific mitigation strategies:
Avoid depending on input signals which may change over time.
Some features are obtained by a table lookup (ie. word embeddings) or simply an input pipeline which is outside the scope of your codebase. When these external feature representations are changed, the model's performance can suffer.
Specific mitigation strategies:
Eliminate unnecessary features.
Regularly evaluate the effect of removing individual features from a given model. A model's feature space should only contain relevant and important features for the given task.
There are many strategies to determine feature importances, such as leave-one-out cross validation and feature permutation tests. Unimportant features add noise to your feature space and should be removed.
Tip: Document deprecated features (deemed unimportant) so that they aren't accidentally reintroduced later.
Model performance will likely decline over time.
As the input distribution shifts, the model's performance will suffer. You should plan to periodically retrain your model such that it has always learned from recent "real world" data.
This guide draws inspiration from the Full Stack Deep Learning Bootcamp, best practices released by Google, my personal experience, and conversations with fellow practitioners.
Find something that's missing from this guide? Let me know!
In this post, I'll discuss an overview of deep learning techniques for object detection using convolutional neural networks. Object detection is useful for understanding what's in an image, describing both what is in an image and where those objects are found.
In general, there's two different approaches for this task
]]>In this post, I'll discuss an overview of deep learning techniques for object detection using convolutional neural networks. Object detection is useful for understanding what's in an image, describing both what is in an image and where those objects are found.
In general, there's two different approaches for this task – we can either make a fixed number of predictions on grid (one stage) or leverage a proposal network to find objects and then use a second network to fine-tune these proposals and output a final prediction (two stage).
In this blog post, I'll discuss the one-stage approach towards object detection; a follow-up post will then discuss the two-stage approach. Each approach has its own strengths and weaknesses, which I'll discuss in the respective blog posts.
Jump to:
The goal of object detection is to recognize instances of a predefined set of object classes (e.g. {people, cars, bikes, animals}) and describe the locations of each detected object in the image using a bounding box. Two examples are shown below.
Example images are taken from the PASCAL VOC dataset.
We'll use rectangles to describe the locations of each object, which may lead to imperfect localizations due to the shapes of objects. An alternative approach would be image segmentation which provides localization at the pixel-level.
This blog post will focus on model architectures which directly predict object bounding boxes for an image in a one-stage fashion. In other words, there is no intermediate task (as we'll discuss later with region proposals) which must be performed in order to produce an output. This leads to a simpler and faster model architecture, although it can sometimes struggle to be flexible enough to adapt to arbitrary tasks (such as mask prediction).
In order to understand what's in an image, we'll feed our input through a standard convolutional network to build a rich feature representation of the original image. We'll refer to this part of the architecture as the "backbone" network, which is usually pre-trained as an image classifier to more cheaply learn how to extract features from an image. This is a result of the fact that data for image classification is easier (and thus cheaper) to label as it only requires a single label as opposed to defining bounding box annotations for each image. Thus, we can train on a very large labeled dataset (such as ImageNet) in order to learn good feature representations.
After pre-training the backbone architecture as an image classifier, we'll remove the last few layers of the network so that our backbone network outputs a collection of stacked feature maps which describe the original image in a low spatial resolution albeit a high feature (channel) resolution. In the example below, we have a 7x7x512 representation of our observation. Each of the 512 feature maps describe different characteristics of the original image.
We can relate this 7x7 grid back to the original input in order to understand what each grid cell represents relative to the original image.
We can also determine roughly where objects are located in the coarse (7x7) feature maps by observing which grid cell contains the center of our bounding box annotation. We'll assign this grid cell as being "responsible" for detecting that specific object.
In order to detect this object, we will add another convolutional layer and learn the kernel parameters which combine the context of all 512 feature maps in order to produce an activation corresponding with the grid cell which contains our object.
If the input image contains multiple objects, we should have multiple activations on our grid denoting that an object is in each of the activated regions.
However, we cannot sufficiently describe each object with a single activation. In order to fully describe a detected object, we'll need to define:
Thus, we'll need to learn a convolution filter for each of the above attributes such that we produce $5 + C$ output channels to describe a single bounding box at each grid cell location. This means that we'll learn a set of weights to look across all 512 feature maps and determine which grid cells are likely to contain an object, what classes are likely to be present in each grid cell, and how to describe the bounding box for possible objects in each grid cell.
The full output of applying $5 + C$ convolutional filters is shown below for clarity, producing one bounding box descriptor for each grid cell.
However, some images might have multiple objects which "belong" to the same grid cell. We can alter our layer to produce $B(5 + C)$ filters such that we can predict $B$ bounding boxes for each grid cell location.
Visualizing the full convolutional output of our $B(5 + C)$ filters, we can see that our model will always produce a fixed number of $N \times N \times B$ predictions for a given image. We can then filter our predictions to only consider bounding boxes which has a $p_{obj}$ above some defined threshold.
Because of the convolutional nature of our detection process, multiple objects can be detected in parallel. However, we also end up predicting for a large number grid cells where no object is found. Although we can filter these bounding boxes out by their $p_{obj}$ score, this introduces quite a large imbalance between the predicted bounding boxes which contain an object and those which do not contain an object.
The two models I'll discuss below both use this concept of "predictions on a grid" to detect a fixed number of possible objects within an image. In the respective sections, I'll describe the nuances of each approach and fill in some of the details that I've glanced over in this section so that you can actually implement each model.
The "predictions on a grid" approach produces a fixed number of bounding box predictions for each image. However, we would like to filter these predictions in order to only output bounding boxes for objects that are actually likely to be in the image. Moreover, we want a single bounding box prediction for each object detected.
We can filter out most of the bounding box predictions by only considering predictions with a $p_{obj}$ above some defined confidence threshold. However, we still may be left with multiple high-confidence predictions describing the same object. Thus, we need a method for removing redundant object predictions such that each object is described by a single bounding box.
To accomplish this, we'll use a technique known as non-max suppression. At a high level, this technique will look at highly overlapping bounding boxes and suppress (or discard) all of the predictions except the highest confidence prediction.
We'll perform non-max suppression on each class separately. Again, the goal here is to remove redundant predictions so we shouldn't be concerned if we have two predictions that overlap if one box is describing a person and the other box is describing a bicycle. However, if two bounding boxes with high overlap are both describing a person, it's likely that these predictions are describing the same person.
The YOLO model was first published (by Joseph Redmon et al.) in 2015 and subsequently revised in two following papers. In each section, I'll discuss the specific implementation details and refinements that were made to improve performance.
The original YOLO network uses a modified GoogLeNet as the backbone network. Redmond later created a new model named DarkNet-19 which follows the general design of a $3 \times 3$ filters, doubling the number of channels at each pooling step; $1 \times 1$ filters are also used to periodically compress the feature representation throughout the network. His latest paper introduces a new, larger model named DarkNet-53 which offers improved performance over its predecessor.
All of these models were first pre-trained as image classifiers before being adapted for the detection task. In the second iteration of the YOLO model, Redmond discovered that using higher resolution images at the end of classification pre-training improved the detection performance and thus adopted this practice.
Adapting the classification network for detection simply consists of removing the last few layers of the network and adding a convolutional layer with $B(5 + C)$ filters to produce the $N \times N \times B$ bounding box predictions.
The first iteration of the YOLO model directly predicts all four values which describe a bounding box. The $x$ and $y$ coordinates of each bounding box are defined relative to the top left corner of each grid cell and normalized by the cell dimensions such that the coordinate values are bounded between 0 and 1. We define the boxes width and height such that our model predicts the square-root width and height; by defining the width and height of the boxes as a square-root value, differences between large numbers are less significant than differences between small numbers (confirm this visually by looking at a plot of $y = \sqrt {x}$). Redmond chose this formulation because “small deviations in large boxes matter less than in small boxes" and thus when calculating our loss function we would like the emphasis to be placed on getting small boxes more exact. The bounding box width and height are normalized by the image width and height and thus are also bounded between 0 and 1. An L2 loss is applied during training.
This formulation was later revised to introduce the concept of a bounding box prior. Rather than expecting the model to directly produce unique bounding box descriptors for each new image, we will define a collection of bounding boxes with varying aspect ratios which embed some prior information about the shape of objects we're expecting to detect. Redmond offers an approach towards discovering the best aspect ratios by doing k-means clustering (with a custom distance metric) on all of the bounding boxes in your training dataset.
In the image below, you can see a collection of 5 bounding box priors (also known as anchor boxes) for the grid cell highlighted in yellow. With this formulation, each of the $B$ bounding boxes explicitly specialize in detecting objects of a specific size and aspect ratio.
Note: Although it is not visualized, these anchor boxes are present for each cell in our prediction grid.
Rather than directly predicting the bounding box dimensions, we'll reformulate our task in order to simply predict the offset from our bounding box prior dimensions such that we can fine-tune our predicted bounding box dimensions. This reformulation makes the prediction task easier to learn.
For similar reasons as originally predicting the square-root width and height, we'll define our task to predict the log offsets from our bounding box prior.
In the first version of the model, the "objectness" score $p_{obj}$ was trained to approximate the Intersection over Union (IoU) between the predicted box and the ground truth label. When we calculate our loss during training, we'll match objects to whichever bounding box prediction (on the same grid cell) has the highest IoU score. For unmatched boxes, the only descriptor which we'll include in our loss function is $p_{obj}$.
After the addition bounding box priors in YOLOv2, we can simply assign labeled objects to whichever anchor box (on the same grid cell) has the highest IoU score with the labeled object.
In the third version, Redmond redefined the "objectness" target score $p_{obj}$ to be 1 for the bounding boxes with highest IoU score for each given target, and 0 for all remaining boxes. However, we will not include bounding boxes which have a high IoU score (above some threshold) but not the highest score when calculating the loss. In simple terms, it doesn't make sense to punish a good prediction just because it isn't the best prediction.
Originally, class prediction was performed at the grid cell level. This means that a single grid cell could not predict multiple bounding boxes of different classes. This was later revised to predict class for each bounding box using a softmax activation across classes and a cross entropy loss.
Redmond later changed the class prediction to use sigmoid activations for multi-label classification as he found a softmax is not necessary for good performance. This choice will depend on your dataset and whether or not your labels overlap (eg. "golden retriever" and "dog").
The first YOLO model simply predicts the $N \times N \times B$ bounding boxes using the output of our backbone network.
In YOLOv2, Redmond adds a weird skip connection splitting a higher resolution feature map across multiple channels as visualized below.
The weird "skip connection from higher resolution feature maps" idea that I don't like.
Fortunately, this was changed in the third iteration for a more standard feature pyramid network output structure. With this method, we'll alternate between outputting a prediction and upsampling the feature maps (with skip connections). This allows for predictions that can take advantage of finer-grained information from earlier in the network, which helps for detecting small objects in the image.
The SSD model was also published (by Wei Liu et al.) in 2015, shortly after the YOLO model, and was also later refined in a subsequent paper. In each section, I'll discuss the specific implementation details for this model.
A VGG-16 model, pre-trained on ImageNet for image classification, is used as the backbone network. The authors make a few slight tweaks when adapting the model for the detection task, including: replacing fully connected layers with convolutional implementations, removing dropout layers, and replacing the last max pooling layer with a dilated convolution.
Rather than using k-means clustering to discover aspect ratios, the SSD model manually defines a collection of aspect ratios (eg. {1, 2, 3, 1/2, 1/3}) to use for the $B$ bounding boxes at each grid cell location.
For each bounding box, we'll predict the offsets from the anchor box for both the bounding box coordinates ($x$ and $y$) and dimensions (width and height). We'll use ReLU activations trained with a Smooth L1 loss.
One major distinction between YOLO and SSD is that SSD does not attempt to predict a value for $p_{obj}$. Whereas the YOLO model predicted the probability of an object and then predicted the probability of each class given that there was an object present, the SSD model attempts to directly predict the probability that a class is present in a given bounding box.
When calculating the loss, we'll match each ground truth box to the anchor box with the highest IoU — defining this box with being "responsible" for making the prediction. However, we'll also match the ground truth boxes with any other anchor boxes with an IoU above some defined threshold (0.5) in the same light of not punishing good predictions simply because they weren't the best. We can always rely on non-max suppression at inference time to filter out redundant predictions.
As I mentioned previously, the class predictions for SSD bounding boxes are not conditioned on the fact that an object is present. Thus, we directly predict the probability of each class using a softmax activation and cross entropy loss. Because we don't explicitly predict $p_{obj}$, it's important to have a class for "background" so that we can predict when no object is present.
Due to the fact that most of the boxes will belong to the "background" class, we will use a technique known as "hard negative mining" to sample negative (no object) predictions such that there is at most a 3:1 ratio between negative and positive predictions when calculating our loss.
To allow for predictions at multiple scales, the SSD output module progressively downsamples the convolutional feature maps, intermittently producing bounding box predictions (as shown with the arrows from convolutional layers to the predictions box).
As I mentioned earlier, we often end up with a large amount of bounding boxes in which no object is contained due to the nature of our "predictions on a grid" approach. Although we can easily filter these boxes out after making a fixed set of bounding box predictions, there is still a (foreground-background) class imbalance present which can introduce difficulties during training. This is especially difficult for models which don't separate prediction of objectness and class probability into two separate tasks, and instead simply include a "background" class for regions with no objects.
Researchers at Facebook proposed adding a scaling factor to the standard cross entropy loss such that it places more the emphasis on "hard" examples during training, preventing easy negative predictions from dominating the training process.
As the researchers point out, easily classified examples can incur a non-trivial loss for standard cross entropy loss ($\gamma=0$) which, summed over a large collection of samples, can easily dominate the parameter update. The ${\left( {1 - {p_t}} \right)^\gamma }$ term acts as a tunable scaling factor to prevent this from occuring.
As the paper points out, "with $\gamma=2$, an example classified with $p_t = 0.9$ would have 100X lower loss compared with CE and with $p_t = 0.968$ it would have 1000X lower loss."
Below I've listed some common datasets that researchers use when evaluating new object detection models.
Papers
Lectures
Blog posts
Frameworks and GitHub repos
Tools for labeling data
]]>When evaluating a standard machine learning model, we usually classify our predictions into four categories: true positives, false positives, true negatives, and false negatives. However, for the dense prediction task of image segmentation, it's not immediately clear what counts as a "true positive" and, more generally, how we
]]>When evaluating a standard machine learning model, we usually classify our predictions into four categories: true positives, false positives, true negatives, and false negatives. However, for the dense prediction task of image segmentation, it's not immediately clear what counts as a "true positive" and, more generally, how we can evaluate our predictions. In this post, I'll discuss common methods for evaluating both semantic and instance segmentation techniques.
Recall that the task of semantic segmentation is simply to predict the class of each pixel in an image.
Our prediction output shape matches the input's spatial resolution (width and height) with a channel depth equivalent to the number of possible classes to be predicted. Each channel consists of a binary mask which labels areas where a specific class is present.
The Intersection over Union (IoU) metric, also referred to as the Jaccard index, is essentially a method to quantify the percent overlap between the target mask and our prediction output. This metric is closely related to the Dice coefficient which is often used as a loss function during training.
Quite simply, the IoU metric measures the number of pixels common between the target and prediction masks divided by the total number of pixels present across both masks.
$$ IoU = \frac{{target \cap prediction}}{{target \cup prediction}} $$
As a visual example, let's suppose we're tasked with calculating the IoU score of the following prediction, given the ground truth labeled mask.
The intersection ($A \cap B$) is comprised of the pixels found in both the prediction mask and the ground truth mask, whereas the union ($A \cup B$) is simply comprised of all pixels found in either the prediction or target mask.
We can calculate this easily using Numpy.
intersection = np.logical_and(target, prediction)
union = np.logical_or(target, prediction)
iou_score = np.sum(intersection) / np.sum(union)
The IoU score is calculated for each class separately and then averaged over all classes to provide a global, mean IoU score of our semantic segmentation prediction.
An alternative metric to evaluate a semantic segmentation is to simply report the percent of pixels in the image which were correctly classified. The pixel accuracy is commonly reported for each class separately as well as globally across all classes.
When considering the per-class pixel accuracy we're essentially evaluating a binary mask; a true positive represents a pixel that is correctly predicted to belong to the given class (according to the target mask) whereas a true negative represents a pixel that is correctly identified as not belonging to the given class.
$$ accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}} $$
This metric can sometimes provide misleading results when the class representation is small within the image, as the measure will be biased in mainly reporting how well you identify negative case (ie. where the class is not present).
Instance segmentation models are a little more complicated to evaluate; whereas semantic segmentation models output a single segmentation mask, instance segmentation models produce a collection of local segmentation masks describing each object detected in the image. As such, evaluation methods for instance segmentation are quite similar to that of object detection, with the exception that we now calculate IoU of masks instead of bounding boxes.
To evaluate our collection of predicted masks, we'll compare each of our predicted masks with each of the available target masks for a given input.
A true positive is observed when a prediction-target mask pair has an IoU score which exceeds some predefined threshold.
A false positive indicates a predicted object mask had no associated ground truth object mask.
A false negative indicates a ground truth object mask had no associated predicted object mask.
Precision effectively describes the purity of our positive detections relative to the ground truth. Of all of the objects that we predicted in a given image, how many of those objects actually had a matching ground truth annotation?
$$ Precision = \frac{TP}{TP + FP} $$
Recall effectively describes the completeness of our positive predictions relative to the ground truth. Of all of the objected annotated in our ground truth, how many did we capture as positive predictions?
$$ Recall = \frac{TP}{TP + FN} $$
However, in order to calculate the prediction and recall of a model output, we'll need to define what constitutes a positive detection. To do this, we'll calculate the IoU score between each (prediction, target) mask pair and then determine which mask pairs have an IoU score exceeding a defined threshold value.
However, computing a single precision and recall score at the specified IoU threshold does not adequately describe the behavior of our model's full precision-recall curve. Instead, we can use average precision to effectively integrate the area under a precision-recall curve.
Let's use the precision-recall curve below as an example.
First, we'll adjust our curve such that the precision at a given point $r$ is adjusted to the maximum precision for recall greater than $r$.
Then, we'll simply calculate the area under the curve by numerical integration. This method replaces an older approach of averaging over a range of recall values.
Note that the precision-recall curve will likely not extend out to perfect recall due to our prediction thresholding according to each mask IoU.
As an example, the Microsoft COCO challenge's primary metric for the detection task evaluates the average precision score using IoU thresholds ranging from 0.5 to 0.95 (in 0.05 increments).
For prediction problems with multiple classes of objects, this value is then averaged over all of the classes.
]]>In this post, I'll discuss how to use convolutional neural networks for the task of semantic image segmentation. Image segmentation is a computer vision task in which we label specific regions of an image according to what's being shown.
]]>"What's in this image, and where in the image is
In this post, I'll discuss how to use convolutional neural networks for the task of semantic image segmentation. Image segmentation is a computer vision task in which we label specific regions of an image according to what's being shown.
"What's in this image, and where in the image is it located?"
Jump to:
More specifically, the goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented. Because we're predicting for every pixel in the image, this task is commonly referred to as dense prediction.
An example of semantic segmentation, where the goal is to predict class labels for each pixel in the image. (Source)
One important thing to note is that we're not separating instances of the same class; we only care about the category of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects. There exists a different class of models, known as instance segmentation models, which do distinguish between separate objects of the same class.
Segmentation models are useful for a variety of tasks, including:
A real-time segmented road scene for autonomous driving. (Source)
A chest x-ray with the heart (red), lungs (green), and clavicles (blue) are segmented. (Source)
Simply, our goal is to take either a RGB color image ($height \times width \times 3$) or a grayscale image ($height \times width \times 1$) and output a segmentation map where each pixel contains a class label represented as an integer ($height \times width \times 1$).
Note: For visual clarity, I've labeled a low-resolution prediction map. In reality, the segmentation label resolution should match the original input's resolution.
Similar to how we treat standard categorical values, we'll create our target by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.
A prediction can be collapsed into a segmentation map (as shown in the first image) by taking the argmax
of each depth-wise pixel vector.
We can easily inspect a target by overlaying it onto the observation.
When we overlay a single channel of our target (or prediction), we refer to this as a mask which illuminates the regions of an image where a specific class is present.
A naive approach towards constructing a neural network architecture for this task is to simply stack a number of convolutional layers (with same
padding to preserve dimensions) and output a final segmentation map. This directly learns a mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings; however, it's quite computationally expensive to preserve the full resolution throughout the network.
Recall that for deep convolutional networks, earlier layers tend to learn low-level concepts while later layers develop more high-level (and specialized) feature mappings. In order to maintain expressiveness, we typically need to increase the number of feature maps (channels) as we get deeper in the network.
This didn't necessarily pose a problem for the task of image classification, because for that task we only care about what the image contains (and not where it is located). Thus, we could alleviate computational burden by periodically downsampling our feature maps through pooling or strided convolutions (ie. compressing the spatial resolution) without concern. However, for image segmentation, we would like our model to produce a full-resolution semantic prediction.
One popular approach for image segmentation models is to follow an encoder/decoder structure where we downsample the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes, and the upsample the feature representations into a full-resolution segmentation map.
There are a few different approaches that we can use to upsample the resolution of a feature map. Whereas pooling operations downsample the resolution by summarizing a local area with a single value (ie. average or max pooling), "unpooling" operations upsample the resolution by distributing a single value into a higher resolution.
However, transpose convolutions are by far the most popular approach as they allow for us to develop a learned upsampling.
Whereas a typical convolution operation will take the dot product of the values currently in the filter's view and produce a single value for the corresponding output position, a transpose convolution essentially does the opposite. For a transpose convolution, we take a single value from the low-resolution feature map and multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.
A simplified 1D example of upsampling through a transpose operation. (Source)
For filter sizes which produce an overlap in the output feature map (eg. 3x3 filter with stride 2 - as shown in the below example), the overlapping values are simply added together. Unfortunately, this tends to produce a checkerboard artifact in the output and is undesirable, so it's best to ensure that your filter size does not produce an overlap.
Input in blue, output in green. (Source)
The approach of using a "fully convolutional" network trained end-to-end, pixels-to-pixels for the task of image segmentation was introduced by Long et al. in late 2014. The paper's authors propose adapting existing, well-studied image classification networks (eg. AlexNet) to serve as the encoder module of the network, appending a decoder module with transpose convolutional layers to upsample the coarse feature maps into a full-resolution segmentation map.
Image credit (with modification)
The full network, as shown below, is trained according to a pixel-wise cross entropy loss.
However, because the encoder module reduces the resolution of the input by a factor of 32, the decoder module struggles to produce fine-grained segmentations (as shown below).
The paper's authors comment eloquently on this struggle:
Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where... Combining fine layers and coarse layers lets the model make local predictions that respect global structure. ― Long et al.
The authors address this tension by slowly upsampling (in stages) the encoded representation, adding "skip connections" from earlier layers, and summing these two feature maps.
Image credit (with modification)
These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.
Ronneberger et al. improve upon the "fully convolutional" architecture primarily through expanding the capacity of the decoder module of the network. More concretely, they propose the U-Net architecture which "consists of a contracting path to capture context and a symmetric expanding path that enables precise localization." This simpler architecture has grown to be very popular and has been adapted for a variety of segmentation problems.
Note: The original architecture introduces a decrease in resolution due to the use of valid
padding. However, some practitioners opt to use same
padding where the padding values are obtained by image reflection at the border.
Whereas Long et al. (FCN paper) reported that data augmentation ("randomly mirroring and “jittering” the images by translating them up to 32 pixels") did not result in a noticeable improvement in performance, Ronneberger et al. (U-Net paper) credit data augmentations ("random elastic deformations of the training samples") as a key concept for learning. It appears as if the usefulness (and type) of data augmentation depends on the problem domain.
The standard U-Net model consists of a series of convolution operations for each "block" in the architecture. As I discussed in my post on common convolutional network architectures, there exist a number of more advanced "blocks" that can be substituted in for stacked convolutional layers.
Drozdzal et al. swap out the basic stacked convolution blocks in favor of residual blocks. This residual block introduces short skip connections (within the block) alongside the existing long skip connections (between the corresponding feature maps of encoder and decoder modules) found in the standard U-Net structure. They report that the short skip connections allow for faster convergence when training and allow for deeper models to be trained.
Expanding on this, Jegou et al. proposed the use of dense blocks, still following a U-Net structure, arguing that the "characteristics of DenseNets make them a very good fit for semantic segmentation as they naturally induce skip connections and multi-scale supervision." These dense blocks are useful as they carry low level features from previous layers directly alongside higher level features from more recent layers, allowing for highly efficient feature reuse.
Image credit (with modification)
One very important aspect of this architecture is the fact that the upsampling path does not have a skip connection between the input and output of a dense block. The authors note that because the "upsampling path increases the feature maps spatial resolution, the linear growth in the number of features would be too memory demanding." Thus, only the output of a dense block is passed along in the decoder module.
The FC-DenseNet103 model acheives state of the art results (Oct 2017) on the CamVid dataset.
One benefit of downsampling a feature map is that it broadens the receptive field (with respect to the input) for the following filter, given a constant filter size. Recall that this approach is more desirable than increasing the filter size due to the parameter inefficiency of large filters (discussed here in Section 3.1). However, this broader context comes at the cost of reduced spatial resolution.
Dilated convolutions provide alternative approach towards gaining a wide field of view while preserving the full spatial dimension. As shown in the figure below, the values used for a dilated convolution are spaced apart according to some specified dilation rate.
Some architectures swap out the last few pooling layers for dilated convolutions with successively higher dilation rates to maintain the same field of view while preventing loss of spatial detail. However, it is often still too computationally expensive to completely replace pooling layers with dilated convolutions.
The most commonly used loss function for the task of image segmentation is a pixel-wise cross entropy loss. This loss examines each pixel individually, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.
Because the cross entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class. Long et al. (FCN paper) discuss weighting this loss for each output channel in order to counteract a class imbalance present in the dataset.
Meanwhile, Ronneberger et al. (U-Net paper) discuss a loss weighting scheme for each pixel such that there is a higher weight at the border of segmented objects. This loss weighting scheme helped their U-Net model segment cells in biomedical images in a discontinuous fashion such that individual cells may be easily identified within the binary segmentation map.
Notice how the binary segmentation map produces clear borders around the cells. (Source)
Another popular loss function for image segmentation tasks is based on the Dice coefficient, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:
$$ Dice = \frac{{2\left| {A \cap B} \right|}}{{\left| A \right| + \left| B \right|}} $$
where ${\left| {A \cap B} \right|}$ represents the common elements between sets A and B, and $\left| A \right|$ represents the number of elements in set A (and likewise for set B).
For the case of evaluating a Dice coefficient on predicted segmentation masks, we can approximate ${\left| {A \cap B} \right|}$ as the element-wise multiplication between the prediction and target mask, and then sum the resulting matrix.
Because our target mask is binary, we effectively zero-out any pixels from our prediction which are not "activated" in the target mask. For the remaining pixels, we are essentially penalizing low-confidence predictions; a higher value for this expression, which is in the numerator, leads to a better Dice coefficient.
In order to quantify $\left| A \right|$ and $\left| B \right|$, some researchers use the simple sum whereas other researchers prefer to use the squared sum for this calculation. I don't have the practical experience to know which performs better empirically over a wide range of tasks, so I'll leave you to try them both and see which works better.
In case you were wondering, there's a 2 in the numerator in calculating the Dice coefficient because our denominator "double counts" the common elements between the two sets. In order to formulate a loss function which can be minimized, we'll simply use $1 - Dice$. This loss function is known as the soft Dice loss because we directly use the predicted probabilities instead of thresholding and converting them into a binary mask.
With respect to the neural network output, the numerator is concerned with the common activations between our prediction and target mask, where as the denominator is concerned with the quantity of activations in each mask separately. This has the effect of normalizing our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.
A soft Dice loss is calculated for each class separately and then averaged to yield a final score. An example implementation is provided below.
Below, I've listed a number of common datasets that researchers use to train new models and benchmark against the state of the art. You can also explore previous Kaggle competitions and read about how winning solutions implemented segmentation models for their given task.
Datasets
Past Kaggle Competitions
Papers
Lectures
Blog posts
Image labeling tools
Useful Github repos
]]>In Q4 of 2017, I made the decision to walk down the entrepreneurial path and dedicate a full-time effort towards launching a startup venture. I secured a healthy seed round of funding from a local angel investor and recruited three of my peers to join me in this effort. Five
]]>In Q4 of 2017, I made the decision to walk down the entrepreneurial path and dedicate a full-time effort towards launching a startup venture. I secured a healthy seed round of funding from a local angel investor and recruited three of my peers to join me in this effort. Five months later, we decided to put the project on an indefinite pause. This blog post discusses what I set out to accomplish, what happened, and what I've learned from the experience.
The venture was founded on a simple premise: we could use machine learning to forecast short-term fluctuations in market prices based on volume imbalances in order books.
Specifically, we were focused on the cryptoasset markets as these exchanges offered free, real-time data feeds to the order books and fulfilled trades.
An order book essentially maintains current unfulfilled orders for a given asset. The exchanges reference these order books in order to match interested buyers and sellers. If you're interested in trading something, you would publish your intentions to this order book which would be executed according to the type of order placed. New orders which can be matched with an existing order on the book are executed immediately and remove that volume from the order book.
To gain an understanding for how this order book works, you can reference the following cartoon.
Because the order book contains a collection of orders waiting to be matched, it essentially displays the schedule for supply and demand as a function of price. That is, you can see how many people have stated that they are willing to buy/sell when the asset moves to a given price. The market price changes when the volume at the "best available price" level is exhausted.
As I watched how the order book and market prices evolved over a period of time, I began to notice predictable patterns emerge.
Feel free to check out a live feed of the BTC-USD order book and see if you can spot any of the patterns mentioned alongside the corresponding price action.
It's also worth noting that cryptoassets are a unique asset class given that the underlying assets being traded are publicly inspectable in real-time. For example, you can listen in on the Bitcoin network and capture statistics such as the number of participants in the network, the total value being transacted on the network, and other useful metrics. Compare this to a US public equity which only releases internal information about the company in their quarterly reports. Given this radical openness and transparency, the asset class seems uniquely posed for quantitative, data-driven trading.
After securing funding and recruiting a team, we initially budgeted 2 to 3 months to build an initial proof of concept and further evaluate the feasibility of the idea.
As we set out to build the proof of concept, the initial days consisted of architecting and clarifying the exact needs and requirements of our technical infrastructure. We discussed various machine learning models and approaches, and specified the desired characteristics of our models. Because the cryptoasset market is rapidly evolving, we wanted models which supported continuous "online" learning. We also posited that this would help safeguard our models from being manipulated by bad actors who could "spoof" the order book data; a model capable of continuous learning could evolve to not be susceptible to repeated manipulation attempts.
As the weeks passed, we unknowingly succumbed to scope creep as our designs for a minimum viable product inched closer toward designs for a robust scalable product. I've realized the importance of building forcing functions into projects to specifically protect against scope creep.
Looking back, it's now clear that we had operated under the naive assumption that our fundamental hypothesis was correct (after all, it seemed like it was the right idea) which led us to focus our time addressing auxiliary problems such as scalability and performance optimizations. If we had instead focused all of our energy on testing our riskiest (fundamental) assumption, we would have discovered its flaws much more quickly.
After spending a fair amount of time designing and iterating on our data infrastructure, we settled down to perform some more rigorous analysis of the data being collected by the real-time market feeds.
Here's what we learned:
For machine learning models, datasets that are inherently more difficult to learn from see an amplification in the learning challenge when a class imbalance is introduced.
The initial hypothesis was that we could use machine learning to forecast short-term price fluctuations based on the order book data. After validating the idea, we concluded that the idea was not likely feasible, at least not in the continuous, automated fashion we had imagined with machine learning.
I had started working on this problem as a side project as I was looking to apply what I have been learning about (machine learning) to address a messy, real world problem. The problem was attractive as it appeared to offer a wealth of data which was freely accessible - however, I did not accurately gauge the challenge of learning from this firehose of market data.
When we discovered the flaws in our fundamental assumption, we had two main options: pivot or halt the project. After much deliberation, we eventually decided to halt this project. We each had our own reasons for reaching this conclusion, I'll only discuss my personal reasons.
At this stage in life, I’m focused on optimizing for learning and continuing to invest in myself. I’ve realized that your environment is a critical factor in this growth, so I’ve decided to embed myself in an environment where I can work alongside more senior engineers and gain experience working with teams who have a history of building machine learning products.
I'm 22 years old, I have a long career ahead of me. My insatiable desire to build things has cultivated an interest in both engineering and entrepreneurship, and I'm sure I'll walk down the entrepreneurial path once again later in my career. However, my current focus is on mastering machine learning and gaining practical work experience; I feel as if this focus will be beneficial for the long-term trajectory of my career.
"The average age of founders of the most successful startups—those with growth in the top 1% of their industry—was 45." Source
I learned a lot my from experience exploring this venture and I'm grateful for the challenges that I encountered. Here's to the next adventure.
]]>In this post, I'll discuss commonly used architectures for convolutional networks. As you'll see, almost all CNN architectures follow the same general design principles of successively applying convolutional layers to the input, periodically downsampling the spatial dimensions while increasing the number of feature maps.
While the classic network architectures were
]]>In this post, I'll discuss commonly used architectures for convolutional networks. As you'll see, almost all CNN architectures follow the same general design principles of successively applying convolutional layers to the input, periodically downsampling the spatial dimensions while increasing the number of feature maps.
While the classic network architectures were comprised simply of stacked convolutional layers, modern architectures explore new and innovative ways for constructing convolutional layers in a way which allows for more efficient learning. Almost all of these architectures are based on a repeatable unit which is used throughout the network.
These architectures serve as general design guidelines which machine learning practitioners will then adapt to solve various computer vision tasks. These architectures serve as rich feature extractors which can be used for image classification, object detection, image segmentation, and many other more advanced tasks.
Classic network architectures (included for historical purposes)
Modern network architectures
Yann Lecun's LeNet-5 model was developed in 1998 to identify handwritten digits for zip code recognition in the postal service. This pioneering model largely introduced the convolutional neural network as we know it today.
Architecture
Convolutional layers use a subset of the previous layer's channels for each filter to reduce computation and force a break of symmetry in the network. The subsampling layers use a form of average pooling.
Parameters: 60,000
Paper: Gradient-based learning applied to document recognition
AlexNet was developed by Alex Krizhevsky et al. in 2012 to compete in the ImageNet competition. The general architecture is quite similar to LeNet-5, although this model is considerably larger. The success of this model (which took first place in the 2012 ImageNet competition) convinced a lot of the computer vision community to take a serious look at deep learning for computer vision tasks.
Architecture
Parameters: 60 million
Paper: ImageNet Classification with Deep Convolutional Neural Networks
The VGG network, introduced in 2014, offers a deeper yet simpler variant of the convolutional structures discussed above. At the time of its introduction, this model was considered to be very deep.
Architecture
Parameters: 138 million
Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
In 2014, researchers at Google introduced the Inception network which took first place in the 2014 ImageNet competition for classification and detection challenges.
The model is comprised of a basic unit referred to as an "Inception cell" in which we perform a series of convolutions at different scales and subsequently aggregate the results. In order to save computation, 1x1 convolutions are used to reduce the input channel depth. For each cell, we learn a set of 1x1, 3x3, and 5x5 filters which can learn to extract features at different scales from the input. Max pooling is also used, albeit with "same" padding to preserve the dimensions so that the output can be properly concatenated.
These researchers published a follow-up paper which introduced more efficient alternatives to the original Inception cell. Convolutions with large spatial filters (such as 5x5 or 7x7) are beneficial in terms of their expressiveness and ability to extract features at a larger scale, but the computation is disproportionately expensive. The researchers pointed out that a 5x5 convolution can be more cheaply represented by two stacked 3x3 filters.
Whereas a $5 \times 5 \times c$ filter requires $25c$ parameters, two $3 \times 3 \times c$ filters only require $18c$ parameters. In order to most accurately represent a 5x5 filter, we shouldn't use any nonlinear activations between the two 3x3 layers. However, it was discovered that "linear activation was always inferior to using rectified linear units in all stages of the factorization."
It was also shown that 3x3 convolutions could be further deconstructed into successive 3x1 and 1x3 convolutions.
Generalizing this insight, we can more efficiently compute an $n \times n$ convolution as a $1 \times n$ convolution followed by a $n \times 1$ convolution.
Architecture
In order to improve overall network performance, two auxiliary outputs are added throughout the network. It was later discovered that the earliest auxiliary output had no discernible effect on the final quality of the network. The addition of auxiliary outputs primarily benefited the end performance of the model, converging at a slightly better value than the same network architecture without an auxiliary branch. It is believed the addition of auxiliary outputs had a regularizing effect on the network.
A revised, deeper version of the Inception network which takes advantage of the more efficient Inception cells is shown below.
Parameters: 5 million (V1) and 23 million (V3)
Papers:
Deep residual networks were a breakthrough idea which enabled the development of much deeper networks (hundreds of layers as opposed to tens of layers).
Its a generally accepted principle that deeper networks are capable of learning more complex functions and representations of the input which should lead to better performance. However, many researchers observed that adding more layers eventually had a negative effect on the final performance. This behavior was not intuitively expected, as explained by the authors below.
Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).
This phenomenon is referred to by the authors as the degradation problem - alluding to the fact that although better parameter initialization techniques and batch normalization allow for deeper networks to converge, they often converge at a higher error rate than their shallower counterparts. In the limit, simply stacking more layers degrades the model's ultimate performance.
The authors propose a remedy to this degradation problem by introducing residual blocks in which intermediate layers of a block learn a residual function with reference to the block input. You can think of this residual function as a refinement step in which we learn how to adjust the input feature map for higher quality features. This compares with a "plain" network in which each layer is expected to learn new and distinct feature maps. In the event that no refinement is needed, the intermediate layers can learn to gradually adjust their weights toward zero such that the residual block represents an identity function.
Note: It was later discovered that a slight modification to the original proposed unit offers better performance by more efficiently allowing gradients to propagate through the network during training.
Wide residual networks
Although the original ResNet paper focused on creating a network architecture to enable deeper structures by alleviating the degradation problem, other researchers have since pointed out that increasing the network's width (channel depth) can be a more efficient way of expanding the overall capacity of the network.
Architecture
Each colored block of layers represent a series of convolutions of the same dimension. The feature mapping is periodically downsampled by strided convolution accompanied by an increase in channel depth to preserve the time complexity per layer. Dotted lines denote residual connections in which we project the input via a 1x1 convolution to match the dimensions of the new block.
The diagram above visualizes the ResNet 34 architecture. For the ResNet 50 model, we simply replace each two layer residual block with a three layer bottleneck block which uses 1x1 convolutions to reduce and subsequently restore the channel depth, allowing for a reduced computational load when calculating the 3x3 convolution.
Parameters: 25 million (ResNet 50)
Papers:
The ResNeXt architecture is an extension of the deep residual network which replaces the standard residual block with one that leverages a "split-transform-merge" strategy (ie. branched paths within a cell) used in the Inception models. Simply, rather than performing convolutions over the full input feature map, the block's input is projected into a series of lower (channel) dimensional representations of which we separately apply a few convolutional filters before merging the results.
This idea is quite similar to group convolutions, which was an idea proposed in the AlexNet paper as a way to share the convolution computation across two GPUs. Rather than creating filters with the full channel depth of the input, the input is split channel-wise into groups with each as shown below.
It was discovered that using grouped convolutions led to a degree of specialization among groups where separate groups focused on different characteristics of the input image.
The ResNeXt paper refers to the number of branches or groups as the cardinality of the ResNeXt cell and performs a series of experiments to understand relative performance gains between increasing the cardinality, depth, and width of the network. The experiments show that increasing cardinality is more effective at benefiting model performance than increasing the width or depth of the network. The experiments also suggest that "residual connections are helpful for optimization, whereas aggregated transformations are (helpful for) stronger representations."
Architecture
The ResNeXt architecture simply mimicks the ResNet models, replacing the ResNet blocks for the ResNeXt block.
Paper: Aggregated Residual Transformations for Deep Neural Networks
The idea behind dense convolutional networks is simple: it may be useful to reference feature maps from earlier in the network. Thus, each layer's feature map is concatenated to the input of every successive layer within a dense block. This allows later layers within the network to directly leverage the features from earlier layers, encouraging feature reuse within the network. The authors state, "concatenating feature-maps learned by different layers increases variation in the input of subsequent layers and improves efficiency."
When I first came across this model, I figured that it would have an absurd number of parameters to support the dense connections between layers. However, because the network is capable of directly using any previous feature map, the authors found that they could work with very small output channel depths (ie. 12 filters per layer), vastly reducing the total number of parameters needed. The authors refer to the number of filters used in each convolutional layer as a "growth rate", $k$, since each successive layer will have $k$ more channels than the last (as a result of accumulating and concatenating all previous layers to the input).
When compared with ResNet models, DenseNets are reported to acheive better performance with less complexity.
Architecture
For a majority of the experiments in the paper, the authors mimicked the general ResNet model architecture, simply swapping in the dense block as the repeated unit.
Parameters:
Paper: Densely Connected Convolutional Networks
Video: CVPR 2017 Best Paper Award: Densely Connected Convolutional Networks
An Analysis of Deep Neural Network Models for Practical Applications
- Corresponding blog post
In my introductory post on autoencoders, I discussed various models (undercomplete, sparse, denoising, contractive) which take data as input and discover some latent state representation of that data. More specifically, our input data is converted into an encoding vector where each dimension represents some learned attribute about the data. The
]]>In my introductory post on autoencoders, I discussed various models (undercomplete, sparse, denoising, contractive) which take data as input and discover some latent state representation of that data. More specifically, our input data is converted into an encoding vector where each dimension represents some learned attribute about the data. The most important detail to grasp here is that our encoder network is outputting a single value for each encoding dimension. The decoder network then subsequently takes these values and attempts to recreate the original input.
A variational autoencoder (VAE) provides a probabilistic manner for describing an observation in latent space. Thus, rather than building an encoder which outputs a single value to describe each latent state attribute, we'll formulate our encoder to describe a probability distribution for each latent attribute.
To provide an example, let's suppose we've trained an autoencoder model on a large dataset of faces with a encoding dimension of 6. An ideal autoencoder will learn descriptive attributes of faces such as skin color, whether or not the person is wearing glasses, etc. in an attempt to describe an observation in some compressed representation.
In the example above, we've described the input image in terms of its latent attributes using a single value to describe each attribute. However, we may prefer to represent each latent attribute as a range of possible values. For instance, what single value would you assign for the smile attribute if you feed in a photo of the Mona Lisa? Using a variational autoencoder, we can describe latent attributes in probabilistic terms.
With this approach, we'll now represent each latent attribute for a given input as a probability distribution. When decoding from the latent state, we'll randomly sample from each latent state distribution to generate a vector as input for our decoder model.
Note: For variational autoencoders, the encoder model is sometimes referred to as the recognition model whereas the decoder model is sometimes referred to as the generative model.
By constructing our encoder model to output a range of possible values (a statistical distribution) from which we'll randomly sample to feed into our decoder model, we're essentially enforcing a continuous, smooth latent space representation. For any sampling of the latent distributions, we're expecting our decoder model to be able to accurately reconstruct the input. Thus, values which are nearby to one another in latent space should correspond with very similar reconstructions.
Suppose that there exists some hidden variable $z$ which generates an observation $x$.
We can only see $x$, but we would like to infer the characteristics of $z$. In other words, we’d like to compute $p\left( {z|x} \right)$.
$$ p\left( {z|x} \right) = \frac{{p\left( {x|z} \right)p\left( z \right)}}{{p\left( x \right)}} $$
Unfortunately, computing $p\left( x \right)$ is quite difficult.
$$ p\left( x \right) = \int {p\left( {x|z} \right)p\left( z \right)dz} $$
This usually turns out to be an intractable distribution. However, we can apply varitational inference to estimate this value.
Let's approximate $p\left( {z|x} \right)$ by another distribution $q\left( {z|x} \right)$ which we'll define such that it has a tractable distribution. If we can define the parameters of $q\left( {z|x} \right)$ such that it is very similar to $p\left( {z|x} \right)$, we can use it to perform approximate inference of the intractable distribution.
Recall that the KL divergence is a measure of difference between two probability distributions. Thus, if we wanted to ensure that $q\left( {z|x} \right)$ was similar to $p\left( {z|x} \right)$, we could minimize the KL divergence between the two distributions.
$$ \min KL\left( {q\left( {z|x} \right)||p\left( {z|x} \right)} \right) $$
Dr. Ali Ghodsi goes through a full derivation here, but the result gives us that we can minimize the above expression by maximizing the following:
$$ {E_{q\left( {z|x} \right)}}\log p\left( {x|z} \right) - KL\left( {q\left( {z|x} \right)||p\left( z \right)} \right) $$
The first term represents the reconstruction likelihood and the second term ensures that our learned distribution $q$ is similar to the true prior distribution $p$.
To revisit our graphical model, we can use $q$ to infer the possible hidden variables (ie. latent state) which was used to generate an observation. We can further construct this model into a neural network architecture where the encoder model learns a mapping from $x$ to $z$ and the decoder model learns a mapping from $z$ back to $x$.
Our loss function for this network will consist of two terms, one which penalizes reconstruction error (which can be thought of maximizing the reconstruction likelihood as discussed earlier) and a second term which encourages our learned distribution ${q\left( {z|x} \right)}$ to be similar to the true prior distribution ${p\left( z \right)}$, which we'll assume follows a unit Gaussian distribution, for each dimension $j$ of the latent space.
$$ {\cal L}\left( {x,\hat x} \right) + \sum\limits_j {KL\left( {{q_j}\left( {z|x} \right)||p\left( z \right)} \right)} $$
In the previous section, I established the statistical motivation for a variational autoencoder structure. In this section, I'll provide the practical implementation details for building such a model yourself.
Rather than directly outputting values for the latent state as we would in a standard autoencoder, the encoder model of a VAE will output parameters describing a distribution for each dimension in the latent space. Since we're assuming that our prior follows a normal distribution, we'll output two vectors describing the mean and variance of the latent state distributions. If we were to build a true multivariate Gaussian model, we'd need to define a covariance matrix describing how each of the dimensions are correlated. However, we'll make a simplifying assumption that our covariance matrix only has nonzero values on the diagonal, allowing us to describe this information in a simple vector.
Our decoder model will then generate a latent vector by sampling from these defined distributions and proceed to develop a reconstruction of the original input.
However, this sampling process requires some extra attention. When training the model, we need to be able to calculate the relationship of each parameter in the network with respect to the final output loss using a technique known as backpropagation. However, we simply cannot do this for a random sampling process. Fortunately, we can leverage a clever idea known as the "reparameterization trick" which suggests that we randomly sample $\varepsilon$ from a unit Gaussian, and then shift the randomly sampled $\varepsilon$ by the latent distribution's mean $\mu$ and scale it by the latent distribution's variance $\sigma$.
With this reparameterization, we can now optimize the parameters of the distribution while still maintaining the ability to randomly sample from that distribution.
Note: In order to deal with the fact that the network may learn negative values for $\sigma$, we'll typically have the network learn $\log \sigma$ and exponentiate this value to get the latent distribution's variance.
To understand the implications of a variational autoencoder model and how it differs from standard autoencoder architectures, it's useful to examine the latent space. This blog post introduces a great discussion on the topic, which I'll summarize in this section.
The main benefit of a variational autoencoder is that we're capable of learning smooth latent state representations of the input data. For standard autoencoders, we simply need to learn an encoding which allows us to reproduce the input. As you can see in the left-most figure, focusing only on reconstruction loss does allow us to separate out the classes (in this case, MNIST digits) which should allow our decoder model the ability to reproduce the original handwritten digit, but there's an uneven distribution of data within the latent space. In other words, there are areas in latent space which don't represent any of our observed data.
Image credit (modified)
On the flip side, if we only focus only on ensuring that the latent distribution is similar to the prior distribution (through our KL divergence loss term), we end up describing every observation using the same unit Gaussian, which we subsequently sample from to describe the latent dimensions visualized. This effectively treats every observation as having the same characteristics; in other words, we've failed to describe the original data.
However, when the two terms are optimized simultaneously, we're encouraged to describe the latent state for an observation with distributions close to the prior but deviating when necessary to describe salient features of the input.
When I'm constructing a variational autoencoder, I like to inspect the latent dimensions for a few samples from the data to see the characteristics of the distribution. I encourage you to do the same.
If we observe that the latent distributions appear to be very tight, we may decide to give higher weight to the KL divergence term with a parameter $\beta>1$, encouraging the network to learn broader distributions. This simple insight has led to the growth of a new class of models - disentangled variational autoencoders. As it turns out, by placing a larger emphasis on the KL divergence term we're also implicitly enforcing that the learned latent dimensions are uncorrelated (through our simplifying assumption of a diagonal covariance matrix).
$$ {\cal L}\left( {x,\hat x} \right) + \beta \sum\limits_j {KL\left( {{q_j}\left( {z|x} \right)||N\left( {0,1} \right)} \right)} $$
By sampling from the latent space, we can use the decoder network to form a generative model capable of creating new data similar to what was observed during training. Specifically, we'll sample from the prior distribution ${p\left( z \right)}$ which we assumed follows a unit Gaussian distribution.
The figure below visualizes the data generated by the decoder network of a variational autoencoder trained on the MNIST handwritten digits dataset. Here, we've sampled a grid of values from a two-dimensional Gaussian and displayed the output of our decoder network.
As you can see, the distinct digits each exist in different regions of the latent space and smoothly transform from one digit to another. This smooth transformation can be quite useful when you'd like to interpolate between two observations, such as this recent example where Google built a model for interpolating between two music samples.
Lectures
Blogs/videos
Papers/books
Papers on my reading list
]]>Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of representation learning. Specifically, we'll design a neural network architecture such that we impose a bottleneck in the network which forces a compressed knowledge representation of the original input. If the input features were each
]]>Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of representation learning. Specifically, we'll design a neural network architecture such that we impose a bottleneck in the network which forces a compressed knowledge representation of the original input. If the input features were each independent of one another, this compression and subsequent reconstruction would be a very difficult task. However, if some sort of structure exists in the data (ie. correlations between input features), this structure can be learned and consequently leveraged when forcing the input through the network's bottleneck.
As visualized above, we can take an unlabeled dataset and frame it as a supervised learning problem tasked with outputting $\hat x$, a reconstruction of the original input $x$. This network can be trained by minimizing the reconstruction error, ${\cal L}\left( {x,\hat x} \right)$, which measures the differences between our original input and the consequent reconstruction. The bottleneck is a key attribute of our network design; without the presence of an information bottleneck, our network could easily learn to simply memorize the input values by passing these values along through the network (visualized below).
A bottleneck constrains the amount of information that can traverse the full network, forcing a learned compression of the input data.
Note: In fact, if we were to construct a linear network (ie. without the use of nonlinear activation functions at each layer) we would observe a similar dimensionality reduction as observed in PCA. See Geoffrey Hinton's discussion of this here.
The ideal autoencoder model balances the following:
This trade-off forces the model to maintain only the variations in the data required to reconstruct the input without holding on to redundancies within the input. For most cases, this involves constructing a loss function where one term encourages our model to be sensitive to the inputs (ie. reconstruction loss ${\cal L}\left( {x,\hat x} \right)$) and a second term discourages memorization/overfitting (ie. an added regularizer).
$$ {\cal L}\left( {x,\hat x} \right) + regularizer $$
We'll typically add a scaling parameter in front of the regularization term so that we can adjust the trade-off between the two objectives.
In this post, I'll discuss some of the standard autoencoder architectures for imposing these two constraints and tuning the trade-off; in a follow-up post I'll discuss variational autoencoders which builds on the concepts discussed here to provide a more powerful model.
The simplest architecture for constructing an autoencoder is to constrain the number of nodes present in the hidden layer(s) of the network, limiting the amount of information that can flow through the network. By penalizing the network according to the reconstruction error, our model can learn the most important attributes of the input data and how to best reconstruct the original input from an "encoded" state. Ideally, this encoding will learn and describe latent attributes of the input data.
Because neural networks are capable of learning nonlinear relationships, this can be thought of as a more powerful (nonlinear) generalization of PCA. Whereas PCA attempts to discover a lower dimensional hyperplane which describes the original data, autoencoders are capable of learning nonlinear manifolds (a manifold is defined in simple terms as a continuous, non-intersecting surface). The difference between these two approaches is visualized below.
For higher dimensional data, autoencoders are capable of learning a complex representation of the data (manifold) which can be used to describe observations in a lower dimensionality and correspondingly decoded into the original input space.
An undercomplete autoencoder has no explicit regularization term - we simply train our model according to the reconstruction loss. Thus, our only way to ensure that the model isn't memorizing the input data is the ensure that we've sufficiently restricted the number of nodes in the hidden layer(s).
For deep autoencoders, we must also be aware of the capacity of our encoder and decoder models. Even if the "bottleneck layer" is only one hidden node, it's still possible for our model to memorize the training data provided that the encoder and decoder models have sufficient capability to learn some arbitrary function which can map the data to an index.
Given the fact that we'd like our model to discover latent attributes within our data, it's important to ensure that the autoencoder model is not simply learning an efficient way to memorize the training data. Similar to supervised learning problems, we can employ various forms of regularization to the network in order to encourage good generalization properties; these techniques are discussed below.
Sparse autoencoders offer us an alternative method for introducing an information bottleneck without requiring a reduction in the number of nodes at our hidden layers. Rather, we'll construct our loss function such that we penalize activations within a layer. For any given observation, we'll encourage our network to learn an encoding and decoding which only relies on activating a small number of neurons. It's worth noting that this is a different approach towards regularization, as we normally regularize the weights of a network, not the activations.
A generic sparse autoencoder is visualized below where the opacity of a node corresponds with the level of activation. It's important to note that the individual nodes of a trained model which activate are data-dependent, different inputs will result in activations of different nodes through the network.
One result of this fact is that we allow our network to sensitize individual hidden layer nodes toward specific attributes of the input data. Whereas an undercomplete autoencoder will use the entire network for every observation, a sparse autoencoder will be forced to selectively activate regions of the network depending on the input data. As a result, we've limited the network's capacity to memorize the input data without limiting the networks capability to extract features from the data. This allows us to consider the latent state representation and regularization of the network separately, such that we can choose a latent state representation (ie. encoding dimensionality) in accordance with what makes sense given the context of the data while imposing regularization by the sparsity constraint.
There are two main ways by which we can impose this sparsity constraint; both involve measuring the hidden layer activations for each training batch and adding some term to the loss function in order to penalize excessive activations. These terms are:
$$ {\cal L}\left( {x,\hat x} \right) + \lambda \sum\limits_i {\left| {a_i^{\left( h \right)}} \right|} $$
$$ {\cal L}\left( {x,\hat x} \right) + \sum\limits_{j} {KL\left( {\rho ||{{\hat \rho }_ j}} \right)} $$
Note: A Bernoulli distribution is "the probability distribution of a random variable which takes the value 1 with probability $p$ and the value 0 with probability $q=1-p$". This corresponds quite well with establishing the probability a neuron will fire.
The KL divergence between two Bernoulli distributions can be written as $\sum\limits_{j = 1}^{{l^{\left( h \right)}}} {\rho \log \frac{\rho }{{{{\hat \rho }_ j}}}} + \left( {1 - \rho } \right)\log \frac{{1 - \rho }}{{1 - {{\hat \rho }_ j}}}$. This loss term is visualized below for an ideal distribution of $\rho = 0.2$, corresponding with the minimum (zero) penalty at this point.
So far I've discussed the concept of training a neural network where the input and outputs are identical and our model is tasked with reproducing the input as closely as possible while passing through some sort of information bottleneck. Recall that I mentioned we'd like our autoencoder to be sensitive enough to recreate the original observation but insensitive enough to the training data such that the model learns a generalizable encoding and decoding. Another approach towards developing a generalizable model is to slightly corrupt the input data but still maintain the uncorrupted data as our target output.
With this approach, our model isn't able to simply develop a mapping which memorizes the training data because our input and target output are no longer the same. Rather, the model learns a vector field for mapping the input data towards a lower-dimensional manifold (recall from my earlier graphic that a manifold describes the high density region where the input data concentrates); if this manifold accurately describes the natural data, we've effectively "canceled out" the added noise.
The above figure visualizes the vector field described by comparing the reconstruction of $x$ with the original value of $x$. The yellow points represent training examples prior to the addition of noise. As you can see, the model has learned to adjust the corrupted input towards the learned manifold.
It's worth noting that this vector field is typically only well behaved in the regions where the model has observed during training. In areas far away from the natural data distribution, the reconstruction error is both large and does not always point in the direction of the true distribution.
One would expect that for very similar inputs, the learned encoding would also be very similar. We can explicitly train our model in order for this to be the case by requiring that the derivative of the hidden layer activations are small with respect to the input. In other words, for small changes to the input, we should still maintain a very similar encoded state. This is quite similar to a denoising autoencoder in the sense that these small perturbations to the input are essentially considered noise and that we would like our model to be robust against noise. Put in other words (emphasis mine), "denoising autoencoders make the reconstruction function (ie. decoder) resist small but ﬁnite-sized perturbations of the input, while contractive autoencoders make the feature extraction function (ie. encoder) resist infinitesimal perturbations of the input."
Because we're explicitly encouraging our model to learn an encoding in which similar inputs have similar encodings, we're essentially forcing the model to learn how to contract a neighborhood of inputs into a smaller neighborhood of outputs. Notice how the slope (ie. derivative) of the reconstructed data is essentially zero for local neighborhoods of input data.
Image credit (modified)
We can accomplish this by constructing a loss term which penalizes large derivatives of our hidden layer activations with respect to the input training examples, essentially penalizing instances where a small change in the input leads to a large change in the encoding space.
In fancier mathematical terms, we can craft our regularization loss term as the squared Frobenius norm ${\left\lVert A \right\rVert_F}$ of the Jacobian matrix ${\bf{J}}$ for the hidden layer activations with respect to the input observations. A Frobenius norm is essentially an L2 norm for a matrix and the Jacobian matrix simply represents all first-order partial derivatives of a vector-valued function (in this case, we have a vector of training examples).
For $m$ observations and $n$ hidden layer nodes, we can calculate these values as follows.
$${\left\lVert A \right\rVert_F}= \sqrt {\sum\limits_{i = 1}^m {\sum\limits_{j = 1}^n {{{\left| {{a_{ij}}} \right|}^2}} } } $$
$$J=\left[\begin{array}{ccc}\frac{\delta {a}_{1}^{\left(h\right)}\left(x\right)}{\delta {x}_{1}}& \cdots & \frac{\delta {a}_{1}^{\left(h\right)}\left(x\right)}{\delta {x}_{m}}\\ \vdots & \ddots & \vdots \\ \frac{\delta {a}_{n}^{\left(h\right)}\left(x\right)}{\delta {x}_{1}}& \cdots & \frac{\delta {a}_{n}^{\left(h\right)}\left(x\right)}{\delta {x}_{m}}\end{array}\right]$$Written more succinctly, we can define our complete loss function as
$$ {\cal L}\left( {x,\hat x} \right) + \lambda {\sum\limits_i {\left\lVert {{\nabla _ x}a_i^{\left( h \right)}\left( x \right)} \right\rVert} ^2} $$
where ${{\nabla_x}a_i^{\left( h \right)}\left( x \right)}$ defines the gradient field of our hidden layer activations with respect to the input $x$, summed over all $i$ training examples.
An autoencoder is a neural network architecture capable of discovering structure within data in order to develop a compressed representation of the input. Many different variants of the general autoencoder architecture exist with the goal of ensuring that the compressed representation represents meaningful attributes of the original data input; typically the biggest challenge when working with autoencoders is getting your model to actually learn a meaningful and generalizable latent space representation.
Because autoencoders learn how to compress the data based on attributes (ie. correlations between the input feature vector) discovered from data during training, these models are typically only capable of reconstructing data similar to the class of observations of which the model observed during training.
Applications of autoencoders include:
Lectures/notes
Blogs/videos
Papers/books
]]>In previous posts, I've discussed how we can train neural networks using backpropagation with gradient descent. One of the key hyperparameters to set in order to train a neural network is the learning rate for gradient descent. As a reminder, this parameter scales the magnitude of our weight updates in
]]>In previous posts, I've discussed how we can train neural networks using backpropagation with gradient descent. One of the key hyperparameters to set in order to train a neural network is the learning rate for gradient descent. As a reminder, this parameter scales the magnitude of our weight updates in order to minimize the network's loss function.
If your learning rate is set too low, training will progress very slowly as you are making very tiny updates to the weights in your network. However, if your learning rate is set too high, it can cause undesirable divergent behavior in your loss function. I'll visualize these cases below - if you find these visuals hard to interpret, I'd recommend reading (at least) the first section in my post on gradient descent.
So how do we find the optimal learning rate?
3e-4 is the best learning rate for Adam, hands down.
— Andrej Karpathy (@karpathy) November 24, 2016
Perfect! I guess my job here is done.
Well... not quite.
(i just wanted to make sure that people understand that this is a joke...)
— Andrej Karpathy (@karpathy) November 24, 2016
(Humor yourself by reading through that thread after finishing this post.)
The loss landscape of a neural network (visualized below) is a function of the network's parameter values quantifying the "error" associated with using a specific configuration of parameter values when performing inference (prediction) on a given dataset. This loss landscape can look quite different, even for very similar network architectures. The images below are from a paper, Visualizing the Loss Landscape of Neural Nets, which shows how residual connections in a network can yield an smoother loss topology.
The optimal learning rate will be dependent on the topology of your loss landscape, which is in turn dependent on both your model architecture and your dataset. While using a default learning rate (ie. the defaults set by your deep learning library) may provide decent results, you can often improve the performance or speed up training by searching for an optimal learning rate. I hope you'll see in the next section that this is quite an easy task.
Ultimately, we'd like a learning rate which results is a steep decrease in the network's loss. We can observe this by performing a simple experiment where we gradually increase the learning rate after each mini batch, recording the loss at each increment. This gradual increase can be on either a linear or exponential scale.
For learning rates which are too low, the loss may decrease, but at a very shallow rate. When entering the optimal learning rate zone, you'll observe a quick drop in the loss function. Increasing the learning rate further will cause an increase in the loss as the parameter updates cause the loss to "bounce around" and even diverge from the minima. Remember, the best learning rate is associated with the steepest drop in loss, so we're mainly interested in analyzing the slope of the plot.
You should set the range of your learning rate bounds for this experiment such that you observe all three phases, making the optimal range trivial to identify.
This technique was proposed by Leslie Smith in Cyclical Learning Rates for Training Neural Networks and evangelized by Jeremy Howard in fast.ai's course.
Another commonly employed technique, known as learning rate annealing, recommends starting with a relatively high learning rate and then gradually lowering the learning rate during training. The intuition behind this approach is that we'd like to traverse quickly from the initial parameters to a range of "good" parameter values but then we'd like a learning rate small enough that we can explore the "deeper, but narrower parts of the loss function" (from Karparthy's CS231n notes). If you're having a hard time picturing what I just mentioned, recall that too high of a learning rate can cause the parameter update to "jump over" the ideal minima and subsequent updates will either result in a continued noisy convergence in the general region of the minima, or in more extreme cases may result in divergence from the minima.
The most popular form of learning rate annealing is a step decay where the learning rate is reduced by some percentage after a set number of training epochs.
More generally, we can establish that it is useful to define a learning rate schedule in which the learning rate is updating during training according to some specified rule.
In the previously mentioned paper, Cyclical Learning Rates for Training Neural Networks, Leslie Smith proposes a cyclical learning rate schedule which varies between two bound values. The main learning rate schedule (visualized below) is a triangular update rule, but he also mentions the use of a triangular update in conjunction with a fixed cyclic decay or an exponential cyclic decay.
Note: At the end of this post, I'll provide the code to implement this learning rate schedule. Thus, if you don't care to understand the mathematical formulation you can skim past this section.
We can write the general schedule as
$$ {\eta_t} = {\eta_{\min }} + \left( {{\eta_{\max }} - {\eta_{\min }}} \right)\left( {\max \left( {0,1 - x} \right)} \right) $$
where $x$ is defined as
$$ x = \left| {\frac{{iterations}}{stepsize} - 2\left( {cycle} \right) + 1} \right| $$
and $cycle$ can be calculated as
$$ cycle = floor\left( {\frac{{1 + iterations}}{{2\left( {stepsize} \right)}}} \right) $$
where $\eta_{\min }$ and $\eta_{\max }$ define the bounds of our learning rate, $iterations$ represents the number of completed mini-batches, $stepsize$ defines one half of a cycle length. As far as I can gather, $1-x$ should always be positive, so it seems the $\max$ operation is not strictly necessary.
In order to grok how this equation works, let's progressively build it with visualizations. For the visuals below, the triangular update for 3 full cycles are shown with a step size of 100 iterations. Remember, one iteration corresponds with one mini-batch of training.
Foremost, we can establish our "progress" during training in terms of half-cycles that we've completed. We measure our progress in terms of half-cycles and not full cycles so that we can acheive symmetry within a cycle (this should become more clear as you continue reading).
Next, we compare our half-cycle progress to the number of half-cycles which will be completed at the end of the current cycle. At the beginning of a cycle, we have two half-cycles yet to be completed. At the end of a cycle, this value reaches zero.
Next, we'll add 1 to this value in order to shift the function to be centered on the y-axis. Now we're showing our progress within a cycle with reference to the half-cycle point.
At this point, we can take the absolute value to acheive a triangular shape within each cycle. This is the value that we assign to $x$.
However, we'd like our learning rate schedule to start at the minumum value, increasing to the maximum value at the middle of a cycle, and then decrease back to the minimum value. We can accomplish this by simply calculating $1-x$.
We now have a value which we can use to modulate the learning rate by adding some fraction of the learning rate range to the minimum learning rate (also referred to as the base learning rate).
Smith writes, the main assumption behind the rationale for a cyclical learning rate (as opposed to one which only decreases) is "that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect." Indeed, his paper includes several examples of a loss function evolution which temporarily deviates to higher losses while ultimately converging to a lower loss when compared with a benchmark fixed learning rate.
To gain intuition on why this short-term effect would yield a long-term positive effect, it's important to understand the desirable characteristics of our converged minimum. Ultimately, we'd like our network to learn from the data in a manner which generalizes to unseen data. Further, a network with good generalization properties should be robust in the sense that small changes to the network's parameters don't cause drastic changes to performance. With this in mind, it makes sense that sharp minima lead to poor generalization as small changes to the parameter values may result in a drastically higher loss. By allowing for our learning rate to increase at times, we can "jump out" of sharp minima which would temporarily increase our loss but may ultimately lead to convergence on a more desirable minima.
Note: Although the "flat minima for good generalization" is widely accepted, you can read a good counter-argument here.
Additionally, increasing the learning rate can also allow for "more rapid traversal of saddle point plateaus." As you can see in the image below, the gradients can be very small at a saddle point. Because the parameter updates are a function of the gradient, this results in our optimization taking very small steps; it can be useful to increase the learning rate here to avoid getting stuck for too long at a saddle point.
Image credit (with modification)
Note: A saddle point, by definition, is a critical point in which some dimensions observe a local minimum while other dimensions observe a local maximum. Because neural networks can have thousands or even millions of parameters, it's unlikely that we'll observe a true local minimum across all of these dimensions; saddle points are much more likely to occur. When I referred to "sharp minima", realistically we should picture a saddle point where the minimum dimensions are very steep while the maximum dimensions are very wide (as shown below).
A similar approach cyclic approach is known as stochastic gradient descent with warm restarts where an aggressive annealing schedule is combined with periodic "restarts" to the original starting learning rate.
We can write this schedule as
$$ {\eta_t} = \eta_{\min }^i + \frac{1}{2}\left( {\eta_{\max }^i - \eta_{\min }^i} \right)\left( {1 + \cos \left( {\frac{{{T_{current}}}}{{{T_i}}}} \pi \right) } \right) $$
where ${\eta_t}$ is the learning rate at timestep $t$ (incremented each mini batch), ${\eta_{\max}^i}$ and ${\eta_{\min }^i}$ define the range of desired learning rates, $T_{current}$ represents the number of epochs since the last restart (this value is calculated at every iteration and thus can take on fractional values), and $T_{i}$ defines the number of epochs in an cycle. Let's try to break this equation down.
This annealing schedule relies on the cosine function, which varies between -1 and 1. ${\frac{T_{current}}{T_i}}$ is capable of taking on values between 0 and 1, which is the input of our cosine function. The corresponding region of the cosine function is highlighted below in green. By adding 1, our function varies between 0 and 2, which is then scaled by $\frac{1}{2}$ to now vary between 0 and 1. Thus, we're simply taking the minimum learning rate and adding some fraction of the specified learning rate range (${\eta_{\max }^i - \eta_{\min }^i}$). Because this function starts at 1 and decreases to 0, the result is a learning rate which starts at the maximum of the specified range and decays to the minimum value. Once we reach the end of a cycle, $T_{current}$ resets to 0 and we start back at the maximum learning rate.
The authors note that this learning rate schedule can further be adapted to:
By drastically increasing the learning rate at each restart, we can essentially exit a local minima and continue exploring the loss landscape.
Neat idea: By snapshotting the weights at the end of each cycle, researchers were able to build an ensemble of models at the cost of training a single model. This is because the network "settles" on various local optima from cycle to cycle, as shown in the figure below.
Both finding the optimal range of learning rates and assigning a learning rate schedule can be implemented quite trivially using Keras Callbacks.
We can write a Keras Callback which tracks the loss associated with a learning rate varied linearly over a defined range.
Step Decay
For a simple step decay, we can use the LearningRateScheduler
Callback.
Cyclical Learning Rate
To apply the cyclical learning rate technique, we can reference this repo which has already implemented the technique in the paper. In fact, this repo is cited in the paper's appendix.
Stochastic Gradient Descent with Restarts
Snapshot Ensembles
To apply the "Train 1, get M for free" technique, you can reference this repo.
In this blog post, I'll discuss a number of considerations and techniques for dealing with imbalanced data when training a machine learning model. The blog post will rely heavily on a sklearn contributor package called imbalanced-learn to implement the discussed techniques.
Training a machine learning model on an imbalanced dataset
]]>In this blog post, I'll discuss a number of considerations and techniques for dealing with imbalanced data when training a machine learning model. The blog post will rely heavily on a sklearn contributor package called imbalanced-learn to implement the discussed techniques.
Training a machine learning model on an imbalanced dataset can introduce unique challenges to the learning problem. Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes). For example, suppose you're building a classifier to classify a credit card transaction a fraudulent or authentic - you'll likely have 10,000 authentic transactions for every 1 fraudulent transaction, that's quite an imbalance!
To understand the challenges that a class imbalance imposes, let's consider two common ways we'll train a model: tree-based logical rules developed according to some splitting criterion, and parameterized models updated by gradient descent.
When building a tree-based model (such as a decision tree), our objective is to find logical rules which are capable of taking the full dataset and separating out the observations into their different classes. In other words, we'd like each split in the tree to increase the purity of observations such that the data is filtered into homogeneous groups. If we have a majority class present, the top of the decision tree is likely to learn splits which separate out the majority class into pure groups at the expense of learning rules which separate the minority class.
For a more concrete example, here's a decision tree trained on the wine quality dataset used as an example later on in this post. The field value
represents the number of observations for each class in a given node.
Similarly, if we're updating a parameterized model by gradient descent to minimize our loss function, we'll be spending most of our updates changing the parameter values in the direction which allow for correct classification of the majority class. In other words, many machine learning models are subject to a frequency bias in which they place more emphasis on learning from data observations which occur more commonly.
It's worth noting that not all datasets are affected equally by class imbalance. Generally, for easy classification problems in which there's a clear separation in the data, class imbalance doesn't impede on the model's ability to learn effectively. However, datasets that are inherently more difficult to learn from see an amplification in the learning challenge when a class imbalance is introduced.
When dealing with imbalanced data, standard classification metrics do not adequately represent your models performance. For example, suppose you are building a model which will look at a person's medical records and classify whether or not they are likely to have a rare disease. An accuracy of 99.5% might look great until you realize that it is correctly classifying the 99.5% of healthy people as "disease-free" and incorrectly classifying the 0.5% of people which do have the disease as healthy. I discussed this in my post on evaluating a machine learning model, but I'll provide a discussion here as well regarding useful metrics when dealing with imbalanced data.
Precision is defined as the fraction of relevant examples (true positives) among all of the examples which were predicted to belong in a certain class.
[{\rm{precision}} = \frac{{{\rm{true \hspace{2mm} positives}}}}{{{\rm{true \hspace{2mm} positives + false \hspace{2mm} positives}}}}]
Recall is defined as the fraction of examples which were predicted to belong to a class with respect to all of the examples that truly belong in the class.
[{\rm{recall}} = {\rm{ }}\frac{{{\rm{true \hspace{2mm} positives}}}}{{{\rm{true \hspace{2mm} positives + false \hspace{2mm} negatives}}}}]
The following graphic does a phenomenal job visualizing the difference between precision and recall.
We can further combine these two metrics into a single value by calcuating the f-score as defined below.
[{F _\beta } = \left( {1 + {\beta ^2}} \right)\frac{{{\rm{precision}} \cdot {\rm{recall}}}}{{\left( {{\beta ^2} \cdot {\rm{precision}}} \right) + {\rm{recall}}}}]
The $\beta$ parameter allows us to control the tradeoff of importance between precision and recall. $\beta < 1$ focuses more on precision while $\beta > 1$ focuses more on recall.
Another common tool used to understand a model's performance is a Receiver Operating Characteristics (ROC) curve. An ROC curve visualizes an algorithm's ability to discriminate the positive class from the rest of the data. We'll do this by plotting the True Positive Rate against the False Positive Rate for varying prediction thresholds.
[{\rm{TPR}} = {\rm{ }}\frac{{{\rm{true \hspace{2mm} positives}}}}{{{\rm{true \hspace{2mm} positives + false \hspace{2mm} negatives}}}}]
[{\rm{FPR}} = {\rm{ }}\frac{{{\rm{false \hspace{2mm} positives}}}}{{{\rm{false \hspace{2mm} positives + true \hspace{2mm} negatives}}}}]
For classifiers which only produce factor outcomes (ie. directly output a class), there exists a fixed TPR and FPR for a trained model. However, other classifiers, such as logistic regression, are capable of giving a probabilistic output (ie. the chance that a given observation belongs to the positive class). For these classifiers, we can specify the probability threshold by which above that amount we'll predict the observation belongs to the positive class.
If we set a very low value for this probability threshold, we can increase our True Positive Rate as we'll be more likely to capture all of the positive observations. However, this can also introduce a number of false positive classifications, increasing our False Positive Rate. Intuitively, there exists a tradeoff between maximizing our True Positive Rate and minimizing our False Positive Rate. The ideal model would correctly identify all positive observations as belonging to the positive class (TPR=1) and would not incorrectly classify negative observations as belonging to the positive class (FPR=0).
This tradeoff can be visualized in this demonstration in which you can adjust the class distributions and classification threshold.
The area under the curve (AUC) is a single-value metric for which attempts to summarize an ROC curve to evaluate the quality of a classifier. As the name implies, this metric approximates the area under the ROC curve for a given classifier. Recall that the ideal curve hugs the upper lefthand corner as closely as possible, giving us the ability to identify all true positives while avoiding false positives; this ideal model would have an AUC of 1. On the flipside, if your model was no better than a random guess, your TPR and FPR would increase in parallel to one another, corresponding with an AUC of 0.5.
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
preds = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, preds, pos_label=1)
auc = roc_auc_score(y_test, preds)
fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], color='navy', linestyle='--', label='random')
plt.title(f'AUC: {auc}')
ax.set_xlabel('False positive rate')
ax.set_ylabel('True positive rate')
One of the simplest ways to address the class imbalance is to simply provide a weight for each class which places more emphasis on the minority classes such that the end result is a classifier which can learn equally from all classes.
To calculate the proper weights for each class, you can use the sklearn utility function shown in the example below.
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes, y)
In a tree-based model where you're determining the optimal split according to some measure such as decreased entropy, you can simply scale the entropy component of each class by the corresponding weight such that you place more emphasis on the minority classes. As a reminder, the entropy of a node can be calculated as
$$ \mathop - \sum \limits_i^{} {p_i}{\log}\left( {{p_i}} \right) $$
where $p_i$ is the fraction of data points within class $i$.
In a gradient-based model, you can scale the calculated loss for each observation by the appropriate class weight such that you place more significance on the losses associated with minority classes. As a reminder, a common loss function for classification is the categorical cross entropy (which is very similar to the above equation, albeit with slight differences). This may be calculated as
$$ -\sum_i y_i \log \hat{y}_ i $$
where $y_i$ represents the true class (typically a one-hot encoded vector) and $\hat{y}_ i$ represents the predicted class distribution.
Another approach towards dealing with a class imbalance is to simply alter the dataset to remove such an imbalance. In this section, I'll discuss common techniques for oversampling the minority classes to increase the number of minority observations until we've reached a balanced dataset.
The most naive method of oversampling is to randomly sample the minority classes and simply duplicate the sampled observations. With this technique, it's important to note that you're artificially reducing the variance of the dataset.
However, we can also use our existing dataset to synthetically generate new data points for the minority classes. Synthetic Minority Over-sampling Technique (SMOTE) is a technique that generates new observations by interpolating between observations in the original dataset.
For a given observation $x_i$, a new (synthetic) observation is generated by interpolating between one of the k-nearest neighbors, $x_{zi}$.
$$x_{new} = x_i + \lambda (x_{zi} - x_i)$$
where $\lambda$ is a random number in the range $\left[ {0,1} \right]$. This interpolation will create a sample on the line between $x_{i}$ and $x_{zi}$.
This algorithm has three options for selecting which observations, $x_i$, to use in generating new data points.
regular
: No selection rules, randomly sample all possible $x_i$.borderline
: Separates all possible $x_i$ into three classes using the k nearest neighbors of each point.
svm
: Uses an SVM classifier to identify the support vectors (samples close to the decision boundary) and samples $x_i$ from these points.Adaptive Synthetic (ADASYN) sampling works in a similar manner as SMOTE, however, the number of samples generated for a given $x_i$ is proportional to the number of nearby samples which do not belong to the same class as $x_i$. Thus, ADASYN tends to focus solely on outliers when generating new synthetic training examples.
Rather than oversampling the minority classes, it's also possible to achieve class balance by undersampling the majority class - essentially throwing away data to make it easier to learn characteristics about the minority classes.
As with oversampling, a naive implementation would be to simply sample the majority class at random until reaching a similar number of observations as the minority classes. For example, if your majority class has 1,000 observations and you have a minority class with 20 observations, you would collect your training data for the majority class by randomly sampling 20 observations from the original 1,000. As you might expect, this could potentially result in removing key characteristics of the majority class.
The general idea behind near miss is to only the sample the points from the majority class necessary to distinguish between other classes.
NearMiss-1 select samples from the majority class for which the average distance of the N closest samples of a minority class is smallest.
NearMiss-2 select samples from the majority class for which the average distance of the N farthest samples of a minority class is smallest.
A Tomek’s link is defined as two observations of different classes ($x$ and $y$) such that there is no example $z$ for which:
$$d(x, z) < d(x, y) \text{ or } d(y, z) < d(x, y)$$
where $d()$ is the distance between the two samples. In other words, a Tomek’s link exists if two observations of different classes are the nearest neighbors of each other. In the figure below, a Tomek’s link is illustrated by highlighting the samples of interest in green.
For this undersampling strategy, we'll remove any observations from the majority class for which a Tomek's link is identified. Depending on the dataset, this technique won't actually achieve a balance among the classes - it will simply "clean" the dataset by removing some noisy observations, which may result in an easier classification problem. As I discussed earlier, most classifiers will still perform adequately for imbalanced datasets as long as there's a clear separation between the classifiers. Thus, by focusing on removing noisy examples of the majority class, we can improve the performance of our classifier even if we don't necessarily balance the classes.
EditedNearestNeighbours applies a nearest-neighbors algorithm and “edit” the dataset by removing samples which do not agree “enough” with their neighboorhood. For each sample in the class to be under-sampled, the nearest-neighbours are computed and if the selection criterion is not fulfilled, the sample is removed.
This is a similar approach as Tomek's links in the respect that we're not necessarily focused on actually achieving a class balance, we're simply looking to remove noisy observations in an attempt to make for an easier classification problem.
To demonstrate these various techniques, I've trained a number of models on the UCI Wine Quality dataset where I've generated my target by asserting that observations with a quality rating less than or equal to 4 are "low quality" wine and observations with a quality rating greater than or equal to 5 are "high quality" wine.
As you can see, this has introduced an imbalance between the two classes.
I'll provide the notebook I wrote to explore these techniques in a Github repo if you're interested in exploring this further. I highly encourage you to check out this notebook and perform the same experiment on a different dataset to see how it compares - let me know in the comment section!
In this post, I'll discuss considerations for normalizing your data - with a specific focus on neural networks. In order to understand the concepts discussed, it's important to have an understanding of gradient descent.
As a quick refresher, when training neural networks we'll feed in observations and compare the expected
]]>In this post, I'll discuss considerations for normalizing your data - with a specific focus on neural networks. In order to understand the concepts discussed, it's important to have an understanding of gradient descent.
As a quick refresher, when training neural networks we'll feed in observations and compare the expected output to the true output of the network. We'll then use gradient descent to update the parameters of the model in the direction which will minimize the difference between our expected (or ideal) outcome and the true outcome. In other words, we're attempting to minimize the error we observe in our model's predictions.
The exact manner by which we update our model parameters will depend on the variant of gradient descent optimization techniques we select (stochastic gradient descent, RMSProp, Adam, etc.) but all of these update techniques will scale the magnitude of our parameter update by a learning rate. This rate ensures that we aren't changing the parameter too drastically such that we overshoot our update and fail to find the optimal value.
Ultimately, gradient descent is a search among a loss function surface in an attempt to find the values for each parameter such that the loss function is minimized. In other words, we're looking for the lowest value on the loss function surface. In my post on gradient descent, I discussed a few advanced techniques for efficiently updating our parameter values such that we can avoid getting stuck at saddle points. However, we can also improve the actual topology of our loss function by ensuring all of the parameters exist on the same scale.
Note: Understanding the topology of loss functions, and how network design affects this topology, is a current area of research in the field.
The important thing to remember throughout this discussion is that our loss function surface is characterized by the parameter values in the network. When visualizing this topology, each parameter will represent a dimension of which a range of values will have a resulting affect on the value of our loss function. Unfortunately, this becomes rather tricky to visualize once you extend beyond two parameters (a dimension characterized by each parameter, and the third dimension representing the value of the loss function).
In the above image, we're visualizing the loss function of a model parameterized by two weights (the x and y dimensions) with the z dimension representing the corresponding "error" (loss) of the network.
This 3D visualization is often also represented by a 2D contour plot.
Let's take a second to imagine a scenario in which you have a very simple neural network with two inputs. The first input value, $x_1$, varies from 0 to 1 while the second input value, $x_2$, varies from 0 to 0.01. Since your network is tasked with learning how to combine these inputs through a series of linear combinations and nonlinear activations, the parameters associated with each input will also exist on different scales.
Unfortunately, this can lead toward an awkward loss function topology which places more emphasis on certain parameter gradients.
More discussion on this subject found here.
By normalizing all of our inputs to a standard scale, we're allowing the network to more quickly learn the optimal parameters for each input node.
Additionally, it's useful to ensure that our inputs are roughly in the range of -1 to 1 to avoid weird mathematical artifacts associated with floating point number precision. In short, computers lose accuracy when performing math operations on really large or really small numbers. Moreover, if your inputs and target outputs are on a completely different scale than the typical -1 to 1 range, the default parameters for your neural network (ie. learning rates) will likely be ill-suited for your data.
It's a common practice to scale your data inputs to have zero mean and unit variance. This is known as the standard scaler approach.
However, you may opt for a different normalization strategy. For example, it's common for image data to simply be scaled by 1/255 so that the pixel intensity range is bound by 0 and 1.
Normalizing the input of your network is a well-established technique for improving the convergence properties of a network. A few years ago, a technique known as batch normalization was proposed to extend this improved loss function topology to more of the parameters of the network.
If we were to consider the above network as an example, normalizing our inputs will help ensure that our network can effectively learn the parameters in the first layer.
However, consider the fact that the second layer of our network accepts the activations from our first layer as input. Thus, by extending the intuition established in the previous section, one could posit that normalizing these values will help the network more effectively learn the parameters in the second layer.
By ensuring the activations of each layer are normalized, we can simplify the overall loss function topology. This is especially helpful for the hidden layers of our network, since the distribution of unnormalized activations from previous layers will change as the network evolves and learns more optimal parameters. Thus, by normalizing each layer, we're introducing a level of orthogonality between layers - which generally makes for an easier learning process.
To summarize, we'd like to normalize the activations of a given layer such that we improve learning of the weights which connect the next layer. In practice, people will typically normalize the value of ${z^{\left[ l \right]}}$ rather than ${a^{\left[ l \right]}}$ - although sometimes debated whether we should normalize before or after activation.
Given a vector of linear combinations from the previous layer ${z^{\left[ l \right]}}$ for each observation $i$ in a dataset, we can calculate the mean and variance as:
$$ \mu = \frac{1}{m}\sum\limits_i {z_i^{\left[ l \right]}} $$
$$ {\sigma ^2} = \frac{1}{m}{\sum\limits_i {\left( {z_i^{\left[ l \right]} - \mu } \right)} ^2} $$
Using these values, we can normalize the vectors ${z^{\left[ l \right]}}$ as follows.
[z_{norm}^{\left( i \right)} = \frac{{{z^{\left( i \right)}} - \mu }}{{\sqrt {{\sigma ^2} + \varepsilon } }}]
We add a very small number $\epsilon$ to prevent the chance of a divide by zero error.
However, it may not be the case that we always want to normalize $z$ to have zero mean and unit variance. In fact, this would perform poorly for some activation functions such as the sigmoid function. Thus, we'll allow our normalization scheme to learn the optimal distribution by scaling our normalized values by $\gamma$ and shifting by $\beta$.
[{{\tilde z}^{\left( i \right)}} = \gamma z_{norm}^{\left( i \right)} + \beta ]
In other words, we've now allowed the network to normalize a layer into whichever distribution is most optimal for learning.
One result of batch normalization is that we no longer need a bias vector for a batch normalized layer given that we are already shifting the normalized $z$ values with the $\beta$ parameter.
Note: $\mu$ and ${\sigma ^2}$ are calculated on a per-batch basis while $\gamma$ and $\beta$ are learned parameters used across all batches.
Ultimately, batch normalization allows us to build deeper networks without the need for exponentially longer training times. This is a result of introducing orthogonality between layers such that we avoid shifting distributions in activations as the parameters in earlier layers are updated.
]]>The paper that introduced Batch Norm https://t.co/vkT0LioKHc combines clear intuition with compelling experiments (14x speedup on ImageNet!!)
— David Page (@dcpage3) September 11, 2019
So why has 'internal covariate shift' remained controversial to this day?
Thread 👇 pic.twitter.com/L0BBmo0q4t