<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Jeremy Jordan]]></title><description><![CDATA[Thoughts, ideas, and new things I've learned.

]]></description><link>https://www.jeremyjordan.me/</link><image><url>https://www.jeremyjordan.me/favicon.png</url><title>Jeremy Jordan</title><link>https://www.jeremyjordan.me/</link></image><generator>Ghost 3.42</generator><lastBuildDate>Wed, 24 Mar 2021 16:48:09 GMT</lastBuildDate><atom:link href="https://www.jeremyjordan.me/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[A simple solution for monitoring ML systems.]]></title><description><![CDATA[<p>This blog post aims to provide a simple, open-source solution for monitoring ML systems. We'll discuss industry-standard monitoring tools and practices for software systems and how they can be adapted to monitor ML systems.</p><p>To illustrate this, we'll use a scikit-learn model trained on the <a href="https://archive.ics.uci.edu/ml/datasets/wine+quality">UCI Wine Quality dataset</a> and</p>]]></description><link>https://www.jeremyjordan.me/ml-monitoring/</link><guid isPermaLink="false">5fef9429c0cccf00395925fc</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Sun, 03 Jan 2021 01:20:13 GMT</pubDate><content:encoded><![CDATA[<p>This blog post aims to provide a simple, open-source solution for monitoring ML systems. We'll discuss industry-standard monitoring tools and practices for software systems and how they can be adapted to monitor ML systems.</p><p>To illustrate this, we'll use a scikit-learn model trained on the <a href="https://archive.ics.uci.edu/ml/datasets/wine+quality">UCI Wine Quality dataset</a> and served via FastAPI (see Github repo <a href="https://github.com/jeremyjordan/ml-monitoring">here</a>). We'll collect metrics from the server using Prometheus and visualize the results in a Grafana dashboard. All of the services will be deployed on a Kubernetes cluster; if you're not familiar with Kubernetes, feel free to take a quick read through my <a href="https://www.jeremyjordan.me/kubernetes/">introduction to Kubernetes</a> blog post.</p><!--kg-card-begin: markdown--><h3 id="overview">Overview</h3>
<ul>
<li><a href="#why-monitor">Why is monitoring important?</a></li>
<li><a href="#what-monitor">What should we be monitoring?</a></li>
<li><a href="#case-study">Monitoring a wine quality prediction model: a case study.</a>
<ul>
<li><a href="#deploy">Deploying a model with FastAPI</a></li>
<li><a href="#instrument">Instrumenting our model service with metrics</a></li>
<li><a href="#prometheus">Capturing metrics with Prometheus</a></li>
<li><a href="#grafana">Visualizing results in Grafana</a></li>
<li><a href="#locust">Simulating production traffic with Locust</a></li>
</ul>
</li>
<li><a href="#advanced-monitoring">Going beyond a simple monitoring solution</a>
<ul>
<li><a href="#monitoring-products">Purpose-built monitoring tools for ML models</a></li>
</ul>
</li>
<li><a href="#best-practices">Best practices for monitoring</a></li>
<li><a href="#resources">Resources</a></li>
</ul>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a id="why-monitor"></a></p>
<h2 id="whyismonitoringimportant">Why is monitoring important?</h2>
<!--kg-card-end: markdown--><p>It's a well-accepted practice to monitor software systems so that we can understand performance characteristics, react quickly to system failures, and ensure that we're upholding our <a href="https://sre.google/sre-book/service-level-objectives/">Service Level Objectives</a>.</p><p>Monitoring systems can help give us confidence that our systems are running smoothly and, in the event of a system failure, can quickly provide appropriate context when diagnosing the root cause.</p><p>When deploying machine learning models, we still have the same set of concerns discussed above. However, we'd also like to have confidence that our model is making <strong>useful</strong> predictions in production.</p><p>There are many reasons why a model can fail to make useful predictions in production:</p><ul><li>The underlying data distribution has shifted over time and the model has gone stale.</li><li>The production data stream contains edge cases (not seen during model development) where the model performs poorly.</li><li>The model was misconfigured in its production deployment.</li></ul><p>In all of these scenarios, the model could still make a "successful" prediction from a service perspective, but the predictions will likely not be useful. Monitoring our machine learning models can help us detect such scenarios and intervene (e.g. trigger a model retraining/deployment pipeline).</p><!--kg-card-begin: markdown--><p><a id="what-monitor"></a></p>
<h2 id="whatshouldwebemonitoring">What should we be monitoring?</h2>
<!--kg-card-end: markdown--><p>At a high level, there's three classes of metrics that we'll want to track and monitor.</p><p><strong>Model metrics</strong></p><ul><li>Prediction distributions</li><li>Feature distributions</li><li>Evaluation metrics (when ground truth is available)</li></ul><p><strong>System metrics</strong></p><ul><li>Request throughput</li><li>Error rate</li><li>Request latencies</li><li>Request body size</li><li>Response body size</li></ul><p><strong>Resource metrics</strong></p><ul><li>CPU utilization</li><li>Memory utilization</li><li>Network data transfer</li><li>Disk I/O</li></ul><!--kg-card-begin: markdown--><p><a id="case-study"></a></p>
<h2 id="monitoringawinequalitypredictionmodelacasestudy">Monitoring a wine quality prediction model: a case study.</h2>
<!--kg-card-end: markdown--><p>Throughout the rest of this blog post, we'll walk through the process of instrumenting and monitoring a scikit-learn model trained on the <a href="https://archive.ics.uci.edu/ml/datasets/wine+quality">UCI Wine Quality dataset</a>. This model is trained to predict a wine's quality on the scale of 0 (lowest) to 10 (highest) based on a number of chemical attributes.</p><p>At a high level, we'll:</p><ol><li>Create a containerized REST service to expose the model via a prediction endpoint.</li><li>Instrument the server to collect metrics which are exposed via a separate metrics endpoint.</li><li>Deploy Prometheus to collect and store metrics.</li><li>Deploy Grafana to visualize the collected metrics.</li><li>Finally, we'll simulate production traffic using Locust so that we have some data to see in our dashboards.</li></ol><p>Feel free to clone <a href="https://github.com/jeremyjordan/ml-monitoring">this Github repository</a> and follow along yourself. All of the instructions to deploy these components on your own cluster are provided in the <code>README.md</code> file.</p><!--kg-card-begin: markdown--><p><a id="deploy"></a></p>
<h3 id="deployingamodelwithfastapi">Deploying a model with FastAPI</h3>
<!--kg-card-end: markdown--><p>If you look in the <code>model/</code> directory of the repo linked previously, you'll see a couple files.</p><ul><li><code>train.py</code> contains a simple script to produce a serialized model artifact.</li><li><code>app/api.py</code> defines a few routes for our model service including a model prediction endpoint and a health-check endpoint.</li><li><code>app/schemas.py</code> defines the expected schema for the request and response bodies in the model prediction endpoint.</li><li><code>Dockerfile</code> lists the instructions to package our REST server as a container.</li></ul><p>We can deploy this server on our Kubernetes cluster using the manifest defined in <code>kubernetes/models/</code>.</p><!--kg-card-begin: markdown--><p><a id="instrument"></a></p>
<h3 id="instrumentingourservicewithmetrics">Instrumenting our service with metrics</h3>
<!--kg-card-end: markdown--><p>In order to monitor this service, we'll need to collect and expose metrics data. We'll go into more details in the subsequent section, but for now our goal is to capture "metrics" and expose this data via a <code>/metrics</code> endpoint on our server.</p><p>For FastAPI servers, we can do this using <code>prometheus-fastapi-instrumentator</code>. This library includes FastAPI middleware that collects metrics for each request and exposes the metric data to a specified endpoint.</p><p>For our example, we'll capture some of the metrics included in the library (request size, response size, latency, request count) as well as one custom-defined metric (our regression model's output). You can see this configuration defined in <code>model/app/monitoring.py</code>.</p><p>After deploying our model service on the Kubernetes cluster, we can port forward to a pod running the server and check out the metrics endpoint running at <a href="http://127.0.0.1:3000/metrics">127.0.0.1:3000/metrics</a>.</p><pre><code>kubectl port-forward service/wine-quality-model-service 3000:80
</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.jeremyjordan.me/content/images/2021/01/metrics_endpoint.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2021/01/metrics_endpoint.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2021/01/metrics_endpoint.png 1000w, https://www.jeremyjordan.me/content/images/size/w1600/2021/01/metrics_endpoint.png 1600w, https://www.jeremyjordan.me/content/images/size/w2400/2021/01/metrics_endpoint.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>An example of the data typically found at a /metrics endpoint.</figcaption></figure><p><strong>Note</strong>: many of the framework-specific serving libraries offer the ability to expose a metrics endpoint out of the box. However, I'm not sure how you can define custom (model-specific) metrics to be logged using these serving platforms.</p><ul><li><a href="https://pytorch.org/serve/metrics_api.html">PyTorch Serve</a></li><li><a href="https://www.tensorflow.org/tfx/serving/serving_config#monitoring_configuration">Tensorflow Serving</a></li><li><a href="https://github.com/triton-inference-server/server/blob/r20.12/docs/metrics.md">NVIDIA Triton Inference Server</a></li></ul><!--kg-card-begin: markdown--><p><a id="prometheus"></a></p>
<h3 id="capturingmetricswithprometheus">Capturing metrics with Prometheus</h3>
<!--kg-card-end: markdown--><p>After exposing our metrics at a specified endpoint, we can use Prometheus to collect and store this metric data. We'll deploy Prometheus onto our Kubernetes cluster using <code>helm</code>, see the <strong>Setup</strong> section in the <code>README.md</code> file for full instructions.</p><p>Prometheus is an open-source monitoring service with a focus on reliability. It is responsible for collecting metrics data from our service endpoints and efficiently storing this data to be later queried.</p><figure class="kg-card kg-image-card"><img src="https://www.jeremyjordan.me/content/images/2021/01/prometheus_architecture.jpg" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2021/01/prometheus_architecture.jpg 600w, https://www.jeremyjordan.me/content/images/size/w1000/2021/01/prometheus_architecture.jpg 1000w, https://www.jeremyjordan.me/content/images/size/w1600/2021/01/prometheus_architecture.jpg 1600w, https://www.jeremyjordan.me/content/images/2021/01/prometheus_architecture.jpg 1613w" sizes="(min-width: 720px) 720px"></figure><p>Prometheus refers to endpoints containing metric data as <strong>targets</strong> which can be discovered either through service discovery or static configuration. In our example, we'll use <em>service discovery</em> to enable Prometheus to discover which targets should be scraped. We can do this by creating a <code>ServiceMonitor</code> resource for our wine quality prediction service. This resource specification is included in the <code>kubernetes/models/wine_quality.yaml</code> manifest. This resource must be defined in the same namespace that Prometheus is running in.</p><p>You can see all of the services configured to be discovered by Prometheus by running:</p><pre><code>kubectl get servicemonitor -n monitoring
</code></pre><p>You'll notice that in addition to collecting metrics from our wine quality prediction service, Prometheus has already been configured to collect metrics from the Kubernetes cluster itself.</p><p>Prometheus will scrape the metrics data at each of these endpoints at a specified interval (every 15 seconds by default). There are four supported data types for metrics.</p><ul><li><strong>Counter</strong>: a single value that monotonically increases over time (e.g. for counting number of requests made)</li><li><strong>Gauge</strong>: a single value that can increase or decrease over time (e.g. for tracking current memory utilization)</li><li><strong>Summary</strong>: a collection of values aggregated by <code>count</code> and <code>sum</code> (e.g. for calculating average request size)</li><li><strong>Histogram</strong>: a collection of values aggregated into buckets (e.g. for tracking request latency)</li></ul><p>You can read more about these metric types <a href="https://prometheus.io/docs/concepts/metric_types/">here</a>. Each metric has a name and an optional set of labels (key-value pairs) to describe the observed value. These metrics are designed such that they can be aggregated at different timescales for <a href="https://prometheus.io/docs/prometheus/latest/storage/#compaction">more efficient storage</a>. This metric data can be queried by other services (such as a Grafana) which make requests to the HTTP server.</p><p><strong>Note</strong>: Prometheus uses a PULL mechanism for collecting metrics, where services simply expose the metrics data at an endpoint and Prometheus collects the data. Some other monitoring services (e.g. AWS CloudWatch) use a PUSH mechanism where each service collects and sends its own metrics to the monitoring server. You can read about the tradeoffs for each approach <a href="https://giedrius.blog/2019/05/11/push-vs-pull-in-monitoring-systems/">here</a>.</p><!--kg-card-begin: markdown--><p><a id="grafana"></a></p>
<h3 id="visualizingresultsingrafana">Visualizing results in Grafana</h3>
<!--kg-card-end: markdown--><p>Now that our metrics are being collected and stored in Prometheus, we need a way to visualize the data. Grafana is often paired with Prometheus to provide the ability to create dashboards from the metric data. In fact, our <code>helm</code> install of the Prometheus stack included Grafana out of the box.</p><p>You can check out the Grafana dashboard by port-forwarding to the service and visiting <a href="http://127.0.0.1:8000">127.0.0.1:8000</a>.</p><pre><code>kubectl port-forward service/prometheus-stack-grafana 8000:80 -n monitoring
</code></pre><p>We can create visualizations by making queries to the Prometheus data source. Prometheus uses a query language called <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a>. Admittedly, it can take some time to get used to this query language.  After reading through the official documentation, I'd recommend watching <a href="https://www.youtube.com/watch?v=hTjHuoWxsks">PromQL for Mere Mortals</a> to get a better understanding of the query language.</p><p>The repository for this blog post contains a pre-built dashboard (see <code>dashboards/model.json</code>) for the wine quality prediction service which you can import and play around with.</p><figure class="kg-card kg-image-card"><img src="https://www.jeremyjordan.me/content/images/2021/01/dashboard.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2021/01/dashboard.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2021/01/dashboard.png 1000w, https://www.jeremyjordan.me/content/images/size/w1600/2021/01/dashboard.png 1600w, https://www.jeremyjordan.me/content/images/2021/01/dashboard.png 1798w" sizes="(min-width: 720px) 720px"></figure><!--kg-card-begin: markdown--><p><a id="locust"></a></p>
<h3 id="simulatingproductiontrafficwithlocust">Simulating production traffic with Locust</h3>
<!--kg-card-end: markdown--><p>At this point, we have a model deployed as a REST service which is instrumented to export metric data. This metric data is being collected by Prometheus and we can visualize the results in a Grafana dashboard. However, in order to see something interesting in the dashboard we'll want to simulate production traffic.</p><p>We'll use <code>locust</code>, a Python load testing framework, to make requests to our model service and simulate production traffic. This behavior is defined in <code>load_tests/locustfile.py</code> where we define three tasks:</p><ul><li>make a request to our health check endpoint</li><li>choose a random example from the wine quality dataset and make a request to our prediction service</li><li>choose a random example from the wine quality dataset, corrupt the data, and make a bad request to our prediction service</li></ul><p>The manifests to deploy this load test to our cluster can be found in <code>kubernetes/load_tests/</code>.</p><!--kg-card-begin: markdown--><p><a id="advanced-monitoring"></a></p>
<h2 id="goingbeyondasimplemonitoringsolution">Going beyond a simple monitoring solution</h2>
<!--kg-card-end: markdown--><p>In the wine quality prediction model case study, we only tracked a single model metric: the prediction distributions. However, as I mentioned previously there's more model-related metrics that we may want to track such as feature distributions and evaluation metrics.</p><p>These remaining concerns introduce a few new challenges:</p><ul><li>The Prometheus time series database was designed to store metric data, not features for ML models. This is not the right technology choice to store feature data for tracking their fluctuations over time.</li><li>Model evaluation metrics require a feedback signal containing the ground truth which is not available at inference time.</li></ul><p>When it comes to logging feature distributions, there's a range of approaches you can take. The simplest approach for monitoring feature distributions might be to deploy a <strong>drift-detection service</strong> alongside the model. This service would fire (and increase a counter metric) when feature drift is detected.</p><p>For additional visibility into your production data, you can log a <strong>statistical profile</strong> of the features observed in production. This allows for a compact summary of data observed in production and can help you identify when data distributions are shifting. The information logged will vary depending on the type of data:</p><ul><li>For <em>tabular</em> data, you can directly log features used through statistical measures such as histograms and value counts.</li><li>For <em>text</em> data, you can log metadata about the text such as character length, count of numeric characters, count of alphabetic characters, count of special characters, average word length, etc.</li><li>For <em>image</em> data, you can log metadata about the image such as average pixel value per channel, image width, image height, aspect ratio, etc.</li></ul><p>For full visibility on your production data stream, you can <strong>log the full feature payload</strong>. This is not only useful for monitoring purposes, but also in data collection for labeling and future model training.  If your deployment stack already supports saving production data, you can leverage this data source (typically a data lake) for monitoring whether or not the production data distribution is drifting away from the training data distribution. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.jeremyjordan.me/content/images/2021/01/model_drift_evaluation_workflow-1024x281.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2021/01/model_drift_evaluation_workflow-1024x281.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2021/01/model_drift_evaluation_workflow-1024x281.png 1000w, https://www.jeremyjordan.me/content/images/2021/01/model_drift_evaluation_workflow-1024x281.png 1024w" sizes="(min-width: 720px) 720px"><figcaption>An example architecture provided by <a href="https://databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html">Databricks</a> for saving and monitoring production data.</figcaption></figure><p>In order to monitor model evaluation metrics, we'll need to persist model prediction results alongside a unique identifier which is included in the model inference response. This unique identifier will be used when we <strong>asynchronously receive feedback</strong> from the user or some upstream system regarding a prediction we made. Once we've received feedback we can calculate our evaluate metric and log the results via Prometheus to be displayed in the model dashboard.</p><!--kg-card-begin: markdown--><p><a id="monitoring-products"></a></p>
<h3 id="purposebuiltmonitoringtoolsformlmodels">Purpose-built monitoring tools for ML models</h3>
<!--kg-card-end: markdown--><p>If you're looking for out-of-the-box support for some of these more advanced model monitoring use cases, there's a growing number of ML monitoring services that you can use.</p><p><strong>WhyLogs</strong> is an <a href="https://whylogs.readthedocs.io/en/latest/">open-source library</a> for capturing statistical profiles of production data streams.</p><p><strong>Seldon</strong> has developed an <a href="https://docs.seldon.io/projects/seldon-core/en/latest/">open-source library</a> which runs on Kubernetes and uses a similar technology stack (Prometheus, Grafana, Elasticsearch) as discussed in this blog post.</p><p><strong>Fiddler</strong> offers a slick <a href="https://www.fiddler.ai/ml-monitoring">monitoring</a> product with a dashboard built specifically for monitoring models and understanding predictions on production data.</p><p><strong>Amazon Sagemaker</strong> offers a <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html">model monitoring suite</a> which integrates well if you're already deploying models on Sagemaker.</p><!--kg-card-begin: markdown--><p><a id="best-practices"></a></p>
<h2 id="bestpracticesformonitoring">Best practices for monitoring</h2>
<!--kg-card-end: markdown--><p><strong>Prometheus</strong></p><ul><li>Avoid storing high-cardinality data in labels.<strong> </strong>Every unique set of labels for is treated as a distinct time series, high-cardinality data in labels can drastically increase the amount of data being stored. As a general rule, try to keep the cardinality for a given metric (number of unique label-sets) under 10.</li><li>Metric names should have a suffix describing the unit (e.g. <code>http_request_duration_<strong><strong>seconds</strong></strong></code>)</li><li>Use base units when recording values (e.g. seconds instead of milliseconds).</li><li>Use standard <a href="https://prometheus.io/docs/instrumenting/exporters/">Prometheus exporters</a> when available. </li></ul><p><strong>Grafana</strong></p><ul><li>Ensure your dashboards are easily discoverable and consistent by design.</li><li>Use template variables instead of hardcoding values or duplicating charts.</li><li>Provide appropriate context next to important charts.</li><li>Keep your dashboards in source control.</li><li>Avoid duplicating dashboards.</li></ul><!--kg-card-begin: markdown--><p><a id="resources"></a></p>
<h2 id="resources">Resources</h2>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><strong>Blog posts</strong></p>
<ul>
<li><a href="https://towardsdatascience.com/production-machine-learning-monitoring-outliers-drift-explainers-statistical-performance-d9b1d02ac158">Production Machine Learning Monitoring: Outliers, Drift, Explainers &amp; Statistical Performance</a></li>
<li><a href="https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/">Monitoring Machine Learning Models in Production: A Comprehensive Guide</a></li>
<li><a href="https://medium.com/whylabs/whylogs-embrace-data-logging-a9449cd121d">whylogs: Embrace Data Logging Across Your ML Systems</a></li>
<li><a href="https://aws.amazon.com/builders-library/building-dashboards-for-operational-visibility/">Building dashboards for operational visibility</a></li>
<li><a href="https://shopify.engineering/make-dashboards-using-product-thinking-approach">How to Make Dashboards Using a Product Thinking Approach</a></li>
<li><a href="https://www.robustperception.io/how-does-a-prometheus-histogram-work">How does a Prometheus Histogram work?</a></li>
</ul>
<p><strong>Talks</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=YE2aQFiMGfY">Fool-Proof Kubernetes Dashboards for Sleep-Deprived Oncalls - David Kaltschmidt</a></li>
<li><a href="https://www.youtube.com/watch?v=h4Sl21AKiDg">How Prometheus Monitoring works | Prometheus Architecture explained</a></li>
<li><a href="https://www.youtube.com/watch?v=hTjHuoWxsks">PromQL for Mere Mortals</a></li>
</ul>
<!--kg-card-end: markdown--><h3 id="acknowledgements">Acknowledgements</h3><p><em>Thanks to Goku Mohandas, John Huffman, Shreya Shankar, and Binal Patel for reading early drafts of this blog post and providing feedback.</em></p>]]></content:encoded></item><item><title><![CDATA[Effective testing for machine learning systems.]]></title><description><![CDATA[<p><em>Working as a core maintainer for <a href="https://github.com/PyTorchLightning/pytorch-lightning">PyTorch Lightning</a>, I've grown a strong appreciation for the value of tests in software development. As I've been spinning up a new project at work, I've been spending a fair amount of time thinking about how we should test machine learning systems. A couple</em></p>]]></description><link>https://www.jeremyjordan.me/testing-ml/</link><guid isPermaLink="false">5f305de3cea33b00392d5760</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Wed, 19 Aug 2020 23:08:38 GMT</pubDate><content:encoded><![CDATA[<p><em>Working as a core maintainer for <a href="https://github.com/PyTorchLightning/pytorch-lightning">PyTorch Lightning</a>, I've grown a strong appreciation for the value of tests in software development. As I've been spinning up a new project at work, I've been spending a fair amount of time thinking about how we should test machine learning systems. A couple weeks ago, one of my coworkers sent me <a href="https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf">a fascinating paper</a> on the topic which inspired me to dig in, collect my thoughts, and write this blog post.</em> </p><p>In this blog post, we'll cover what testing looks like for traditional software development, why testing machine learning systems can be different, and discuss some strategies for writing effective tests for machine learning systems. We'll also clarify the distinction between the closely related roles of evaluation and testing as part of the model development process. By the end of this blog post, I hope you're convinced of both the extra work required to effectively test machine learning systems and the value of doing such work.</p><h2 id="what-s-different-about-testing-machine-learning-systems">What's different about testing machine learning systems?</h2><p>In traditional software systems, humans write the logic which interacts with data to produce a desired behavior. Our software tests help ensure that this <strong>written logic</strong> aligns with the actual expected behavior.</p><figure class="kg-card kg-image-card"><img src="https://www.jeremyjordan.me/content/images/2020/08/Screen-Shot-2020-08-09-at-4.45.29-PM.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2020/08/Screen-Shot-2020-08-09-at-4.45.29-PM.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2020/08/Screen-Shot-2020-08-09-at-4.45.29-PM.png 1000w, https://www.jeremyjordan.me/content/images/size/w1600/2020/08/Screen-Shot-2020-08-09-at-4.45.29-PM.png 1600w, https://www.jeremyjordan.me/content/images/2020/08/Screen-Shot-2020-08-09-at-4.45.29-PM.png 1850w" sizes="(min-width: 720px) 720px"></figure><p>However, in machine learning systems, humans provide desired behavior as examples during training and the model optimization process produces the logic of the system. How do we ensure this <strong>learned logic</strong> is going to consistently produce our desired behavior?</p><figure class="kg-card kg-image-card"><img src="https://www.jeremyjordan.me/content/images/2020/08/Screen-Shot-2020-08-09-at-4.47.06-PM.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2020/08/Screen-Shot-2020-08-09-at-4.47.06-PM.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2020/08/Screen-Shot-2020-08-09-at-4.47.06-PM.png 1000w, https://www.jeremyjordan.me/content/images/size/w1600/2020/08/Screen-Shot-2020-08-09-at-4.47.06-PM.png 1600w, https://www.jeremyjordan.me/content/images/2020/08/Screen-Shot-2020-08-09-at-4.47.06-PM.png 1850w" sizes="(min-width: 720px) 720px"></figure><p>Let's start by looking at the best practices for testing traditional software systems and developing high-quality software. </p><p>A typical software testing suite will include:</p><ul><li><strong>unit tests</strong> which operate on atomic pieces of the codebase and can be run quickly during development,</li><li><strong>regression tests</strong> replicate bugs that we've previously encountered and fixed,</li><li><strong>integration tests</strong> which are typically longer-running tests that observe higher-level behaviors that leverage multiple components in the codebase,</li></ul><p>and follow conventions such as:</p><ul><li>don't merge code unless all tests are passing,</li><li>always write tests for newly introduced logic when contributing code,</li><li>when contributing a bug fix, be sure to write a test to capture the bug and prevent future regressions.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.jeremyjordan.me/content/images/2020/08/Group-5-1.jpg" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2020/08/Group-5-1.jpg 600w, https://www.jeremyjordan.me/content/images/size/w1000/2020/08/Group-5-1.jpg 1000w, https://www.jeremyjordan.me/content/images/size/w1600/2020/08/Group-5-1.jpg 1600w, https://www.jeremyjordan.me/content/images/2020/08/Group-5-1.jpg 2348w" sizes="(min-width: 720px) 720px"><figcaption>A typical workflow for software development.</figcaption></figure><p>When we run our testing suite against the new code, we'll get a report of the specific behaviors that we've written tests around and verify that our code changes don't affect the expected behavior of the system. If a test fails, we'll know which specific behavior is no longer aligned with our expected output. We can also look at this testing report to get an understanding of how extensive our tests are by looking at metrics such as <strong><a href="https://en.wikipedia.org/wiki/Code_coverage">code coverage</a></strong>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.jeremyjordan.me/content/images/2020/08/Group-1.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2020/08/Group-1.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2020/08/Group-1.png 1000w, https://www.jeremyjordan.me/content/images/2020/08/Group-1.png 1149w" sizes="(min-width: 720px) 720px"><figcaption>An example output from a traditional software testing suite.</figcaption></figure><p>Let's contrast this with a typical workflow for developing machine learning systems. After training a new model, we'll typically produce an evaluation report including:</p><ul><li>performance of an established metric on a validation dataset,</li><li>plots such as precision-recall curves,</li><li>operational statistics such as inference speed,</li><li>examples where the model was most confidently incorrect,</li></ul><p>and follow conventions such as:</p><ul><li>save all of the hyper-parameters used to train the model,</li><li>only promote models which offer an improvement over the existing model (or baseline) when evaluated on the same dataset. </li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.jeremyjordan.me/content/images/2020/08/Group-3-1.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2020/08/Group-3-1.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2020/08/Group-3-1.png 1000w, https://www.jeremyjordan.me/content/images/2020/08/Group-3-1.png 1174w" sizes="(min-width: 720px) 720px"><figcaption>A typical workflow for model development.</figcaption></figure><p>When reviewing a new machine learning model, we'll inspect metrics and plots which summarize model performance over a validation dataset. We're able to compare performance between multiple models and make relative judgements, but we're not immediately able to characterize specific model behaviors. For example, figuring out <em>where</em> the model is failing usually requires additional investigative work; one common practice here is to look through a list of the top most egregious model errors on the validation dataset and manually categorize these failure modes.</p><p>Assuming we write behavioral tests for our models (discussed below), there's also the question of whether or not we have enough tests! While traditional software tests have metrics such as the lines of code covered when running tests, this becomes harder to quantify when you shift your application logic from lines of code to parameters of a machine learning model. Do we want to quantify our test coverage with respect to the input data distribution? Or perhaps the possible activations inside the model?</p><p><a href="http://proceedings.mlr.press/v97/odena19a/odena19a.pdf">Odena et al</a>. introduce one possible metric for coverage where we track the model logits for all of the test examples and quantify the area covered by radial neighborhoods around these activation vectors. However, my perception is that as an industry we don't have a well-established convention here. In fact, it feels like that testing for machine learning systems is in such early days that this question of test coverage isn't really being asked by many people.</p><h2 id="what-s-the-difference-between-model-testing-and-model-evaluation">What's the difference between model testing and model evaluation?</h2><p>While reporting evaluation metrics is certainly a good practice for quality assurance during model development, I don't think it's sufficient. Without a granular report of specific behaviors, we won't be able to immediately understand the nuance of how behavior may change if we switch over to the new model. Additionally, we won't be able to track (and prevent) behavioral regressions for specific failure modes that had been previously addressed.</p><p>This can be especially dangerous for machine learning systems since often times failures happen silently. For example, you might improve the overall evaluation metric but introduce a regression on a critical subset of data. Or you could unknowingly add a gender bias to the model through the inclusion of a new dataset during training. We need more nuanced reports of model behavior to identify such cases, which is exactly where model testing can help.</p><p>For machine learning systems, we should be running model evaluation and model tests in parallel. </p><ul><li><strong>Model evaluation</strong> covers metrics and plots which summarize performance on a validation or test dataset.</li><li><strong>Model testing</strong> involves explicit checks for behaviors that we expect our model to follow.</li></ul><p>Both of these perspectives are instrumental in building high-quality models. </p><p>In practice, most people are doing a combination of the two where evaluation metrics are calculated automatically and some level of model "testing" is done <a href="https://www.coursera.org/learn/machine-learning-projects/lecture/GwViP/carrying-out-error-analysis">manually through error analysis</a> (i.e. classifying failure modes). Developing model tests for machine learning systems can offer a systematic approach towards error analysis.</p><h2 id="how-do-you-write-model-tests">How do you write model tests?</h2><p>In my opinion, there's two general classes of model tests that we'll want to write.</p><ul><li><strong>Pre-train</strong> <strong>tests</strong> allow us to identify some bugs early on and short-circuit a training job.</li><li><strong>Post-train tests</strong> use the trained model artifact to inspect behaviors for a variety of important scenarios that we define.</li></ul><h3 id="pre-train-tests">Pre-train tests</h3><p>There's some tests that we can run without needing trained parameters. These tests include:</p><ul><li>check the shape of your model output and ensure it aligns with the labels in your dataset</li><li>check the output ranges and ensure it aligns with our expectations (eg. the output of a classification model should be a distribution with class probabilities that sum to 1)</li><li>make sure a single gradient step on a batch of data yields a decrease in your loss</li><li>make <a href="https://greatexpectations.io/">assertions about your datasets</a></li><li>check for label leakage between your training and validation datasets</li></ul><p>The main goal here is to identify some errors early so we can avoid a wasted training job.</p><h3 id="post-train-tests">Post-train tests</h3><p>However, in order for us to be able to understand model behaviors we'll need to test against trained model artifacts. These tests aim to <strong>interrogate the logic learned during training</strong> and provide us with a behavioral report of model performance.</p><blockquote><strong>Paper highlight</strong>: <a href="https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf">Beyond Accuracy: Behavioral Testing of NLP Models with CheckList</a></blockquote><p>The authors of the above paper present three different types of model tests that we can use to understand behavioral attributes. </p><p><strong>Invariance Tests</strong></p><p>Invariance tests allow us to describe a set of perturbations we should be able to make to the input <em>without</em> affecting the model's output. We can use these perturbations to produce pairs of input examples (original and perturbed) and <strong>check for consistency</strong> in the model predictions. This is closely related to the concept of data augmentation, where we apply perturbations to inputs during training and preserve the original label. </p><p>For example, imagine running a sentiment analysis model on the following two sentences:</p><ul><li><u>Mark</u> was a great instructor.</li><li><u>Samantha</u> was a great instructor.</li></ul><p>We would expect that simply changing the name of the subject doesn't affect the model predictions. </p><p><strong>Directional Expectation Tests</strong></p><p>Directional expectation tests, on the other hand, allow us to define a set of perturbations to the input which <em>should</em> have a <em>predictable</em> effect on the model output. </p><p>For example, if we had a housing price prediction model we might assert:</p><ul><li>Increasing the number of bathrooms (holding all other features constant) should not cause a drop in price.</li><li>Lowering the square footage of the house (holding all other features constant) should not cause an increase in price.</li></ul><p>Let's consider a scenario where a model fails the second test - taking a random row from our validation dataset and decreasing the feature <code>house_sq_ft</code> yields a higher predicted price than the original label. This is surprising as it doesn't match our intuition, so we decide to look further into it. We realize that, without having a feature for the house's neighborhood/location, our model has learned that smaller units tend to be more expensive; this is due to the fact that smaller units from our dataset are more prevalent in cities where prices are generally higher. In this case, the <em>selection</em> of our dataset has influenced the model's logic in unintended ways - this isn't something we would have been able to identify simply by examining performance on a validation dataset.</p><p><strong>Minimum Functionality Tests (aka data unit tests)</strong></p><p>Just as software unit tests aim to isolate and test atomic components in your codebase, data unit tests allow us to quantify model performance for specific cases found in your data. </p><p>This allows you to identify critical scenarios where prediction errors lead to high consequences. You may also decide to write data unit tests for failure modes that you uncover during error analysis; this allows you to "automate" searching for such errors in future models.</p><p>Snorkel also has introduced a very similar approach through their concept of <a href="https://www.snorkel.org/use-cases/03-spam-data-slicing-tutorial">slicing functions</a>. These are programatic functions which allow us to identify subsets of a dataset which meet certain criteria. For example, you might write a slicing function to identify sentences less than 5 words to evaluate how the model performs on short pieces of text.</p><h3 id="organizing-tests">Organizing tests</h3><p>In traditional software tests, we typically organize our tests to mirror the structure of the code repository. However, this approach doesn't translate well to machine learning models since our logic is structured by the parameters of the model. </p><p>The authors of the CheckList paper linked above recommend structuring your tests around the "skills" we expect the model to acquire while learning to perform a given task. </p><p>For example, a sentiment analysis model might be expected to gain some understanding of:</p><ul><li>vocabulary and parts of speech,</li><li>robustness to noise, </li><li>identifying named entities,</li><li>temporal relationships,</li><li>and negation of words.</li></ul><p>For an image recognition model, we might expect the model to learn concepts such as:</p><ul><li>object rotation,</li><li>partial occlusion,</li><li>perspective shift,</li><li>lighting conditions,</li><li>weather artifacts (rain, snow, fog),</li><li>and camera artifacts (ISO noise, motion blur).</li></ul><h3 id="model-development-pipeline">Model development pipeline</h3><p>Putting this all together, we can revise our diagram of the model development process to include pre-train and post-train tests. These tests outputs can be displayed alongside model evaluation reports for review during the last step in the pipeline. Depending on the nature of your model training, you may choose to automatically approve models provided that they meet some specified criteria.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.jeremyjordan.me/content/images/2020/08/Group-7.png" class="kg-image" alt srcset="https://www.jeremyjordan.me/content/images/size/w600/2020/08/Group-7.png 600w, https://www.jeremyjordan.me/content/images/size/w1000/2020/08/Group-7.png 1000w, https://www.jeremyjordan.me/content/images/2020/08/Group-7.png 1333w" sizes="(min-width: 720px) 720px"><figcaption>A proposed workflow for developing high-quality models.</figcaption></figure><h2 id="conclusion">Conclusion</h2><p>Machine learning systems are trickier to test due to the fact that we're not explicitly writing the logic of the system. However, automated testing is still an important tool for the development of high-quality software systems. These tests can provide us with a behavioral report of trained models, which can serve as a systematic approach towards error analysis. </p><p>Throughout this blog post, I've presented "traditional software development" and "machine learning model development" as two separate concepts. This simplification made it easier to discuss the unique challenges associated with testing machine learning systems; unfortunately, the real world is messier. Developing machine learning models also relies on a large amount of "traditional software development" in order to process data inputs, create feature representations, perform data augmentation, orchestrate model training, expose interfaces to external systems, and much more. Thus, effective testing for machine learning systems requires <strong>both </strong>a traditional software testing suite (for model development infrastructure) and a model testing suite (for trained models).</p><p>If you have experience testing machine learning systems, please reach out and share what you've learned!</p><p><em>Thank you, Xinxin Wu, for sending me the paper which inspired me to write this post! Additionally, I'd like to thank John Huffman, <a href="http://josh-tobin.com/">Josh Tobin</a>, and <a href="https://automationpanda.com/">Andrew Knight</a> for reading earlier drafts of this post and providing helpful feedback.</em></p><h2 id="further-reading">Further reading</h2><p><strong>Papers</strong></p><ul><li><a href="https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf">Beyond Accuracy: Behavioral Testing of NLP Models with CheckList</a></li><li><a href="http://proceedings.mlr.press/v97/odena19a/odena19a.pdf">TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing</a></li></ul><p><strong>Blog posts</strong></p><ul><li><a href="https://eugeneyan.com/writing/testing-ml/">How to Test Machine Learning Code and Systems</a> by Eugene Yan</li><li><a href="https://www.snorkel.org/use-cases/03-spam-data-slicing-tutorial">Snorkel Intro Tutorial: <em><em>Data Slicing</em></em></a></li><li><a href="https://krokotsch.eu/cleancode/2020/08/11/Unit-Tests-for-Deep-Learning.html">How to Trust Your Deep Learning Code</a> by Tilman Krokotsch</li></ul><p><strong>Talks</strong></p><ul><li><a href="https://www.youtube.com/watch?v=k0naEYedv5I&amp;feature=youtu.be">MLOps Chat: How Should We Test ML Models? with Data Scientist Jeremy Jordan</a></li><li><a href="https://www.youtube.com/watch?v=Da-FL_1i6ps">Unit Testing for Data Scientists - Hanna Torrence</a> </li><li><a href="https://www.youtube.com/watch?v=GEqM9uJi64Q">Trey Causey: Testing for Data Scientists</a></li><li><a href="https://www.youtube.com/watch?v=x7lhb7ASyu0&amp;feature=emb_title">Bay Area NLP Meetup: Beyond Accuracy Behavioral Testing of NLP Models with CheckList</a></li></ul><p><strong>Code</strong></p><ul><li><a href="https://github.com/marcotcr/checklist">CheckList</a></li><li><a href="https://github.com/great-expectations/great_expectations">Great Expectations</a></li><li><a href="https://github.com/deepmind/chex">Chex</a> (testing library for Jax)</li></ul>]]></content:encoded></item><item><title><![CDATA[An introduction to Kubernetes.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>This blog post will provide an introduction to Kubernetes so that you can understand the motivation behind the tool, what it is, and how you can use it. In a follow-up post, I'll discuss how we can leverage Kubernetes to power data science workloads using more concrete (data science) examples.</p>]]></description><link>https://www.jeremyjordan.me/kubernetes/</link><guid isPermaLink="false">5dbf9a7808b3d60038a607a0</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Wed, 27 Nov 2019 03:29:54 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>This blog post will provide an introduction to Kubernetes so that you can understand the motivation behind the tool, what it is, and how you can use it. In a follow-up post, I'll discuss how we can leverage Kubernetes to power data science workloads using more concrete (data science) examples. However, it helps to first build an understanding of the fundamentals - which is the focus of this post.</p>
<p><strong>Prerequisites</strong>: I'm going to make the assumption that you're familiar with container technologies such as Docker. If you don't have experience building and running container images, I suggest starting <a href="https://unsupervisedpandas.com/data-science/docker-for-data-science/">here</a> before you continue reading this post.</p>
<h2 id="overview">Overview</h2>
<p><em>Here's what we'll discuss in this post. You can click on any top-level heading to jump directly to that section.</em></p>
<ul>
<li><a href="#objective">What is the point of Kubernetes?</a></li>
<li><a href="#design">Design principles.</a>
<ul>
<li>Declaritive</li>
<li>Distributed</li>
<li>Decoupled</li>
<li>Immutable</li>
</ul>
</li>
<li><a href="#objects">Basic objects in Kubernetes.</a>
<ul>
<li>Pod</li>
<li>Deployment</li>
<li>Service</li>
<li>Ingress</li>
<li>Job</li>
</ul>
</li>
<li><a href="#control-plane">How? The Kubernetes control plane.</a>
<ul>
<li>Master node
<ul>
<li>API server</li>
<li>etcd</li>
<li>scheduler</li>
<li>controller-manager</li>
</ul>
</li>
<li>Worker nodes
<ul>
<li>kubelet</li>
<li>kube-proxy</li>
</ul>
</li>
</ul>
</li>
<li><a href="#nah">When should you not use Kubernetes?</a></li>
<li><a href="#resources">Resources</a></li>
</ul>
<hr>
<p><a id="objective"></a></p>
<h2 id="whatisthepointofkubernetes">What is the point of Kubernetes?</h2>
<p>Kubernetes is often described as a <strong>container orchestration</strong> platform. In order to understand what exactly that means, it helps to revisit the purpose of containers, what's missing, and how Kubernetes fills that gap.</p>
<blockquote>
<p>Note: You will also see Kubernetes referred to by its <a href="https://en.wikipedia.org/wiki/Numeronym">numeronym</a>, <strong>k8s</strong>. It means the same thing, just easier to type.</p>
</blockquote>
<p><strong>Why do we love containers?</strong> Containers provide a lightweight mechanism for isolating an application's environment. For a given application, we can specify the system configuration and libraries we want installed without worrying about creating conflicts with other applications that might be running on the same physical machine. We encapsulate each application as a <em>container image</em> which can be executed reliably on any machine* (as long as it has the ability to run container images), providing us the portability to enable smooth transitions from development to deployment. Additionally, because each application is self-contained without the concern of environment conflicts, it's easier to place multiple workloads on the same physical machine and achieve higher resource (memory and CPU) utilization - ultimately lowering costs.</p>
<p><strong>What's missing?</strong> However, what happens if your container dies? Or even worse, what happens if the machine running your container fails? Containers do not provide a solution for <em>fault tolerance</em>. Or what if you have multiple containers that need the ability to communicate, how do you enable networking between containers? How does this change as you spin up and down individual containers? Container <em>networking</em> can easily become an entangled mess. Lastly, suppose your production environment consists of multiple machines - how do you decide which machine to use to run your container?</p>
<p><strong>Kubernetes as a container orchestration platform.</strong> We can address many of the concerns mentioned above using a container orchestration platform.</p>
<blockquote>
<p>The director of an orchestra holds the <strong>vision</strong> for a musical performance and <strong>communicates</strong> with the musicians in order to <strong>coordinate</strong> their individual instrumental contributions to achieve this overall vision. As the architect of a system, your job is simply to <strong>compose the music</strong> (specify the containers to be run) and then hand over control to the orchestra director (container orchestration platform) to achieve that vision.</p>
</blockquote>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/orchestration.gif" alt="orchestration"><br>
<small><a href="https://giphy.com/gifs/kennedycenter-nationalsymphonyorchestra-noseda-nso-classicalmusic-orchestra-3o8dFCeY6O5NyLVDYQ">Image credit</a></small></p>
<p>A container orchestration platform manages the entire lifecycle of individual containers, spinning up and shutting down resources as needed. If a container shuts down unexpectedly, the orchestration platform will react by launching another container in its place.</p>
<p>On top of this, the orchestration platform provides a mechanism for applications to communicate with each other even as underlying individual containers are created and destroyed.</p>
<p>Lastly, given (1) a set of container workloads to run and (2) a set of machines on a cluster, the container orchestrator examines each container and determines the optimal machine to schedule that workload. To understand why this can be valuable, <a href="https://youtu.be/u_iAXzy3xBA?t=1067">watch Kelsey Hightower explain</a> (17:47-20:55) the difference between automated deployments and container orchestration using an example game of Tetris.</p>
<hr>
<p><a id="design"></a></p>
<h2 id="designprinciples">Design principles.</h2>
<p>Now that we understand the motivation for container orchestration in general, let's spend some time to discuss the motivating design principles behind Kubernetes. It helps to understand these principles so that you can use the tool <em>as it was intended to be used</em>.</p>
<h3 id="declarative">Declarative</h3>
<p>Perhaps the most important design principle in Kubernetes is that we simply define the <em>desired state</em> of our system and let Kubernetes automation work to ensure that the <em>actual state</em> of the system reflects these desires. This absolves you of the responsibility of fixing most things when they break; you simply need to state what your system <em>should</em> look like in an ideal state. Kubernetes will detect when the actual state of the system doesn't meet these expectations and it will intervene on your behalf to fix the problem. This enables our systems to be <strong>self-healing</strong> and react to problems without the need for human intervention.</p>
<p>The &quot;state&quot; of your system is defined by a collection of <em>objects</em>. Each Kubernetes object has (1) a <em>specification</em> in which you provide the desired state and (2) a <em>status</em> which reflects the current state of the object. Kubernetes maintains a list of all object specifications and constantly polls each object in order to ensure that its status is equal to the specification. If an object is unresponsive, Kubernetes will spin up a new version to replace it. If a object's status has drifted from the specification, Kubernetes will issue the necessary commands to drive that object back to its desired state.</p>
<h3 id="distributed">Distributed</h3>
<p>For a certain operating scale, it becomes necessary to architect your applications as a distributed system. Kubernetes is designed to provide the infrastructural layer for such distributed systems, yielding clean abstractions to build applications on top of a collection of machines (collectively known as a cluster). More specifically, Kubernetes provides a <em>unified</em> interface for interacting with this cluster such that you don't have to worry about communicating with each machine individually.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/distributed_systems.png" alt="distributed_systems"></p>
<h3 id="decoupled">Decoupled</h3>
<p>It is <a href="https://www.redhat.com/en/resources/cloud-native-container-design-whitepaper">commonly recommended</a> that containers be developed with a <em>single concern</em> in mind. As a result, developing containerized applications lends itself quite nicely to the <a href="https://martinfowler.com/articles/microservices.html">microservice architecture</a> design pattern, which recommends &quot;designing software applications as suites of independently deployable services.&quot;</p>
<p>The abstractions provided in Kubernetes naturally support the idea of decoupled services which can be scaled and updated independently. These services are logically separated and communicate via well-defined APIs. This logical separation allows teams to deploy changes into production at a higher velocity since each service can operate on independent release cycles (provided that they respect the existing API contracts).</p>
<h3 id="immutableinfrastructure">Immutable infrastructure</h3>
<p>In order to achieve the most benefit from containers and container orchestration, you should be deploying immutable infrastructure. This is, rather than logging in to a container on a machine to make changes (eg. updating a library), you should build a new container image, deploy the new version, and terminate the older version. As you transition across environments during the life-cycle of a project (development -&gt; testing -&gt; production) you should <em>use the same container image</em> and only modify configurations external to the container image (eg. by mounting a config file).</p>
<p>This becomes very important since <strong>containers are designed to be ephemeral</strong>, ready to be replaced by another container instance at any time. If your original container had mutated state (eg. manual configuration) but was shut down due to a failed healthcheck, the new container spun up in its place would not reflect those manual changes and could potentially break your application.</p>
<p>When you maintain immutable infrastructure, it also becomes much easier to roll back your applications to a previous state (eg. if an error occurs) - you can simply update your configuration to use an older container image.</p>
<hr>
<p><a id="objects"></a></p>
<h2 id="basicobjectsinkubernetes">Basic objects in Kubernetes.</h2>
<p>Previously, I mentioned that we describe our <em>desired state</em> of the system through a collection of Kubernetes <strong>objects</strong>. Up until now, our discussion of Kubernetes has been relatively abstract and high-level. In this section, we'll dig into more specifics regarding how you can deploy applications on Kubernetes by covering the basic objects available in Kubernetes.</p>
<p>Kubernetes objects can be defined using either YAML or JSON files; these files defining objects are commonly referred to as <em>manifests</em>. It's a good practice to keep these manifests in a version controlled repository which can act as the single source of truth as to what objects are running on your cluster.</p>
<h3 id="pod">Pod</h3>
<p>The <strong>Pod</strong> object is the fundamental building block in Kubernetes, comprised of one or more (tightly related) containers, a shared networking layer, and shared filesystem volumes. Similar to containers, pods are designed to be ephemeral - there is no expectation that a <em>specific, individual pod</em> will persist for a long lifetime.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/pod.png" alt="pod"></p>
<p>You won't typically explicitly create Pod objects in your manifests, as it's often simpler to use higher level components which manage Pod objects for you.</p>
<h3 id="deployment">Deployment</h3>
<p>A <strong>Deployment</strong> object encompasses a collection of pods defined by a template and a replica count (how many copies of the template we want to run). You can either set a specific value for the replica count or use a separate Kubernetes resource (eg. a horizontal pod autoscaler) to control the replica count based on system metrics such as CPU utilization.</p>
<blockquote>
<p>Note: The Deployment object's controller actually creates another object, a ReplicaSet, under the hood. However, this is abstracted away from you as the user.</p>
</blockquote>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/deployment.png" alt="deployment"></p>
<p>While you can't rely on any <em>single</em> pod to stay running indefinitely, you can rely on the fact that the cluster will always try to have $n$ pods available (where $n$ is defined by your specified replica count). If we have a Deployment with a replica count of 10 and 3 of those pods crash due to a machine failure, 3 more pods will be scheduled to run on a different machine in the cluster. For this reason, <em>Deployments are best suited for stateless applications</em> where Pods are able to be replaced at any time without breaking things.</p>
<p>The following YAML file provides an annotated example of how you might define a Deployment object. In this example, we want to run 10 instances of a container which serves an ML model over a REST interface.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/deployment_spec.png" alt="deployment_spec"></p>
<blockquote>
<p>Note: In order for Kubernetes to know how compute-intensive this workload might be, we should also provide <a href="https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/">resource limits</a> in the pod template specification.</p>
</blockquote>
<p>Deployments also allow us to specify how we would like to roll out updates when we have new versions of our container image; <a href="https://blog.container-solutions.com/kubernetes-deployment-strategies">this blog post</a> provides a good overview of your different options. If we wanted to override the defaults we would include an additional <code>strategy</code> field under the object <code>spec</code>. Kubernetes will make sure to gracefully shut down Pods running the old container image and spin up new Pods running the new container image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/deployment_update.png" alt="deployment_update"></p>
<h3 id="service">Service</h3>
<p>Each Pod in Kubernetes is assigned a unique IP address that we can use to communicate with it. However, because Pods are ephemeral, it can be quite difficult to send traffic to your desired container. For example, let's consider the Deployment from above where we have 10 Pods running a container serving a machine learning model over REST. How do we reliably communicate with a server if the set of Pods running as part of the Deployment can change at any time? This is where the <strong>Service</strong> object enters the picture. A Kubernetes Service provides you with a stable endpoint which can be used to direct traffic to the desired Pods even as the exact underlying Pods change due to updates, scaling, and failures. Services know which Pods they should send traffic to based on <em>labels</em> (key-value pairs) which we define in the Pod metadata.</p>
<blockquote>
<p>Note: This <a href="https://www.asykim.com/blog/deep-dive-into-kubernetes-external-traffic-policies">blog post</a> does a nice job explaining how traffic is actually routed.</p>
</blockquote>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/service-1.png" alt="service"><br>
<small>In this example, our Service sends traffic to all healthy Pods with the label <code>app=&quot;ml-model&quot;</code>.</small></p>
<p>The following YAML file provides an example for how we might wrap a Service around the earlier Deployment example.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/service_spec.png" alt="service_spec"></p>
<h3 id="ingress">Ingress</h3>
<p>While a Service allows us to expose applications behind a stable endpoint, the endpoint is only available to internal cluster traffic. If we wanted to expose our application to traffic external to our cluster, we need to define an <strong>Ingress</strong> object.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/ingress.png" alt="ingress"></p>
<p>The benefit of this approach is that you can select which Services to make publicly available. For example, suppose that in addition to our Service for a machine learning model, we had a UI which leveraged the model's predictions as part of a larger application. We may choose to only make the UI available to public traffic, preventing users from being able to query the model serving Service directly.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/ingress_2.png" alt="ingress_2"></p>
<p>The following YAML file defines an Ingress object for the above example, making the UI publicly accessible.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/ingress_spec.png" alt="ingress_spec"></p>
<h3 id="job">Job</h3>
<p>The Kubernetes objects I've described up until this point can be composed to create reliable, long-running services. In contrast, the <strong>Job</strong> object is useful when you want to perform a discrete task. For example, suppose we want to retrain our model daily based on the information collected from the previous day. Each day, we want to spin up a container to execute a predefined workload (eg. a <code>train.py</code> script) and then shut down when the training finishes. Jobs provide us the ability to do exactly this! If for some reason our container crashes before finishing the script, Kubernetes will react by launching a new Pod in its place to finish the job. For Job objects, the &quot;desired state&quot; of the object is completion of the job.</p>
<p>The following YAML defines an example Job for training a machine learning model (assuming the training code is defined in <code>train.py</code>).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/job_spec.png" alt="job_spec"></p>
<blockquote>
<p>Note: This Job specification will only execute a single training run. If we wanted to execute this job daily, we could define a CronJob object instead.</p>
</blockquote>
<h3 id="andmanymore">...and many more.</h3>
<p>The objects discussed above are certainly not an exhaustive list of resource types available in Kubernetes. Some other objects that you'll most likely find useful when deploying applications include:</p>
<ul>
<li>Volume: for managing directories mounted onto Pods</li>
<li>Secret: for storing sensitive credentials</li>
<li>Namespace: for separating resources on your cluster</li>
<li>ConfigMap: for specifying application configuration values to be mounted as a file</li>
<li>HorizontalPodAutoscaler: for scaling Deployments based on the current resource utlization of existing Pods</li>
<li>StatefulSet: similar to a Deployment, but for when you need to run a stateful application</li>
</ul>
<hr>
<p><a id="control-plane"></a></p>
<h2 id="howthekubernetescontrolplane">How? The Kubernetes control plane.</h2>
<p>By this point, you're probably wondering how Kubernetes is capable of taking all of our object specifications and actually executing these workloads on a cluster. In this section we'll discuss the components that make up the Kubernetes <strong>control plane</strong> which govern how workloads are executed, monitored, and maintained on our cluster.</p>
<p>Before we dive in, it's important to distinguish two classes of machines on our cluster:</p>
<ul>
<li>A <strong>master node</strong> contains most of the components which make up our control plane that we'll discuss below. In most moderate-sized clusters you'll only have a single master node, although it is possible to have multiple master nodes for high availability. If you use a cloud provider's managed Kubernetes service, they will typically abstract away the master node and you will not have to manage or pay for it.</li>
<li>A <strong>worker node</strong> is a machine which actually runs our application workloads. It is possible to have multiple different machine types tailored to different types of workloads on your cluster. For example, you might have some GPU-optimized nodes for faster model training and then use CPU-optimized nodes for serving. When you define object specifications, you can specify a preference as to what type of machine the workload gets assigned to.</li>
</ul>
<p>Now let's dive in to the main components on our master node. When you're communicating with Kubernetes to provide a new or updated object specification, you're talking to the <strong>API server</strong>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/api-server.png" alt="api-server"></p>
<p>More specifically, the API server validates requests to update objects and acts as the unified interface for questions about our cluster's current state. However, the <em>state</em> of our cluster is stored in <strong>etcd</strong>, a distributed key-value store. We'll use etcd to persist information regarding: our cluster configuration, object specifications, object statuses, nodes on the cluster, and which nodes the objects are assigned to run on.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/etcd.png" alt="etcd"></p>
<blockquote>
<p>Note: etcd is the only stateful component in our control plane, all other components are stateless.</p>
</blockquote>
<p>Speaking of where objects should be run, the <strong>scheduler</strong> is in charge of determining this! The scheduler will ask the API server (which will then communicate with etcd) which objects which haven't been assigned to a machine. The scheduler will then determine which machines those objects should be assigned to and will reply back to the API server to reflect this assignment (which gets propagated to etcd).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/scheduler.png" alt="scheduler"></p>
<p>The last component on the master node that we'll discuss in this post is the <strong>controller-manager</strong>, which monitors the state of a cluster through the API server to see if the current state of the cluster aligns with our desired state. If the actual state differs from our desired state, the controller-manager will make changes via the API server in an attempt to drive the cluster towards the desired state. The controller-manager is defined by a collection of <strong>controllers</strong>, each of which are responsible for managing objects of a specific resource type on the cluster. At a very high level, a controller will watch a specific resource type (eg. deployments) stored in etcd and create specifications for the pods which should be run to acheive the object's desired state. It is then the controller's responsibility for ensuring that these pods stay healthy while running and are shut down when needed.</p>
<p>To summarize what we've covered so far...<br>
<img src="https://www.jeremyjordan.me/content/images/2019/11/master-node.png" alt="master-node"></p>
<p>Next, let's discuss the control plane components which are run on worker nodes. Most of the resources available on our worker nodes are spent running our actual applications, but our nodes do need to know which pods they should be running and how to communicate with pods on other machines. The two final components of the control plane that we'll discuss cover exactly these two concerns.</p>
<p>The <strong>kubelet</strong> acts as a node's &quot;agent&quot; which communicates with the API server to see which container workloads have been assigned to the node. It is then responsible for spinning up pods to run these assigned workloads. When a node first joins the cluster, kubelet is responsible for announcing the node's existence to the API server so the scheduler can assign pods to it.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/kubelet.png" alt="kubelet"></p>
<p>Lastly, <strong>kube-proxy</strong> enables containers to be able to communicate with each other across the various nodes on the cluster. This component handles all the networking concerns such as how to forward traffic to the appropriate pod.</p>
<p>Hopefully, by this point you should be able to start seeing the whole picture for how things operate in a Kubernetes cluster. All of the components interact through the API server and we store the state of our cluster in etcd. There are various components which write to etcd (via the API server) to make changes to the cluster and nodes on the cluster listen to etcd (via the API server) to see which pods it should be running.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/full_picture-1.png" alt="full_picture"></p>
<p>The overall system is designed such that failures have minimal impact on the overall cluster. For example, if our master node went down, none of our applications would be immediately affected; we would just be prevented from making any further changes to the cluster until a new master node is brought online.</p>
<hr>
<p><a id="nah"></a></p>
<h2 id="whenshouldyounotusekubernetes">When should you not use Kubernetes?</h2>
<p>As with every new technology, you'll incur an overhead cost of adoption as you learn how it works and how it applies to the applications that you're building. It's a reasonable question to ask &quot;do I really need Kubernetes?&quot; so I'll attempt to provide some example situations where the answer might be no.</p>
<ul>
<li>You can run your workload on a single machine. <em>(Kubernetes can be viewed as a platform for building distributed systems, but you shouldn't build a distributed system if you don't need one!)</em></li>
<li>Your compute needs are light. <em>(In this case, the compute spent on the orchestration framework is relatively high!)</em></li>
<li>You don't need high availability and can tolerate downtime.</li>
<li>You don't envision making a lot of changes to your deployed services.</li>
<li>You already have an effective tool stack that you're satisfied with.</li>
<li>You have a monolithic architecture and don't plan to <a href="https://martinfowler.com/bliki/MonolithFirst.html">separate it into microservices</a>. <em>(This goes back to using the tool as it was intended to be used.)</em></li>
<li>You read this post and thought &quot;holy shit this is complicated&quot; rather than &quot;holy shit this is useful&quot;.</li>
</ul>
<hr>
<h2 id="gratitude">Gratitude</h2>
<p>Thank you to Derek Whatley and Devon Kinghorn for teaching me most of what I know about Kubernetes and answering my questions as I've been trying to wrap my head around this technology. Thank you to Goku Mohandas, John Huffman, Dan Salo, and Zack Abzug for spending your time to review an early version of this post and provide  thoughtful feedback. And lastly, thank you to Kelsey Hightower for all that you've contributed to the Kubernetes community - your talks have helped me understand the bigger picture and gave me confidence that I could learn about this topic.</p>
<hr>
<p><a id="resources"></a></p>
<h2 id="resources">Resources</h2>
<p>Here are some resources that I found to be useful when learning about Kubernetes.</p>
<p><strong>Blogs</strong></p>
<ul>
<li><a href="https://jvns.ca/blog/2017/10/05/reasons-kubernetes-is-cool/">Julia Evans - Reasons Kubernetes is cool</a></li>
<li><a href="https://jvns.ca/blog/2017/06/04/learning-about-kubernetes/">Julia Evans - A few things I've learned about Kubernetes</a> (Julia's zines were a big inspiration for my visuals explaining the Kubernetes control plane)</li>
<li><a href="https://blog.jessfraz.com/post/you-might-not-need-k8s/">Jessie Frazelle - You might not need Kubernetes</a></li>
<li><a href="https://medium.com/faun/is-kubernetes-overkill-ac7796d18b6f">Matt Rogish - Is Kubernetes Overkill?</a></li>
<li><a href="https://mattturck.com/2019trends/">Major Trends in the 2019 Data &amp; AI Landscape</a></li>
<li><a href="https://docs.microsoft.com/en-us/dotnet/architecture/cloud-native/introduction">Introduction to cloud-native applications</a> and <a href="https://docs.microsoft.com/en-us/dotnet/architecture/cloud-native/definition">defining cloud native</a></li>
</ul>
<p><strong>Videos</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=u_iAXzy3xBA">Kelsey Hightower - Kuberenetes for Pythonistas</a> discusses the motivation for Kubernetes and provides an example of running a Python application.</li>
<li><a href="https://www.youtube.com/watch?v=8SvQqZNP6uo">Kubernetes by Kelsey Hightower</a> introduces the core components of Kubernetes and how they work together, with the API server at the core.</li>
<li><a href="https://www.youtube.com/watch?v=kOa_llowQ1c">Kubernetes The Easy Way!</a> presents a developer-centric workflow for building products and leveraging Kubernetes as the infrastructure.</li>
<li><a href="https://www.youtube.com/watch?v=UUt7SuG3nW4">Kubernetes in Real Life - Ian Crosby</a></li>
<li><a href="https://www.youtube.com/watch?v=ZuIQurh_kDk">Kubernetes Design Principles: Understand the Why - Saad Ali, Google</a></li>
<li><a href="https://www.youtube.com/watch?v=90kZRyPcRZw">Kubernetes Deconstructed: Understanding Kubernetes by Breaking It Down - Carson Anderson, DOMO</a></li>
<li><a href="https://www.youtube.com/watch?v=uRvKGZ_fDPU&amp;feature=youtu.be">From COBOL to Kubernetes: A 250 Year Old Bank's Cloud-Native Journey - Laura Rehorst</a></li>
</ul>
<p><strong>Books</strong></p>
<ul>
<li><a href="https://azure.microsoft.com/en-us/resources/kubernetes-up-and-running/">Kubernetes: Up and Running, Second Edition</a> - free book!</li>
<li><a href="https://mapr.com/ebook/kubernetes-for-machine-learning-deep-learning-and-ai/assets/Cloud_EB_Kubernetes_MLDL_SPONSOR.pdf">Kubernetes for Machine Learning, Deep Learning &amp; AI</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Building machine learning products: a problem well-defined is a problem half-solved.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Previously, I wrote about <a href="https://www.jeremyjordan.me/ml-projects-guide/">organizing machine learning projects</a> where I presented the framework that I use for building and deploying models. However, that framework operates on the implicit assumption that you already know generally what your model should do. In this post, we'll dig deeper into how to <strong>develop the</strong></p>]]></description><link>https://www.jeremyjordan.me/ml-requirements/</link><guid isPermaLink="false">5d7efa4862e62e0038ee7e7d</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Sun, 22 Sep 2019 00:09:49 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>Previously, I wrote about <a href="https://www.jeremyjordan.me/ml-projects-guide/">organizing machine learning projects</a> where I presented the framework that I use for building and deploying models. However, that framework operates on the implicit assumption that you already know generally what your model should do. In this post, we'll dig deeper into how to <strong>develop the requirements for a machine learning project</strong> when you're given a vague problem to solve. Some questions that we'll address include:</p>
<ul>
<li>What specific task should our model be automating?</li>
<li>How does the user interact with the model?</li>
<li>What information should we expose to the user?</li>
</ul>
<p><em><strong>Note:</strong> Sometimes machine learning projects can be very straightforward; your stakeholders define an API specification stating the inputs to the system and the desired outputs and you agree that the task seems feasible. These projects typically support existing products with existing &quot;intelligence&quot; solutions - your task is to simply encapsulate the &quot;intelligence&quot; task using machine learning in lieu of the existing solution (ie. rule-based systems, mechanical turk workers, etc).</em></p>
<p>Throughout this blog post, I'll use the following problem statement as a running example.</p>
<blockquote>
<p>We're building an application to help people keep their photos organized. Our target user base are casual smartphone photographers who have a large number of photos in their camera roll. We want to make it easier for these users to find photos of interest.</p>
</blockquote>
<p>Notice how vague that problem statement is - what defines a photo of interest? There's a multitude of ways we could address this task and without understanding the problem in more detail we won't know which direction to take. At this point in time, we have insufficient information to specify an objective function for training a model.</p>
<p>Understanding the problem and developing the requirements isn't something you typically get right on the first attempt; this is often an <em>iterative process</em> where we initially define a set of coarse requirements and refine the detail as we gain more information.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/design-process.png" alt="design-process"></p>
<p><strong>Jump to:</strong></p>
<ul>
<li><a href="#understand">Understand the problem from the perspective of the user</a></li>
<li><a href="#iterate">Mock out your machine learning model and iterate on the user experience</a></li>
<li><a href="#language">Develop a shared language with your project stakeholders</a></li>
<li><a href="#shipping">Win by shipping</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<hr>
<p><a id="understand"></a></p>
<h2 id="understandtheproblemfromtheperspectiveoftheuser">Understand the problem from the perspective of the user.</h2>
<p>The first step towards establishing any set of requirements for a project is understanding the problem you're setting out to solve. No one knows the problem better than the intended user - the person we are attempting to solve the problem for.</p>
<p><strong>Perform informational interviews with end users to understand their perspective.</strong> The further removed you are from the end user, the less likely you are to solve the actual problem they're experiencing.</p>
<p>Do you remember playing the telephone game as a kid, where you have a chain of people and try to deliver a message from one end to the other by asking each person to transmit the message to the person next to them? Usually the message at the end is very different from the original message. In order to ensure you're solving the right problem, you'll want to be able to empathize with the people who currently experience that problem.</p>
<p>For the case of our photo app example, this would involve speaking with casual smartphone photographers and understanding how they use their app.</p>
<ul>
<li>When they're searching through their camera roll, what are they typically looking for?
<ul>
<li>Are they Instagram influencers who take 100 photos of the same thing and are trying to find the <em>best</em> photo to post?</li>
<li>Are they nature photographers looking to share the scenic views from their latest camping trip?</li>
<li>Are they recent parents trying to document the development of their newborn baby?</li>
<li>Are they often looking for photos with a <em>specific</em> friend to share on social media with a happy birthday wish?</li>
</ul>
</li>
<li>What are their current strategies for finding the photos of interest?
<ul>
<li>Do they know when the photo was taken?</li>
<li>Do they flip through their camera roll photo by photo or scroll through a series of thumbnails?</li>
<li>Are they typically searching for a photo of a specific person?</li>
<li>Do they mentally &quot;chunk&quot; photos together and search chunk by chunk? What's the criteria for chunking?</li>
</ul>
</li>
</ul>
<p>At this stage you don't want to start prescribing a solution, you're simply trying to understand the problem. However, it can be good to consider the capabilities of machine learning systems and ask questions which may eventually guide your scoping of a solution. For example, let's consider the motivation behind asking the first set of questions.</p>
<ul>
<li>Are they Instagram influencers who take 100 photos of the same thing and are trying to find the <em>best</em> photo to post? <em>Perhaps we want to consider clustering similar photos and then applying a ranking model to help them find the best photo to share.</em></li>
<li>Are they nature photographers looking to share the scenic views from their latest camping trip? <em>Do we need to consider additional metadata such as the GPS coordinates of where the photo was taken to enrich the data representation?</em></li>
<li>Are they recent parents trying to document the development of their newborn baby? <em>Perhaps we want to perform action recognition on the photographs to help the parents find the moment where their newborn walked for the first time or did a cute dance.</em></li>
<li>Are they often looking for photos with a <em>specific</em> friend to share on social media with a happy birthday wish? <em>Maybe we'll need to use facial recognition to identify the user's friends.</em></li>
</ul>
<p>Even though you might be considering solutions, it's important that the conversations with users are focused on the problems they experience. You won't ask them about their thoughts on specific solutions until the next stage.</p>
<p><strong>Subject yourself to the problem.</strong> As a machine learning practitioner, it can be tempting to jump right into training a model to learn some task. However, it's often very instructive to first force yourself to perform the task manually. Additionally, this helps you <em>empathize</em> with the users. Pay close attention to how <em>you</em> solve the task, as this might inform what features might be important to include when you do train a model to perform the task.</p>
<p>For example, after speaking with some casual smartphone photographers you might construct a couple photo albums and go through the tasks described during your interviews. What strategies did <em>you</em> find effective for finding content more quickly?</p>
<hr>
<p><a id="iterate"></a></p>
<h2 id="mockoutyourmachinelearningmodelanditerateontheuserexperience">Mock out your machine learning model and iterate on the user experience.</h2>
<p>After (and <em><strong>only</strong></em> after) getting a better understanding of the problem, you'll want to start sketching out the set of possible solutions and approaches that you could take. It's generally a good idea to think broadly at this stage rather than prematurely honing in on your first decent idea. At this stage, we're trying to elicit the <em>desired user experience</em>, as this can ultimately drive the requirements of the project.</p>
<p><strong>Prototype and iterate on the user experience using design tools to communicate possible solutions.</strong> There's something magical about seeing something concretely that has the ability to elicit tangible feedback from your users. When speaking in the abstract sense, it's possible for both you and the other stakeholders to have different understandings whilst under the illusion that you're in agreement. However, these latent misunderstandings often become very clear when you reduce an abstract idea to practice.</p>
<p>For example, I used a design tool called Figma to sketch out a couple different ways we might help users find photos of interest more quickly. The goal of these sketches is to spark discussion with our stakeholders and/or users in an attempt to start narrowing down the possible solution space.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/four-ideas.png" alt="four-ideas"></p>
<p>These tools allow you to easily stitch together user flows and import data to use in your mockups.</p>
<p><strong>Fake the machine learning component with &quot;Wizard-of-Oz&quot; experiments.</strong> Building a machine learning model takes a significant amount of work. You have to acquire data for training, decide on a model architecture, ensure the model is performing at a sufficient level, etc. Our goal here is to validate the <em>utility</em> of a model without actually going through all of the effort involved in building it. These types of experiments can be especially useful when there's still uncertainty surrounding how users will interact with your model.</p>
<p>Apple researchers <a href="https://arxiv.org/abs/1904.01664">wrote about one such study</a> when deciding whether their digital assistant should mimic the conversational style of the user. Their core hypothesis was that users would prefer digital assistants that match their level of chattiness when conversing with the agent. Rather than training a machine learning model that would allow them to modulate the &quot;chattiness&quot; level in the agent's response, they first tested this hypothesis by faking the digital assistant component and having humans in a separate room follow a script pretending to be the digital assistant.</p>
<p><strong>Figure out how to establish trust with the user.</strong> Ideally you'd like to design your product such that user interactions can improve your model, which in turn improve the user experience. This is commonly referred to as the &quot;data flywheel&quot; in machine learning products. However, in order to source meaningful interactions from your users, you may need to first establish trust with the user that those interactions will indeed improve their experience. For example, if we decided that we would perform facial recognition and allow you to search for photos with a specific person present, you'd likely require the user to assign identities to collections of photos with a common face. In order to motivate and incentivize users, they need to feel as if their effort in identifying faces in photos is meaningful. One common technique used here is to <em>show the model's effort</em> in an unobtrusive manner which informs the user how it's working. Continuing the facial recongition example, you might allow the user to toggle a setting which draws bounding boxes around the identified faces in photos.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/faces_toggle.png" alt="faces_toggle"></p>
<hr>
<p><a id="language"></a></p>
<h2 id="developasharedlanguagewithyourprojectstakeholders">Develop a shared language with your project stakeholders.</h2>
<p>At this point, you're likely to have a decent grasp on the approach you'll take to solve the presented problem. However, much work still lies ahead! It'll be important that you <em>communicate effectively</em> with stakeholders as you work to build out your solution; this starts with speaking a common language.</p>
<p>Perhaps a number of users leverage a technique of <em>chunking</em> photos together when searching, while you typically refer to that same technique as <em>clustering</em>. Without converging on a shared language, it's all too easy to talk right past one another without realizing you're actually on the same page.</p>
<p><strong>Over-communicate and ask a lot of clarifying questions at the onset.</strong> This can include asking questions when you're already pretty sure what the answer will be, sometimes you may be surprised.</p>
<p><strong>Present ideas and progress often.</strong> It's important to share progress with your stakeholders and hold discussion to ensure you're still heading in the right direction. As you present your work, it can be helpful to discuss things such as model metrics which you're using to evaluate performance and <em>why we should care</em> about those metrics; keeping things simple combined with repeated exposure can go a long way here.</p>
<hr>
<p><a id="shipping"></a></p>
<h2 id="winbyshipping">Win by shipping.</h2>
<p>Getting your product in the hands of your users is one of the best ways to validate your ideas. Quicker iterations allow you to validate more ideas, which in turn allows you to fine-tune your solution offering in order to maximize value to the user. Further, incremental deployments help ensure that you're heading in the right direction as you work to build out the full solution.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/giphy.gif" alt="giphy"></p>
<p>In my post on <a href="https://www.jeremyjordan.me/ml-projects-guide/">organizing machine learning projects</a>, I presented a diagram for model development which represents the iterative nature of the work.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/ml-development-cycle-1.png" alt="ml-development-cycle-1"></p>
<p>I also called out <a href="https://twitter.com/Smerity/status/1095490777860304896">Stephen Merity's advice</a> and Martin Zinkevich's <a href="http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf">Rule #4 of Machine Learning</a>, which both advocate for initially deploying simple models in the spirit of winning by shipping. In this context, <em>&quot;winning by shipping&quot;</em> involves making quick iterations through the outermost development loop.</p>
<p>It can take getting a few projects under your belt to fully appreciate this wisdom. When I initially published the diagram a year ago, I tended to view progression through this iterative flow as a <a href="https://en.wikipedia.org/wiki/Phase-gate_process">stage-gate process</a> where you might loop back to an earlier step if you gain more information, but you still <em>advance</em> through the cycle in the order that sections appear on the diagram. However, after joining the machine learning team at Proofpoint, I've learned from the team that there can be immense value in shortcutting parts of the development process (eg. see <em>shadow mode deployment</em> below) in order to gain insights from deploying models on production data.</p>
<p><strong>Deploy a baseline model on production data as soon as possible.</strong> Deploying your model on production data can be enlightening. Often times, the &quot;live data&quot; varies in unexpected ways from the data you've collected to use during development (often referred to as &quot;train/test skew&quot; or &quot;production data drift&quot;).</p>
<p>As a countermeasure, it's often a good idea to deploy a simple model <em>on production data</em> as soon as possible. Depending on the consequence of wrong predictions, you might choose to deploy this simple model in &quot;shadow mode&quot; (don't actually use the predictions) or as a canary deployment on a small subset of your users. The goal of this deployment is to observe and characterize the types of errors that the simple model makes, which can inform further model improvements.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/shadow-mode-1.png" alt="shadow-mode-1"><br>
<small>You may choose to initially skip some steps in the model development process in order to prioritize a quick first iteration and gain insights.</small></p>
<p>Deploying a simple model with urgency also helps ensure that the additional engineering work required to run a machine learning model is done upfront. This allows you to <em>deploy</em> your incremental model improvements (enabling quick iterations) rather than slaving away trying to build the perfect model in isolation.</p>
<p><strong>Deliver value incrementally and quickly.</strong> If you're automating a system, start with the simplest task and deploy a solution quickly. Let data guide the priority order of further automation. In other words, automate first according to the &quot;happy path&quot; (assume nothing goes wrong) and then categorize the times where control is reverted to a human.</p>
<p>Going back to our photo app example, there are <em>many</em> opportunities to apply machine learning models to the product. Ideally, we'd like to start at the intersection of a model which is simple to develop and provides value to users. For example, we might decide to initially deploy a facial recognition model given the fact that there are a number of open source implementations that we could use and user studies provided us with some level of evidence that this model would be valuable as part of the product.</p>
<p><strong>Measure time to results, not results.</strong> Sometimes it can be tricky to get into the mindset of delivering value quickly. It's easy to think that in order to delight the user, we need to give them the <em>perfect</em> product. Measuring <em>time to results</em> rather than the results themselves instills a culture of quick iterations which ultimately enables your team to deliver better products, winning by shipping.</p>
<hr>
<p><a id="conclusion"></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>Often times we're presented with vague problems and tasked with developing a &quot;machine learning solution.&quot; By spending the effort to properly scope your project and define its requirements up front, you can establish a foundation for <em>smooth iterations</em> through the model development loop as we work towards a final solution.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/full-diagram.png" alt="full-diagram"></p>
<p>Now go build great things!</p>
<hr>
<p><em>The insights shared in this blog post are shaped by my experience working on teams building and deploying machine learning models in products. On my current team at Proofpoint, we frequently discuss how we can refine our process to deliver value quickly on machine learning projects; these conversations and shared experiences are invaluable. Thank you to <a href="http://jnbrymn.com/">John Berryman</a>, <a href="https://dancsalo.github.io/">Dan Salo</a>, John Huffman, and Michelle Carney for reading early drafts of this post and providing feedback.</em></p>
<h2 id="externalresources">External Resources</h2>
<p><strong>Blog posts</strong></p>
<ul>
<li><a href="https://medium.com/@neal_lathia/machine-learning-faster-ce404dc4cd2">Machine learning, faster</a> (introduced me to the concept of &quot;measure time to results, not results&quot;)</li>
<li><a href="http://radar.oreilly.com/2012/07/data-jujitsu.html">Data Jujitsu: The art of turning data into product</a></li>
<li><a href="https://www.oreilly.com/radar/what-you-need-to-know-about-product-management-for-ai/">What you need to know about product management for AI</a></li>
<li><a href="https://www.oreilly.com/radar/practical-skills-for-the-ai-product-manager/">Practical Skills for The AI Product Manager</a></li>
<li><a href="https://hackernoon.com/product-driven-machine-learning-and-parking-tickets-in-nyc-4a3b74cfe496">Product Driven Machine Learning</a></li>
<li><a href="https://medium.com/google-design/human-centered-machine-learning-a770d10562cd">Human-Centered Machine Learning</a></li>
<li><a href="http://www.r2d3.us/talks/design-in-a-world-where-machines-are-learning/">Design in a world where Machines are Learning</a></li>
<li><a href="https://medium.com/microsoft-design/user-research-makes-your-ai-smarter-70f6ef6eb25a">User Research Makes Your AI Smarter</a></li>
<li><a href="https://www.vindhyac.com/posts/best-prd-templates-from-companies-we-adore/">What is the best way to write a Product Requirements Document?</a></li>
</ul>
<p><strong>Case studies</strong></p>
<ul>
<li><a href="https://design.google/library/control-and-simplicity/">Control and Simplicity in the Age of AI</a></li>
<li><a href="https://blog.acolyer.org/2019/10/07/150-successful-machine-learning-models/">150 successful machine learning models: 6 lessons learned at Booking.com</a></li>
</ul>
<p><strong>Talks</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=O5Gx1CFp0-Y">Triptech: A Method for Evaluating Early Design Concepts</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLUW5utnrTMQc2VMdpf8cPz9qQpHGWL-iu">AI x Design - Youtube Playlist</a></li>
</ul>
<p><strong>Guides</strong></p>
<ul>
<li><a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/01/Guidelines-for-Human-AI-Interaction-camera-ready.pdf">Guidelines for Human-AI Interaction</a> (and also check out the pretty infographic <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/03/AI_Guidelines_Poster_PrintQuality.pdf">here</a>)</li>
<li><a href="http://aimeets.design/">AI Meets Design</a></li>
<li><a href="https://pair.withgoogle.com/">Google's People+AI Guidebook: Designing human-centered AI products</a></li>
<li><a href="https://designwith.ml/">Designing Machine Learning</a></li>
<li><a href="https://uxdesign.cc/human-centered-ai-cheat-sheet-1da130ba1bab">Human-centered AI cheat-sheet</a></li>
</ul>
<p><strong>People to follow</strong></p>
<ul>
<li>Michelle Carney <a href="https://twitter.com/michellercarney">@michellercarney</a> works at the intersection of UX+ML and leads the <a href="https://www.youtube.com/channel/UC8WlnMDt6LdqoZhnPakagXw">MLUX meetup group</a>.</li>
<li>Jess Holbrook <a href="https://medium.com/@jessholbrook">@jessholbrook</a> is the co-lead of Google’s People + AI Research team; he has a ton of great articles on Medium.</li>
<li>Nadia Piet <a href="https://twitter.com/NadiaPiet">@NadiaPiet</a> works as a freelance design consultant; she put together the phenomenal <a href="http://aimeets.design/">aimeets.design</a> toolkit referenced above.</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Introduction to recurrent neural networks.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p><strong>Jump to:</strong></p>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#evolving">Evolving a hidden state over time</a></li>
<li><a href="#common_structures">Common structures of recurrent networks</a></li>
<li><a href="#bidirectional">Bidirectionality</a></li>
<li><a href="#limitations">Limitations</a></li>
<li><a href="#further_reading">Further reading</a></li>
</ul>
<p><a id="overview"></a></p>
<h2 id="overview">Overview</h2>
<p>Previously, I've written about <a href="https://www.jeremyjordan.me/intro-to-neural-networks/">feed-forward neural networks</a> as a generic function approximator and <a href="https://www.jeremyjordan.me/convolutional-neural-networks/">convolutional neural networks</a> for efficiently extracting local information from data. In this post, I'll discuss a third type</p>]]></description><link>https://www.jeremyjordan.me/introduction-to-recurrent-neural-networks/</link><guid isPermaLink="false">5ce76c381b7d2c00bfcbbdce</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Sun, 09 Jun 2019 04:53:51 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p><strong>Jump to:</strong></p>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#evolving">Evolving a hidden state over time</a></li>
<li><a href="#common_structures">Common structures of recurrent networks</a></li>
<li><a href="#bidirectional">Bidirectionality</a></li>
<li><a href="#limitations">Limitations</a></li>
<li><a href="#further_reading">Further reading</a></li>
</ul>
<p><a id="overview"></a></p>
<h2 id="overview">Overview</h2>
<p>Previously, I've written about <a href="https://www.jeremyjordan.me/intro-to-neural-networks/">feed-forward neural networks</a> as a generic function approximator and <a href="https://www.jeremyjordan.me/convolutional-neural-networks/">convolutional neural networks</a> for efficiently extracting local information from data. In this post, I'll discuss a third type of neural networks, recurrent neural networks, for learning from sequential data.</p>
<p>For some classes of data, the order in which we receive observations is important. As an example, consider the two following sentences:</p>
<ol>
<li>&quot;I'm sorry... it's not you, it's me.&quot;</li>
<li>&quot;It's not me, it's you... I'm sorry.&quot;</li>
</ol>
<p>These two sentences are communicating quite different messages, but this can only be interpreted when considering the sequential order of the words. Without this information, we're unable to disambiguate from the collection of words: <code>{'you', 'sorry', 'me', 'not', 'im', 'its'}</code>.</p>
<p>Recurrent neural networks allow us to formulate the learning task in a manner which considers the sequential order of individual observations.</p>
<p><a id="evolving"></a></p>
<h2 id="evolvingahiddenstateovertime">Evolving a hidden state over time</h2>
<p>In this section, we'll build the intuition behind recurrent neural networks. We'll start by reviewing standard feed-forward neural networks and build a simple mental model of how these networks learn. We'll then build on that to discuss how we can extend this model to a sequence of related inputs.</p>
<p>Recall that neural networks perform a series of layer by layer transformations to our input data. The hidden layers of the network form <em>intermediate representations</em> of our input data which make it easier to solve the given task.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/Screen-Shot-2019-05-27-at-11.39.48-AM.png" alt="Screen-Shot-2019-05-27-at-11.39.48-AM"></p>
<p>This is demonstrated in the example below. Observe how our input space is warped into one which allows for a linear decision boundary to cleanly separate the two classes. At a high level, you can think of the hidden layers as <strong>&quot;useful representations&quot;</strong> of the original input data.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/netvis.png" alt="netvis"><br>
<small><a href="https://colah.github.io/posts/2015-09-NN-Types-FP/">Image credit</a></small></p>
<p>Now let's consider how we can leverage this insight for a sequence of related observations.</p>
<p>Let's first focus on the initial value in the sequence. As we calculate the forward pass through the network, we build a &quot;useful representation&quot; of our input in the hidden layers (the activations in these layers define our <strong>hidden state</strong>), continuing on to calculate an output prediction for the initial time-step.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/intuition-1.png" alt="intuition-1"></p>
<p>When considering the next time-step in the sequence, we want to leverage any information we've already extracted from the sequence.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/intuition-2.png" alt="intuition-2"></p>
<p>In order to do this, our next hidden state will be calculated as a <strong>combination</strong> of the previous hidden state and latest input.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/intuition-3.png" alt="intuition-3"></p>
<p>The basic method for combining these two pieces of information is shown below; however, there exist other more advanced methods that we'll discuss later (gated recurrent units, long short-term memory units). Here, we have one set of weights $w_{ih}$ to transform the input to a hidden layer representation and a second set of weights $w_{hh}$ to bring along information from the previous hidden state into the next time-step.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/intuition-4.png" alt="intuition-4"></p>
<p>We can continue performing this <em>same calculation</em> of incorporating new information to update the value of the hidden state for an <em>arbitrarily long sequence</em> of observations.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/intuition-5.png" alt="intuition-5"></p>
<p>By always remembering the previous hidden state, we're able to chain a sequence of events together. This also allows us to backpropagate errors to earlier timesteps during training, often referred to as &quot;backpropagation through time&quot;.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-01-at-12.21.53-PM.png" alt="Screen-Shot-2019-06-01-at-12.21.53-PM"></p>
<p><a id="common_structures"></a></p>
<h2 id="commonstructuresofrecurrentnetworks">Common structures of recurrent networks</h2>
<p>One of the benefits of recurrent neural networks is the ability to handle arbitrary length inputs and outputs. This flexibility allows us to define a broad range of tasks. In this section, I'll discuss the general architectures used for various sequence learning tasks.</p>
<p><strong>One to many</strong> RNNs are used in scenarios where we have a single input observation and would like to generate an arbitrary length sequence related to that input. One example of this is image captioning, where you feed in an image as input and output a sequence of words to describe the image. For this architecture, we take our prediction at each time step and feed that in as input to the next timestep, iteratively generating a sequence from our initial observation and following predictions.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/rnn-one-to-many.png" alt="rnn-one-to-many"><br>
<small><a href="https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#overview">Image credit</a></small></p>
<p><strong>Many to one</strong> RNNs are used to look across a sequence of inputs and make a single determination from that sequence. For example, you might look at a sequence of words and predict the sentiment of the sentence. Generally, this structure is used when you want to perform classification on sequences of data.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/rnn-many-to-one.png" alt="rnn-many-to-one"><br>
<small><a href="https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#overview">Image credit</a></small></p>
<p><strong>Many to many (same)</strong> RNNs are used for tasks in which we would like to predict a label for each observation in a sequence, sometimes referred to as dense classification. For example, if we would like to detect named entities (person, organization, location) in sentences, we might produce a label for every single word denoting whether or not that word is part of a named entity. As another example, you could feed in a video (sequence of images) and predict the current activity in frame.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/rnn-many-to-many-same.png" alt="rnn-many-to-many-same"><br>
<small><a href="https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#overview">Image credit</a></small></p>
<p><strong>Many to many (different)</strong> RNNs are useful for translating a sequence of inputs into a different but related sequence of outputs. In this case, both the input and the output can be arbitrary length sequences and the input length might not always be equal to the output length. For example, a machine translation model would be expected to translate &quot;how are you&quot; (input) into &quot;cómo estás&quot; (output) even though the sequence lengths are different.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/11/rnn-many-to-many-different.png" alt="rnn-many-to-many-different"><br>
<small><a href="https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#overview">Image credit</a></small></p>
<p><a id="bidirectional"></a></p>
<h2 id="bidirectionality">Bidirectionality</h2>
<p>One of the weaknesses of a ordinary recurrent neural networks is that we can only use the set of observations which we have already seen when making a prediction. As an example, consider training a model for named entity recognition. Here, we want the model to output the start and end of phrases which contain a named entity. Consider the following two sentences:</p>
<blockquote>
<p>&quot;I can't believe that Teddy Roosevelt was your great grandfather!&quot;</p>
</blockquote>
<blockquote>
<p>&quot;I can't believe that Teddy bear is made out of chocolate!&quot;</p>
</blockquote>
<p>However, if you only read the input sequence from left to right, it's hard to tell whether or not you should mark &quot;Teddy&quot; as the start of a name.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-03-at-10.42.54-PM.png" alt="Screen-Shot-2019-06-03-at-10.42.54-PM"></p>
<p>Ideally, our model output would look something like this when reading the first sentence (roughly following the <a href="https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)">inside–outside–beginning tagging</a> format).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-03-at-10.47.38-PM.png" alt="Screen-Shot-2019-06-03-at-10.47.38-PM"></p>
<p>When determining whether or not a token is the start of a name, it would sure be helpful to see which tokens follow after it; a <strong>bidirectional</strong> recurrent neural network provides exactly that. Here, we process the sequence reading from left-to-right and right-to-left in parallel and then combine these two representations such that at any point in a sequence you have knowledge of the tokens which came before <em><strong>and</strong></em> after it.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-03-at-11.12.52-PM.png" alt="Screen-Shot-2019-06-03-at-11.12.52-PM"></p>
<p>We have one set of recurrent cells which process the sequence from left to right...</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-03-at-11.13.23-PM.png" alt="Screen-Shot-2019-06-03-at-11.13.23-PM"></p>
<p>... and another set of recurrent cells which process the sequence from right to left.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-03-at-11.13.43-PM.png" alt="Screen-Shot-2019-06-03-at-11.13.43-PM"></p>
<p>Thus, at any given time-step we have knowledge of all of the tokens which came before the current time-step <em><strong>and</strong></em> all of the tokens which came after that time-step.</p>
<p><a id="limitations"></a></p>
<h2 id="limitations">Limitations</h2>
<p>One key component that I glanced over previously is that the recurrent layer's weights are <em><strong>shared</strong></em> across time-steps. This provides us with the flexibility to process arbitrary length sequences, but also introduces a unique challenge when training the network.</p>
<p>For a concrete example, suppose you've trained a recurrent neural network as a language model (predict the next word in a sequence). As you're generating text, it might be important to know whether the current word is inside quotation marks. Let's assume this is true and consider the case where our model makes a wrong prediction because it wasn't paying attention to whether or not the current time-step is inside quotation marks. Ideally, you want a way to send back a signal to the earlier time-step where we entered the quotation mark to say &quot;pay attention!&quot; to avoid the same mistake in the future. Doing so requires sending our error signal back through <strong>many time-steps</strong>. (As an aside, Karpathy has a <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">famous blog post</a> which shows that a character-level RNN language model can indeed pay attention to this detail.)</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-08-at-5.02.21-PM.png" alt="Screen-Shot-2019-06-08-at-5.02.21-PM"><br>
<small><a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Image credit</a></small></p>
<p>Let's consider what the backpropagation step would look like to send this signal to earlier time-steps.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/06/Screen-Shot-2019-06-08-at-5.18.39-PM.png" alt="Screen-Shot-2019-06-08-at-5.18.39-PM"></p>
<p>As a reminder, the <a href="https://www.jeremyjordan.me/neural-networks-training/">backpropagation algorithm</a> states that we can define the relationship between a given layer's weights and the final loss using the following expression:</p>
<p>$$ \frac{{\partial E\left( w  \right)}}{{\partial w^{(l)}}} = {\left( {{\delta ^{(l + 1)}}} \right)^T}{a^{(l)}} $$</p>
<p>where ${\delta ^{(l)}}$ (our &quot;error&quot; term) can be calculated as:</p>
<p>$$ {\delta ^{(l)}} = {\delta ^{(l + 1)}}{w ^{(l)}}f'\left( {{a^{(l)}}} \right) $$</p>
<p>This allows to efficiently calculate the gradient for any given layer by reusing the terms already computed at layer $l+1$. However, notice how there's a term for the weight matrix, ${w ^{(l)}}$, included in the computation at every layer. Now recall that I earlier mentioned recurrent layers share weights across time-steps. This means that the <strong>same exact value</strong> is being mulitplied every time we perform this layer by layer backpropagation through time.</p>
<p>Let's suppose one of the weights in our matrix is 0.5 and we're attempting to send a signal back 10 time-steps. By the time we've backpropagated to $t-10$, we've multiplied the overall gradient expression by $0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 \cdot 0.5 = 0.00098$. This has the effect of drastically reducing the magnitude of our error signal! This phenomenon is known as the <strong>&quot;vanishing gradient&quot; problem</strong> which makes it very hard to learn using a vanilla recurrent neural network. The same problem can occur when the weight is greater than one, introducing an exploding gradient, although this is slightly easier to manage thanks to a technique known as gradient clipping.</p>
<p>In following posts, we'll look at two common variations of the standard recurrent cell which alleviate this problem of a vanishing gradient.</p>
<p><a id="further_reading"></a></p>
<h2 id="furtherreading">Further reading</h2>
<p><strong>Papers</strong></p>
<ul>
<li><a href="http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf">Learning Long-Term Dependencies with Gradient Descent is Difficult</a></li>
<li><a href="https://arxiv.org/pdf/1211.5063.pdf">On the difficulty of training Recurrent Neural Networks</a></li>
<li><a href="https://arxiv.org/abs/1504.00941">A Simple Way to Initialize Recurrent Networks of Rectified Linear Units</a></li>
</ul>
<p><strong>Lectures/Notes</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=6niqTuYFZLQ">Stanford CS231n: Lecture 10 | Recurrent Neural Networks</a></li>
<li><a href="https://www.youtube.com/watch?v=yCC09vCHzF8">Stanford CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM</a></li>
<li><a href="https://www.youtube.com/watch?v=6niqTuYFZLQ">Stanford CS224n: Lecture 8: Recurrent Neural Networks and Language Models</a></li>
<li><a href="https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks">Stanford CS230: Recurrent Neural Networks Cheatsheet</a></li>
<li><a href="https://www.youtube.com/watch?v=nFTQ7kHQWtc">MIT 6.S094: Recurrent Neural Networks for Steering Through Time</a></li>
<li><a href="https://www.cs.toronto.edu/~hinton/csc2535/notes/lec10new.pdf">University of Toronto CSC2535: Lecture 10 | Recurrent neural networks</a></li>
</ul>
<p><strong>Blog posts</strong></p>
<ul>
<li><a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">The Unreasonable Effectiveness of Recurrent Neural Networks</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Scaling nearest neighbors search with approximate methods.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p><strong>Jump to:</strong></p>
<ul>
<li><a href="#what">What is nearest neighbors search?</a></li>
<li><a href="#kd_tree">K-d trees</a></li>
<li><a href="#quantization">Quantization</a>
<ul>
<li><a href="#pq">Product quantization</a></li>
<li><a href="#multimodal">Handling multi-modal data</a></li>
<li><a href="#lopq">Locally optimized product quantization</a></li>
</ul>
</li>
<li><a href="#datasets">Common datasets</a></li>
<li><a href="#further_reading">Further reading</a></li>
</ul>
<p><a id="what"></a></p>
<h2 id="whatisnearestneighborssearch">What is nearest neighbors search?</h2>
<p>In the world of deep learning, we often use neural networks to learn <strong>representations of objects as vectors</strong>. We can then use</p>]]></description><link>https://www.jeremyjordan.me/scaling-nearest-neighbors-search-with-approximate-methods/</link><guid isPermaLink="false">5c4de12fc5d46700bfb57d38</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Mon, 04 Feb 2019 04:36:13 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p><strong>Jump to:</strong></p>
<ul>
<li><a href="#what">What is nearest neighbors search?</a></li>
<li><a href="#kd_tree">K-d trees</a></li>
<li><a href="#quantization">Quantization</a>
<ul>
<li><a href="#pq">Product quantization</a></li>
<li><a href="#multimodal">Handling multi-modal data</a></li>
<li><a href="#lopq">Locally optimized product quantization</a></li>
</ul>
</li>
<li><a href="#datasets">Common datasets</a></li>
<li><a href="#further_reading">Further reading</a></li>
</ul>
<p><a id="what"></a></p>
<h2 id="whatisnearestneighborssearch">What is nearest neighbors search?</h2>
<p>In the world of deep learning, we often use neural networks to learn <strong>representations of objects as vectors</strong>. We can then use these vector representations for a myriad of useful tasks.</p>
<p>To give a concrete example, let's consider the case of a facial recognition system powered by deep learning. For this use case, the objects are images of people's faces and the task is to identify whether or not the person in a submitted photo matches a person in a database of known identities. We'll use a neural network to build vector representations of all of the images; then, performing facial recognition is as simple as taking the vector representation of a submitted image (the query vector) and <strong>searching for similar vectors</strong> in our database. Here, we define similarity as vectors which are close together in vector-space. (How we actually train a network to produce these vector representations is outside of the scope of this blog post.)</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-27-at-10.00.13-PM.png" alt="Screen-Shot-2019-01-27-at-10.00.13-PM"><br>
<small>All of these vectors were extracted from a ResNet50 model. Notice how the values in the query vector are quite similar to the vector in the top left of known identities.</small></p>
<p>The process of finding vectors that are close to our query is known as <strong>nearest neighbors</strong> search. A naive implementation of nearest neighbors search is to simply calculate the distance between the <em>query vector</em> and every vector in our collection (commonly referred to as the <em>reference set</em>). However, calculating these distances in a brute force manner quickly becomes infeasible as your reference set grows to millions of objects.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-9.57.21-PM.png" alt="Screen-Shot-2019-01-29-at-9.57.21-PM"><br>
<small>Imagine if Facebook had to compare each face in a new photo against <strong>all</strong> of its users <em>every</em> time it suggested who to tag, this would be computationally infeasible!</small></p>
<p>A class of methods known as <strong>approximate nearest neighbors</strong> search offer a solution to our scaling dilemma by partitioning the vector space in a clever way such that we only need to examine a small subset of the overall reference set.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-10.04.09-PM.png" alt="Screen-Shot-2019-01-29-at-10.04.09-PM"><br>
<small>Approximate methods alleviate this computational burden by cleverly partitioning the vectors such that we only need to focus on a small subset of objects.</small></p>
<p>In this blog post, I'll cover a couple of techniques used for approximate nearest neighbors search. This post will not cover approximate nearest neighbors methods exhaustively, but hopefully you'll be able to understand how people generally approach this problem and how to apply these techniques in your own work.</p>
<p>In general, the approximate nearest neighbor methods can be grouped as:</p>
<ul>
<li>Tree-based data structures</li>
<li>Neighborhood graphs</li>
<li>Hashing methods</li>
<li>Quantization</li>
</ul>
<p><a id="kd_tree"></a></p>
<h2 id="kdimensionaltrees">K-dimensional trees</h2>
<p>The first approximate nearest neighbors method we'll cover is a tree-based approach. K-dimensional trees generalize the concept of a <a href="https://medium.com/basecs/leaf-it-up-to-binary-trees-11001aaf746d">binary search tree</a> into multiple dimensions.</p>
<p>The general procedure for growing a k-dimensional tree is as follows:</p>
<ul>
<li>pick a random dimension from your k-dimensional vector</li>
<li>find the median of that dimension across all of the vectors in your current collection</li>
<li>split the vectors on the median value</li>
<li>repeat!</li>
</ul>
<p>A toy 2-dimensional example is visualized below. At the top level, we select a random dimension (out of the two possible dimensions, $x_0$ and $x_1$) and calculate the median. Then, we follow the same procedure of picking a dimension and calculating the median for <em>each path</em> independently. This process is repeated until some stopping criterion is satisfied; each leaf node in the tree contains a subset of vectors from our reference set.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-12.35.09-AM.png" alt="Screen-Shot-2019-01-29-at-12.35.09-AM"></p>
<p>We can view how the two-dimensional vectors are partitioned at each level of the k-d tree in the figure below. Take a minute to verify that this visualization matches what is described in the tree above.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-30-at-11.16.21-PM.png" alt="Screen-Shot-2019-01-30-at-11.16.21-PM"></p>
<p>In order to see the usefulness of this tree, let's now consider how we could use this data structure to perform an approximate nearest neighbor query. As we walk down the tree, notice how the highlighted area (the area in vector space that we're interested in) shrinks down to a small subset of the original space. (I'll use the level 4 subplot for this example.)</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-10.21.41-PM.png" alt="Screen-Shot-2019-01-29-at-10.21.41-PM"></p>
<p>At the top level, we look at the first dimension of the query vector and ask whether or not its value is greater than or equal to 1. Since 4 is greater than 1, we walk down the &quot;yes&quot; path to the next level down. We can safely ignore any of the nodes that follow the first &quot;no&quot; path.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-10.22.10-PM-1.png" alt="Screen-Shot-2019-01-29-at-10.22.10-PM-1"></p>
<p>Now we look at the second dimension of the vector and ask whether its value is greater than or equal to 0. Since -2 is less than 0, we now walk down the &quot;no&quot; path. Notice again how the area of interest in our overall vector-space continues to shrink.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-10.22.43-PM.png" alt="Screen-Shot-2019-01-29-at-10.22.43-PM"></p>
<p>Finally, once we reach the bottom of the tree we are left with a collection of vectors. Thankfully, this is a small subset relative to the overall size of the reference set, so calculating the distance between the query vector and each vector in this subset is computationally feasible.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-10.23.05-PM.png" alt="Screen-Shot-2019-01-29-at-10.23.05-PM"></p>
<p>K-d trees are popular due to their simplicity, however, this technique struggles to perform well when dealing with high dimensional data. Further notice how we only returned vectors which are found in the same cell as the query point. In this example, the query vector happened to fall in the middle of a cell, but you could imagine a scenario where the query vector lies near the edge of a cell and we miss out on vectors which lie just outside of the cell.</p>
<p><a id="quantization"></a></p>
<h2 id="quantization">Quantization</h2>
<p>Another approach to the approximate nearest neighbors problem is to collapse our reference set into a smaller collection of representative vectors. We can find these &quot;representative&quot; vectors by simply running the <a href="https://www.jeremyjordan.me/grouping-data-points-with-k-means-clustering/">K-means algorithm</a> on our data. In the literature, this collection of &quot;representative&quot; vectors is commonly referred to as the <strong>codebook</strong>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-11.29.48-PM.png" alt="Screen-Shot-2019-01-29-at-11.29.48-PM"><br>
<small>The right figure displays a <a href="https://en.wikipedia.org/wiki/Voronoi_diagram">Voronoi diagram</a> which essentially partitions the space according to the set of points for which a given centroid is closest.</small></p>
<p>We'll then &quot;map&quot; all of our data onto these centroids. By doing this, we can represent our reference set of a couple hundred vectors with only 7 representative centroids. This greatly reduces the number of distance computations we need to perform (only 7!) when making an nearest neighbors query.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-29-at-11.30.19-PM.png" alt="Screen-Shot-2019-01-29-at-11.30.19-PM"></p>
<p>We can then maintain an inverted list to keep track of all of the original objects in relation to which centroid represents the quantized vector.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/02/Screen-Shot-2019-02-02-at-11.45.21-PM.png" alt="Screen-Shot-2019-02-02-at-11.45.21-PM"></p>
<p>You can optionally retrieve the full vectors for all of the ids maintained in the inverted list for a given centroid, calculating the true distances between each vector and our query. This is a process known as <strong>re-ranking</strong> and can improve your query performance.</p>
<p>Similar to before, let's now look at how we can use this method to perform a query. For a given query vector, we'll calculate the distances between the query vector and each centroid in order to find the closest centroid. We can then look up the centroid in our inverted list in order to find all of the nearest vectors.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-30-at-12.00.35-AM.png" alt="Screen-Shot-2019-01-30-at-12.00.35-AM"></p>
<p>Unfortunately, in order to get good performance using quantization, you typically need to use very large numbers of centroids for quantization; this impedes on original goal of alleviating the computational burden of calculating too many distances.</p>
<p><a id="pq"></a></p>
<h3 id="productquantization">Product quantization</h3>
<p>Product quantization addresses this problem by first subdividing the original vectors into subcomponents and then quantizing (ie. running K-means on) each subcomponent separately. A single vector is now represented by a collection of centroids, one for each subcomponent.</p>
<p>To illustrate this, I've provided two examples. In the 8D case, you can see how our vector is divided into subcomponents and each subcomponent is represented by some centroid value. However, the 2D example shows us the benefit of this approach. In this case, we can only split our 2D vector into a maximum of two components. We'll then quantize each dimension separately, squashing all of the data onto the horizontal axis and running k-means and then squashing all of the data onto the vertical axis and running k-means again. We find 3 centroids for each subcomponent with a total of 6 centroids. However, the total set of all possible quantized states for the overall vector is the <strong>Cartesian product</strong> of the subcomponent centroids.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-30-at-11.22.40-PM.png" alt="Screen-Shot-2019-01-30-at-11.22.40-PM"></p>
<p>In other words, if we divide our vector into $m$ subcomponents and find $k$ centroids, we can represent $k^m$ possible quantizations using only $km$ vectors! The chart below shows how many centroids are needed in order to get 90% of the top 5 search results correct for an approximate nearest neighbors query. Notice how using product quantization ($m&gt;1$) vastly reduces the number of centroids needed to represent our data. One of the reasons why I love this idea so much is that we've effectively turned <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">the curse of dimensionality</a> into something highly beneficial!</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-30-at-11.24.00-PM.png" alt="Screen-Shot-2019-01-30-at-11.24.00-PM"><br>
<small><a href="https://loukratz.info/talks/2018-08-09-lopq">Image credit</a></small></p>
<p><a id="multimodal"></a></p>
<h3 id="handlingmultimodaldata">Handling multi-modal data</h3>
<p>Product quantization alone works great when our data is distributed relatively evenly across the vector-space. However, in reality our data is usually multi-modal. To handle this, a common technique involves first training a coarse quantizer to roughly &quot;slice&quot; up the vector-space, and then we'll run product quantization on each individual coarse cell.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/01/Screen-Shot-2019-01-31-at-12.13.25-AM.png" alt="Screen-Shot-2019-01-31-at-12.13.25-AM"></p>
<p>Below, I've visualized the data that falls within a single coarse cell. We'll use product quantization to find a set of centroids which describe this local subset of data, and then repeat for each coarse cell. Commonly, people encode the vector residuals (the difference between the original vector and the closest coarse centroid) since the residuals tend to have smaller magnitudes and thus lead to less lossy compression when running product quantization. In simple terms, we <strong>treat each coarse centroid as a local origin</strong> and run product quantization on the data with respect to the <em>local origin</em> rather than the global origin.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/02/Screen-Shot-2019-02-03-at-10.16.01-PM.png" alt="Screen-Shot-2019-02-03-at-10.16.01-PM"></p>
<p><strong>Pro-tip:</strong> If you want to scale to <em>really</em> large datasets you can use product quantization as both the coarse quantizer <em>and</em> the fine-grained quantizer within each coarse cell. See <a href="https://cache-ash04.cdn.yandex.net/download.yandex.ru/company/cvpr2012.pdf">this paper</a> for the details.</p>
<p><a id="lopq"></a></p>
<h3 id="locallyoptimizedproductquantization">Locally optimized product quantization</h3>
<p>The ideal goal for quantization is to develop a codebook which is (1) concise and (2) highly representative of our data. More specifically, we'd like all of the vectors in our codebook to represent dense regions of our data in vector-space. A centroid in a low-density area of our data is inefficient at representing data and introduces high distortion error for any vectors which fall in its Voronoi cell.</p>
<p>One potential way we can attempt to avoid these inefficient centroids is to add an alignment step to our product quantization. This allows for our product quantizers to better cover the local data for each coarse Voronoi cell.</p>
<p>We can do this by applying a transformation to our data such that we minimize our quantization distortion error. One simple way to minimize this quantization distortion error is to simply apply <a href="https://www.jeremyjordan.me/principal-components-analysis/">PCA</a> in order to mean-center the data and rotate it such that the axes capture most of the variance within the data.</p>
<p>Recall my earlier example where we ran product quantization on a toy 2D dataset. In doing so, we effectively squashed all of the data onto the horizontal axis and ran k-means and then repeated this for the vertical axis. By rotating the data such that the axes capture most of the variance, we can more effectively cover our data when using product quantization.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/02/Screen-Shot-2019-01-31-at-11.48.58-PM.png" alt="Screen-Shot-2019-01-31-at-11.48.58-PM"></p>
<p>This technique is known as <strong>locally optimized product quantization</strong>, since we're manipulating the local data within each coarse Voronoi cell in order to optimize the product quantization performance. The authors who introduced this technique have a great illustrative example of how this technique can better fit a given set of vectors.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/02/Screen-Shot-2019-01-31-at-11.52.37-PM.png" alt="Screen-Shot-2019-01-31-at-11.52.37-PM"><br>
<small>This blog post glances over (c) Optimized Product Quantization which is the same idea of aligning our data for better product quantization performance, but the alignment is performed globally instead of aligning local data in each Voronoi cell independently. <a href="http://image.ntua.gr/iva/files/lopq.pdf">Image credit</a></small></p>
<h4 id="aquicksidenoteregardingpcaalignment">A quick sidenote regarding PCA alignment</h4>
<p>The authors who introduced product quantization noted that the technique works best when the vector subcomponents had similar variance. A nice side effect of doing PCA alignment is that during the process we get a matrix of eigenvalues which describe the variance of each principal component. We can use this to our advantage by allocating principal components into buckets of equal variance.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/02/Screen-Shot-2019-02-03-at-11.32.59-PM.png" alt="Screen-Shot-2019-02-03-at-11.32.59-PM"></p>
<p><a id="datasets"></a></p>
<h2 id="commondatasets">Common datasets</h2>
<ul>
<li><a href="https://nlp.stanford.edu/projects/glove/">GLOVE word vectors</a></li>
<li><a href="http://corpus-texmex.irisa.fr/">SIFT image descriptors</a></li>
</ul>
<p><a id="further_reading"></a></p>
<h2 id="furtherreading">Further reading</h2>
<p><strong>Papers</strong></p>
<ul>
<li><a href="https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf">Product quantization for nearest neighbor search</a></li>
<li><a href="https://www.robots.ox.ac.uk/~vgg/rg/papers/ge__cvpr2013__optimized.pdf">Optimized Product Quantization for Approximate Nearest Neighbor Search</a> (see also this <a href="http://kaiminghe.com/cvpr13/cvpr13opq_ppt.pdf">presentation</a>)</li>
<li><a href="http://image.ntua.gr/iva/files/lopq.pdf">Locally Optimized Product Quantization for Approximate Nearest Neighbor Search</a></li>
<li><a href="https://cache-ash04.cdn.yandex.net/download.yandex.ru/company/cvpr2012.pdf">The Inverted Multi-Index</a></li>
</ul>
<p>I didn't cover binary codes in this post - but I should have! I <em>may</em> come back and edit the post to include more information soon. Until then, enjoy this paper.</p>
<ul>
<li><a href="https://arxiv.org/pdf/1708.02932.pdf">SUBIC: A supervised, structured binary code for image search</a></li>
</ul>
<p><strong>Lectures</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=gHdbqsDK9YY&amp;list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ">Victor Lavrenko - KNN Lectures</a></li>
</ul>
<p><strong>Blog posts/talks</strong></p>
<ul>
<li><a href="https://loukratz.info/talks/2018-08-09-lopq">Scaling Visual Search with Locally Optimized Product Quantization</a></li>
<li><a href="https://0x65.dev/blog/2019-12-07/indexing-billions-of-text-vectors.html">Indexing Billions of Text Vectors</a></li>
<li><a href="https://mccormickml.com/2017/10/13/product-quantizer-tutorial-part-1/">Product Quantizers for k-NN Tutorial Part 1</a></li>
<li><a href="https://mccormickml.com/2017/10/22/product-quantizer-tutorial-part-2/">Product Quantizers for k-NN Tutorial Part 2</a></li>
</ul>
<p><strong>Libraries and Github repos</strong></p>
<ul>
<li><a href="https://github.com/facebookresearch/faiss">Facebook AI Similarity Server (FAISS)</a></li>
<li><a href="https://github.com/spotify/annoy">Spotify Annoy</a></li>
<li><a href="https://github.com/yahoo/lopq">Yahoo LOPQ</a></li>
<li><a href="https://github.com/nmslib/nmslib">Non-Metric Space Library (NMSlib)</a></li>
<li><a href="https://scikit-learn.org/stable/modules/neighbors.html">Scikit-learn Nearest Neighbors</a></li>
<li><a href="https://github.com/erikbern/ann-benchmarks">Github: ANN Benchmarks</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Organizing machine learning projects: project management guidelines.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>The goal of this document is to provide a common framework for approaching machine learning projects that can be referenced by practitioners. If you build ML models, this post is for you. If you collaborate with people who build ML models, I hope that this guide provides you with a</p>]]></description><link>https://www.jeremyjordan.me/ml-projects-guide/</link><guid isPermaLink="false">5b79d1d59eaeb100bf13db8c</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Sun, 02 Sep 2018 00:09:00 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>The goal of this document is to provide a common framework for approaching machine learning projects that can be referenced by practitioners. If you build ML models, this post is for you. If you collaborate with people who build ML models, I hope that this guide provides you with a good perspective on the common project workflow. Knowledge of machine learning is assumed.</p>
<p><a id="overview"></a></p>
<h2 id="overview">Overview</h2>
<p>This overview intends to serve as a project &quot;checklist&quot; for machine learning practitioners. Subsequent sections will provide more detail.</p>
<p><strong>Project lifecycle</strong><br>
Machine learning projects are highly iterative; as you progress through the ML lifecycle, you’ll find yourself iterating on a section until reaching a satisfactory level of performance, then proceeding forward to the next task (which may be circling back to an even earlier step). Moreover, a project isn’t complete after you ship the first version; you get feedback from real-world interactions and redefine the goals for the next iteration of deployment.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/09/ml-development-cycle.png" alt="ml-development-cycle"></p>
<ol>
<li><a href="#planning"><strong>Planning and project setup</strong></a>
<ul>
<li>Define the task and scope out requirements</li>
<li>Determine project feasibility</li>
<li>Discuss general model tradeoffs (accuracy vs speed)</li>
<li>Set up project codebase</li>
</ul>
</li>
<li><a href="#data"><strong>Data collection and labeling</strong></a>
<ul>
<li>Define ground truth (create labeling documentation)</li>
<li>Build data ingestion pipeline</li>
<li>Validate quality of data</li>
<li>Label data and ensure ground truth is well-definend</li>
<li>Revisit Step 1 and ensure data is sufficient for the task</li>
</ul>
</li>
<li><a href="#exploration"><strong>Model exploration</strong></a>
<ul>
<li>Establish baselines for model performance</li>
<li>Start with a simple model using initial data pipeline</li>
<li>Overfit simple model to training data</li>
<li>Stay nimble and try many parallel (isolated) ideas during early stages</li>
<li>Find SoTA model for your problem domain (if available) and reproduce results, then apply to your dataset as a second baseline</li>
<li>Revisit Step 1 and ensure feasibility</li>
<li>Revisit Step 2 and ensure data quality is sufficient</li>
</ul>
</li>
<li><a href="#refinement"><strong>Model refinement</strong></a>
<ul>
<li>Perform model-specific optimizations (ie. hyperparameter tuning)</li>
<li>Iteratively debug model as complexity is added</li>
<li>Perform error analysis to uncover common failure modes</li>
<li>Revisit Step 2 for targeted data collection and labeling of observed failure modes</li>
</ul>
</li>
<li><a href="#testing"><strong>Testing and evaluation</strong></a>
<ul>
<li>Evaluate model on test distribution; understand differences between train and test set distributions (how is “data in the wild” different than what you trained on)</li>
<li>Revisit model evaluation metric; ensure that this metric drives desirable downstream user behavior</li>
<li>Write tests for:
<ul>
<li>Input data pipeline</li>
<li>Model inference functionality</li>
<li>Model inference performance on validation data</li>
<li>Explicit scenarios expected in production (model is evaluated on a curated set of observations)</li>
</ul>
</li>
</ul>
</li>
<li><a href="#deployment"><strong>Model deployment</strong></a>
<ul>
<li>Expose model via a REST API</li>
<li>Deploy new model to small subset of users to ensure everything goes smoothly, then roll out to all users</li>
<li>Maintain the ability to roll back model to previous versions</li>
<li>Monitor live data and model prediction distributions</li>
</ul>
</li>
<li><a href="#maintenance"><strong>Ongoing model maintenance</strong></a>
<ul>
<li>Understand that changes can affect the system in unexpected ways</li>
<li>Periodically retrain model to prevent model staleness</li>
<li>If there is a transfer in model ownership, educate the new team</li>
</ul>
</li>
</ol>
<p><strong>Team roles</strong></p>
<p>A typical team is composed of:</p>
<ul>
<li><strong>data engineer</strong> (builds the data ingestion pipelines)</li>
<li><strong>machine learning engineer</strong> (train and iterate models to perform the task)</li>
<li><strong>software engineer</strong> (aids with integrating machine learning model with the rest of the product)</li>
<li><strong>project manager</strong> (main point of contact with the client)</li>
</ul>
<hr>
<p><a id="planning"></a></p>
<h2 id="planningandprojectsetup">Planning and project setup</h2>
<p>It may be tempting to skip this section and dive right in to &quot;just see what the models can do&quot;. Don't skip this section. All too often, you'll end up wasting time by delaying discussions surrounding the project goals and model evaluation criteria. Everyone should be working toward a common goal from the start of the project.</p>
<p>It's worth noting that defining the model task is not always straightforward. There's often many different approaches you can take towards solving a problem and it's not always immediately evident which is optimal. If your problem is vague and the modeling task is not clear, jump over to my post on <a href="https://www.jeremyjordan.me/ml-requirements/">defining requirements for machine learning projects</a> before proceeding.</p>
<p><a id="project_selection"></a></p>
<h3 id="prioritizingprojects">Prioritizing projects</h3>
<p><em>Ideal: project has high impact and high feasibility.</em></p>
<p>Mental models for evaluating project impact:</p>
<ul>
<li>Look for places where cheap prediction drives large value</li>
<li>Look for complicated rule-based software where we can learn rules instead of programming them</li>
</ul>
<p>When evaluating projects, it can be useful to have a common language and understanding of the differences between traditional software and machine learning software. Andrej Karparthy's <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">Software 2.0</a> is recommended reading for this topic.</p>
<p><strong>Software 1.0</strong></p>
<ul>
<li>Explicit instructions for a computer written by a programmer using a <em>programming language</em> such as Python or C++. A human writes the logic such that when the system is provided with data it will output the desired behavior.</li>
</ul>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/Screen-Shot-2019-05-07-at-9.45.18-PM.png" alt="Screen-Shot-2019-05-07-at-9.45.18-PM"></p>
<p><strong>Software 2.0</strong></p>
<ul>
<li>Implicit instructions by providing data, &quot;written&quot; by an optimization algorithm using <em>parameters</em> of a specified model architecture. The system logic is learned from a provided collection of data examples and their corresponding desired behavior.</li>
</ul>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/Screen-Shot-2019-05-07-at-9.44.55-PM.png" alt="Screen-Shot-2019-05-07-at-9.44.55-PM"></p>
<p>See <a href="https://www.youtube.com/watch?v=zywIvINSlaI">this talk</a> for more detail.</p>
<p>A quick note on Software 1.0 and Software 2.0 - these two paradigms are <em><strong>not</strong></em> mutually exclusive. Software 2.0 is usually used to scale the <strong>logic</strong> component of traditional software systems by leveraging large amounts of data to enable more complex or nuanced decision logic.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2019/05/Screen-Shot-2019-05-10-at-9.52.20-PM.png" alt="Screen-Shot-2019-05-10-at-9.52.20-PM"></p>
<p>For example, <a href="https://twimlai.com/twiml-talk-124-systems-software-machine-learning-scale-jeff-dean/">Jeff Dean talks</a> (at 27:15) about how the code for Google Translate used to be a very complicated system consisting of ~500k lines of code. Google was able to simplify this product by leveraging a machine learning model to perform the core logical task of translating text to a different language, requiring only ~500 lines of code to describe the model. However, this model still requires some &quot;Software 1.0&quot; code to process the user's query, invoke the machine learning model, and return the desired information to the user.</p>
<p>In summary, machine learning can drive large value in applications where decision logic is difficult or complicated for humans to write, but relatively easy for machines to learn. On that note, we'll continue to the next section to discuss how to evaluate whether a task is &quot;relatively easy&quot; for machines to learn.</p>
<p><a id="feasibility"></a></p>
<h3 id="determiningfeasibility">Determining feasibility</h3>
<p>Some useful questions to ask when determining the feasibility of a project:</p>
<ul>
<li>Cost of data acquisition
<ul>
<li>How hard is it to acquire data?</li>
<li>How expensive is data labeling?</li>
<li>How much data will be needed?</li>
</ul>
</li>
<li>Cost of wrong predictions
<ul>
<li>How frequently does the system need to be right to be useful?</li>
<li>Are there scenarios where a wrong prediction incurs a large cost?</li>
</ul>
</li>
<li>Availability of good published work about similar problems
<ul>
<li>Has the problem been reduced to practice?</li>
<li>Is there sufficient literature on the problem?</li>
<li>Are there pre-trained models we can leverage?</li>
</ul>
</li>
<li>Computational resources available both for training and inference
<ul>
<li>Will the model be deployed in a resource-constrained environment?</li>
<li>What are the latency requirements for the model?</li>
</ul>
</li>
</ul>
<p><a id="requirements"></a></p>
<h3 id="specifyingprojectrequirements">Specifying project requirements</h3>
<p>Establish a single value optimization metric for the project. Can also include several other <a href="https://en.wikipedia.org/wiki/Satisficing">satisficing</a> metrics (ie. performance thresholds) to evaluate models, but can only <em><strong>optimize</strong></em> a single metric.</p>
<p><em>Example:</em></p>
<ul>
<li>Optimize for accuracy</li>
<li>Prediction latency under 10 ms</li>
<li>Model requires no more than 1gb of memory</li>
<li>90% coverage (model confidence exceeds required threshold to consider a prediction as valid)</li>
</ul>
<p>The optimization metric may be a weighted sum of many things which we care about. Revisit this metric as performance improves.</p>
<p>Some teams may choose to ignore a certain requirement at the start of the project, with the goal of revising their solution (to meet the ignored requirements) after they have discovered a promising general approach.</p>
<p>Decide at what point you will ship your first model.</p>
<blockquote>
<p>Some teams aim for a “neutral” first launch: a first launch that explicitly deprioritizes machine learning gains, to avoid getting distracted. — <a href="http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf">Google Rules of Machine Learning</a></p>
</blockquote>
<p>The motivation behind this approach is that the first deployment should involve a simple model with focus spent on building the proper machine learning pipeline required for prediction. This allows you to deliver value quickly and avoid the trap of spending too much of your time trying to <a href="http://karpathy.github.io/2019/04/25/recipe/#6-squeeze-out-the-juice">&quot;squeeze the juice.&quot;</a></p>
<p><a id="codebase"></a></p>
<h3 id="settingupamlcodebase">Setting up a ML codebase</h3>
<p>A well-organized machine learning codebase should modularize data processing, model definition, model training, and experiment management.</p>
<p>Example codebase organization:</p>
<pre><code class="language-bash">configs/
    baseline.yaml
    latest.yaml
data/
docker/ 
project_name/
  api/
    app.py
  models/
    base.py
    simple_baseline.py
    cnn.py
  datasets.py
  train.py
  experiment.py
scripts/
</code></pre>
<p><code>data/</code> provides a place to store raw and processed data for your project. You can also include a <code>data/README.md</code> file which describes the data for your project.</p>
<p><code>docker/</code> is a place to specify one or many Dockerfiles for the project. Docker (and other container solutions) help ensure consistent behavior across multiple machines and deployments.</p>
<p><code>api/app.py</code> exposes the model through a REST client for predictions. You will likely choose to load the (trained) model from a <a href="https://mlflow.org/docs/latest/model-registry.html">model registry</a> rather than importing directly from your library.</p>
<p><code>models/</code> defines a collection of machine learning models for the task, unified by a common API defined in <code>base.py</code>.  These models include code for any necessary data preprocessing and output normalization.</p>
<p><code>datasets.py</code> manages construction of the dataset. Handles data pipelining/staging areas, shuffling, reading from disk.</p>
<p><code>experiment.py</code> manages the experiment process of evaluating multiple models/ideas. This constructs the dataset and models for a given experiment.</p>
<p><code>train.py</code> defines the actual training loop for the model. This code interacts with the optimizer and handles logging during training.</p>
<p>See other examples <a href="https://github.com/jeremyjordan/data-science-template">here</a>, <a href="https://github.com/ml-tooling/ml-project-template">here</a>, <a href="https://github.com/cmawer/reproducible-model">here</a> and <a href="https://drivendata.github.io/cookiecutter-data-science/#directory-structure">here</a>.</p>
<hr>
<p><a id="data"></a></p>
<h2 id="datacollectionandlabeling">Data collection and labeling</h2>
<p>An ideal machine learning pipeline uses data which labels itself. For example, Tesla Autopilot has a model running that predicts when cars are about to <a href="https://www.youtube.com/watch?v=Ucp0TTmvqOE&amp;feature=youtu.be&amp;t=7809">cut into your lane</a>. In order to acquire labeled data in a systematic manner, you can simply observe when a car changes from a neighboring lane into the Tesla's lane and then rewind the video feed to label that a car is about to cut in to the lane.</p>
<p>As another example, suppose Facebook is building a model to predict user engagement when deciding how to order things on the newsfeed. After serving the user content based on a prediction, they can monitor engagement and turn this interaction into a labeled observation without any human effort. However, just be sure to think through this process and ensure that your &quot;self-labeling&quot; system won't get stuck in a <em>feedback loop</em> with itself.</p>
<p>For many other cases, we must manually label data for the task we wish to automate. The quality of your data labels has a <em>large</em> effect on the upper bound of model performance.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Here is a real use case from work for model improvement and the steps taken to get there:<br><br>- Baseline: 53%<br>- Logistic: 58%<br>- Deep learning: 61%<br>- **Fixing your data: 77%**<br><br>Some good ol&#39; fashion &quot;understanding your data&quot; is worth it&#39;s weight in hyperparameter tuning!</p>&mdash; Alex Gude (@alex_gude) <a href="https://twitter.com/alex_gude/status/1121138827601383426?ref_src=twsrc%5Etfw">April 24, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Most data labeling projects require multiple people, which necessitates labeling <strong>documentation</strong>. Even if you're the only person labeling the data, it makes sense to document your labeling criteria so that you maintain consistency.</p>
<p>One tricky case is where you decide to change your labeling methodology after already having labeled data. For example, in the Software 2.0 talk mentioned previously, Andrej Karparthy <a href="https://www.youtube.com/watch?v=zywIvINSlaI&amp;feature=youtu.be&amp;t=20m43s">talks about</a> data which has no clear and obvious ground truth.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/09/lane_lines.png" alt="lane_lines"><br>
<small><em><a href="https://www.figure-eight.com/wp-content/uploads/2018/06/TRAIN_AI_2018_Andrej_Karpathy_Tesla.pdf">Image credit</a></em></small></p>
<p>If you run into this, <em>tag &quot;hard-to-label&quot; examples</em> in some manner such that you can easily find all similar examples should you decide to change your labeling methodology down the road. Additionally, you should <strong>version your dataset</strong> and associate a given model with a dataset version.</p>
<p><em>Tip: After labeling data and training an initial model, look at the observations with the largest error. These examples are often poorly labeled.</em></p>
<p><a id="active_learning"></a><br>
<strong>Active learning</strong></p>
<p>Active learning is useful when you have a large amount of unlabeled data and you need to decide what data you should label. Labeling data can be expensive, so we'd like to limit the time spent on this task.</p>
<p><em>As a counterpoint, if you can afford to label your entire dataset, you probably should. Active learning adds another layer of complexity.</em></p>
<blockquote>
<p>&quot;The main hypothesis in active learning is that if a learning algorithm can choose the data it wants to learn from, it can perform better than traditional methods with substantially less data for training.&quot; - <a href="https://www.datacamp.com/community/tutorials/active-learning">DataCamp</a></p>
</blockquote>
<p>General approach:</p>
<ol>
<li>Starting with an unlabeled dataset, build a &quot;seed&quot; dataset by acquiring labels for a small subset of instances</li>
<li>Train initial model on the seed dataset</li>
<li>Predict the labels of the remaining unlabeled observations</li>
<li>Use the uncertainty of the model's predictions to prioritize the labeling of remaining observations</li>
</ol>
<p><a id="weak_labels"></a><br>
<strong>Leveraging weak labels</strong><br>
However, tasking humans with generating ground truth labels is expensive. Often times you'll have access to large swaths of unlabeled data and a limited labeling budget - how can you maximize the value from your data? In some cases, your data can have information which provides a noisy estimate of the ground truth. For example, <a href="https://code.fb.com/ml-applications/advancing-state-of-the-art-image-recognition-with-deep-learning-on-hashtags/">if you're categorizing Instagram photos, you might have access to the hashtags used in the caption of the image</a>. Other times, you might have subject matter experts which can help you develop heuristics about the data.</p>
<p><a href="https://hazyresearch.github.io/snorkel/">Snorkel</a> is an interesting project produced by the Stanford DAWN (Data Analytics for What’s Next) lab which formalizes an approach towards combining many noisy label estimates into a probabilistic ground truth. I'd encourage you to check it out and see if you might be able to leverage the approach for your problem.</p>
<hr>
<p><a id="exploration"></a></p>
<h2 id="modelexploration">Model exploration</h2>
<p><strong>Establish performance baselines on your problem.</strong> Baselines are useful for both establishing a lower bound of expected performance (simple model baseline) and establishing a target performance level (human baseline).</p>
<ul>
<li>Simple baselines include out-of-the-box scikit-learn models (i.e. logistic regression with default parameters) or even simple heuristics (always predict the majority class). Without these baselines, it's impossible to evaluate the value of added model complexity.</li>
<li>If your problem is well-studied, search the literature to approximate a baseline based on published results for very similar tasks/datasets.</li>
<li>If possible, try to estimate human-level performance on the given task. Don't naively assume that humans will perform the task perfectly, a lot of simple tasks are <a href="http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/">deceptively hard</a>!</li>
</ul>
<p><strong>Start simple and gradually ramp up complexity.</strong> This typically involves using a simple model, but can also include starting with a simpler version of your task.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Before doing anything intelligent with &quot;AI&quot;, do the unintelligent version fast and at scale.<br>At worst you understand the limits of a simplistic approach and what complexities you need to handle.<br>At best you realize you don&#39;t need the overhead of intelligence.</p>&mdash; Smerity (@Smerity) <a href="https://twitter.com/Smerity/status/1095490777860304896?ref_src=twsrc%5Etfw">February 13, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><strong>Once a model runs, overfit a single batch of data.</strong> Don't use regularization yet, as we want to see if the unconstrained model has sufficient capacity to learn from the data.</p>
<ul>
<li><a href="https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/">Practical Advice for Building Deep Neural Networks</a> (see case study on overfitting an initial model)</li>
</ul>
<p><strong>Survey the literature.</strong> Search for papers on Arxiv describing model architectures for similar problems and speak with other practitioners to see which approaches have been most successful in practice. Determine a <em>state of the art</em> approach and use this as a baseline model (trained on your dataset).</p>
<p><strong>Reproduce a known result.</strong> If you're using a model which has been well-studied, ensure that your model's performance <em>on a commonly-used dataset</em> matches what is reported in the literature.</p>
<p><strong>Understand how model performance scales with more data.</strong> Plot the model performance as a function of increasing dataset size for the baseline models that you've explored. Observe how each model's performance scales as you increase the amount of data used for training.</p>
<hr>
<p><a id="refinement"></a></p>
<h2 id="modelrefinement">Model refinement</h2>
<p>Once you have a general idea of successful model architectures and approaches for your problem, you should now spend much more focused effort on squeezing out performance gains from the model.</p>
<p><strong>Build a scalable data pipeline.</strong> By this point, you've determined which types of data are necessary for your model and you can now focus on engineering a performant pipeline.</p>
<p><strong>Apply the bias variance decomposition to determine next steps.</strong> Break down error into: irreducible error, avoidable bias (difference between train error and irreducible error), variance (difference between validation error and train error), and validation set overfitting (difference between test error and validation error).</p>
<ul>
<li>If training on a (known) different distribution than what is available at test time, consider having <em>two validation subsets</em>: val-train and val-test. The difference between val-train error and val-test error is described by distribution shift.</li>
<li><em>Addressing underfitting</em>:
<ol>
<li>Increase model capacity</li>
<li>Reduce regularization</li>
<li>Error analysis</li>
<li>Choose a more advanced architecture (closer to state of art)</li>
<li>Tune hyperparameters</li>
<li>Add features</li>
</ol>
</li>
<li><em>Addressing overfitting</em>:
<ol>
<li>Add more training data</li>
<li>Add regularization</li>
<li>Add data augmentation</li>
<li>Error analysis</li>
<li>Tune hyperparameters</li>
<li>Reduce model size</li>
</ol>
</li>
<li><em>Addressing distribution shift</em>:
<ol>
<li>Perform error analysis to understand nature of distribution shift</li>
<li>Synthesize data (by augmentation) to more closely match the test distribution</li>
<li>Apply domain adaptation techniques</li>
</ol>
</li>
</ul>
<p><strong>Use coarse-to-fine random searches for hyperparameters.</strong> Start with a wide hyperparameter space initially and iteratively hone in on the highest-performing region of the hyperparameter space.</p>
<ul>
<li><a href="https://www.jeremyjordan.me/hyperparameter-tuning/">Hyperparameter tuning for machine learning models</a>.</li>
</ul>
<p><strong>Perform targeted collection of data to address current failure modes.</strong> Develop a systematic method for analyzing errors of your current model. Categorize these errors, if possible, and collect additional data to better cover these cases.</p>
<p><a id="debugging"></a></p>
<h3 id="debuggingmlprojects">Debugging ML projects</h3>
<p>Why is your model performing poorly?</p>
<ul>
<li>Implementation bugs</li>
<li>Hyperparameter choices</li>
<li>Data/model fit</li>
<li>Dataset construction</li>
</ul>
<p><em>Key mindset for DL troubleshooting: pessimism.</em></p>
<p>In order to complete machine learning projects efficiently,  <em><strong>start simple</strong></em> and gradually increase complexity. Start with a solid foundation and build upon it in an incremental fashion.</p>
<p><em>Tip: Fix a <a href="https://towardsdatascience.com/properly-setting-the-random-seed-in-machine-learning-experiments-7da298d1320b">random</a> <a href="https://pytorch.org/docs/stable/notes/randomness.html">seed</a> to ensure your model training is reproducible.</em></p>
<p>Common bugs:</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">oh: 5) you didn&#39;t use bias=False for your Linear/Conv2d layer when using BatchNorm, or conversely forget to include it for the output layer .This one won&#39;t make you silently fail, but they are spurious parameters</p>&mdash; Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/1013245864570073090?ref_src=twsrc%5Etfw">July 1, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><a id="failure_modes"></a></p>
<h3 id="discoveringfailuremodes">Discovering failure modes</h3>
<p>Use clustering to uncover failure modes and improve error analysis:</p>
<ul>
<li>Select all incorrect predictions. (Optionally, sort your observations by their calculated loss to find the most egregious errors.)</li>
<li>Run a clustering algorithm such as DBSCAN across selected observations.</li>
<li>Manually explore the clusters to look for common attributes which make prediction difficult.</li>
</ul>
<p>Categorize observations with incorrect predictions and determine what best action can be taken in the model refinement stage in order to improve performance on these cases.</p>
<hr>
<p><a id="testing"></a></p>
<h2 id="testingandevaluation">Testing and evaluation</h2>
<p>If you haven't already written tests for your code yet, you should write them at this point.</p>
<p>Different components of a ML product to test:</p>
<ul>
<li><strong>Training system</strong> processes raw data, runs experiments, manages results, stores weights.
<ul>
<li><em>Required tests</em>:
<ul>
<li>Test the full training pipeline (from raw data to trained model) to ensure that changes haven't been made upstream with respect to how data from our application is stored. These tests should be run nightly/weekly.</li>
</ul>
</li>
</ul>
</li>
<li><strong>Prediction system</strong> constructs the network, loads the stored weights, and makes predictions.
<ul>
<li><em>Required tests</em>:
<ul>
<li>Run inference on the validation data (already processed) and ensure model score does not degrade with new model/weights. This should be triggered every code push.</li>
<li>You should also have a quick functionality test that runs on a few important examples so that you can quickly (&lt;5 minutes) ensure that you haven't broken functionality during development. These tests are used as a sanity check as you are writing new code.</li>
<li>Also consider scenarios that your model might encounter, and develop tests to ensure new models still perform sufficiently. The &quot;test case&quot; is a scenario defined by the human and represented by a curated set of observations.
<ul>
<li><em>Example: For a self driving car, you might have a test to ensure that the car doesn't turn left at a yellow light.  For this case, you may run your model on observations where the car is at a yellow light and ensure that the prediction doesn't tell the car to proceed forward.</em></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li><strong>Serving system</strong> exposed to accept &quot;real world&quot; input and perform inference on production data. This system must be able to scale to demand.
<ul>
<li><em>Required monitoring</em>:
<ul>
<li>Alerts for downtime and errors</li>
<li>Check for distribution shift in data</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><img src="https://www.jeremyjordan.me/content/images/2018/09/test_infra.png" alt="test_infra"><br>
<small><em><a href="https://ai.google/research/pubs/pub46555">Image credit</a></em></small></p>
<p><a id="readiness"></a></p>
<h3 id="evaluatingproductionreadiness">Evaluating production readiness</h3>
<p><a href="https://ai.google/research/pubs/pub46555">The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction</a></p>
<p><em>Data:</em></p>
<ul>
<li>Feature expectations are captured in a schema.</li>
<li>All features are beneficial.</li>
<li>No feature’s cost is too much.</li>
<li>Features adhere to meta-level requirements.</li>
<li>The data pipeline has appropriate privacy controls.</li>
<li>New features can be added quickly.</li>
<li>All input feature code is tested.</li>
</ul>
<p><em>Model:</em></p>
<ul>
<li>Model specs are reviewed and submitted.</li>
<li>Offline and online metrics correlate.</li>
<li>All hyperparameters have been tuned.</li>
<li>The impact of model staleness is known.</li>
<li>A simple model is not better.</li>
<li>Model quality is sufficient on important data slices.</li>
<li>The model is tested for considerations of inclusion.</li>
</ul>
<p><em>Infrastructure:</em></p>
<ul>
<li>Training is reproducible.</li>
<li>Model specs are unit tested.</li>
<li>The ML pipeline is integration tested.</li>
<li>Model quality is validated before serving.</li>
<li>The model is debuggable.</li>
<li>Models are canaried before serving.</li>
<li>Serving models can be rolled back.</li>
</ul>
<p><em>Monitoring:</em></p>
<ul>
<li>Dependency changes result in notification.</li>
<li>Data invariants hold for inputs.</li>
<li>Training and serving are not skewed.</li>
<li>Models are not too stale.</li>
<li>Models are numerically stable.</li>
<li>Computing performance has not regressed.</li>
<li>Prediction quality has not regressed.</li>
</ul>
<hr>
<p><a id="deployment"></a></p>
<h2 id="modeldeployment">Model deployment</h2>
<p>Be sure to have a versioning system in place for:</p>
<ul>
<li>Model parameters</li>
<li>Model configuration</li>
<li>Feature pipeline</li>
<li>Training dataset</li>
<li>Validation dataset</li>
</ul>
<p>A common way to deploy a model is to package the system into a Docker container and expose a REST API for inference.</p>
<p><strong>Canarying</strong>: Serve new model to a small subset of users (ie. 5%) while still serving the existing model to the remainder. Check to make sure rollout is smooth, then deploy new model to rest of users.</p>
<p><strong>Shadow mode:</strong> Ship a new model alongside the existing model, still using the existing model for predictions but storing the output for both models. Measuring the delta between the new and current model's predictions will give an indication for how drastically things will change when you switch to the new model.</p>
<hr>
<p><a id="maintenance"></a></p>
<h2 id="ongoingmodelmaintenance">Ongoing model maintenance</h2>
<p><a href="https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf">Hidden Technical Debt in Machine Learning Systems</a> (quoted below, emphasis mine)</p>
<p>A primer on concept of technical debt:</p>
<blockquote>
<p>As with fiscal debt, there are often sound strategic reasons to take on technical debt. <strong>Not all debt is bad, but all debt needs to be serviced.</strong> Technical debt may be paid down by refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening APIs, and improving documentation. The goal is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability. <strong>Deferring such payments results in compounding costs.</strong> Hidden debt is dangerous because it compounds silently.</p>
</blockquote>
<p>Machine learning projects are not complete upon shipping the first version. If you are &quot;handing off&quot; a project and transferring model responsibility, it is extremely important to talk through the required model maintenance with the new team.</p>
<blockquote>
<p>Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.</p>
</blockquote>
<h3 id="maintenanceprinciples">Maintenance principles</h3>
<p><strong>CACE principle: Changing Anything Changes Everything</strong><br>
Machine learning systems are tightly coupled. Changes to the feature space, hyper parameters, learning rate, or any other &quot;knob&quot; can affect model performance.</p>
<p><em>Specific mitigation strategies:</em></p>
<ul>
<li>Create model validation tests which are run every time new code is pushed.</li>
<li>Decompose problems into <em>isolated</em> components where it makes sense to do so.</li>
</ul>
<p><strong>Undeclared consumers of your model may be inadvertently affected by your changes.</strong></p>
<blockquote>
<p>&quot;Without access controls, it is possible for some of these consumers to be undeclared consumers, consuming the output of a given prediction model as an input to another component of the system.&quot;</p>
</blockquote>
<p>If your model and/or its predictions are widely accessible, other components within your system may grow to depend on your model without your knowledge. Changes to the model (such as periodic retraining or redefining the output) may negatively affect those downstream components.</p>
<p><em>Specific mitigation strategies:</em></p>
<ul>
<li>Control access to your model by making outside components request permission and signal their usage of your model.</li>
</ul>
<p><strong>Avoid depending on input signals which may change over time.</strong><br>
Some features are obtained by a table lookup (ie. word embeddings) or simply an input pipeline which is outside the scope of your codebase. When these external feature representations are changed, the model's performance can suffer.</p>
<p><em>Specific mitigation strategies:</em></p>
<ul>
<li>Create a versioned copy of your input signals to provide stability against changes in external input pipelines. These versioned inputs can be specified in a model's configuration file.</li>
</ul>
<p><strong>Eliminate unnecessary features.</strong><br>
Regularly evaluate the effect of removing individual features from a given model. A model's feature space should only contain relevant and important features for the given task.</p>
<p>There are many strategies to determine feature importances,  such as leave-one-out cross validation and feature permutation tests. Unimportant features add noise to your feature space and should be removed.</p>
<p><em>Tip: Document deprecated features (deemed unimportant) so that they aren't accidentally reintroduced later.</em></p>
<p><strong>Model performance will likely decline over time.</strong><br>
As the input distribution shifts, the model's performance will suffer. You should plan to periodically retrain your model such that it has always learned from recent &quot;real world&quot; data.</p>
<hr>
<p>This guide draws inspiration from the <a href="https://fullstackdeeplearning.com/">Full Stack Deep Learning Bootcamp</a>, <a href="http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf">best practices released by Google</a>, my personal experience, and conversations with fellow practitioners.</p>
<p>Find something that's missing from this guide? Let me know!</p>
<h2 id="externalresources">External Resources</h2>
<p><strong>Blog posts</strong></p>
<ul>
<li><a href="https://karpathy.github.io/2019/04/25/recipe/">A Recipe for Training Neural Networks</a></li>
<li><a href="https://stackoverflow.blog/2020/10/12/how-to-put-machine-learning-models-into-production/">How to put machine learning models into production</a></li>
<li><a href="https://medium.com/@Ben_Reinhardt/designing-collaborative-ai-5c1e8dbc8810">Designing collaborative AI</a> (clever product design can reduce model performance requirements)</li>
<li><a href="https://www.fast.ai/2020/01/07/data-questionnaire/">Data project checklist - Jeremy Howard</a></li>
<li><a href="https://towardsdatascience.com/checklist-for-debugging-neural-networks-d8b2a9434f21">Checklist for debugging neural networks</a></li>
<li><a href="http://josh-tobin.com/troubleshooting-deep-neural-networks.html">Troubleshooting Deep Neural Networks</a></li>
<li><a href="https://matthewmcateer.me/blog/machine-learning-technical-debt/">Nitpicking Machine Learning Technical Debt</a></li>
<li><a href="https://becominghuman.ai/accelerate-machine-learning-with-active-learning-96cea4b72fdb">Accelerate Machine Learning with Active Learning</a></li>
</ul>
<p><strong>Papers</strong></p>
<ul>
<li><a href="https://d1.awsstatic.com/whitepapers/aws-managing-ml-projects.pdf">Managing Machine Learning Projects</a></li>
<li><a href="http://burrsettles.com/pub/settles.activelearning.pdf">Active Learning Literature Survey</a></li>
</ul>
<p><strong>Case studies</strong></p>
<ul>
<li><a href="https://dropbox.tech/machine-learning/content-suggestions-machine-learning">Using machine learning to predict what file you need next</a></li>
<li><a href="https://medium.com/pinterest-engineering/a-better-clickthrough-rate-how-pinterest-upgraded-everyones-favorite-engagement-metric-27f6fa6cba14">A better clickthrough rate: How Pinterest upgraded everyone’s favorite engagement metric</a></li>
</ul>
<p><strong>Talks</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=C5JElgliTeE">Leading Data Science Teams: A Framework To Help Guide Data Science Project Managers - Jeffrey Saltz</a></li>
<li><a href="https://www.youtube.com/watch?v=7D8unG3XMzU">An Only One Step Ahead Guide for Machine Learning Projects - Chang Lee</a>
<ul>
<li>An entertaining talk discussing advice for approaching machine learning projects. This talk will give you a &quot;flavor&quot; for the details covered in this guide.</li>
</ul>
</li>
<li><a href="https://www.youtube.com/watch?v=FE1r7_SQq6Y">Microsoft Research: Active Learning and Annotation</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[An overview of object detection: one-stage methods.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In this post, I'll discuss an overview of deep learning techniques for <strong>object detection</strong> using <a href="https://www.jeremyjordan.me/convolutional-neural-networks/">convolutional neural networks</a>. Object detection is useful for understanding what's in an image, describing both <em>what</em> is in an image and <em>where</em> those objects are found.</p>
<p>In general, there's two different approaches for this task</p>]]></description><link>https://www.jeremyjordan.me/object-detection-one-stage/</link><guid isPermaLink="false">5b2d6e9ab98f3b00bfbb1e4b</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Wed, 11 Jul 2018 14:06:15 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In this post, I'll discuss an overview of deep learning techniques for <strong>object detection</strong> using <a href="https://www.jeremyjordan.me/convolutional-neural-networks/">convolutional neural networks</a>. Object detection is useful for understanding what's in an image, describing both <em>what</em> is in an image and <em>where</em> those objects are found.</p>
<p>In general, there's two different approaches for this task – we can either make a fixed number of predictions on grid (one stage) <em><strong>or</strong></em> leverage a proposal network to find objects and then use a second network to fine-tune these proposals and output a final prediction (two stage).</p>
<p>In this blog post, I'll discuss the one-stage approach towards object detection; a follow-up post will then discuss the two-stage approach. Each approach has its own strengths and weaknesses, which I'll discuss in the respective blog posts.</p>
<p><strong>Jump to:</strong></p>
<ul>
<li><a href="#understanding">Understanding the task</a></li>
<li><a href="#direct_prediction">Direct object prediction</a>
<ul>
<li><a href="#grid">Predictions on a grid</a></li>
<li><a href="#nonmax_suppression">Non-maximum suppression</a></li>
</ul>
</li>
<li><a href="#yolo">YOLO: You Only Look Once</a></li>
<li><a href="#ssd">SSD: Single Shot Detection</a></li>
<li><a href="#focal">Addressing object imbalance with focal loss</a></li>
<li><a href="#datasets">Common datasets and competitions</a></li>
<li><a href="#further_reading">Further reading</a></li>
</ul>
<p><a id="understanding"></a></p>
<h2 id="understandingthetask">Understanding the task</h2>
<p>The goal of object detection is to recognize instances of a predefined set of object classes (e.g. {people, cars, bikes, animals}) and describe the locations of each detected object in the image using a <em>bounding box</em>. Two examples are shown below.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-29-at-2.52.10-PM.png" alt="Screen-Shot-2018-06-29-at-2.52.10-PM"><br>
<small><a href="http://host.robots.ox.ac.uk/pascal/VOC/">Example images are taken from the PASCAL VOC dataset.</a></small></p>
<p>We'll use rectangles to describe the locations of each object, which may lead to imperfect localizations due to the shapes of objects. An alternative approach would be <a href="https://www.jeremyjordan.me/semantic-segmentation/">image segmentation</a> which provides localization at the pixel-level.</p>
<p><a id="direct_prediction"></a></p>
<h2 id="directobjectprediction">Direct object prediction</h2>
<p>This blog post will focus on model architectures which directly predict object bounding boxes for an image in a <strong>one-stage</strong> fashion. In other words, there is no intermediate task (as we'll discuss later with region proposals) which must be performed in order to produce an output. This leads to a simpler and faster model architecture, although it can sometimes struggle to be flexible enough to adapt to arbitrary tasks (such as mask prediction).</p>
<p><a id="grid"></a></p>
<h4 id="predictionsonagrid">Predictions on a grid</h4>
<p>In order to understand what's in an image, we'll feed our input through a <a href="https://www.jeremyjordan.me/convnet-architectures/">standard convolutional network</a> to build a <strong>rich feature representation</strong> of the original image. We'll refer to this part of the architecture as the <em>&quot;backbone&quot; network</em>, which is usually pre-trained as an image classifier to more cheaply learn how to <em>extract features from an image</em>. This is a result of the fact that data for image classification is easier (and thus cheaper) to label as it only requires a single label as opposed to defining bounding box annotations for each image. Thus, we can train on a very large labeled dataset (such as ImageNet) in order to learn good feature representations.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-01-at-11.25.31-PM.png" alt="backbone architecture"></p>
<p>After pre-training the backbone architecture as an image classifier, we'll remove the last few layers of the network so that our backbone network outputs a collection of stacked feature maps which describe the original image in a <em>low spatial resolution</em> albeit a <em>high feature (channel) resolution</em>. In the example below, we have a 7x7x512 representation of our observation. Each of the 512 feature maps describe different characteristics of the original image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-01-at-11.29.17-PM.png" alt="grid output"></p>
<p>We can relate this 7x7 grid back to the original input in order to understand what each grid cell represents relative to the original image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-10.42.33-PM.png" alt="feature map to receptive field"></p>
<p>We can also determine <em>roughly</em> where objects are located in the coarse (7x7) feature maps by observing which grid cell contains the center of our bounding box annotation. We'll assign this grid cell as being &quot;responsible&quot; for detecting that specific object.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-10-at-11.04.51-PM.png" alt="bounding box center"></p>
<p>In order to detect this object, we will add another convolutional layer and learn the kernel parameters which combine the context of all 512 feature maps in order to produce an activation corresponding with the grid cell which contains our object.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-9.52.11-PM.png" alt="convolution operation"></p>
<p>If the input image contains multiple objects, we should have multiple activations on our grid denoting that an object is in each of the activated regions.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-9.54.18-PM.png" alt="multiple objects"></p>
<p>However, we cannot sufficiently describe each object with a single activation. In order to fully describe a detected object, we'll need to define:</p>
<ul>
<li>The likelihood that a grid cell contains an object ($p_{obj}$)</li>
<li>Which class the object belongs to ($c_1$, $c_2$, ..., $c_C$)</li>
<li>Four bounding box descriptors to describe the $x$ coordinate, $y$ coordinate, width, and height of a labeled box ($t_x$, $t_y$, $t_w$, $t_h$)</li>
</ul>
<p>Thus, we'll need to learn a convolution filter for <em>each</em> of the above attributes such that we produce $5 + C$ output channels to describe a <em>single bounding box</em> at each grid cell location. This means that we'll learn a set of weights to look across all 512 feature maps and determine which grid cells are likely to contain an object, what classes are likely to be present in each grid cell, and how to describe the bounding box for possible objects in each grid cell.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-10.00.39-PM.png" alt="one bounding box"></p>
<p>The full output of applying $5 + C$ convolutional filters is shown below for clarity, producing one bounding box descriptor for each grid cell.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-05-at-1.05.06-PM.png" alt="bounding box all grid cells"></p>
<p>However, some images might have multiple objects which &quot;belong&quot; to the same grid cell. We can alter our layer to produce $B(5 + C)$ filters such that we can predict $B$ bounding boxes for each grid cell location.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-10.13.04-PM.png" alt="two bounding boxes"></p>
<p>Visualizing the full convolutional output of our $B(5 + C)$ filters, we can see that our model will always produce a fixed number of $N \times N \times B$ predictions for a given image. We can then filter our predictions to only consider bounding boxes which has a $p_{obj}$ above some defined threshold.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-10.58.21-PM.png" alt="full grid one object"></p>
<p>Because of the convolutional nature of our detection process, <strong>multiple objects can be detected in parallel</strong>. However, we also end up predicting for a large number grid cells where no object is found. Although we can filter these bounding boxes out by their $p_{obj}$ score, this introduces quite a large imbalance between the predicted bounding boxes which contain an object and those which do not contain an object.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-02-at-10.58.30-PM.png" alt="full grid two objects"></p>
<p>The two models I'll discuss below both use this concept of <strong>&quot;predictions on a grid&quot;</strong> to detect a fixed number of possible objects within an image. In the respective sections, I'll describe the nuances of each approach and fill in some of the details that I've glanced over in this section so that you can actually implement each model.</p>
<p><a id="nonmax_suppression"></a></p>
<h4 id="nonmaximumsuppression">Non-maximum suppression</h4>
<p>The &quot;predictions on a grid&quot; approach produces a fixed number of bounding box predictions for each image. However, we would like to filter these predictions in order to only output bounding boxes for objects that are actually likely to be in the image. Moreover, we want a single bounding box prediction for each object detected.</p>
<p>We can filter out most of the bounding box predictions by only considering predictions with a $p_{obj}$ above some defined confidence threshold. However, we still may be left with multiple high-confidence predictions describing the <em>same</em> object. Thus, we need a method for removing redundant object predictions such that each object is described by a single bounding box.</p>
<p>To accomplish this, we'll use a technique known as <strong>non-max suppression</strong>. At a high level, this technique will look at highly overlapping bounding boxes and suppress (or discard) all of the predictions except the highest confidence prediction.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-10-at-9.46.29-PM.png" alt="NMS steps"></p>
<p>We'll perform non-max suppression on <strong>each class separately</strong>. Again, the goal here is to remove <em>redundant</em> predictions so we shouldn't be concerned if we have two predictions that overlap if one box is describing a person and the other box is describing a bicycle. However, if two bounding boxes with high overlap are both describing a person, it's likely that these predictions are describing the same person.</p>
<p><a id="yolo"></a></p>
<h2 id="yoloyouonlylookonce">YOLO: You Only Look Once</h2>
<p>The YOLO model was first published (by Joseph Redmon et al.) in 2015 and subsequently revised in two following papers. In each section, I'll discuss the specific implementation details and refinements that were made to improve performance.</p>
<h4 id="backbonenetwork">Backbone network</h4>
<p>The original YOLO network uses a modified GoogLeNet as the backbone network. Redmond later created a new model named DarkNet-19 which follows the general design of a $3 \times 3$ filters, doubling the number of channels at each pooling step; $1 \times 1$ filters are also used to periodically compress the feature representation throughout the network. His latest paper introduces a new, larger model named DarkNet-53 which offers improved performance over its predecessor.</p>
<p>All of these models were first pre-trained as image classifiers before being adapted for the detection task. In the second iteration of the YOLO model, Redmond discovered that using higher resolution images at the end of classification pre-training improved the detection performance and thus adopted this practice.</p>
<p>Adapting the classification network for detection simply consists of removing the last few layers of the network and adding a convolutional layer with $B(5 + C)$ filters to produce the $N \times N \times B$ bounding box predictions.</p>
<h4 id="boundingboxesandconceptofanchorboxes">Bounding boxes (and concept of anchor boxes)</h4>
<p>The first iteration of the YOLO model <strong>directly predicts</strong> all four values which describe a bounding box. The $x$ and $y$ coordinates of each bounding box are defined relative to the top left corner of each grid cell and normalized by the cell dimensions such that the coordinate values are bounded between 0 and 1. We define the boxes width and height such that our model predicts the <em>square-root</em> width and height; by defining the width and height of the boxes as a square-root value, differences between large numbers are less significant than differences between small numbers (confirm this visually by looking at a plot of $y = \sqrt {x}$). Redmond chose this formulation because “small deviations in large boxes matter less than in small boxes&quot; and thus when calculating our loss function we would like the emphasis to be placed on getting small boxes more exact. The bounding box width and height are normalized by the image width and height and thus are also bounded between 0 and 1. An L2 loss is applied during training.</p>
<p>This formulation was later revised to introduce the concept of a <strong>bounding box prior</strong>. Rather than expecting the model to directly produce unique bounding box descriptors for each new image, we will define a collection of bounding boxes with varying aspect ratios which embed some prior information about the shape of objects we're expecting to detect. Redmond offers an approach towards discovering the best aspect ratios by doing k-means clustering (with a custom distance metric) on all of the bounding boxes in your training dataset.</p>
<p>In the image below, you can see a collection of 5 bounding box priors (also known as anchor boxes) for the grid cell highlighted in yellow. With this formulation, each of the $B$ bounding boxes explicitly specialize in detecting objects of a specific size and aspect ratio.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/download--3-.png" alt="bounding box prior"></p>
<p><em>Note: Although it is not visualized, these anchor boxes are present for each cell in our prediction grid.</em></p>
<p>Rather than directly predicting the bounding box dimensions, we'll reformulate our task in order to simply predict the <em>offset</em> from our bounding box prior dimensions such that we can fine-tune our predicted bounding box dimensions. This reformulation makes the prediction task easier to learn.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-09-at-12.36.54-PM.png" alt="dimension refinement"></p>
<p>For similar reasons as originally predicting the square-root width and height, we'll define our task to predict the <em>log offsets</em> from our bounding box prior.</p>
<h4 id="objectnessandassigninglabeledobjectstoaboundingbox">Objectness (and assigning labeled objects to a bounding box)</h4>
<p>In the first version of the model, the &quot;objectness&quot; score $p_{obj}$ was trained to approximate the Intersection over Union (IoU) between the predicted box and the ground truth label. When we calculate our loss during training, we'll match objects to whichever bounding box prediction (on the same grid cell) has the highest IoU score. For unmatched boxes, the only descriptor which we'll include in our loss function is $p_{obj}$.</p>
<p>After the addition bounding box priors in YOLOv2, we can simply assign labeled objects to whichever anchor box (on the same grid cell) has the highest IoU score with the labeled object.</p>
<p>In the third version, Redmond redefined the &quot;objectness&quot; target score $p_{obj}$ to be 1 for the bounding boxes with highest IoU score for each given target, and 0 for all remaining boxes. However, we will not include bounding boxes which have a high IoU score (above some threshold) but not the highest score when calculating the loss. In simple terms, it doesn't make sense to punish a good prediction just because it isn't the <em>best</em> prediction.</p>
<h4 id="classlabels">Class labels</h4>
<p>Originally, class prediction was performed at the <em>grid cell</em> level. This means that a single grid cell could not predict multiple bounding boxes of different classes. This was later revised to predict class for each bounding box using a softmax activation across classes and a cross entropy loss.</p>
<p>Redmond later changed the class prediction to use sigmoid activations for multi-label classification as he found a softmax is not necessary for good performance. This choice will depend on your dataset and whether or not your labels overlap (eg. &quot;golden retriever&quot; and &quot;dog&quot;).</p>
<h4 id="outputlayer">Output layer</h4>
<p>The first YOLO model simply predicts the $N \times N \times B$ bounding boxes using the output of our backbone network.</p>
<p>In YOLOv2, Redmond adds a weird skip connection splitting a higher resolution feature map across multiple channels as visualized below.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-08-at-11.31.01-PM.png" alt="weird skip connection"><br>
<small>The weird &quot;skip connection from higher resolution feature maps&quot; idea that I don't like. </small></p>
<p>Fortunately, this was changed in the third iteration for a more standard feature pyramid network output structure. With this method, we'll alternate between outputting a prediction and upsampling the feature maps (with skip connections). This allows for predictions that can take advantage of finer-grained information from earlier in the network, which helps for detecting small objects in the image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-09-at-11.21.03-AM.png" alt="FPN structure"><br>
<small><a href="https://arxiv.org/abs/1612.03144">Image credit</a></small></p>
<p><a id="ssd"></a></p>
<h2 id="ssdsingleshotdetection">SSD: Single Shot Detection</h2>
<p>The SSD model was also published (by Wei Liu et al.) in 2015, shortly after the YOLO model, and was also later refined in a subsequent paper. In each section, I'll discuss the specific implementation details for this model.</p>
<h4 id="backbonenetwork">Backbone network</h4>
<p>A VGG-16 model, pre-trained on ImageNet for image classification, is used as the backbone network. The authors make a few slight tweaks when adapting the model for the detection task, including: replacing fully connected layers with convolutional implementations, removing dropout layers, and replacing the last max pooling layer with a dilated convolution.</p>
<h4 id="boundingboxesandconceptofanchorboxes">Bounding boxes (and concept of anchor boxes)</h4>
<p>Rather than using k-means clustering to discover aspect ratios, the SSD model manually defines a collection of aspect ratios (eg. {1, 2, 3, 1/2, 1/3}) to use for the $B$ bounding boxes at each grid cell location.</p>
<p>For each bounding box, we'll predict the <em>offsets</em> from the anchor box for both the bounding box coordinates ($x$ and $y$) and dimensions (width and height). We'll use ReLU activations trained with a Smooth L1 loss.</p>
<h4 id="objectnessandassigninglabeledobjectstoaboundingbox">Objectness (and assigning labeled objects to a bounding box)</h4>
<p>One major distinction between YOLO and SSD is that SSD <em>does not</em> attempt to predict a value for $p_{obj}$. Whereas the YOLO model predicted the probability of an object and then predicted the probability of each class given that there was an object present, the SSD model attempts to directly predict the probability that a class is present in a given bounding box.</p>
<p>When calculating the loss, we'll match each ground truth box to the anchor box with the highest IoU — defining this box with being &quot;responsible&quot; for making the prediction. However, we'll also match the ground truth boxes with any other anchor boxes with an IoU above some defined threshold (0.5) in the same light of not punishing good predictions simply because they weren't the best. We can always rely on non-max suppression at inference time to filter out redundant predictions.</p>
<h4 id="classlabels">Class labels</h4>
<p>As I mentioned previously, the class predictions for SSD bounding boxes <em>are not</em> conditioned on the fact that an object is present. Thus, we directly predict the probability of each class using a softmax activation and cross entropy loss. Because we don't explicitly predict $p_{obj}$, it's important to have a class for &quot;background&quot; so that we can predict when no object is present.</p>
<p>Due to the fact that most of the boxes will belong to the &quot;background&quot; class, we will use a technique known as &quot;hard negative mining&quot; to sample negative (no object) predictions such that there is at most a 3:1 ratio between negative and positive predictions when calculating our loss.</p>
<h4 id="outputlayer">Output layer</h4>
<p>To allow for predictions at multiple scales, the SSD output module progressively downsamples the convolutional feature maps, intermittently producing bounding box predictions (as shown with the arrows from convolutional layers to the predictions box).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-09-at-12.26.30-PM.png" alt="multi scale output"></p>
<p><a id="focal"></a></p>
<h2 id="addressingobjectimbalancewithfocalloss">Addressing object imbalance with focal loss</h2>
<p>As I mentioned earlier, we often end up with a large amount of bounding boxes in which no object is contained due to the nature of our &quot;predictions on a grid&quot; approach. Although we can easily filter these boxes out after making a fixed set of bounding box predictions, there is still a (foreground-background) class imbalance present which can introduce difficulties during training. This is especially difficult for models which don't separate prediction of objectness and class probability into two separate tasks, and instead simply include a &quot;background&quot; class for regions with no objects.</p>
<p>Researchers at Facebook proposed adding a scaling factor to the standard cross entropy loss such that it places more the emphasis on &quot;hard&quot; examples during training, preventing easy negative predictions from dominating the training process.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/07/Screen-Shot-2018-07-10-at-10.34.22-PM.png" alt="focal loss"></p>
<p><small><a href="https://arxiv.org/abs/1708.02002">Image credit</a></small></p>
<p>As the researchers point out, easily classified examples can incur a non-trivial loss for standard cross entropy loss ($\gamma=0$) which, summed over a large collection of samples, can easily dominate the parameter update. The ${\left( {1 - {p_t}} \right)^\gamma }$ term acts as a tunable scaling factor to prevent this from occuring.</p>
<p>As the paper points out, &quot;with $\gamma=2$, an example classified with $p_t = 0.9$ would have 100X lower loss compared with CE and with $p_t = 0.968$ it would have 1000X lower loss.&quot;</p>
<p><a id="datasets"></a></p>
<h2 id="commondatasetsandcompetitions">Common datasets and competitions</h2>
<p>Below I've listed some common datasets that researchers use when evaluating new object detection models.</p>
<ul>
<li><a href="http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html">PASCAL VOC 2012 Detection Competition</a></li>
<li><a href="http://cocodataset.org/#detection-2018">COCO 2018 Stuff Object Detection Task</a></li>
<li><a href="https://www.kaggle.com/c/imagenet-object-detection-challenge">ImageNet Object Detection Challenge</a></li>
<li><a href="https://www.kaggle.com/c/google-ai-open-images-object-detection-track">Google AI Open Images - Object Detection Track</a></li>
<li><a href="http://www.aiskyeye.com/views/index">Vision Meets Drones: A Challenge</a></li>
</ul>
<p><a id="further_reading"></a></p>
<h2 id="furtherreading">Further reading</h2>
<p><strong>Papers</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/1809.02165v1">Deep Learning for Generic Object Detection: A Survey</a></li>
<li>YOLO
<ul>
<li><a href="https://arxiv.org/abs/1506.02640">You Only Look Once: Unified, Real-Time Object Detection</a></li>
<li><a href="https://arxiv.org/abs/1612.08242">YOLO9000: Better, Faster, Stronger</a></li>
<li><a href="https://arxiv.org/abs/1804.02767">YOLOv3: An Incremental Improvement</a></li>
</ul>
</li>
<li>SSD
<ul>
<li><a href="https://arxiv.org/abs/1512.02325">SSD: Single Shot MultiBox Detector</a></li>
<li><a href="https://arxiv.org/abs/1701.06659">DSSD: Deconvolutional Single Shot Detector</a> (I didn't discuss this in the blog post but it's worth the read)</li>
</ul>
</li>
<li><a href="https://arxiv.org/abs/1708.02002">Focal Loss for Dense Object Detection</a></li>
<li><a href="https://arxiv.org/abs/1807.03247">An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution</a> (see relevant section on object detection)
<ul>
<li><a href="https://www.youtube.com/watch?v=8yFQc6elePA">Explainer video</a></li>
</ul>
</li>
</ul>
<p><strong>Lectures</strong></p>
<ul>
<li><a href="https://www.youtube.com/watch?v=nDPWywWRIRo&amp;t=1967s">Stanford CS 231n: Lecture 11 | Detection and Segmentation</a></li>
</ul>
<p><strong>Blog posts</strong></p>
<ul>
<li><a href="http://zoey4ai.com/2018/05/12/deep-learning-object-detection/">Understanding deep learning for object detection</a></li>
<li><a href="https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088">Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3</a></li>
</ul>
<p><strong>Frameworks and GitHub repos</strong></p>
<ul>
<li><a href="https://github.com/tryolabs/luminoth">Luminoth</a></li>
<li><a href="https://github.com/thtrieu/darkflow">Darkflow</a></li>
</ul>
<p><strong>Tools for labeling data</strong></p>
<ul>
<li><a href="https://github.com/opencv/cvat">Computer Vision Annotation Tool (CVAT)</a></li>
<li><a href="https://github.com/tzutalin/labelImg">LabelImg</a></li>
<li><a href="https://github.com/Microsoft/VoTT">Microsoft's Visual Object Tagging Tool</a></li>
<li><a href="https://github.com/tryolabs/taggerine">Taggerine</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Evaluating image segmentation models.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>When evaluating a standard machine learning model, we usually classify our predictions into four categories: true positives, false positives, true negatives, and false negatives. However, for the dense prediction task of image segmentation, it's not immediately clear what counts as a &quot;true positive&quot; and, more generally, how we</p>]]></description><link>https://www.jeremyjordan.me/evaluating-image-segmentation-models/</link><guid isPermaLink="false">5b09e003db6e4c00bf385278</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Wed, 30 May 2018 20:08:46 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>When evaluating a standard machine learning model, we usually classify our predictions into four categories: true positives, false positives, true negatives, and false negatives. However, for the dense prediction task of image segmentation, it's not immediately clear what counts as a &quot;true positive&quot; and, more generally, how we can evaluate our predictions. In this post, I'll discuss common methods for evaluating both semantic and instance segmentation techniques.</p>
<h2 id="semanticsegmentation">Semantic segmentation</h2>
<p>Recall that the task of <a href="https://www.jeremyjordan.me/semantic-segmentation/">semantic segmentation</a> is simply to predict the class of each pixel in an image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-21-at-10.44.23-PM.png" alt="Screen-Shot-2018-05-21-at-10.44.23-PM"><br>
<small><a href="https://arxiv.org/abs/1611.09326">Image credit</a></small></p>
<p>Our prediction output shape matches the input's spatial resolution (width and height) with a channel depth equivalent to the number of possible classes to be predicted. Each channel consists of a binary mask which labels areas where a specific class is present.</p>
<h4 id="intersectionoverunion">Intersection over Union</h4>
<p>The Intersection over Union (IoU) metric, also referred to as the Jaccard index, is essentially a method to quantify the percent overlap between the target mask and our prediction output.  This metric is closely related to the Dice coefficient which is often used as a <a href="https://www.jeremyjordan.me/semantic-segmentation/#loss">loss function</a> during training.</p>
<p>Quite simply, the IoU metric measures the number of pixels common between the target and prediction masks divided by the total number of pixels present across <em>both</em> masks.</p>
<p>$$ IoU = \frac{{target \cap prediction}}{{target \cup prediction}} $$</p>
<p>As a visual example, let's suppose we're tasked with calculating the IoU score of the following prediction, given the ground truth labeled mask.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/target_prediction.png" alt="target_prediction"></p>
<p>The <strong>intersection</strong> ($A \cap B$) is comprised of the pixels found in both the prediction mask <em>and</em> the ground truth mask, whereas the <strong>union</strong> ($A \cup B$) is simply comprised of all pixels found in either the prediction <em>or</em> target mask.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/intersection_union.png" alt="intersection_union"></p>
<p>We can calculate this easily using Numpy.</p>
<pre><code class="language-python">intersection = np.logical_and(target, prediction)
union = np.logical_or(target, prediction)
iou_score = np.sum(intersection) / np.sum(union)
</code></pre>
<p>The IoU score is calculated for each class separately and then <strong>averaged over all classes</strong> to provide a global, mean IoU score of our semantic segmentation prediction.</p>
<h4 id="pixelaccuracy">Pixel Accuracy</h4>
<p>An alternative metric to evaluate a semantic segmentation is to simply report the percent of pixels in the image which were correctly classified. The pixel accuracy is commonly reported for each class separately as well as globally across all classes.</p>
<p>When considering the per-class pixel accuracy we're essentially evaluating a binary mask; a true positive represents a pixel that is correctly predicted to belong to the given class (according to the target mask) whereas a true negative represents a pixel that is correctly identified as not belonging to the given class.</p>
<p>$$ accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}} $$</p>
<p>This metric can sometimes provide misleading results when the class representation is small within the image, as the measure will be biased in mainly reporting how well you identify negative case (ie. where the class is not present).</p>
<h2 id="instancesegmentation">Instance segmentation</h2>
<p>Instance segmentation models are a little more complicated to evaluate; whereas semantic segmentation models output a single segmentation mask, instance segmentation models produce a collection of local segmentation masks describing each object detected in the image. As such, evaluation methods for instance segmentation are quite similar to that of object detection, with the exception that we now calculate IoU of masks instead of bounding boxes.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/semantic_instance-2.png" alt="semantic_instance-2"><br>
<small><a href="https://engineering.matterport.com/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46">Image credit</a></small></p>
<h4 id="calculatingprecision">Calculating Precision</h4>
<p>To evaluate our collection of predicted masks, we'll compare each of our predicted masks with each of the available target masks for a given input.</p>
<ul>
<li>
<p>A <strong>true positive</strong> is observed when a prediction-target mask pair has an IoU score which exceeds some predefined threshold.</p>
</li>
<li>
<p>A <strong>false positive</strong> indicates a predicted object mask had no associated ground truth object mask.</p>
</li>
<li>
<p>A <strong>false negative</strong> indicates a ground truth object mask had no associated predicted object mask.</p>
</li>
</ul>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-30-at-12.01.19-PM.png" alt="Screen-Shot-2018-05-30-at-12.01.19-PM"></p>
<p><strong>Precision</strong> effectively describes the <em>purity</em> of our positive detections relative to the ground truth. Of all of the objects that we predicted in a given image, how many of those objects actually had a matching ground truth annotation?</p>
<p>$$ Precision = \frac{TP}{TP + FP} $$</p>
<p><strong>Recall</strong> effectively describes the <em>completeness</em> of our positive predictions relative to the ground truth. Of all of the objected annotated in our ground truth, how many did we capture as positive predictions?</p>
<p>$$ Recall = \frac{TP}{TP + FN} $$</p>
<p>However, in order to calculate the prediction and recall of a model output, we'll need to define what constitutes a <em><strong>positive detection</strong></em>. To do this, we'll calculate the IoU score between each (prediction, target) mask pair and then determine which mask pairs have an IoU score <em>exceeding a defined threshold value</em>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/IoU-comparison.png" alt="IoU-comparison"></p>
<p>However, computing a single precision and recall score at the specified IoU threshold does not adequately describe the behavior of our model's full precision-recall curve. Instead, we can use <strong>average precision</strong> to effectively integrate the area under a precision-recall curve.</p>
<p>Let's use the precision-recall curve below as an example.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/12/calculated-PR-curve-1.png" alt="calculated-PR-curve"></p>
<p>First, we'll adjust our curve such that the precision at a given point $r$ is adjusted to the maximum precision for recall greater than $r$.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/12/adjusted-PR-curve-1.png" alt="adjusted-PR-curve"></p>
<p>Then, we'll simply calculate the area under the curve by numerical integration. This method replaces an <a href="https://www.youtube.com/watch?v=yjCMEjoc_ZI">older approach of averaging over a range of recall values</a>.</p>
<p><em>Note that the precision-recall curve will likely not extend out to perfect recall due to our prediction thresholding according to each mask IoU.</em></p>
<p>As an example, the <a href="http://cocodataset.org/#detection-eval">Microsoft COCO challenge</a>'s primary metric for the detection task evaluates the average precision score using IoU thresholds ranging from 0.5 to 0.95 (in 0.05 increments).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/threshold_ranges.png" alt="threshold_ranges"></p>
<p>For prediction problems with multiple classes of objects, this value is then averaged over all of the classes.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[An overview of semantic image segmentation.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In this post, I'll discuss how to use <a href="https://www.jeremyjordan.me/convolutional-neural-networks/">convolutional neural networks</a> for the task of <strong>semantic image segmentation</strong>. Image segmentation is a computer vision task in which we label specific regions of an image according to what's being shown.</p>
<blockquote>
<p>&quot;What's in this image, and where in the image is</p></blockquote>]]></description><link>https://www.jeremyjordan.me/semantic-segmentation/</link><guid isPermaLink="false">5abd49c16eaef00022b75ecb</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Tue, 22 May 2018 03:11:27 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In this post, I'll discuss how to use <a href="https://www.jeremyjordan.me/convolutional-neural-networks/">convolutional neural networks</a> for the task of <strong>semantic image segmentation</strong>. Image segmentation is a computer vision task in which we label specific regions of an image according to what's being shown.</p>
<blockquote>
<p>&quot;What's in this image, and where in the image is it located?&quot;</p>
</blockquote>
<p><strong>Jump to:</strong></p>
<ul>
<li><a href="#representing">Representing the task</a></li>
<li><a href="#constructing">Constructing an architecture</a>
<ul>
<li><a href="#upsampling">Methods for upsampling</a></li>
<li><a href="#fully_convolutional">Fully convolutional networks</a></li>
<li><a href="#skip_connections">Adding skip connections</a></li>
<li><a href="#advanced_unet">Advanced U-Net variants</a></li>
<li><a href="#dilated_convolutions">Dilated convolutions</a></li>
</ul>
</li>
<li><a href="#loss">Defining a loss function</a></li>
<li><a href="#datasets">Common datasets and segmentation competitions</a></li>
<li><a href="#further_reading">Further reading</a></li>
</ul>
<p>More specifically, the goal of semantic image segmentation is to label <em>each pixel</em> of an image with a corresponding <em><strong>class</strong></em> of what is being represented. Because we're predicting for every pixel in the image, this task is commonly referred to as <strong>dense prediction</strong>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-17-at-7.42.16-PM.png" alt="Screen-Shot-2018-05-17-at-7.42.16-PM"><br>
<small>An example of semantic segmentation, where the goal is to predict class labels for each pixel in the image. <a href="http://host.robots.ox.ac.uk/pascal/VOC/voc2012/#devkit">(Source)</a></small></p>
<p>One important thing to note is that we're not separating <em>instances</em> of the same class; we only care about the category of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects. There exists a different class of models, known as <em>instance segmentation</em> models, which <em>do</em> distinguish between separate objects of the same class.</p>
<p>Segmentation models are useful for a variety of tasks, including:</p>
<ul>
<li><strong>Autonomous vehicles</strong><br>
We need to equip cars with the necessary perception to understand their environment so that self-driving cars can safely integrate into our existing roads.</li>
</ul>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/deeplabcityscape.gif" alt="deeplabcityscape"><br>
<small>A real-time segmented road scene for autonomous driving. <a href="https://www.youtube.com/watch?v=ATlcEDSPWXY">(Source)</a></small></p>
<ul>
<li><strong>Medical image diagnostics</strong><br>
Machines can augment analysis performed by radiologists, greatly reducing the time required to run diagnositic tests.</li>
</ul>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-23-at-7.17.43-PM.png" alt="chest xray"><br>
<small>A chest x-ray with the heart (red), lungs (green), and clavicles (blue) are segmented. <a href="https://arxiv.org/abs/1701.08816">(Source)</a></small></p>
<p><a id="representing"></a></p>
<h2 id="representingthetask">Representing the task</h2>
<p>Simply, our goal is to take either a RGB color image ($height \times width \times 3$) or a grayscale image ($height \times width \times 1$) and output a segmentation map where each pixel contains a class label represented as an integer ($height \times width \times 1$).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-17-at-9.02.15-PM.png" alt="input to label"></p>
<p><em>Note: For visual clarity, I've labeled a low-resolution prediction map. In reality, the segmentation label resolution should match the original input's resolution.</em></p>
<p>Similar to how we treat standard categorical values, we'll create our <strong>target</strong> by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-16-at-9.36.00-PM.png" alt="one hot"></p>
<p>A prediction can be collapsed into a segmentation map (as shown in the first image) by taking the <code>argmax</code> of each depth-wise pixel vector.</p>
<p>We can easily inspect a target by overlaying it onto the observation.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-16-at-9.36.38-PM.png" alt="overlay"></p>
<p>When we overlay a <em>single channel</em> of our target  (or prediction), we refer to this as a <strong>mask</strong> which illuminates the regions of an image where a specific class is present.</p>
<p><a id="constructing"></a></p>
<h2 id="constructinganarchitecture">Constructing an architecture</h2>
<p>A naive approach towards constructing a neural network architecture for this task is to simply stack a number of convolutional layers (with <code>same</code> padding to preserve dimensions) and output a final segmentation map. This directly learns a mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings; however, it's quite computationally expensive to preserve the full resolution throughout the network.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-19-at-12.32.20-PM.png" alt="Screen-Shot-2018-05-19-at-12.32.20-PM"><br>
<small><a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf">Image credit</a></small></p>
<p>Recall that for deep convolutional networks, earlier layers tend to learn low-level concepts while later layers develop more high-level (and specialized) feature mappings. <strong>In order to maintain expressiveness, we typically need to increase the number of feature maps (channels) as we get deeper in the network.</strong></p>
<p>This didn't necessarily pose a problem for the task of image classification, because for that task we only care about <em>what</em> the image contains (and not where it is located). Thus, we could alleviate computational burden by periodically downsampling our feature maps through pooling or strided convolutions (ie. compressing the spatial resolution) without concern. However, for image segmentation, we would like our model to produce a <em>full-resolution</em> semantic prediction.</p>
<p>One popular approach for image segmentation models is to follow an <strong>encoder/decoder structure</strong> where we <em>downsample</em> the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes, and the <em>upsample</em> the feature representations into a full-resolution segmentation map.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-16-at-10.33.29-PM.png" alt="Screen-Shot-2018-05-16-at-10.33.29-PM"><br>
<small><a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf">Image credit</a></small></p>
<p><a id="upsampling"></a></p>
<h4 id="methodsforupsampling">Methods for upsampling</h4>
<p>There are a few different approaches that we can use to <em>upsample</em> the resolution of a feature map. Whereas pooling operations downsample the resolution by summarizing a local area with a single value (ie. average or max pooling), &quot;unpooling&quot; operations upsample the resolution by distributing a single value into a higher resolution.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-19-at-12.54.50-PM.png" alt="Screen-Shot-2018-05-19-at-12.54.50-PM"><br>
<small><a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf">Image credit</a></small></p>
<p>However, <strong>transpose convolutions</strong> are by far the most popular approach as they allow for us to develop a <em>learned upsampling</em>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-19-at-3.12.51-PM.png" alt="Screen-Shot-2018-05-19-at-3.12.51-PM"><br>
<small><a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf">Image credit</a></small></p>
<p>Whereas a typical convolution operation will take the dot product of the values currently in the filter's view and produce a single value for the corresponding output position, a transpose convolution essentially does the opposite. For a transpose convolution, we take a single value from the low-resolution feature map and multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-21-at-11.01.29-PM.png" alt="Screen-Shot-2018-05-21-at-11.01.29-PM"><br>
<small>A simplified 1D example of upsampling through a transpose operation. <a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf">(Source)</a></small></p>
<p>For filter sizes which produce an overlap in the output feature map (eg. 3x3 filter with stride 2 - as shown in the below example), the overlapping values are simply added together. Unfortunately, this tends to produce a checkerboard artifact in the output and is undesirable, so it's best to ensure that your filter size does not produce an overlap.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/padding_strides_transposed--1-.gif" alt="padding_strides_transposed"><br>
<small>Input in blue, output in green. <a href="https://github.com/vdumoulin/conv_arithmetic">(Source)</a></small></p>
<p><a id="fully_convolutional"></a></p>
<h4 id="fullyconvolutionalnetworks">Fully convolutional networks</h4>
<p>The approach of using a &quot;fully convolutional&quot; network trained end-to-end, pixels-to-pixels for the task of image segmentation was introduced by <a href="https://arxiv.org/abs/1411.4038">Long et al.</a> in late 2014. The paper's authors propose adapting existing, well-studied <em>image classification</em> networks (eg. AlexNet) to serve as the encoder module of the network, appending a decoder module with transpose convolutional layers to upsample the coarse feature maps into a full-resolution segmentation map.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-9.53.20-AM.png" alt="Screen-Shot-2018-05-20-at-9.53.20-AM"><br>
<small><a href="https://arxiv.org/abs/1411.4038">Image credit (with modification)</a></small></p>
<p>The full network, as shown below, is trained according to a pixel-wise cross entropy loss.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-16-at-10.34.02-PM.png" alt="Screen-Shot-2018-05-16-at-10.34.02-PM"><br>
<small><a href="https://arxiv.org/abs/1411.4038">Image credit</a></small></p>
<p>However, because the encoder module reduces the resolution of the input by a factor of 32, the decoder module <strong>struggles to produce fine-grained segmentations</strong> (as shown below).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-10.15.09-AM.png" alt="Screen-Shot-2018-05-20-at-10.15.09-AM"></p>
<p>The paper's authors comment eloquently on this struggle:</p>
<blockquote>
<p>Semantic segmentation faces an inherent tension between semantics and location: global information resolves <strong>what</strong> while local information resolves <strong>where</strong>... Combining fine layers and coarse layers lets the model make local predictions that respect global structure. ― <a href="https://arxiv.org/abs/1411.4038">Long et al.</a></p>
</blockquote>
<p><a id="skip_connections"></a></p>
<h4 id="addingskipconnections">Adding skip connections</h4>
<p>The authors address this tension by slowly upsampling (in stages) the encoded representation, adding &quot;skip connections&quot; from earlier layers, and summing these two feature maps.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-12.26.53-PM.png" alt="FCN-8s"><br>
<small><a href="https://arxiv.org/abs/1411.4038">Image credit (with modification)</a></small></p>
<p>These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-12.10.25-PM.png" alt="Screen-Shot-2018-05-20-at-12.10.25-PM"></p>
<p><a href="https://arxiv.org/abs/1505.04597">Ronneberger et al.</a> improve upon the &quot;fully convolutional&quot; architecture primarily through <em><strong>expanding the capacity of the decoder</strong></em> module of the network. More concretely, they propose the <strong>U-Net architecture</strong> which &quot;consists of a contracting path to capture context and a <em><strong>symmetric</strong></em> expanding path that enables precise localization.&quot; This simpler architecture has grown to be very popular and has been adapted for a variety of segmentation problems.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-1.46.43-PM.png" alt="U Net"><br>
<small><a href="https://arxiv.org/abs/1505.04597">Image credit</a></small></p>
<p><em>Note: The original architecture introduces a decrease in resolution due to the use of <code>valid</code> padding. However, some practitioners opt to use <code>same</code> padding where the padding values are obtained by image reflection at the border.</em></p>
<p>Whereas <a href="https://arxiv.org/abs/1411.4038">Long et al.</a> (FCN paper) reported that data augmentation (&quot;randomly mirroring and “jittering” the images by translating them up to 32 pixels&quot;) did not result in a noticeable improvement in performance, <a href="https://arxiv.org/abs/1505.04597">Ronneberger et al.</a> (U-Net paper) credit data augmentations (&quot;random elastic deformations of the training samples&quot;) as a key concept for learning. It appears as if <strong>the usefulness (and type) of data augmentation depends on the problem domain</strong>.</p>
<p><a id="advanced_unet"></a></p>
<h4 id="advancedunetvariants">Advanced U-Net variants</h4>
<p>The standard U-Net model consists of a series of convolution operations for each &quot;block&quot; in the architecture. As I discussed in my post on <a href="https://www.jeremyjordan.me/convnet-architectures/">common convolutional network architectures</a>, there exist a number of more advanced &quot;blocks&quot; that can be substituted in for stacked convolutional layers.</p>
<p><a href="https://arxiv.org/abs/1608.04117">Drozdzal et al.</a> swap out the basic stacked convolution blocks in favor of <strong>residual blocks</strong>. This residual block introduces short skip connections (within the block) alongside the existing long skip connections (between the corresponding feature maps of encoder and decoder modules) found in the standard U-Net structure. They report that the short skip connections allow for faster convergence when training and allow for deeper models to be trained.</p>
<p>Expanding on this, <a href="https://arxiv.org/abs/1611.09326">Jegou et al.</a> proposed the use of <strong>dense blocks</strong>, still following a U-Net structure, arguing that the &quot;characteristics of DenseNets make them a very good fit for semantic segmentation as they <em>naturally induce skip connections and multi-scale supervision</em>.&quot; These dense blocks are useful as they carry low level features from previous layers directly alongside higher level features from more recent layers, allowing for highly efficient feature reuse.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-3.42.24-PM.png" alt="FC DenseNet"><br>
<small><a href="https://arxiv.org/abs/1611.09326">Image credit (with modification)</a></small></p>
<p>One very important aspect of this architecture is the fact that the upsampling path <em>does not</em> have a skip connection between the input and output of a dense block. The authors note that because the &quot;upsampling path <em>increases</em> the feature maps spatial resolution, the linear growth in the number of features would be too memory demanding.&quot;  Thus, only the <em>output</em> of a dense block is passed along in the decoder module.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-21-at-10.44.23-PM.png" alt="Screen-Shot-2018-05-21-at-10.44.23-PM"><br>
<small>The FC-DenseNet103 model acheives <a href="https://arxiv.org/abs/1611.09326">state of the art results</a> (Oct 2017) on the CamVid dataset.</small></p>
<p><a id="dilated_convolutions"></a></p>
<h4 id="dilatedatrousconvolutions">Dilated/atrous convolutions</h4>
<p>One benefit of downsampling a feature map is that it <em>broadens the receptive field</em> (with respect to the input) for the following filter, given a constant filter size. Recall that this approach is more desirable than increasing the filter size due to the parameter inefficiency of large filters (discussed <a href="https://arxiv.org/abs/1512.00567">here</a> in Section 3.1). However, this broader context comes at the cost of reduced spatial resolution.</p>
<p><strong>Dilated convolutions</strong> provide alternative approach towards gaining a wide field of view while preserving the full spatial dimension. As shown in the figure below, the values used for a dilated convolution are spaced apart according to some specified <em>dilation rate</em>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/dilation.gif" alt="dilation"><br>
<small><a href="https://github.com/vdumoulin/conv_arithmetic">Image credit</a></small></p>
<p><a href="https://arxiv.org/abs/1511.07122">Some architectures</a> swap out the last few pooling layers for dilated convolutions with successively higher dilation rates to maintain the same field of view while preventing loss of spatial detail. However, it is often <a href="https://arxiv.org/abs/1606.00915">still too computationally expensive</a> to completely replace pooling layers with dilated convolutions.</p>
<p><a id="loss"></a></p>
<h2 id="definingalossfunction">Defining a loss function</h2>
<p>The most commonly used loss function for the task of image segmentation is a <strong>pixel-wise cross entropy loss</strong>. This loss examines <em>each pixel individually</em>, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-24-at-10.46.16-PM.png" alt="cross entropy"></p>
<p>Because the cross entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class. <a href="https://arxiv.org/abs/1411.4038">Long et al.</a> (FCN paper) discuss weighting this loss for each <strong>output channel</strong> in order to counteract a class imbalance present in the dataset.</p>
<p>Meanwhile, <a href="https://arxiv.org/abs/1505.04597">Ronneberger et al.</a> (U-Net paper) discuss a loss weighting scheme for each <strong>pixel</strong> such that there is a higher weight at the border of segmented objects. This loss weighting scheme helped their U-Net model segment cells in biomedical images in a <em>discontinuous</em> fashion such that individual cells may be easily identified within the binary segmentation map.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-21-at-10.53.04-PM.png" alt="pixel loss weights"><br>
<small>Notice how the binary segmentation map produces clear borders around the cells. <a href="https://arxiv.org/abs/1505.04597">(Source)</a></small></p>
<hr>
<p>Another popular loss function for image segmentation tasks is based on the <a href="https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient">Dice coefficient</a>, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:</p>
<p>$$ Dice = \frac{{2\left| {A \cap B} \right|}}{{\left| A \right| + \left| B \right|}} $$</p>
<p>where ${\left| {A \cap B} \right|}$ represents the common elements between sets A and B, and $\left| A \right|$ represents the number of elements in set A (and likewise for set B).</p>
<p>For the case of evaluating a Dice coefficient on predicted segmentation masks, we can approximate ${\left| {A \cap B} \right|}$ as the element-wise multiplication between the prediction and target mask, and then sum the resulting matrix.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/intersection-1.png" alt="intersection"></p>
<p>Because our target mask is binary, we effectively zero-out any pixels from our prediction which are not &quot;activated&quot; in the target mask. For the remaining pixels, we are essentially penalizing low-confidence predictions; a higher value for this expression, which is in the numerator, leads to a better Dice coefficient.</p>
<p>In order to quantify $\left| A \right|$ and $\left| B \right|$, <a href="https://arxiv.org/abs/1608.04117">some researchers</a> use the simple sum whereas <a href="https://arxiv.org/abs/1606.04797">other researchers</a> prefer to use the squared sum for this calculation. I don't have the practical experience to know which performs better empirically over a wide range of tasks, so I'll leave you to try them both and see which works better.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-25-at-5.53.46-PM.png" alt="cardinality"></p>
<p>In case you were wondering, there's a 2 in the numerator in calculating the Dice coefficient because our denominator &quot;double counts&quot; the common elements between the two sets. In order to formulate a loss function which can be minimized, we'll simply use $1 - Dice$. This loss function is known as the <strong>soft Dice loss</strong> because we directly use the predicted probabilities instead of thresholding and converting them into a binary mask.</p>
<p>With respect to the neural network output, the numerator is concerned with the <em>common activations</em> between our prediction and target mask, where as the denominator is concerned with the quantity of activations in each mask <em>separately</em>. This has the effect of normalizing our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-24-at-10.50.59-PM.png" alt="soft dice"></p>
<p>A soft Dice loss is calculated for each class separately and then averaged to yield a final score. An example implementation is provided below.</p>
<script src="https://gist.github.com/jeremyjordan/9ea3032a32909f71dd2ab35fe3bacc08.js"></script>
<p><a id="datasets"></a></p>
<h2 id="commondatasetsandsegmentationcompetitions">Common datasets and segmentation competitions</h2>
<p>Below, I've listed a number of common datasets that researchers use to train new models and benchmark against the state of the art. You can also explore previous Kaggle competitions and read about how winning solutions implemented segmentation models for their given task.</p>
<p><strong>Datasets</strong></p>
<ul>
<li><a href="http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html">PASCAL VOC 2012 Segmentation Competition</a></li>
<li><a href="http://cocodataset.org/#stuff-2018">COCO 2018 Stuff Segmentation Task</a></li>
<li><a href="http://bair.berkeley.edu/blog/2018/05/30/bdd/">BDD100K: A Large-scale Diverse Driving Video Database</a></li>
<li><a href="http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/">Cambridge-driving Labeled Video Database (CamVid)</a></li>
<li><a href="https://www.cityscapes-dataset.com/">Cityscapes Dataset</a></li>
<li><a href="https://www.mapillary.com/dataset/vistas">Mapillary Vistas Dataset</a></li>
<li><a href="http://apolloscape.auto/scene.html">ApolloScape Scene Parsing</a></li>
</ul>
<p><strong>Past Kaggle Competitions</strong></p>
<ul>
<li><a href="https://www.kaggle.com/c/data-science-bowl-2018">2018 Data Science Bowl</a>
<ul>
<li>Read about the <a href="https://www.kaggle.com/c/data-science-bowl-2018/discussion/54741">first place solution.</a></li>
</ul>
</li>
<li><a href="https://www.kaggle.com/c/carvana-image-masking-challenge">Carvana Image Masking Challenge</a>
<ul>
<li>Read about the <a href="https://arxiv.org/abs/1801.05746">first place solution.</a></li>
</ul>
</li>
<li><a href="https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection">Dstl Satellite Imagery Feature Detection</a>
<ul>
<li>Read about the <a href="https://arxiv.org/abs/1706.06169">third place solution.</a></li>
</ul>
</li>
</ul>
<p><a id="further_reading"></a></p>
<h2 id="furtherreading">Further Reading</h2>
<p>Papers</p>
<ul>
<li><a href="https://arxiv.org/abs/1605.06211">Fully Convolutional Networks for Semantic Segmentation</a></li>
<li><a href="https://arxiv.org/abs/1505.04597">U-Net: Convolutional Networks for Biomedical Image Segmentation</a></li>
<li><a href="https://arxiv.org/abs/1608.04117">The Importance of Skip Connections in Biomedical Image Segmentation</a></li>
<li><a href="https://arxiv.org/abs/1611.09326">The One Hundred Layers Tiramisu:<br>
Fully Convolutional DenseNets for Semantic Segmentation</a></li>
<li><a href="https://arxiv.org/abs/1511.07122">Multi-Scale Context Aggregation by Dilated Convolutions</a></li>
<li><a href="https://arxiv.org/abs/1606.00915">DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs</a></li>
<li><a href="https://arxiv.org/abs/1706.05587">Rethinking Atrous Convolution for Semantic Image Segmentation</a></li>
<li><a href="https://www.biorxiv.org/content/early/2018/05/31/335216">Evaluation of Deep Learning Strategies for Nucleus Segmentation in Fluorescence Images</a></li>
</ul>
<p>Lectures</p>
<ul>
<li><a href="https://youtu.be/nDPWywWRIRo">Stanford CS231n: Detection and Segmentation</a>
<ul>
<li><a href="http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf">Lecture Slides</a></li>
</ul>
</li>
</ul>
<p>Blog posts</p>
<ul>
<li><a href="http://matpalm.com/blog/counting_bees/">Mat Kelcey's (Twitter Famous) Bee Detector</a></li>
<li><a href="https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html">Semantic Image Segmentation with DeepLab in TensorFlow</a></li>
<li><a href="https://thegradient.pub/semantic-segmentation/">Going beyond the bounding box with semantic segmentation</a></li>
<li><a href="https://medium.com/@keremturgutlu/semantic-segmentation-u-net-part-1-d8d6f6005066">U-Net Case Study: Data Science Bowl 2018</a></li>
<li><a href="https://nikolasent.github.io/proj/comp2.html">Lyft Perception Challenge: 4th place solution</a></li>
</ul>
<p>Image labeling tools</p>
<ul>
<li><a href="https://github.com/wkentaro/labelme">labelme: Image Polygonal Annotation with Python</a></li>
</ul>
<p>Useful Github repos</p>
<ul>
<li><a href="https://github.com/meetshah1995/pytorch-semseg">Pytorch implementations</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Lessons learned from attempting to launch a startup.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In Q4 of 2017, I made the decision to walk down the entrepreneurial path and dedicate a full-time effort towards launching a startup venture. I secured a healthy seed round of funding from a local angel investor and recruited three of my peers to join me in this effort. Five</p>]]></description><link>https://www.jeremyjordan.me/mobius/</link><guid isPermaLink="false">5aec81508a2aae0022a40581</guid><category><![CDATA[Startups]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Wed, 09 May 2018 11:35:33 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In Q4 of 2017, I made the decision to walk down the entrepreneurial path and dedicate a full-time effort towards launching a startup venture. I secured a healthy seed round of funding from a local angel investor and recruited three of my peers to join me in this effort. Five months later, we decided to put the project on an indefinite pause. This blog post discusses what I set out to accomplish, what happened, and what I've learned from the experience.</p>
<h1 id="thebigidea">The big idea</h1>
<p>The venture was founded on a simple premise: <strong>we could use machine learning to forecast short-term fluctuations</strong> in market prices <strong>based on volume imbalances in order books</strong>.</p>
<p>Specifically, we were focused on the cryptoasset markets as these exchanges offered free, real-time data feeds to the order books and fulfilled trades.</p>
<p>An <strong>order book</strong> essentially maintains current unfulfilled orders for a given asset. The exchanges reference these order books in order to <em>match interested buyers and sellers</em>. If you're interested in trading something, you would publish your intentions to this order book which would be executed according to the type of order placed. New orders which can be matched with an existing order on the book are executed immediately and <em>remove</em> that volume from the order book.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Order-book.png" alt="Order-book"></p>
<p>To gain an understanding for how this order book works, you can reference the following cartoon.</p>
<h5 id="anillustratedguidetoexchangeorderbooks">An illustrated guide to exchange order books</h5>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-08-at-9.33.12-AM-1.png#full" alt="OB cartoon"></p>
<p>Because the order book contains a collection of orders waiting to be matched, it essentially <em>displays the schedule for supply and demand</em> as a function of price. That is, you can see how many people have stated that they are willing to buy/sell when the asset moves to a given price. The market price changes when the volume at the &quot;best available price&quot; level is exhausted.</p>
<p>As I watched how the order book and market prices evolved over a period of time, I began to notice predictable patterns emerge.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-08-at-10.10.26-AM.png" alt="patterns"></p>
<p>Feel free to check out a <a href="https://www.gdax.com/trade/BTC-USD">live feed of the BTC-USD order book</a> and see if you can spot any of the patterns mentioned alongside the corresponding price action.</p>
<p>It's also worth noting that <a href="https://www.jeremyjordan.me/blockchain-introduction/">cryptoassets</a> are a unique asset class given that the <strong>underlying assets being traded are publicly inspectable in real-time</strong>. For example, you can <a href="http://statoshi.info/dashboard/db/transactions">listen in on the Bitcoin network</a> and capture statistics such as the number of participants in the network, the total value being transacted on the network, and other useful metrics. Compare this to a US public equity which only releases internal information about the company in their quarterly reports. Given this radical openness and transparency, the asset class seems uniquely posed for quantitative, data-driven trading.</p>
<p>After securing funding and recruiting a team, we initially budgeted 2 to 3 months to build an initial proof of concept and further evaluate the feasibility of the idea.</p>
<h1 id="buildingtheproofofconcept">Building the proof of concept</h1>
<p>As we set out to build the proof of concept, the initial days consisted of architecting and clarifying the exact needs and requirements of our technical infrastructure. We discussed various machine learning models and approaches, and specified the desired characteristics of our models. Because the cryptoasset market is rapidly evolving, we wanted models which supported continuous &quot;online&quot; learning. We also posited that this would help safeguard our models from being manipulated by bad actors who could <a href="https://en.wikipedia.org/wiki/Spoofing_(finance)">&quot;spoof&quot; the order book data</a>; a model capable of continuous learning could evolve to not be susceptible to repeated manipulation attempts.</p>
<p>As the weeks passed, we unknowingly succumbed to <a href="https://en.wikipedia.org/wiki/Scope_creep"><em><strong>scope creep</strong></em></a> as our designs for a minimum viable product inched closer toward designs for a robust scalable product. I've realized the importance of building <em>forcing functions</em> into projects to specifically protect against scope creep.</p>
<p>Looking back, it's now clear that <strong>we had operated under the naive assumption that our fundamental hypothesis was correct</strong> (after all, it <em>seemed</em> like it was the right idea) which led us to focus our time addressing auxiliary problems such as scalability and performance optimizations. If we had instead focused <em>all of our energy</em> on testing our <a href="https://hackernoon.com/the-mvp-is-dead-long-live-the-rat-233d5d16ab02">riskiest (fundamental) assumption</a>, we would have discovered its flaws much more quickly.</p>
<h1 id="discoveringflawsinourfundamentalassumption">Discovering flaws in our fundamental assumption</h1>
<p>After spending a fair amount of time designing and iterating on our data infrastructure, we settled down to perform some more rigorous analysis of the data being collected by the real-time market feeds.</p>
<p>Here's what we learned:</p>
<ul>
<li>Most of the observed <strong>price fluctuations were too small to be profitable</strong> after accounting for trading fees.</li>
<li>The data was <strong>highly imbalanced</strong>, where most periods of time the proper action was no action. When looking at snapshots of the order book once per second, <em>profitable buy and sell opportunities</em> were observed in ~2% of the snapshots.</li>
<li>The <strong>signal wasn't as clear as it seemed</strong>. When I had manually been watching the exchange data for patterns, I was subject to confirmation bias where I was only attune to the patterns when they <em>did</em> manifest, not when they didn't. The order book data in fact contains quite a bit of noise as traders can cancel their orders at any given time.</li>
</ul>
<p>For machine learning models, datasets that are inherently more difficult to learn from see an <em>amplification</em> in the learning challenge when a class imbalance is introduced.</p>
<p>The initial hypothesis was that we could use machine learning to forecast short-term price fluctuations based on the order book data. After validating the idea, we concluded that the idea was <em>not likely feasible</em>, at least not in the continuous, automated fashion we had imagined with machine learning.</p>
<p>I had started working on this problem as a side project as I was looking to apply what I have been learning about (machine learning) to address a messy, real world problem. The problem was attractive as it appeared to offer a wealth of data which was freely accessible - however, I did not accurately gauge the challenge of learning from this firehose of market data.</p>
<h1 id="movingforward">Moving forward</h1>
<p>When we discovered the flaws in our fundamental assumption, we had two main options: pivot or halt the project. After much deliberation, we eventually decided to halt this project. We each had our own reasons for reaching this conclusion, I'll only discuss my personal reasons.</p>
<p>At this stage in life, I’m focused on optimizing for learning and continuing to invest in myself. I’ve realized that your <em>environment</em> is a critical factor in this growth, so I’ve decided to embed myself in an environment where I can work alongside more senior engineers and gain experience working with teams who have a history of building machine learning products.</p>
<p>I'm 22 years old, I have a long career ahead of me. My insatiable desire to build things has cultivated an interest in both engineering and entrepreneurship, and I'm sure I'll walk down the entrepreneurial path once again later in my career. However, my current focus is on <em>mastering machine learning</em> and gaining practical work experience; I feel as if this focus will be beneficial for the <em>long-term trajectory</em> of my career.</p>
<blockquote>
<p>&quot;The average age of founders of the most successful startups—those with growth in the top 1% of their industry—was 45.&quot; <a href="https://work.qz.com/1260465/the-most-successful-startups-have-founders-over-the-age-of-40/">Source</a></p>
</blockquote>
<p>I learned a lot my from experience exploring this venture and I'm grateful for the challenges that I encountered. Here's to the next adventure.</p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Common architectures in convolutional neural networks.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In this post, I'll discuss commonly used architectures for convolutional networks. As you'll see, almost all CNN architectures follow the same general design principles of successively applying convolutional layers to the input, periodically downsampling the spatial dimensions while increasing the number of feature maps.</p>
<p>While the classic network architectures were</p>]]></description><link>https://www.jeremyjordan.me/convnet-architectures/</link><guid isPermaLink="false">5a122a5a59e20e0022bcf030</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Fri, 20 Apr 2018 02:45:39 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In this post, I'll discuss commonly used architectures for convolutional networks. As you'll see, almost all CNN architectures follow the same general design principles of successively applying convolutional layers to the input, periodically downsampling the spatial dimensions while increasing the number of feature maps.</p>
<p>While the classic network architectures were comprised simply of stacked convolutional layers, modern architectures explore new and innovative ways for constructing convolutional layers in a way which allows for more efficient learning. Almost all of these architectures are based on a repeatable unit which is used throughout the network.</p>
<p>These architectures serve as general design guidelines which machine learning practitioners will then adapt to solve various computer vision tasks. These architectures serve as <strong>rich feature extractors</strong> which can be used for image classification, object detection, image segmentation, and many other more advanced tasks.</p>
<p>Classic network architectures (included for historical purposes)</p>
<ul>
<li><a href="#lenet5">LeNet-5</a></li>
<li><a href="#alexnet">AlexNet</a></li>
<li><a href="#vgg16">VGG 16</a></li>
</ul>
<p>Modern network architectures</p>
<ul>
<li><a href="#inception">Inception</a></li>
<li><a href="#resnet">ResNet</a></li>
<li><a href="#resnext">ResNeXt</a></li>
<li><a href="#densenet">DenseNet</a></li>
</ul>
<p><a id="lenet5"></a></p>
<h2 id="lenet5">LeNet-5</h2>
<p>Yann Lecun's LeNet-5 model was developed in 1998 to identify handwritten digits for zip code recognition in the postal service. This pioneering model largely introduced the convolutional neural network as we know it today.</p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-16-at-11.34.51-AM.png" alt="Screen-Shot-2018-04-16-at-11.34.51-AM"></p>
<p>Convolutional layers use a subset of the previous layer's channels for each filter to reduce computation and force a break of symmetry in the network. The subsampling layers use a form of average pooling.</p>
<p><strong>Parameters:</strong> 60,000</p>
<p><strong>Paper:</strong> <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">Gradient-based learning applied to document recognition</a></p>
<p><a id="alexnet"></a></p>
<h2 id="alexnet">AlexNet</h2>
<p>AlexNet was developed by Alex Krizhevsky et al. in 2012 to compete in the ImageNet competition. The general architecture is quite similar to LeNet-5, although this model is considerably larger. The success of this model (which took first place in the 2012 ImageNet competition) convinced a lot of the computer vision community to take a serious look at deep learning for computer vision tasks.</p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/AlexNet-CNN-architecture-layers.png" alt="AlexNet-CNN-architecture-layers"></p>
<p><strong>Parameters:</strong> 60 million</p>
<p><strong>Paper:</strong> <a href="https://www.nvidia.cn/content/tesla/pdf/machine-learning/imagenet-classification-with-deep-convolutional-nn.pdf">ImageNet Classification with Deep Convolutional Neural Networks</a></p>
<p><a id="vgg16"></a></p>
<h2 id="vgg16">VGG-16</h2>
<p>The VGG network, introduced in 2014, offers a deeper yet simpler variant of the convolutional structures discussed above. At the time of its introduction, this model was considered to be very deep.</p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/vgg16.png" alt="vgg16"></p>
<p><strong>Parameters:</strong> 138 million</p>
<p><strong>Paper:</strong> <a href="https://arxiv.org/abs/1409.1556">Very Deep Convolutional Networks for Large-Scale Image Recognition</a></p>
<hr>
<p><a id="inception"></a></p>
<h2 id="inceptiongooglenet">Inception (GoogLeNet)</h2>
<p>In 2014, researchers at Google introduced the Inception network which took first place in the 2014 ImageNet competition for classification and detection challenges.</p>
<p>The model is comprised of a basic unit referred to as an &quot;Inception cell&quot; in which we perform a series of convolutions at different scales and subsequently aggregate the results. In order to save computation, 1x1 convolutions are used to reduce the input channel depth. For each cell, we learn a set of 1x1, 3x3, and 5x5 filters which can learn to extract features at different scales from the input. Max pooling is also used, albeit with &quot;same&quot; padding to preserve the dimensions so that the output can be properly concatenated.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-17-at-10.12.35-AM.png" alt="inception cell"></p>
<p>These researchers published a follow-up paper which introduced more efficient alternatives to the original Inception cell. Convolutions with large spatial filters (such as 5x5 or 7x7) are beneficial in terms of their expressiveness and ability to extract features at a larger scale, but the computation is disproportionately expensive. The researchers pointed out that a 5x5 convolution can be more cheaply represented by two stacked 3x3 filters.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-17-at-5.32.45-PM.png" alt="3x3 vs 5x5"></p>
<p>Whereas a $5 \times 5 \times c$ filter requires $25c$ parameters, two $3 \times 3 \times c$ filters only require $18c$ parameters. In order to most accurately represent a 5x5 filter, we shouldn't use any nonlinear activations between the two 3x3 layers. However, it was discovered that &quot;linear activation was always inferior to using rectified linear units in all stages of the factorization.&quot;</p>
<p>It was also shown that 3x3 convolutions could be further deconstructed into successive 3x1 and 1x3 convolutions.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-17-at-7.07.13-PM.png" alt="Screen-Shot-2018-04-17-at-7.07.13-PM"></p>
<p>Generalizing this insight, we can more efficiently compute an $n \times n$ convolution as a $1 \times n$ convolution followed by a $n \times 1$ convolution.</p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/inception-model.png" alt="inception-model"></p>
<p>In order to improve overall network performance, two auxiliary outputs are added throughout the network. It was later discovered that the earliest auxiliary output had no discernible effect on the final quality of the network. The addition of auxiliary outputs primarily benefited the end performance of the model, converging at a slightly better value than the same network architecture without an auxiliary branch. It is believed the addition of auxiliary outputs had a regularizing effect on the network.</p>
<p>A revised, deeper version of the Inception network which takes advantage of the more efficient Inception cells is shown below.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/inception-v3-model.png" alt="inception-v3-model"></p>
<p><strong>Parameters:</strong> 5 million (V1) and 23 million (V3)</p>
<p><strong>Papers:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/1409.4842">Going deeper with convolutions</a></li>
<li><a href="https://arxiv.org/abs/1512.00567">Rethinking the Inception Architecture for Computer Vision</a></li>
</ul>
<p><a id="resnet"></a></p>
<h2 id="resnet">ResNet</h2>
<p>Deep residual networks were a breakthrough idea which enabled the development of much deeper networks (hundreds of layers as opposed to tens of layers).</p>
<p>Its a generally accepted principle that deeper networks are capable of learning more complex functions and representations of the input which should lead to better performance. However, many researchers observed that adding more layers eventually had a negative effect on the final performance. This behavior was not intuitively expected, as explained by the authors below.</p>
<blockquote>
<p>Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution <em>by construction</em> to the deeper model: the added layers are <em>identity</em> mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that <strong>a deeper model should produce no higher training error than its shallower counterpart</strong>. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).</p>
</blockquote>
<p>This phenomenon is referred to by the authors as the <em>degradation</em> problem - alluding to the fact that although better parameter initialization techniques and batch normalization allow for deeper networks to <em>converge</em>, they often converge at a higher error rate than their shallower counterparts. In the limit, simply stacking more layers degrades the model's ultimate performance.</p>
<p>The authors propose a remedy to this degradation problem by introducing <em>residual blocks</em> in which intermediate layers of a block learn a residual function with reference to the block input. You can think of this residual function as a refinement step in which we learn how to adjust the input feature map for higher quality features. This compares with a &quot;plain&quot; network in which each layer is expected to learn new and distinct feature maps. In the event that no refinement is needed, the intermediate layers can learn to gradually adjust their weights toward zero such that the residual block represents an identity function.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-16-at-6.29.19-PM.png" alt="residual unit"></p>
<p>Note: <a href="https://arxiv.org/abs/1603.05027">It was later discovered</a> that a slight modification to the original proposed unit offers better performance by more efficiently allowing gradients to propagate through the network during training.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-17-at-10.36.21-PM.png" alt="revised residual unit"></p>
<p><strong>Wide residual networks</strong><br>
Although the original ResNet paper focused on creating a network architecture to enable deeper structures by alleviating the degradation problem, <a href="https://arxiv.org/abs/1605.07146">other researchers have since pointed out</a> that increasing the network's width (channel depth) can be a more efficient way of expanding the overall capacity of the network.</p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-16-at-6.30.05-PM.png" alt="Screen-Shot-2018-04-16-at-6.30.05-PM"></p>
<p>Each colored block of layers represent a series of convolutions of the same dimension. The feature mapping is periodically downsampled by strided convolution accompanied by an increase in channel depth to preserve the time complexity per layer. Dotted lines denote residual connections in which we project the input via a 1x1 convolution to match the dimensions of the new block.</p>
<p>The diagram above visualizes the ResNet 34 architecture. For the ResNet 50 model, we simply replace each two layer residual block with a three layer bottleneck block which uses 1x1 convolutions to reduce and subsequently restore the channel depth, allowing for a reduced computational load when calculating the 3x3 convolution.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-16-at-5.47.38-PM.png" alt="Screen-Shot-2018-04-16-at-5.47.38-PM"></p>
<p><strong>Parameters:</strong> 25 million (ResNet 50)</p>
<p><strong>Papers:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/1512.03385">Deep Residual Learning for Image Recognition</a></li>
<li><a href="https://arxiv.org/abs/1603.05027">Identity Mappings in Deep Residual Networks</a></li>
<li><a href="https://arxiv.org/abs/1605.07146">Wide Residual Networks</a></li>
</ul>
<p><a id="resnext"></a></p>
<h2 id="resnext">ResNeXt</h2>
<p>The ResNeXt architecture is an extension of the deep residual network which replaces the standard residual block with one that leverages a  &quot;<em>split-transform-merge</em>&quot; strategy (ie. branched paths within a cell) used in the Inception models. Simply, rather than performing convolutions over the full input feature map, the block's input is projected into a series of lower (channel) dimensional representations of which we separately apply a few convolutional filters before merging the results.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-18-at-11.46.24-PM.png" alt="Screen-Shot-2018-04-18-at-11.46.24-PM"></p>
<p>This idea is quite similar to <a href="https://blog.yani.io/filter-group-tutorial/"><em>group convolutions</em></a>, which was an idea proposed in the <a href="https://www.nvidia.cn/content/tesla/pdf/machine-learning/imagenet-classification-with-deep-convolutional-nn.pdf">AlexNet paper</a> as a way to share the convolution computation across two GPUs. Rather than creating filters with the full channel depth of the input, the input is split channel-wise into groups with each as shown below.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-18-at-11.47.29-PM.png" alt="Screen-Shot-2018-04-18-at-11.47.29-PM"></p>
<p>It was discovered that using grouped convolutions led to a degree of <em>specialization</em> among groups where separate groups focused on different characteristics of the input image.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-19-at-10.07.03-AM.png" alt="Screen-Shot-2018-04-19-at-10.07.03-AM"></p>
<p>The ResNeXt paper refers to the number of branches or groups as the <strong>cardinality</strong> of the ResNeXt cell and performs a series of experiments to understand relative performance gains between increasing the cardinality, depth, and width of the network. The experiments show that increasing cardinality is more effective at benefiting model performance than increasing the width or depth of the network. The experiments also suggest that &quot;residual connections are helpful for <em>optimization</em>, whereas aggregated transformations are (helpful for) <em>stronger representations</em>.&quot;</p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-19-at-12.08.53-AM.png" alt="Screen-Shot-2018-04-19-at-12.08.53-AM"></p>
<p>The ResNeXt architecture simply mimicks the ResNet models, replacing the ResNet blocks for the ResNeXt block.</p>
<p><strong>Paper:</strong> <a href="https://arxiv.org/abs/1611.05431">Aggregated Residual Transformations for Deep Neural Networks</a></p>
<p><a id="densenet"></a></p>
<h2 id="densenet">DenseNet</h2>
<p>The idea behind dense convolutional networks is simple: <strong>it may be useful to reference feature maps from earlier in the network</strong>. Thus, each layer's feature map is concatenated to the input of <em>every successive layer</em> within a dense block. This allows later layers within the network to <em>directly</em> leverage the features from earlier layers, encouraging feature reuse within the network. The authors state, &quot;concatenating feature-maps learned by <em>different layers</em> increases variation in the input of subsequent layers and improves efficiency.&quot;</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-19-at-5.15.59-PM.png" alt="Screen-Shot-2018-04-19-at-5.15.59-PM"></p>
<p>When I first came across this model, I figured that it would have an absurd number of parameters to support the dense connections between layers. However, because the network is capable of directly using any previous feature map, the authors found that they could work with very small output channel depths (ie. 12 filters per layer), vastly <em>reducing</em> the total number of parameters needed. The authors refer to the number of filters used in each convolutional layer as a &quot;growth rate&quot;, $k$, since each successive layer will have $k$ more channels than the last (as a result of accumulating and concatenating all previous layers to the input).</p>
<p>When compared with ResNet models, DenseNets are reported to acheive better performance with less complexity.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-19-at-7.27.08-PM.png" alt="Screen-Shot-2018-04-19-at-7.27.08-PM"></p>
<p><strong>Architecture</strong></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/04/Screen-Shot-2018-04-19-at-12.14.09-PM.png" alt="Screen-Shot-2018-04-19-at-12.14.09-PM"></p>
<p>For a majority of the experiments in the paper, the authors mimicked the general ResNet model architecture, simply swapping in the <em>dense block</em> as the repeated unit.</p>
<p><strong>Parameters:</strong></p>
<ul>
<li>0.8 million (DenseNet-100, k=12)</li>
<li>15.3 million (DenseNet-250, k=24)</li>
<li>40 million (DenseNet-190, k=40)</li>
</ul>
<p><strong>Paper:</strong> <a href="https://arxiv.org/abs/1608.06993">Densely Connected Convolutional Networks</a><br>
<strong>Video:</strong> <a href="https://www.youtube.com/watch?v=-W6y8xnd--U">CVPR 2017 Best Paper Award: Densely Connected Convolutional Networks</a></p>
<h2 id="furtherreading">Further reading</h2>
<p><a href="https://arxiv.org/abs/1605.07678">An Analysis of Deep Neural Network Models for Practical Applications</a><br>
- <a href="https://towardsdatascience.com/neural-network-architectures-156e5bad51ba">Corresponding blog post</a></p>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Variational autoencoders.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In my <a href="https://www.jeremyjordan.me/autoencoders/">introductory post</a> on autoencoders, I discussed various models (undercomplete, sparse, denoising, contractive) which take data as input and discover some latent state representation of that data. More specifically, our input data is converted into an <em>encoding vector</em> where each dimension represents some learned attribute about the data. The</p>]]></description><link>https://www.jeremyjordan.me/variational-autoencoders/</link><guid isPermaLink="false">5aac1e0248f22a0022a825b3</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Mon, 19 Mar 2018 04:25:56 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In my <a href="https://www.jeremyjordan.me/autoencoders/">introductory post</a> on autoencoders, I discussed various models (undercomplete, sparse, denoising, contractive) which take data as input and discover some latent state representation of that data. More specifically, our input data is converted into an <em>encoding vector</em> where each dimension represents some learned attribute about the data. The most important detail to grasp here is that our encoder network is outputting a <em>single value</em> for each encoding dimension. The decoder network then subsequently takes these values and attempts to recreate the original input.</p>
<p>A variational autoencoder (VAE) provides a <em>probabilistic</em> manner for describing an observation in latent space. Thus, rather than building an encoder which outputs a single value to describe each latent state attribute, we'll formulate our encoder to describe a probability distribution for each latent attribute.</p>
<h2 id="intuition">Intuition</h2>
<p>To provide an example, let's suppose we've trained an autoencoder model on a large dataset of faces with a encoding dimension of 6. An ideal autoencoder will learn descriptive attributes of faces such as skin color, whether or not the person is wearing glasses, etc. in an attempt to describe an observation in some compressed representation.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-16-at-10.24.11-PM.png" alt="Screen-Shot-2018-03-16-at-10.24.11-PM"></p>
<p>In the example above, we've described the input image in terms of its latent attributes using a single value to describe each attribute. However, we may prefer to represent each latent attribute as a range of possible values. For instance, what <em>single value</em> would you assign for the smile attribute if you feed in a photo of the Mona Lisa? Using a variational autoencoder, we can describe latent attributes in probabilistic terms.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-20-at-2.46.16-PM.png" alt="Screen-Shot-2018-06-20-at-2.46.16-PM"></p>
<p>With this approach, we'll now represent <em>each latent attribute</em> for a given input as a probability distribution. When decoding from the latent state, we'll randomly sample from each latent state distribution to generate a vector as input for our decoder model.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-20-at-2.47.56-PM.png" alt="Screen-Shot-2018-06-20-at-2.47.56-PM"></p>
<p><em>Note: For variational autoencoders, the encoder model is sometimes referred to as the <strong>recognition model</strong> whereas the decoder model is sometimes referred to as the <strong>generative model</strong>.</em></p>
<p>By constructing our encoder model to output a range of possible values (a statistical distribution) from which we'll randomly sample to feed into our decoder model, we're essentially enforcing a continuous, smooth latent space representation. For any sampling of the latent distributions, we're expecting our decoder model to be able to accurately reconstruct the input. Thus, values which are nearby to one another in latent space should correspond with very similar reconstructions.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-20-at-2.48.42-PM.png" alt="Screen-Shot-2018-06-20-at-2.48.42-PM"></p>
<h2 id="statisicalmotivation">Statisical motivation</h2>
<p>Suppose that there exists some hidden variable $z$ which generates an observation $x$.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-17-at-1.31.39-PM.png" alt="Screen-Shot-2018-03-17-at-1.31.39-PM"></p>
<p>We can only see $x$, but we would like to infer the characteristics of $z$. In other words, we’d like to compute $p\left( {z|x} \right)$.</p>
<p>$$ p\left( {z|x} \right) = \frac{{p\left( {x|z} \right)p\left( z \right)}}{{p\left( x \right)}} $$</p>
<p>Unfortunately, computing $p\left( x \right)$ is quite difficult.</p>
<p>$$ p\left( x \right) = \int {p\left( {x|z} \right)p\left( z \right)dz} $$</p>
<p>This usually turns out to be an <a href="https://stats.stackexchange.com/questions/4417/intractable-posterior-distributions">intractable distribution</a>. However, we can apply <a href="https://arxiv.org/pdf/1601.00670.pdf">varitational inference</a> to estimate this value.</p>
<p>Let's approximate $p\left( {z|x} \right)$ by another distribution $q\left( {z|x} \right)$ which we'll define such that it has a tractable distribution. If we can define the parameters of $q\left( {z|x} \right)$ such that it is very similar to $p\left( {z|x} \right)$, we can use it to perform approximate inference of the intractable distribution.</p>
<p>Recall that the KL divergence is a measure of difference between two probability distributions. Thus, if we wanted to ensure that $q\left( {z|x} \right)$ was similar to $p\left( {z|x} \right)$, we could minimize the KL divergence between the two distributions.</p>
<p>$$ \min KL\left( {q\left( {z|x} \right)||p\left( {z|x} \right)} \right) $$</p>
<p>Dr. Ali Ghodsi goes through a full derivation <a href="https://youtu.be/uaaqyVS9-rM?t=19m42s">here</a>, but the result gives us that we can minimize the above expression by maximizing the following:</p>
<p>$$ {E_{q\left( {z|x} \right)}}\log p\left( {x|z} \right) - KL\left( {q\left( {z|x} \right)||p\left( z \right)} \right) $$</p>
<p>The first term represents the reconstruction likelihood and the second term ensures that our learned distribution $q$ is similar to the true prior distribution $p$.</p>
<p>To revisit our graphical model, we can use $q$ to infer the possible hidden variables (ie. latent state) which was used to generate an observation. We can further construct this model into a neural network architecture where the encoder model learns a mapping from $x$ to $z$ and the decoder model learns a mapping from $z$ back to $x$.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-17-at-11.31.15-PM.png" alt="Screen-Shot-2018-03-17-at-11.31.15-PM"></p>
<p>Our loss function for this network will consist of two terms, one which penalizes reconstruction error (which can be thought of maximizing the reconstruction likelihood as discussed earlier) and a second term which encourages our learned distribution ${q\left( {z|x} \right)}$ to be similar to the true prior distribution ${p\left( z \right)}$, which we'll assume follows a unit Gaussian distribution, for each dimension $j$ of the latent space.</p>
<p>$$ {\cal L}\left( {x,\hat x} \right) + \sum\limits_j {KL\left( {{q_j}\left( {z|x} \right)||p\left( z \right)} \right)}  $$</p>
<h2 id="implementation">Implementation</h2>
<p>In the previous section, I established the statistical motivation for a variational autoencoder structure. In this section, I'll provide the practical implementation details for building such a model yourself.</p>
<p>Rather than directly outputting values for the latent state as we would in a standard autoencoder, the encoder model of a VAE will output parameters describing a distribution for each dimension in the latent space. Since we're assuming that our prior follows a normal distribution, we'll output <em>two</em> vectors describing the mean and variance of the latent state distributions. If we were to build a true multivariate Gaussian model, we'd need to define a covariance matrix describing how each of the dimensions are correlated. However, we'll make a simplifying assumption that our covariance matrix only has nonzero values on the diagonal, allowing us to describe this information in a simple vector.</p>
<p>Our decoder model will then generate a latent vector by sampling from these defined distributions and proceed to develop a reconstruction of the original input.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-18-at-12.24.19-AM.png" alt="Screen-Shot-2018-03-18-at-12.24.19-AM"></p>
<p>However, this sampling process requires some extra attention. When training the model, we need to be able to calculate the relationship of each parameter in the network with respect to the final output loss using a technique known as <a href="https://www.jeremyjordan.me/neural-networks-training/">backpropagation</a>. However, we simply cannot do this for a <em>random sampling</em> process. Fortunately, we can leverage a clever idea known as the &quot;reparameterization trick&quot; which suggests that we randomly sample $\varepsilon$ from a unit Gaussian, and then shift the randomly sampled $\varepsilon$ by the latent distribution's mean $\mu$ and scale it by the latent distribution's variance $\sigma$.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-18-at-4.36.34-PM.png" alt="Screen-Shot-2018-03-18-at-4.36.34-PM"></p>
<p>With this reparameterization, we can now optimize the <em>parameters</em> of the distribution while still maintaining the ability to randomly sample from that distribution.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-18-at-4.39.41-PM.png" alt="Screen-Shot-2018-03-18-at-4.39.41-PM"></p>
<p><em>Note: In order to deal with the fact that the network may learn negative values for $\sigma$, we'll typically have the network learn  $\log \sigma$ and exponentiate this value to get the latent distribution's variance.</em></p>
<h2 id="visualizationoflatentspace">Visualization of latent space</h2>
<p>To understand the implications of a variational autoencoder model and how it differs from standard autoencoder architectures, it's useful to examine the latent space. <a href="https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf">This blog post</a> introduces a great discussion on the topic, which I'll summarize in this section.</p>
<p>The main benefit of a variational autoencoder is that we're capable of learning <em>smooth</em> latent state representations of the input data. For standard autoencoders, we simply need to learn an encoding which allows us to reproduce the input. As you can see in the left-most figure, focusing only on reconstruction loss <em>does</em> allow us to separate out the classes (in this case, MNIST digits) which should allow our decoder model the ability to reproduce the original handwritten digit, but there's an uneven distribution of data within the latent space. In other words, there are areas in latent space which don't represent <em>any</em> of our observed data.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-18-at-7.22.24-PM.png" alt="Screen-Shot-2018-03-18-at-7.22.24-PM"><br>
<small><a href="https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf">Image credit</a> (modified)</small></p>
<p>On the flip side, if we only focus only on ensuring that the latent distribution is similar to the prior distribution (through our KL divergence loss term), we end up describing <em>every</em> observation using the same unit Gaussian, which we subsequently sample from to describe the latent dimensions visualized. This effectively treats every observation as having the same characteristics; in other words, we've failed to describe the original data.</p>
<p>However, when the two terms are optimized simultaneously, we're encouraged to describe the latent state for an observation with distributions close to the prior but deviating when necessary to describe salient features of the input.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/06/Screen-Shot-2018-06-20-at-2.51.06-PM.png" alt="Screen-Shot-2018-06-20-at-2.51.06-PM"></p>
<p>When I'm constructing a variational autoencoder, I like to inspect the latent dimensions for a few samples from the data to see the characteristics of the distribution. I encourage you to do the same.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/6-vae.png" alt="6-vae"></p>
<p>If we observe that the latent distributions appear to be very tight, we may decide to give higher weight to the KL divergence term with a parameter $\beta&gt;1$, encouraging the network to learn broader distributions. This simple insight has led to the growth of a new class of models - disentangled variational autoencoders. As it turns out, by placing a larger emphasis on the KL divergence term we're also implicitly enforcing that the learned latent dimensions are uncorrelated (through our simplifying assumption of a diagonal covariance matrix).</p>
<p>$$ {\cal L}\left( {x,\hat x} \right) + \beta \sum\limits_j {KL\left( {{q_j}\left( {z|x} \right)||N\left( {0,1} \right)} \right)} $$</p>
<h2 id="variationalautoencodersasagenerativemodel">Variational autoencoders as a generative model</h2>
<p>By sampling from the latent space, we can use the decoder network to form a generative model capable of creating new data similar to what was observed during training. Specifically, we'll sample from the prior distribution ${p\left( z \right)}$ which we assumed follows a unit Gaussian distribution.</p>
<p>The figure below visualizes the data generated by the decoder network of a variational autoencoder trained on the MNIST handwritten digits dataset. Here, we've sampled a grid of values from a two-dimensional Gaussian and displayed the output of our decoder network.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/vae-sample.png" alt="vae-sample"></p>
<p>As you can see, the distinct digits each exist in different regions of the latent space and smoothly transform from one digit to another. This smooth transformation can be quite useful when you'd like to interpolate between two observations, such as this recent example where <a href="https://magenta.tensorflow.org/music-vae">Google built a model for interpolating between two music samples</a>.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/G5JT16flZwM" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
<h2 id="furtherreading">Further reading</h2>
<p>Lectures</p>
<ul>
<li><a href="https://www.youtube.com/watch?v=uaaqyVS9-rM">Ali Ghodsi: Deep Learning, Variational Autoencoder (Oct 12 2017)</a></li>
<li><a href="https://www.youtube.com/watch?v=R3DNKE3zKFk">UC Berkley Deep Learning Decall Fall 2017 Day 6: Autoencoders and Representation Learning</a></li>
<li><a href="https://youtu.be/5WoItGTWV54?t=26m32s">Stanford CS231n: Lecture on Variational Autoencoders</a></li>
</ul>
<p>Blogs/videos</p>
<ul>
<li><a href="https://danijar.com/building-variational-auto-encoders-in-tensorflow/">Building Variational Auto-Encoders in TensorFlow (with great code examples)</a></li>
<li><a href="https://blog.keras.io/building-autoencoders-in-keras.html">Building Autoencoders in Keras</a></li>
<li><a href="https://www.youtube.com/watch?v=9zKuYvjFFS8">Variational Autoencoders - Arxiv Insights</a></li>
<li><a href="https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf">Intuitively Understanding Variational Autoencoders</a></li>
<li><a href="http://ruishu.io/2018/03/14/vae/">Density Estimation: A Neurotically In-Depth Look At Variational Autoencoders</a></li>
<li><a href="http://pyro.ai/examples/vae.html">Pyro: Variational Autoencoders</a></li>
<li><a href="http://blog.fastforwardlabs.com/2016/08/22/under-the-hood-of-the-variational-autoencoder-in.html">Under the Hood of the Variational Autoencoder</a></li>
<li><a href="https://towardsdatascience.com/with-great-power-comes-poor-latent-codes-representation-learning-in-vaes-pt-2-57403690e92b">With Great Power Comes Poor Latent Codes: Representation Learning in VAEs</a></li>
<li><a href="https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained">Kullback-Leibler Divergence Explained</a></li>
<li><a href="https://avdnoord.github.io/homepage/vqvae/">Neural Discrete Representation Learning</a></li>
</ul>
<p>Papers/books</p>
<ul>
<li><a href="http://www.deeplearningbook.org/contents/generative_models.html">Deep learning book (Chapter 20.10.3): Variational Autoencoders</a></li>
<li><a href="https://arxiv.org/pdf/1601.00670.pdf">Variational Inference: A Review for Statisticians</a></li>
<li><a href="http://www.robots.ox.ac.uk/~sjrob/Pubs/vbTutorialFinal.pdf">A tutorial on variational Bayesian inference</a></li>
<li><a href="https://arxiv.org/abs/1312.6114">Auto-Encoding Variational Bayes</a></li>
<li><a href="https://arxiv.org/abs/1606.05908">Tutorial on Variational Autoencoders</a></li>
</ul>
<p>Papers on my reading list</p>
<ul>
<li><a href="https://arxiv.org/abs/1711.00937">Neural Discrete Representation Learning</a></li>
<li><a href="https://arxiv.org/abs/1606.05579">Early Visual Concept Learning with Unsupervised Deep Learning</a></li>
<li><a href="https://arxiv.org/abs/1804.04732">Multimodal Unsupervised Image-to-Image Translation</a>
<ul>
<li><a href="https://www.youtube.com/watch?v=ab64TWzWn40">Video from paper</a></li>
</ul>
</li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Introduction to autoencoders.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of <strong>representation learning</strong>. Specifically, we'll design a neural network architecture such that we <em>impose a bottleneck in the network which forces a <strong>compressed</strong> knowledge representation of the original input</em>. If the input features were each</p>]]></description><link>https://www.jeremyjordan.me/autoencoders/</link><guid isPermaLink="false">5a9977dce1fb120022e8e646</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Mon, 19 Mar 2018 04:25:45 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of <strong>representation learning</strong>. Specifically, we'll design a neural network architecture such that we <em>impose a bottleneck in the network which forces a <strong>compressed</strong> knowledge representation of the original input</em>. If the input features were each independent of one another, this compression and subsequent reconstruction would be a very difficult task. However, if some sort of structure exists in the data (ie. correlations between input features), this structure can be learned and consequently leveraged when forcing the input through the network's bottleneck.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-06-at-3.17.13-PM.png" alt="Screen-Shot-2018-03-06-at-3.17.13-PM"></p>
<p>As visualized above, we can take an unlabeled dataset and frame it as a supervised learning problem tasked with outputting $\hat x$, a <strong>reconstruction of the original input</strong> $x$. This network can be trained by minimizing the <em>reconstruction error</em>, ${\cal L}\left( {x,\hat x} \right)$,  which measures the differences between our original input and the consequent reconstruction. The bottleneck is a key attribute of our network design; without the presence of an information bottleneck, our network could easily learn to simply memorize the input values by passing these values along through the network (visualized below).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-06-at-6.09.05-PM.png" alt="Screen-Shot-2018-03-06-at-6.09.05-PM"></p>
<p>A bottleneck constrains the amount of information that can traverse the full network, forcing a learned compression of the input data.</p>
<p><em>Note: In fact, if we were to construct a linear network (ie. without the use of nonlinear <a href="https://www.jeremyjordan.me/neural-networks-activation-functions/">activation functions</a> at each layer) we would observe a similar dimensionality reduction as observed in PCA. <a href="https://www.coursera.org/learn/neural-networks/lecture/JiT1i/from-pca-to-autoencoders-5-mins">See Geoffrey Hinton's discussion of this here.</a></em></p>
<p>The ideal autoencoder model balances the following:</p>
<ul>
<li>Sensitive to the inputs enough to accurately build a reconstruction.</li>
<li>Insensitive enough to the inputs that the model doesn't simply memorize or overfit the training data.</li>
</ul>
<p>This trade-off forces the model to maintain only the variations in the data required to reconstruct the input without holding on to redundancies within the input. For most cases, this involves constructing a loss function where one term encourages our model to be sensitive to the inputs (ie. reconstruction loss ${\cal L}\left( {x,\hat x} \right)$) and a second term discourages memorization/overfitting (ie. an added regularizer).</p>
<p>$$ {\cal L}\left( {x,\hat x} \right) + regularizer $$</p>
<p>We'll typically add a scaling parameter in front of the regularization term so that we can adjust the trade-off between the two objectives.</p>
<p>In this post, I'll discuss some of the standard autoencoder architectures for imposing these two constraints and tuning the trade-off; in a follow-up post I'll discuss <a href="https://www.jeremyjordan.me/variational-autoencoders/">variational autoencoders</a> which builds on the concepts discussed here to provide a more powerful model.</p>
<h4 id="undercompleteautoencoder">Undercomplete autoencoder</h4>
<p>The simplest architecture for constructing an autoencoder is to constrain the number of nodes present in the hidden layer(s) of the network, limiting the amount of information that can flow through the network. By penalizing the network according to the reconstruction error, our model can learn the most important attributes of the input data and how to best reconstruct the original input from an &quot;encoded&quot; state. Ideally, this encoding will <strong>learn and describe latent attributes of the input data</strong>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-07-at-8.24.37-AM.png" alt="Screen-Shot-2018-03-07-at-8.24.37-AM"></p>
<p>Because neural networks are capable of learning nonlinear relationships, this can be thought of as a more powerful (nonlinear) generalization of <a href="https://www.jeremyjordan.me/principal-components-analysis/">PCA</a>. Whereas PCA attempts to discover a lower dimensional hyperplane which describes the original data, autoencoders are capable of <a href="http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/">learning nonlinear manifolds</a> (a manifold is defined in <em>simple</em> terms as a continuous, non-intersecting surface). The difference between these two approaches is visualized below.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-07-at-8.52.21-AM.png" alt="Screen-Shot-2018-03-07-at-8.52.21-AM"></p>
<p>For higher dimensional data, autoencoders are capable of learning a complex representation of the data (manifold) which can be used to describe observations in a lower dimensionality and correspondingly decoded into the original input space.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/LinearNonLinear.png" alt="LinearNonLinear"><br>
<small><a href="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/unsupervised_learning.html">Image credit</a></small></p>
<p>An undercomplete autoencoder has no explicit regularization term - we simply train our model according to the reconstruction loss. Thus, our only way to ensure that the model isn't memorizing the input data is the ensure that we've sufficiently restricted the number of nodes in the hidden layer(s).</p>
<p>For deep autoencoders, we must also be aware of the <em>capacity</em> of our encoder and decoder models. Even if the &quot;bottleneck layer&quot; is only one hidden node, it's still possible for our model to memorize the training data provided that the encoder and decoder models have sufficient capability to learn some arbitrary function which can map the data to an index.</p>
<p>Given the fact that we'd like our model to discover latent attributes within our data, it's important to ensure that the autoencoder model is not simply learning an efficient way to memorize the training data. Similar to supervised learning problems, we can employ various forms of regularization to the network in order to encourage good generalization properties; these techniques are discussed below.</p>
<h4 id="sparseautoencoders">Sparse autoencoders</h4>
<p>Sparse autoencoders offer us an alternative method for introducing an information bottleneck without <em>requiring</em> a reduction in the number of nodes at our hidden layers. Rather, we'll construct our loss function such that we penalize <em>activations</em> within a layer. For any given observation, we'll encourage our network to learn an encoding and decoding which only relies on activating a small number of neurons. It's worth noting that this is a different approach towards regularization, as we normally regularize the <em>weights</em> of a network, not the activations.</p>
<p>A generic sparse autoencoder is visualized below where the opacity of a node corresponds with the level of activation. It's important to note that the individual nodes of a trained model which activate are <em>data-dependent</em>, different inputs will result in activations of different nodes through the network.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-07-at-1.50.55-PM.png" alt="Screen-Shot-2018-03-07-at-1.50.55-PM"></p>
<p>One result of this fact is that <strong>we allow our network to sensitize individual hidden layer nodes toward specific attributes of the input data</strong>. Whereas an undercomplete autoencoder will use the entire network for every observation, a sparse autoencoder will be forced to selectively activate regions of the network depending on the input data. As a result, we've limited the network's capacity to memorize the input data without limiting the networks capability to extract features from the data. This allows us to consider the latent state representation and regularization of the network <em>separately</em>, such that we can choose a latent state representation (ie. encoding dimensionality) in accordance with what makes sense given the context of the data while imposing regularization by the sparsity constraint.</p>
<p>There are two main ways by which we can impose this sparsity constraint; both involve measuring the hidden layer activations for each training batch and adding some term to the loss function in order to penalize excessive activations. These terms are:</p>
<ul>
<li><strong>L1 Regularization</strong>: We can add a term to our loss function that penalizes the absolute value of the vector of activations $a$ in layer $h$ for observation $i$, scaled by a tuning parameter $\lambda$.</li>
</ul>
<p>$$ {\cal L}\left( {x,\hat x} \right) +  \lambda \sum\limits_i {\left| {a_i^{\left( h \right)}} \right|} $$</p>
<ul>
<li><strong>KL-Divergence</strong>: In essence, KL-divergence is a measure of the difference between two probability distributions. We can define a sparsity parameter $\rho$ which denotes the average activation of a neuron over a collection of samples. This expectation can be calculated as $ {{\hat \rho }_ j} = \frac{1}{m}\sum\limits_{i} {\left[ {a_i^{\left( h \right)}\left( x \right)} \right]} $ where the subscript $j$ denotes the specific neuron in layer $h$, summing the activations for $m$ training observations denoted individually as $x$. In essence, by constraining the average activation of a neuron over a collection of samples we're encouraging neurons to only fire for a subset of the observations. We can describe $\rho$ as a Bernoulli random variable distribution such that we can leverage the KL divergence (expanded below) to compare the ideal distribution $\rho$ to the observed distributions over all hidden layer nodes $\hat \rho$.</li>
</ul>
<p>$$ {\cal L}\left( {x,\hat x} \right) + \sum\limits_{j} {KL\left( {\rho ||{{\hat \rho }_ j}} \right)}  $$</p>
<p><em>Note: A <a href="https://en.wikipedia.org/wiki/Bernoullidistribution">Bernoulli distribution</a> is &quot;the probability distribution of a random variable which takes the value 1 with probability $p$ and the value 0 with probability $q=1-p$&quot;. This corresponds quite well with establishing the probability a neuron will fire.</em></p>
<p>The KL divergence between two Bernoulli distributions can be written as $\sum\limits_{j = 1}^{{l^{\left( h \right)}}} {\rho \log \frac{\rho }{{{{\hat \rho }_ j}}}}  + \left( {1 - \rho } \right)\log \frac{{1 - \rho }}{{1 - {{\hat \rho }_ j}}}$. This loss term is visualized below for an ideal distribution of $\rho = 0.2$, corresponding with the minimum (zero) penalty at this point.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/KLPenaltyExample-1.png" alt="KLPenaltyExample-1"></p>
<h4 id="denoisingautoencoders">Denoising autoencoders</h4>
<p>So far I've discussed the concept of training a neural network where the input and outputs are identical and our model is tasked with reproducing the input as closely as possible while passing through some sort of information bottleneck. Recall that I mentioned we'd like our autoencoder to be sensitive enough to recreate the original observation but insensitive enough to the training data such that the model learns a generalizable encoding and decoding. Another approach towards developing a generalizable model is to slightly corrupt the input data but still maintain the uncorrupted data as our target output.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-09-at-10.20.44-AM.png" alt="Screen-Shot-2018-03-09-at-10.20.44-AM"></p>
<p>With this approach, <strong>our model isn't able to simply develop a mapping which memorizes the training data because our input and target output are no longer the same</strong>. Rather, the model learns a vector field for mapping the input data towards a lower-dimensional manifold (recall from my earlier graphic that a manifold describes the high density region where the input data concentrates); if this manifold accurately describes the natural data, we've effectively &quot;canceled out&quot; the added noise.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-09-at-10.12.59-PM.png" alt="Screen-Shot-2018-03-09-at-10.12.59-PM"><br>
<small><a href="https://arxiv.org/abs/1211.4246">Image credit</a></small></p>
<p>The above figure visualizes the vector field described by comparing the reconstruction of $x$ with the original value of $x$. The yellow points represent training examples prior to the addition of noise. As you can see, the model has learned to adjust the corrupted input towards the learned manifold.</p>
<p>It's worth noting that this vector field is typically only well behaved in the regions where the model has observed during training. In areas far away from the natural data distribution, the reconstruction error is both large and does not always point in the direction of the true distribution.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-10-at-10.17.44-AM.png" alt="Screen-Shot-2018-03-10-at-10.17.44-AM"><br>
<small><a href="https://arxiv.org/abs/1211.4246">Image credit</a></small></p>
<h4 id="contractiveautoencoders">Contractive autoencoders</h4>
<p>One would expect that <strong>for very similar inputs, the learned encoding would also be very similar</strong>. We can explicitly train our model in order for this to be the case by requiring that the <em>derivative of the hidden layer activations are small</em> with respect to the input. In other words, for small changes to the input, we should still maintain a very similar encoded state. This is quite similar to a denoising autoencoder in the sense that these small perturbations to the input are essentially considered noise and that we would like our model to be robust against noise. Put in <a href="https://arxiv.org/abs/1211.4246">other words</a> (emphasis mine), &quot;denoising autoencoders make the <em>reconstruction function</em> (ie. decoder) resist small but ﬁnite-sized perturbations of the input, while contractive autoencoders make the <em>feature extraction function</em> (ie. encoder) resist infinitesimal perturbations of the input.&quot;</p>
<p>Because we're explicitly encouraging our model to learn an encoding in which similar inputs have similar encodings, we're essentially forcing the model to learn how to <strong>contract</strong> a neighborhood of inputs into a smaller neighborhood of outputs. Notice how the slope (ie. derivative) of the reconstructed data is essentially zero for local neighborhoods of input data.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/Screen-Shot-2018-03-10-at-12.25.43-PM.png" alt="Screen-Shot-2018-03-10-at-12.25.43-PM"><br>
<small><a href="https://arxiv.org/abs/1211.4246">Image credit</a> (modified)</small></p>
<p>We can accomplish this by constructing a loss term which penalizes large <em>derivatives</em> of our <em>hidden layer activations</em> with respect to the input training examples, essentially penalizing instances where a small change in the input leads to a large change in the encoding space.</p>
<p>In fancier mathematical terms, we can craft our regularization loss term as the squared Frobenius norm ${\left\lVert A \right\rVert_F}$ of the Jacobian matrix ${\bf{J}}$ for the hidden layer activations with respect to the input observations. A Frobenius norm is essentially an L2 norm for a matrix and the Jacobian matrix simply represents all first-order partial derivatives of a vector-valued function (in this case, we have a vector of training examples).</p>
<p>For $m$ observations and $n$ hidden layer nodes, we can calculate these values as follows.</p>
<p>$${\left\lVert A \right\rVert_F}= \sqrt {\sum\limits_{i = 1}^m {\sum\limits_{j = 1}^n {{{\left| {{a_{ij}}} \right|}^2}} } } $$</p>
<math display="block">
 <mstyle mathvariant="bold" mathsize="normal"><mi>J</mi></mstyle><mo>=</mo><mrow><mo>[</mo> <mrow>
  <mtable>
   <mtr>
    <mtd>
     <mrow>
      <mfrac>
       <mrow>
        <mi>&#x03B4;</mi><msubsup>
         <mi>a</mi>
         <mn>1</mn>
         <mrow>
          <mrow><mo>(</mo>
           <mi>h</mi>
          <mo>)</mo></mrow>
         </mrow>
        </msubsup>
        <mrow><mo>(</mo>
         <mi>x</mi>
        <mo>)</mo></mrow>
       </mrow>
       <mrow>
        <mi>&#x03B4;</mi><msub>
         <mi>x</mi>
         <mn>1</mn>
        </msub>
       </mrow>
      </mfrac>
     </mrow>
    </mtd>
    <mtd>
     <mo>&#x22EF;</mo>
    </mtd>
    <mtd>
     <mrow>
      <mfrac>
       <mrow>
        <mi>&#x03B4;</mi><msubsup>
         <mi>a</mi>
         <mn>1</mn>
         <mrow>
          <mrow><mo>(</mo>
           <mi>h</mi>
          <mo>)</mo></mrow>
         </mrow>
        </msubsup>
        <mrow><mo>(</mo>
         <mi>x</mi>
        <mo>)</mo></mrow>
       </mrow>
       <mrow>
        <mi>&#x03B4;</mi><msub>
         <mi>x</mi>
         <mi>m</mi>
        </msub>
       </mrow>
      </mfrac>
     </mrow>
    </mtd>
   </mtr>
   <mtr>
    <mtd>
     <mo>&#x22EE;</mo>
    </mtd>
    <mtd>
     <mo>&#x22F1;</mo>
    </mtd>
    <mtd>
     <mo>&#x22EE;</mo>
    </mtd>
   </mtr>
   <mtr>
    <mtd>
     <mrow>
      <mfrac>
       <mrow>
        <mi>&#x03B4;</mi><msubsup>
         <mi>a</mi>
         <mi>n</mi>
         <mrow>
          <mrow><mo>(</mo>
           <mi>h</mi>
          <mo>)</mo></mrow>
         </mrow>
        </msubsup>
        <mrow><mo>(</mo>
         <mi>x</mi>
        <mo>)</mo></mrow>
       </mrow>
       <mrow>
        <mi>&#x03B4;</mi><msub>
         <mi>x</mi>
         <mn>1</mn>
        </msub>
       </mrow>
      </mfrac>
     </mrow>
    </mtd>
    <mtd>
     <mo>&#x22EF;</mo>
    </mtd>
    <mtd>
     <mrow>
      <mfrac>
       <mrow>
        <mi>&#x03B4;</mi><msubsup>
         <mi>a</mi>
         <mi>n</mi>
         <mrow>
          <mrow><mo>(</mo>
           <mi>h</mi>
          <mo>)</mo></mrow>
         </mrow>
        </msubsup>
        <mrow><mo>(</mo>
         <mi>x</mi>
        <mo>)</mo></mrow>
       </mrow>
       <mrow>
        <mi>&#x03B4;</mi><msub>
         <mi>x</mi>
         <mi>m</mi>
        </msub>
       </mrow>
      </mfrac>
     </mrow>
    </mtd>
   </mtr>
  </mtable>
 </mrow> <mo>]</mo></mrow>
</math>
<p>Written more succinctly, we can define our complete loss function as</p>
<p>$$ {\cal L}\left( {x,\hat x} \right) + \lambda {\sum\limits_i {\left\lVert {{\nabla _ x}a_i^{\left( h \right)}\left( x \right)} \right\rVert} ^2} $$</p>
<p>where ${{\nabla_x}a_i^{\left( h \right)}\left( x \right)}$ defines the gradient field of our hidden layer activations with respect to the input $x$, summed over all $i$ training examples.</p>
<h2 id="summary">Summary</h2>
<p>An autoencoder is a neural network architecture capable of discovering structure within data in order to develop a compressed representation of the input. Many different variants of the general autoencoder architecture exist with the goal of ensuring that the compressed representation represents <em>meaningful</em> attributes of the original data input; typically the biggest challenge when working with autoencoders is getting your model to actually learn a meaningful and generalizable latent space representation.</p>
<p>Because autoencoders <em>learn</em> how to compress the data based on attributes (ie. correlations between the input feature vector) <em>discovered from data during training</em>, these models are typically only capable of reconstructing data similar to the class of observations of which the model observed during training.</p>
<p>Applications of autoencoders include:</p>
<ul>
<li>Anomaly detection</li>
<li>Data denoising (ex. images, audio)</li>
<li>Image inpainting</li>
<li>Information retrieval</li>
</ul>
<h2 id="furtherreading">Further reading</h2>
<p>Lectures/notes</p>
<ul>
<li><a href="http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/">Unsupervised feature learning - Stanford</a></li>
<li><a href="https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf">Sparse autoencoder - Andrew Ng CS294A Lecture notes</a></li>
<li><a href="https://www.youtube.com/watch?v=R3DNKE3zKFk">UC Berkley Deep Learning Decall Fall 2017 Day 6: Autoencoders and Representation Learning</a></li>
</ul>
<p>Blogs/videos</p>
<ul>
<li><a href="https://blog.keras.io/building-autoencoders-in-keras.html">Building Autoencoders in Keras</a></li>
<li><a href="http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/">Neural Networks, Manifolds, and Topology - Chris Olah</a></li>
</ul>
<p>Papers/books</p>
<ul>
<li><a href="http://www.deeplearningbook.org/contents/autoencoders.html">Deep learning book (Chapter 14): Autoencoders</a></li>
<li><a href="https://arxiv.org/abs/1211.4246">What Regularized Auto-Encoders Learn from the Data Generating Distribution</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item><item><title><![CDATA[Setting the learning rate of your neural network.]]></title><description><![CDATA[<!--kg-card-begin: markdown--><p>In previous posts, I've discussed how we can train neural networks using <a href="https://www.jeremyjordan.me/neural-networks-training/">backpropagation</a> with <a href="https://www.jeremyjordan.me/gradient-descent/">gradient descent</a>. One of the key hyperparameters to set in order to train a neural network is the <em>learning rate</em> for gradient descent. As a reminder, this parameter scales the magnitude of our weight updates in</p>]]></description><link>https://www.jeremyjordan.me/nn-learning-rate/</link><guid isPermaLink="false">5a8b946a0684220022f5784e</guid><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Jeremy Jordan]]></dc:creator><pubDate>Fri, 02 Mar 2018 04:39:10 GMT</pubDate><content:encoded><![CDATA[<!--kg-card-begin: markdown--><p>In previous posts, I've discussed how we can train neural networks using <a href="https://www.jeremyjordan.me/neural-networks-training/">backpropagation</a> with <a href="https://www.jeremyjordan.me/gradient-descent/">gradient descent</a>. One of the key hyperparameters to set in order to train a neural network is the <em>learning rate</em> for gradient descent. As a reminder, this parameter scales the magnitude of our weight updates in order to minimize the network's loss function.</p>
<p>If your learning rate is set too low, training will progress very slowly as you are making very tiny updates to the weights in your network. However, if your learning rate is set too high, it can cause undesirable divergent behavior in your loss function. I'll visualize these cases below - if you find these visuals hard to interpret, I'd recommend reading (at least) the first section in my post on <a href="https://www.jeremyjordan.me/gradient-descent/">gradient descent</a>.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png" alt="Goldilocks of learning rates"></p>
<p>So how do we find the optimal learning rate?</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">3e-4 is the best learning rate for Adam, hands down.</p>&mdash; Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/801621764144971776?ref_src=twsrc%5Etfw">November 24, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Perfect! I guess my job here is done.</p>
<p>Well... not quite.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">(i just wanted to make sure that people understand that this is a joke...)</p>&mdash; Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/801694597009113088?ref_src=twsrc%5Etfw">November 24, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>(Humor yourself by reading through that thread after finishing this post.)</p>
<p>The loss landscape of a neural network (visualized below) is a function of the network's parameter values quantifying the &quot;error&quot; associated with using a specific configuration of parameter values when performing inference (prediction) on a given dataset. This loss landscape can look quite different, even for very similar network architectures. The images below are from a paper, <a href="https://arxiv.org/abs/1712.09913">Visualizing the Loss Landscape of Neural Nets</a>, which shows how residual connections in a network can yield an smoother loss topology.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-26-at-10.50.53-PM.png" alt="Screen-Shot-2018-02-26-at-10.50.53-PM"><br>
<small><a href="https://www.cs.umd.edu/~tomg/projects/landscapes/">Image credit</a></small></p>
<p>The optimal learning rate will be dependent on the topology of your loss landscape, which is in turn dependent on both your model architecture and your dataset. While using a default learning rate (ie. the defaults set by your deep learning library) may provide decent results, you can often improve the performance or speed up training by searching for an optimal learning rate. I hope you'll see in the next section that this is quite an easy task.</p>
<h2 id="asystematicapproachtowardsfindingtheoptimallearningrate">A systematic approach towards finding the optimal learning rate</h2>
<p>Ultimately, we'd like a learning rate which results is a <em>steep decrease</em> in the network's loss. We can observe this by performing a simple experiment where we gradually increase the learning rate after each mini batch, recording the loss at each increment. This gradual increase can be on either a linear or exponential scale.</p>
<p>For learning rates which are too low, the loss may decrease, but at a very shallow rate. When entering the optimal learning rate zone, you'll observe a quick drop in the loss function. Increasing the learning rate further will cause an increase in the loss as the parameter updates cause the loss to &quot;bounce around&quot; and even diverge from the minima. Remember, the best learning rate is associated with the <em>steepest</em> drop in loss, so we're mainly interested in analyzing the slope of the plot.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/lr_finder.png" alt="lr_finder"></p>
<p>You should set the range of your learning rate bounds for this experiment such that you observe <strong>all three</strong> phases, making the optimal range trivial to identify.</p>
<p>This technique was proposed by Leslie Smith in <a href="https://arxiv.org/abs/1506.01186">Cyclical Learning Rates for Training Neural Networks</a> and evangelized by Jeremy Howard in <a href="http://course.fast.ai/index.html">fast.ai's course</a>.</p>
<h2 id="settingascheduletoadjustyourlearningrateduringtraining">Setting a schedule to adjust your learning rate during training</h2>
<p>Another commonly employed technique, known as <strong>learning rate annealing</strong>, recommends starting with a relatively high learning rate and then gradually lowering the learning rate during training. The intuition behind this approach is that we'd like to traverse quickly from the initial parameters to a range of &quot;good&quot; parameter values but then we'd like a learning rate small enough that we can explore the &quot;deeper, but narrower parts of the loss function&quot; (from <a href="http://cs231n.github.io/neural-networks-3/#annealing-the-learning-rate">Karparthy's CS231n notes</a>). If you're having a hard time picturing what I just mentioned, recall that too high of a learning rate can cause the parameter update to &quot;jump over&quot; the ideal minima and subsequent updates will either result in a continued noisy convergence in the general region of the minima, or in more extreme cases may result in divergence from the minima.</p>
<p>The most popular form of learning rate annealing is a <em>step decay</em> where the learning rate is reduced by some percentage after a set number of training epochs.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/stepdecay.png" alt="stepdecay"></p>
<p>More generally, we can establish that it is useful to define a <strong>learning rate schedule</strong> in which the learning rate is updating during training according to some specified rule.</p>
<h4 id="cyclicallearningrates">Cyclical learning rates</h4>
<p>In the previously mentioned paper, <a href="https://arxiv.org/abs/1506.01186">Cyclical Learning Rates for Training Neural Networks</a>, Leslie Smith proposes a cyclical learning rate schedule which varies between two bound values. The main learning rate schedule (visualized below) is a triangular update rule, but he also mentions the use of a triangular update in conjunction with a fixed cyclic decay or an exponential cyclic decay.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-25-at-8.44.49-PM.png" alt="Screen-Shot-2018-02-25-at-8.44.49-PM"><br>
<small><a href="https://github.com/bckenstler/CLR">Image credit</a></small></p>
<p><em>Note: At the end of this post, I'll provide the code to implement this learning rate schedule. Thus, if you don't care to understand the mathematical formulation you can skim past this section.</em></p>
<p>We can write the general schedule as</p>
<p>$$ {\eta_t} = {\eta_{\min }} + \left( {{\eta_{\max }} - {\eta_{\min }}} \right)\left( {\max \left( {0,1 - x} \right)} \right) $$</p>
<p>where $x$ is defined as</p>
<p>$$ x = \left| {\frac{{iterations}}{stepsize} - 2\left( {cycle} \right) + 1} \right| $$</p>
<p>and $cycle$ can be calculated as</p>
<p>$$ cycle = floor\left( 1 + {\frac{{iterations}}{{2\left( {stepsize} \right)}}} \right) $$</p>
<p>where $\eta_{\min }$ and $\eta_{\max }$ define the bounds of our learning rate, $iterations$ represents the number of completed mini-batches, $stepsize$ defines one half of a cycle length. As far as I can gather, $1-x$ should always be positive, so it seems the $\max$ operation is not strictly necessary.</p>
<p>In order to grok how this equation works, let's progressively build it with visualizations. For the visuals below, the triangular update for 3 full cycles are shown with a step size of 100 iterations. Remember, one iteration corresponds with one mini-batch of training.</p>
<p>Foremost, we can establish our &quot;progress&quot; during training in terms of half-cycles that we've completed. We measure our progress in terms of half-cycles and not full cycles so that we can acheive symmetry within a cycle (this should become more clear as you continue reading).</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/grok_step1.png" alt="grok_step1"></p>
<p>Next, we compare our half-cycle progress to the number of half-cycles which will be completed at the end of the current cycle. At the beginning of a cycle, we have two half-cycles yet to be completed. At the end of a cycle, this value reaches zero.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/grok_step2.png" alt="grok_step2"></p>
<p>Next, we'll add 1 to this value in order to shift the function to be centered on the y-axis. Now we're showing our progress within a cycle with reference to the half-cycle point.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/grok_step3.png" alt="grok_step3"></p>
<p>At this point, we can take the absolute value to acheive a triangular shape within each cycle. This is the value that we assign to $x$.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/grok_step4.png" alt="grok_step4"></p>
<p>However, we'd like our learning rate schedule to start at the minumum value, increasing to the maximum value at the middle of a cycle, and then decrease back to the minimum value. We can accomplish this by simply calculating $1-x$.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/03/grok_step5.png" alt="grok_step5"></p>
<p>We now have a value which we can use to modulate the learning rate by adding some fraction of the learning rate range to the minimum learning rate (also referred to as the base learning rate).</p>
<p>Smith writes, the main assumption behind the rationale for a cyclical learning rate (as opposed to one which <em>only</em> decreases) is &quot;that <strong>increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect</strong>.&quot; Indeed, his paper includes several examples of a loss function evolution which temporarily deviates to higher losses while ultimately converging to a lower loss when compared with a benchmark fixed learning rate.</p>
<p>To gain intuition on why this short-term effect would yield a long-term positive effect, it's important to understand the desirable characteristics of our converged minimum. Ultimately, we'd like our network to learn from the data in a manner which <em>generalizes</em> to unseen data. Further, a network with good generalization properties should be robust in the sense that small changes to the network's parameters don't cause drastic changes to performance. With this in mind, it makes sense that <a href="https://arxiv.org/abs/1609.04836">sharp minima lead to poor generalization</a> as small changes to the parameter values may result in a drastically higher loss. By allowing for our learning rate to <em>increase</em> at times, we can &quot;jump out&quot; of sharp minima which would temporarily increase our loss but may ultimately lead to convergence on a more desirable minima.</p>
<p><em>Note: Although the &quot;flat minima for good generalization&quot; is widely accepted, you can read a good counter-argument <a href="https://arxiv.org/abs/1703.04933">here</a>.</em></p>
<p>Additionally, <strong>increasing the learning rate can also allow for &quot;more rapid traversal of saddle point plateaus.&quot;</strong> As you can see in the image below, the gradients can be very small at a saddle point. Because the parameter updates are a function of the gradient, this results in our optimization taking very small steps; it can be useful to increase the learning rate here to avoid getting stuck for too long at a saddle point.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/Saddle_point.svg.png" alt="Saddle point"><br>
<small><a href="https://en.wikipedia.org/wiki/Saddle_point#/media/File:Saddle_Point_between_maxima.svg">Image credit</a> (with modification)</small></p>
<p><em>Note: A saddle point, by definition, is a critical point in which some dimensions observe a local minimum while other dimensions observe a local maximum. Because neural networks can have thousands or even millions of parameters, it's unlikely that we'll observe a true local minimum across all of these dimensions; saddle points are much more likely to occur. When I referred to &quot;sharp minima&quot;, realistically we should picture a saddle point where the minimum dimensions are very steep while the maximum dimensions are very wide (as shown below).</em></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/sharp_minima.png" alt="sharp_minima"></p>
<h4 id="stochasticgradientdescentwithwarmrestartssgdr">Stochastic Gradient Descent with Warm Restarts (SGDR)</h4>
<p>A similar approach cyclic approach is known as <strong>stochastic gradient descent with warm restarts</strong> where an aggressive annealing schedule is combined with periodic &quot;restarts&quot; to the original starting learning rate.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/cosine_annealing.png" alt="cosine_annealing"></p>
<p>We can write this schedule as</p>
<p>$$ {\eta_t} = \eta_{\min }^i + \frac{1}{2}\left( {\eta_{\max }^i - \eta_{\min }^i} \right)\left( {1 + \cos \left( {\frac{{{T_{current}}}}{{{T_i}}}} \pi \right) } \right) $$</p>
<p>where ${\eta_t}$ is the learning rate at timestep $t$ (incremented each mini batch), ${\eta_{\max}^i}$ and ${\eta_{\min }^i}$ define the range of desired learning rates, $T_{current}$ represents the number of epochs since the last restart (this value is calculated at every iteration and thus can take on fractional values), and $T_{i}$ defines the number of epochs in an cycle. Let's try to break this equation down.</p>
<p>This annealing schedule relies on the cosine function, which varies between -1 and 1. ${\frac{T_{current}}{T_i}}$ is capable of taking on values between 0 and 1, which is the input of our cosine function. The corresponding region of the cosine function is highlighted below in green. By adding 1, our function varies between 0 and 2, which is then scaled by $\frac{1}{2}$ to now vary between 0 and 1. Thus, we're simply taking the minimum learning rate and adding some fraction of the specified learning rate range (${\eta_{\max }^i - \eta_{\min }^i}$). Because this function starts at 1 and decreases to 0, the result is a learning rate which starts at the maximum of the specified range and decays to the minimum value. Once we reach the end of a cycle, $T_{current}$ resets to 0 and we start back at the maximum learning rate.</p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/cosine.png" alt="cosine"></p>
<p>The authors note that this learning rate schedule can further be adapted to:</p>
<ol>
<li>Lengthen the cycle as training progresses.</li>
<li>Decay ${\eta_{\max }^i}$ and ${\eta_{\min }^i}$ after each cycle.</li>
</ol>
<p>By drastically increasing the learning rate at each restart, we can essentially exit a local minima and continue exploring the loss landscape.</p>
<p><em>Neat idea: By snapshotting the weights at the end of each cycle, researchers were able to <a href="https://arxiv.org/abs/1704.00109">build an ensemble of models at the cost of training a single model</a>. This is because the network &quot;settles&quot; on various local optima from cycle to cycle, as shown in the figure below.</em></p>
<p><img src="https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-26-at-9.12.57-PM.png" alt="Snapshot Ensembles: Train 1, get M for free"></p>
<h2 id="implementation">Implementation</h2>
<p>Both finding the optimal range of learning rates and assigning a learning rate schedule can be implemented quite trivially using Keras <a href="https://keras.io/callbacks/">Callbacks</a>.</p>
<h4 id="findingtheoptimallearningraterange">Finding the optimal learning rate range</h4>
<p>We can write a Keras Callback which tracks the loss associated with a learning rate varied linearly over a defined range.</p>
<div class="tex2jax_ignore">
<script src="https://gist.github.com/jeremyjordan/ac0229abd4b2b7000aca1643e88e0f02.js"></script>
</div>
<h4 id="settingalearningrateschedule">Setting a learning rate schedule</h4>
<p><strong>Step Decay</strong><br>
For a simple step decay, we can use the <code>LearningRateScheduler</code> Callback.</p>
<div class="tex2jax_ignore">
<script src="https://gist.github.com/jeremyjordan/86398d7c05c02396c24661baa4c88165.js"></script>
</div>
<p><strong>Cyclical Learning Rate</strong></p>
<p>To apply the cyclical learning rate technique, we can reference <a href="https://github.com/bckenstler/CLR">this repo</a> which has already implemented the technique in the paper. In fact, this repo is cited in the paper's appendix.</p>
<p><strong>Stochastic Gradient Descent with Restarts</strong></p>
<div class="tex2jax_ignore">
<script src="https://gist.github.com/jeremyjordan/5a222e04bb78c242f5763ad40626c452.js"></script>
</div>
<p><strong>Snapshot Ensembles</strong><br>
To apply the &quot;Train 1, get M for free&quot; technique, you can reference <a href="https://github.com/titu1994/Snapshot-Ensembles">this repo</a>.</p>
<h2 id="furtherreading">Further reading</h2>
<ul>
<li><a href="http://cs231n.github.io/neural-networks-3/#annealing-the-learning-rate">Stanford CS231n: Annealing the learning rate</a></li>
<li><a href="https://arxiv.org/abs/1506.01186">Cyclical Learning Rates for Training Neural Networks</a></li>
<li><a href="https://arxiv.org/abs/1702.04283">Exploring loss function topology with cyclical learning rates</a></li>
<li><a href="https://arxiv.org/abs/1608.03983">SGDR: Stochastic Gradient Descent with Warm Restarts</a></li>
<li><a href="https://arxiv.org/abs/1704.00109">Snapshot Ensembles: Train 1, get M for free</a>
<ul>
<li>(extension of concept) <a href="https://arxiv.org/abs/1802.10026">Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs<br>
</a></li>
</ul>
</li>
<li><a href="https://arxiv.org/abs/1712.09913">Visualizing the Loss Landscape of Neural Nets</a></li>
<li><a href="http://ruder.io/deep-learning-optimization-2017/">Optimization for Deep Learning Highlights in 2017: Sebastian Ruder</a></li>
<li><a href="https://arxiv.org/abs/1609.04836">On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima</a></li>
<li><a href="https://arxiv.org/abs/1703.04933">Sharp Minima Can Generalize For Deep Nets</a></li>
<li><a href="https://medium.com/intuitionmachine/the-peculiar-behavior-of-deep-learning-loss-surfaces-330cb741ec17">The Two Phases of Gradient Descent in Deep Learning</a></li>
<li><a href="https://towardsdatascience.com/recent-advances-for-a-better-understanding-of-deep-learning-part-i-5ce34d1cc914">Recent Advances for a Better Understanding of Deep Learning − Part I</a></li>
</ul>
<!--kg-card-end: markdown-->]]></content:encoded></item></channel></rss>