/ Data Science

Decision trees for regression.

Before you read this post, go ahead and check out my post on decision trees for classification. This post will be building on top of that, as you'll see that decision tree regressors are very similar to decision tree classifiers.

Regression models, in the general sense, are able to take variable inputs and predict an output from a continuous range. However, decision tree regressions are not capable of producing continuous output. Rather, these models are trained on a set of examples with outputs that lie in a continuous range. These training examples are partitioned in the decision tree and new examples that end in a given node will take on the mean of the training example values that reside in the same node.

It looks something like this.

Alright, but how do we get there? As I mentioned before, the general process is very similar to a decision tree classifier with a few small changes.

Determining the optimal split

We'll still build our tree recursively, making splits on the data as we go, but we need a new method for determining the optimal split. For classification, we used information entropy (you can also use the Gini index or Chi-square method) to figure out which split provided the biggest gain in information about the new example's class. For regression, we're not trying to predict a class, but rather we're expected the generate an output given the input criterion. Thus, we'll need a new method for determining optimal fit.

One way to do this is to measure whether or not a split will result in a reduction of variance within the data. If a split is useful, its combined weighted variance of the child nodes will be less than the original variance of the parent node. We can continue to make recursive splits on our dataset until we've effectively reduced the overall variance below a certain threshold, or upon reaching another stopping parameter (such as reaching a defined maximum depth). Notice how the mean squared error decreases as you step through the decision tree.

Avoiding overfitting

The techniques for preventing overfitting remain largely the same as for decision tree classifiers. However, it seems that not many people actually take the time to prune a decision tree for regression, but rather they elect to use a random forest regressor (a collection of decision trees) which are less prone to overfitting and perform better than a single optimized tree. The common argument for using a decision tree over a random forest is that decision trees are easier to interpret, you simply look at the decision tree logic. However, in a random forest, you're not going to want to study the decision tree logic of 500 different trees. Luckily for us, there are still ways to maintain interpretability within a random forest without studying each tree manually.

Implementation

Here's the code I used to generate the graphic above.

from sklearn import datasets
# Load the diabetes dataset
diabetes = datasets.load_diabetes()

import pandas as pd 

features = pd.DataFrame(diabetes.data, columns = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5", "feature_6", "feature_7", "feature_8", "feature_9", "feature_10"])
target = pd.DataFrame(diabetes.target)

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth = 4)
regressor.fit(features, target)

from IPython.display import Image  
from sklearn.externals.six import StringIO  
import pydot  
from sklearn import tree

dot_data = StringIO()  
tree.export_graphviz(regressor, out_file=dot_data,  
    feature_names=features.columns,
    class_names=target.columns,
    filled=True, rounded=True,
    special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())