title | duration | creator | ||||||
---|---|---|---|---|---|---|---|---|
Introduction to Classification |
9000 |
|
DS | Lesson 8
After this lesson, you will be able to:
- Define class label and classification
- Build a K-Nearest Neighbors using the sci-kit-learn library
- Evaluate and tune model by using metrics such as classification accuracy/error
Before this lesson, you should already be able to:
- Understand how to optimize for error in a model
- Understand the concept of iteration to solve problems
- Measure basic probability
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Opening | Discuss lesson objectives |
10-15 mins | Introduction | What is Classification? |
20 mins | Guided Practice | Regression vs Classification |
20 mins | Independent Practice | Build a Classifier |
10-15 mins | Introduction | Introduction to K Nearest Neighbors |
10-15 mins | Demo | Demo of KNN |
10-15 mins | Introduction | First Classification Metrics |
30-35 mins | Independent Practice | Solving for K |
5-10 mins | Conclusion | Review topics |
In class so far, we've primarily worked with regression problems: machine learning approaches to solving or predicting a continuous set of values. Since regressions are continuous (for example, 1 is greater than 0, and 100 is greater than 1), we've been able to use distance to measure how accurate our prediction is.
While predicting something like the cost of a house or the number of clicks on an ad, we use ranges to sort our values. But other problems, like whether a loan is going to default or not, don't make sense within those parameters. A loan is either going to default, or it's not. We call these binary problems.
Check: What if we want to build a model to predict set values, like an social media status or photo color? How about the gender of a baby? Do we use regression problems or binary values? Do the same principles apply?
Classification is a machine learning problem for solving a set value given the knowledge we have about that value.
Many classification problems boil down to a binary problem. For example, with patient data, one could be working on solving a treatment problem for smokers... but first we need to know if their medical history suggests, or is predictive, of whether the patient is a smoker or not.
Many problems don't appear to be binary at first glance, but with a limited set of variables, you can usually boil your model down to a boolean value. For instance, what if you are predicting whether an image pixel will be red or blue? In analyzing the image, we could check whether a pixel "is red" or "is not red."
Binary classification is the simplest form of classification, though classification problems can certainly be wrapped around multiple class labels.
A class label is a representation of what we are trying to predict: our target. The examples of class labels from above would be:
Data Problem | Class Labels |
---|---|
Patient data problem | is smoker, is not smoker |
pixel color | red, blue (etc) |
The easiest way to understand if our y
, the dependent variable, is a classification problem or not, is to see if the values can be ordered given math.
For example, if predicting revenue, $100MM is greater than $90MM (and more so, could be negative!), so revenue prediction sounds like a regression problem. Red is not inherently greater than or less than blue, therefore predicting this pixel is a classification problem, with "red" and "blue" as the class labels.
Review the following situations and decide if each one is a regression problem, classification problem, or neither:
- Using the total number of explosions in a movie, predict if the movie is by JJ Abrams or Michael Bay.
- Determine how many tickets will be sold to a concert given who is performing, where, and the date and time.
- Given the temperature over the last year by day, predict tomorrow's temperature outside.
- Using data from four cell phone microphones, reduce the noisy sounds so the voice is crystal clear to the receiving phone.
- With customer data, determine if a user will return or not in the next 7 days to an e-commerce website.
Note: The primary difference between regression and classification is the result; the data used as input should resonate with what we've used in the past. In fact, writing a classifier could look a lot like control flow, a common pattern in coding.
With our new knowledge about class labels and classification, it should be relatively straightforward to write a computer program that returns class labels based on some prior knowledge.
Our goal below is to (re) explore the iris dataset, which has 50 samples of 3 different class labels, and see if we can write a program that classifies the data. We can do this very easily with python if-else statements and some pandas functions.
Afterward, measure the accuracy of your classifier using the math of "total correct" over "total samples."
Your classifier should be able to:
- Get one class label 100% correct: one of the irises is very easy to distinguish from the other 2.
- Accurately predict the majority of the other two, with some error.
- Note: the samples for the remaining class labels are a little intertwined, so you may need to generalize.
Here's some starter code to get you going:
from sklearn import datasets, neighbors, metrics
import pandas as pd
iris = datasets.load_iris()
irisdf = pd.DataFrame(iris.data, columns=iris.feature_names)
irisdf['target'] = iris.target
cmap = {'0': 'r', '1': 'g', '2': 'b' }
irisdf['ctarget'] = irisdf.target.apply(lambda x: cmap[str(x)])
irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget)
print irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget)
print irisdf.describe()
# starter code
def my_classifier(row):
if row['petal length (cm)'] < 2:
return 0
else:
return 1
predictions = irisdf.apply(my_classifier, axis=1)
- How simple could the if-else classifier be to still be relatively accurate?
- How complicated could this if-else classifier be to be completely accurate? How many if-else statements would you need, or nested if-else statements, in order to get the classifier 100% accurate? (The above uses a count of 2).
- RECALL Which if-else classifier would work better against iris data that it hasn't seen? Why is that the case?
K Nearest Neighbors (KNN) is a fairly straightforward algorithm used for classification:
- For a given point, calculate the distance to all other points.
- Given those distances, pick the k closest points.
- Calculate the probability of each class label given those points
- The original point is classified as the class label with the largest probability ("votes").
KNN uses distance to predict a class label. This is a different application of distance--previously we've used distance to calculate error in regression problems; now, we'll use it as a measure of similarity between classifications.
If I picked an arbitrary M&M's from a table without looking, but knew exactly where I picked it from, I could use the surrounding M&M colors to make my most educated guess of what color candy is in my hand. This is a natural thing to do; if we're unfamiliar with something we're looking at, we'll typically look for things that are similar, identify which traits are shared the most, then use that information to gauge whether the new item is similar to something we already know.
Check: Can you think of other examples where we commonly use this heuristic?
Below is some sample code that loads in the iris dataset.
from sklearn import datasets, neighbors, metrics
import pandas as pd
iris = datasets.load_iris()
# n_neighbors is our option in KNN. We'll tune this value to attempt to improve our prediction.
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(iris.data[:,2:], iris.target)
print knn.predict(iris.data[:,2:])
print iris.target
print knn.score(iris.data[:,2:], iris.target)
Above we have the simplest implementation of KNN using sklearn, attempting to predict one of three iris types based the size of the iris. We use the default n_neighbors
of 5, which will remove most ties. Of course, there could be ties; for example, if there are three labels, and two of them get two votes each, the last label would get one vote.
It is certainly possible for a KNN classifier to have a tie for votes: in binary classification, we'd see this using 4 for k and each value (0 and 1) getting two votes each. For sklearn, in the case of ties, it will designate the class based on what it saw first in the training set.
We can also implement a weight, so that the total distance plays a more significant role. Try changing the weights
argument in the previous code to "distance" and see how it affects the accuracy.
In regressions, we could use L1 regularization when we have significantly more features than observations.
With KNN, we do not have regularization, and a different problem: since KNN works with distance, higher dimensionality of data (i.e. more features) requires significantly more samples in order to have the same predictive power. Consider this: with more dimensions, all points slowly start averaging out to be equally distant; this causes significant issues for KNN! Keep the feature space limited, and KNN can do well.
In a related example, consider the similarity of users for a particular product. When the product is very broad (for example, a newspaper/news website), the audience will also be very broad, so the newspaper will likely have many features: different sections, topics, types of stories, writers, etc. Yet with so many different different parts that appeal to such a broad audience, user similarity should actually be high!
What about a product like toothpaste? While it also appeals to a broad audience, the types of toothpaste (and the features that separate them) are quite limited. It would be much simpler to identify "distance" between toothpaste users, since the feature set ("has fluoride," "controls tarter", etc) is much smaller.
The previous metrics we've used for regressions do not apply for classification.
We could measure distance between the probability of a given class and an item being in the class: for example, guessing .6 for a 1 is a .4 error, while guessing .99 for 1 is .01 error... but this overly complicates our current goal: understanding binary classifications, like whether something is right or wrong.
Instead, let's start with two new metrics, which are inverses of each other: accuracy and misclassification rate.
Accuracy's equation is simple: of all the samples/observations we predicted, how many were correct? This is a value we'd want to increase (like r-squared).
Misclassification rate is directly opposite; of all the samples/observations we predicted, how many were incorrect? This is a value we'd want to decrease (like mean squared error).
Since they are opposite of each other, you can pick one or the other; effectively they will be the same. But when coding, do make sure that you are using a classification metric when solving a classification problem!
SKLearn will not intuitively understand if you are doing classification or regression, and accidentally using mean squared error for classification, or accuracy for regression, is a common programming pitfall.
One of the primary challenges of KNN is solving for k--how many neighbors do we use?
- The smallest k we can use is 1: however, using only one neighbor will probably perform poorly.
- The largest k we can use is n-1; that is, every other point in the data set. But without weighting, this would always set it to the class with the largest sample size! Within the Iris data set, we should see at some value k and greater, the performance will flat line (in a bad way).
Using the following starter code and the iris data set, test and evaluate the following questions:
- What is the accuracy for when using k=1?
- What is the accuracy for using (most of, all) the other points as neighbors perform?
- At what point, with cross validation, does k optimize accuracy? How many k for the best accuracy score for this data set?
Look at the grid_scores and contextualize the results a bit with a figure using matplotlib, where the x-axis represents k
, and the y-axis represents accuracy (we call this a "fit chart"). This fit chart, and a print out of the scores, will help you answer the questions above.
from sklearn import grid_search
# some n_list! keep in mind cross validation
# recall: what's an effective way to create a numerical list in python?
params = {'n_neighbors': }
gs = grid_search.GridSearchCV(
estimator=,
param_grid=,
cv=,
)
gs.fit(iris.data, iris.target)
gs.grid_scores_
- By default, the KNN classifier in SKlearn uses the Minkowski metric for distance, given p: this is how it decides to calculate distance (using a triangle, p=1 is using the length of sides 1+2 to get the distance from a to c; p=2 using the length of side 3).
- What type of data does this metric work best for?
- What type of data does this distance metric may not work for?
- For help, read about distance metrics in the sklearn documentation.
- It's possible to use KNN as a regression estimator. Using independent reading or your own brilliant creativity, come up with the following:
- Steps that KNN Regression would follow
- How it predicts a regression value
Instructor Note: See solution code for more detail.
- What are class labels? What does it mean to classify?
- How is a classification problem different from a regression problem? How are they similar?
- How does the KNN algorithm work?
- What primary parameters are available for tuning a KNN estimator?
- How do you define: accuracy, misclassification?
UPCOMING PROJECTS | Final Project, Deliverable 2 |
UPCOMING PROJECTS | Unit Project 3, Deliverable 1 |