Support Vector Machines

We could use KNN, logistic regression, LDA, and decision trees for classification problems. Here we discuss a new set of methods: the maximal margin classifier, the support vector classifier, and the support vector machine, which are often loosely referred to as “support vector machines”.

Maximal Margin Classifier

Hyperplane

A hyperplane, which is a flat affine subspace of dimension $p-1$ in a $p$-dimensional space, can divide a $p$-dimensional space into two halves. For example, when $p=2$, a hyperplane is simply a line and can divide the 2-dimensional space into two halves. In general, a $p$-dimensional hyperplane is defined by the equation: $$ \beta_0+\sum_{j=0}^p\beta_jX_j = 0 $$ Those $X$ satisfying this equation lie on the hyperplane. Other $X$ either satisfy $$ \beta_0+\sum_{j=0}^p\beta_jX_j < 0 $$ or $$ \beta_0+\sum_{j=0}^p\beta_jX_j > 0, $$ each representing one half of the space. ### Classification Using a Separating Hyperplane We can use a separating hyperplane for classification. We here only consider the cases when the observations fall into two classes, which is to say $y_i$ is either $1$ or $-1$. To find a hyperplane that can separate the two classes, is to obtain a hyperplane that satisfies: $$ y_i = \text{sgn} (\beta_0+\sum_{j=0}^p\beta_jx_{ij}) $$ or equivalently, $$ y_i(\beta_0+\sum_{j=0}^p\beta_jx_{ij}) >0 $$ for all $i=1,\dots,n$. Then we classify the test observation $x^*$ based on $$ \text{sgn} (f(x ^* )) = \text{sgn} (\beta_0+\sum_{j=0}^p\beta_jx_{j}^* ) $$ The confidence of the classification is based on the magnitude of $f(x^*)$.

The Maximal Margin Classifier

The hyperplane obtained from the method above is often not unique, because we can often shift a tiny bit up or down, or rotate the line without violating the requirements of the line. However, we often want to decide which hyperplane is the most suitable to use. We often choose the maximal margin classifier. For each training observation, we can obtain a perpendicular distance to the separating hyperplane; among all the distances, the minimum is called the margin of the separating hyperplane. Maximize the margin, and we obtain the maximal margin classifier. The maximal margin classifier is often successful, but can overfit the data when $p$ is large. Training observations whose distance to the separating hyperplane is the minimum (i.e., the margin) are called support vectors. To understand the name:

They are vectors in the $p$-dimensional space.
It turns out that the separating hyperplane is directly determined only by these support vectors (a small set of the training observations), which means that moving any other observation without crossing the boundary of the margin will not cause the hyperplane to change.

Construction of the Maximal Margin Classifier

The maximal margin classifier is the solution to this optimal problem: Find $\beta_0, \beta_1, \dots, \beta_p$, where $$ \sum_{j=1}^p \beta_j^2=1, $$ which maximize $M$, where $$ y_i(\beta_0+\sum_{j=0}^p\beta_jx_{ij}) \ge M > 0 $$ for all $i=1,\dots,n$. The constraint $\sum_{j=1}^p \beta_j^2=1$ actually ensures that the maximized $M$ is the margin of the hyperplane.

Problems

The non-separable case. The constraint that $M>0$ implies that there exists a hyperplane that can separate the two classes, but it is not always the case.
Possible high variance. The maximum margin classifier can be too sensitive to an individual observation:
1. This can lead to dramatical reduction in the confidence for the prediction.
2. This may overfit the data.

Support Vector Classifiers

The maximal margin classifier needs improvement in terms of

greater robustness to individual observations, and
better classification of most of the observations.

The support vector classifier (a.k.a. soft margin classifier) does exactly this, where we allow some observations to be on the incorrect side of the margin and even the hyperplane within a given “budget”.

Details of the Support Vector Classifier

The support vector classifier is the solution to this optimal problem: Find $\beta_0, \beta_1, \dots, \beta_p$ and non-negative $\epsilon_1, \epsilon_2, \dots, \epsilon_n$, where $$ \sum_{j=1}^p \beta_j^2=1 $$ and $$ \sum_{i=1}^n \epsilon_i \le C, $$ which maximize $M$, where $$ y_i(\beta_0+\sum_{j=0}^p\beta_jx_{ij}) \ge M(1-\epsilon_i) $$ for all $i=1,\dots,n$. We first discuss the slack variables $\epsilon_1, \epsilon_2, \dots, \epsilon_n$, each of which represents the amount by which each training observation violates the margin:

If $\epsilon_i=0$, then $y_i(\beta_0+\sum_{j=0}^p\beta_jx_{ij}) \ge M$, and thus $x_i$ is on the right side of the margin;
If $0<\epsilon_i \le 1$, then $0 \le y_i(\beta_0+\sum_{j=0}^p\beta_jx_{ij}) < M$, and thus $x_i$ is on the wrong side of the margin but the right side of the hyperplane;
If $\epsilon_i > 1$, then $ y_i(\beta_0+\sum_{j=0}^p\beta_jx_{ij}) < 0$, and thus $x_i$ is on the wrong side of the hyperplane.

Tuning parameter $C$ determines the number and severity of the violations to the margin of the observations. Bounding the sum of all $\epsilon_i$, it can be thought of as a budget for all violations:

If $C=0$, then all $\epsilon_i=0$ and thus this amounts to the maximal margin classifier;
If $C > 0$, then the number of observations on the wrong side of the hyperplane, where $\epsilon_i > 1$, should be no more than $C$.

$C$, in practice, is often determined by cross-validation. When $C$ increases, we are more tolerant of the violations to the margin, hence the margin becomes wider and there are more observations on the wrong side of the margin, which leads the classifier to potentially have higher bias and lower variance. It turns out that, similar to the maximal margin classifier, the support vector classifier is only directly determined by the observations that either lie on the margin or violate the margin, which are referred to as support vectors. This is in line with the assertion that tuning parameter $C$ controls the bias-variance trade-off of the model: when $C$ increases, more observations become support vectors and thus contribute to the determination of the hyperplane. The support vector classifier is only determined by a potentially small set of observations (support vectors). This makes it different to LDA, which takes into account all the observations, but similar to logistic regression, which turns out to be deeply linked with it.

So far we’ve been only considering linear decision boundaries, but what if we want non-linear ones? One way is to add polynomial functions (e.g., $X_j^2$) or interaction terms (e.g., $X_jX_{j’}$) of the predictors to the feature space. However, there are often too many ways to enlarge the feature space and the computation can be unmanageable.

The Support Vector Machine

We want to somehow generalize the optimal problem arising in the context of the support vector classifier to non-linear cases. According to the definition of the inner product of two vectors, the inner product of two $p$-dimensional observations is $$ \langle x_i, x_{i’}\rangle = \sum_{j=1}^p x_{ij} x_{i’j} $$ It turns out that to solve the optimal problems arising in the context of the support vector classifier, we only need the all the pair-wise inner products of the observations. Moreover, to evaluate the classification function at point $x$, we only need the inner products between $x$ and all observations: $$ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i \langle x, x_i \rangle $$ where $\alpha_i$ are parameters. Further, $\alpha_i$ is non-zero only when $x_i$ is a support vector: $$ f(x) = \beta_0 + \sum_{i\in S} \alpha_i \langle x, x_i \rangle $$ where $S$ stands for the set of indices of support vectors. To conclude, throughout the modeling and predicting process, the only operation we need to do with our training and test observations is inner product. Now we replace inner product with a generalized form of $$ K(x_i, x_{i’}), $$ where $K$ is some function we refer to as a kernel. When $K$ is not inner product (some non-linear kernel), the resultant classifier is referred to as a support vector machine. In this case, $$ f(x) = \beta_0 + \sum_{i\in S} \alpha_i K(x, x_i). $$ One popular choice for $K$ is $$ K(x_i, x_{i’}) = \left(1+\sum_{j=1}^p x_{ij}x_{i’j}\right)^d, $$ When $d = 1$, it amounts to a support vector classifier. When $d>1$, it is a polynomial kernel with degree $d$. This is essentially equivalent to fitting the classifier in a higher-dimensional space involving $d$-degree polynomials instead of the original feature space. Another choice for $K$ is $$ K(x_i, x_{i’}) = \exp \left(-\gamma \sum_{j=1}^p(x_{ij}-x_{i’j})^2 \right) $$ where $\gamma$ is a positive constant. This is referred to as a radial kernel. It has very local behavior, because observations far from (having a large Euclidean distance from) $x^* $ play little role in evaluating $f(x^* )$.

SVMs with More than Two Classes

We now extend support vector machines where there are $K$ cases ($K>2$).

One-Versus-One Classification

For each of the $K \choose 2$ pairs of classes, we construct a SVM.
A test observation is classified by all these SVMs and the final classification is most common occurring one among all.

One-Versus-All Classification

For each of the $K$ classes, we construct a SVM comparing the $k$th class and the remaining $K-1$ classes.
For each test observation, we choose the class which has the largest confidence against the remaining $K-1$ classes. Confidence is evaluated in terms of the magnitude of $\beta_{0k} + \sum_{j=1}^p \beta_{jk}x^*_j$.