Non-Linear Models | Yuhang's Blog

Linear models, including linear regression and models making improvements on it, can only offer a limited approximation of the real world. Therefore, moving beyond linearity is necessitated in various situations where more flexibility is required.

(Notice that ridge regression and Lasso actually reduce the variance or flexibility of linear regression.)

Basis Functions

To move beyond linearity, one basic idea is to transform the predictor using some non-linear functions and fit the model with the transformed values. The transformations are called basis functions and are applied to $X$: $b_1(X), b_2(X), …, b_k(X)$. We then fit the model: $$ y_i = \beta_0 + \sum_{j=1}^k \beta_j b_j(x_i) + \epsilon_i $$

Polynomial Regression

The idea is simple: to use not only $X$ as predictor but also $X^2, X^3, … X^d$. Or we shall say, we choose $b_i(X) = X^i$ in a basis function context: $$ y_i = \beta_0 + \sum_{j=1}^d \beta_j x_i^j + \epsilon_i $$ The variance of the fit can be estimated from the covariance matrix of the $\hat \beta_j$, and the estimated pointwise standard error is the square root of this variance.

Problems

When the degree is higher than 3 or 4, the function would likely take very strange and wiggly shapes, especially near the boundary of the $X$ variable.

Step Functions

We break the range of $X$ into bins and fit a different constant in each bin. In other (fancy) words, we first select $K$ knots $c_i$ where $1 \le i \le K$. We then choose $$ b_i(X) = I(c_i \le X < c_{i+1}) $$ for $1 \le i < K$ and $b_K(X) = I(c_K \le X)$, where $I$ is an indicator function that returns either 0 or 1 depending on whether its input value is false or true. Notice that when $X < c_1$, the model is the intercept $\beta_0$. In other words, $\beta_0$ is the mean value of all $Y$ for $X < c_1$. Other $\beta_j$ are the average increase in the response in a given interval. The chosen knots are usually the natural breakpoints in the predictors. In other situations, step functions can miss the action.

Splines

We may generalize the idea of step functions and fit polynomial models in different intervals. This is called piecewise polynomials. However, this approach is often undesirable in that there is likely to be break points or discontinuous points in the fitted model which would look really ridiculous. To fix the problems of piecewise polynomials, one must make restrictions on the model and require continuity at knots of the fitted function (and possibly its $k$-th order derivatives). Then one would get splines. We take the (popular) 3-degree spline as an example. We require that the cubic spline model should be continuous on itself and its first 2 derivatives. Hence, moving from left to right on the real line, we could add an increment on its 3rd order derivative every time we meet a knot. Let $\xi$ denotes a knot. Let $$ u(x) = (x-\xi)^3 $$ Then $$ u’(x) = 3(x-\xi)^2 $$ $$ u’’(x) = 6(x-\xi) $$ $$ u’’’(x) = 6 $$ Thus $$ u(\xi) = u’(\xi) = u’’(\xi) = 0 $$ Therefore, adding a multiple of $u(x)$ to our function every time we meet a knot can adjust its 3rd order derivative while keeping itself and its first 2 derivatives unchanged. We make use of the notation of a truncated power basis function: $$ h(x, \xi) = (x-\xi)^3_+ = \begin{cases} (x-\xi)^3 &x>\xi \newline 0 & x\le \xi \end{cases} $$ We then choose our $K+3$ basis functions as: $$ b_i(X) = \begin{cases} X^i & 1\le i \le 3 \newline h(X, \xi_{i-3}) & 4 \le i\le K+3 \end{cases} $$ Thus, a spline with degree $d$ and $K$ knots would have $d+K+1$ degrees of freedom (also considering the intercept $\beta_0$). Cubic splines is often used partly because the changes of coefficients at knots are invisible to most human eyes.

Natural Splines

Splines can still behave poorly in the boundary region. A natural spline additionally requires that the model should be linear at the boundary (for $X$ whose value is less than the smallest knot or greater than the largest knot). Compared with its spline counterpart, a natural spline reduces the degrees of freedom by $2(d-1)$ (in the two boundary regions, all $d + 1$ coefficients of the polynomial but $\beta_0 $ and $\beta_1$ are zero and thus not free). Therefore, a natural spline with degree $d$ and $K$ knots would have $K-d+3$ degrees of freedom. We commonly use cubic natural splines, which have $K$ degrees of freedom.

Choosing the Number and Locations of the Knots

Locations of the knots:

We can place more knots where the function might vary rapidly and fewer knots where it might be stable.
Alternatively, it is common to place knots in a uniform fashion. We can specify the desired degrees of freedom and let the software place the corresponding number of knots at uniform quantiles of the data.

Number of the knots:

We can try out different numbers of knots and see which produces the best looking curve.
Alternatively, we can use cross-validation.

Advantages over Polynomial Regression

We can increase the degree of freedom without increasing the degree of the function.
We generally get more stable estimates.
We can decide in which interval we want the function to be more flexible.
We can often get desirable results at the boundaries with natural cubic spline.

Smoothing Splines

Recall that in every model we want a balance between its bias and variance. To reduce the bias, we usually minimize the RSS of the training data, and in the linear context, this is how we construct the linear regression model. To reduce the variance, we introduced ridge regression and Lasso as improvements on linear regression, which penalizes large values of linear coefficients. In general and beyond a linear context, to reduce the bias, we want to find a function $g$ that can make $$ \text{RSS} = \sum_{i=1}^n (y_i-g(x_i))^2 $$ small. However, with no restrictions on the variance of the model, we only have to choose $g$ that interpolates all responses. This is never what we want. A model with a low variance is a “smooth” model. One way to judge the smoothness (or inversely, roughness) of a function is to use $$ \int g’’(t)^2 d t $$ It makes sense in that for a linear function $g$ this formula is 0, which means no roughness or infinite smoothness. For a quadratic function $g$, this formula is dependent on the value of the squared value of the quadratic coefficient, which also makes sense. Thus it would be natural to minimize $$ \sum_{i=1}^n (y_i-g(x_i))^2 + \lambda \int g’’(t)^2 d t $$ where $\lambda$ is a nonnegative tuning parameter, in order to obtain a function where bias and variance (or roughness) is balanced. The desired function $g$ (which minimizes the above formula) turns out to be a shrunken version of a natural cubic spline with knots at $x_1, x_2, …, x_n$ and $\lambda$ controls the level of shrinkage.

Choosing the Smoothing Parameter

We have mentioned that a smoothing spline is a shrunken version of a natural cubic spline. Shrinkage really means that the effective degrees of freedom of the model is reduced with the presence of the smoothing parameter $\lambda$. The effective degrees of freedom can be calculated given the observation data and $\lambda$, for which there is a formula (that I don’t really understand so I leave it out here). We really want to know which $\lambda$ to choose. The method is simple and familiar: cross validation. We choose the $\lambda$ that minimizes the cross-validated RSS. In particular, leave-one-out cross-validation (LOOCV) can be computed efficiently with a formula that is left out here.

Local Regression

We may compute the fit at a target point $x_0$ only using the nearby training observations. This approach is called local regression. Its algorithm is described below:

Preselect the fraction $s$ where $0<s\le 1$.
Gather $k = ns$ nearest neighbors of $x_0$ (in terms of the distance between $x_0$ and $x_i$).
Assign a weight $K_{i0}=K(x_i, x_0)$ to each point in the neighborhood so that the point furthest from $x_0$ has weight 0 and the closest has the highest weight. Points outside the neighborhood has weight 0.
Fit a weighted least squares regression of the $y_i$ on the $x_i$ by finding $\hat \beta _0$ and $\hat \beta _1$ that minimize

$$ \sum_{i=1}^n K_{i0}(y_i-\beta_0-\beta_1x_i)^2 $$

The fitted value at $x_0$ is given by $\hat f(x_0) = \hat \beta_0 + \hat \beta _ 1 x_0$.

GAM

So far we have only discussed non-linear improvements on single-predictor linear regression (simple linear regression). When we have multiple predictors, $X_1, X_2, …, X_p$, generalized additive models (GAMs) provide a general framework for non-linear extension. $$ y_i = \beta_0 + \sum_{j=1}^p f_j(x_{ij}) + \epsilon_i $$ where each $f_j$ is a model that we have mentioned before.