9/11/19: Linear Regression Tutorial by Dr. Sadia Khalil

2 minute read

Regression

We learned two models for regression. A simple linear regression as an example of a parametric method, and Gaussian Process as an example for the non-parametric method.

Linear Regression

In simple linear regression, we determine the parameters (𝑀, Ο΅) by fitting a linear equation, y=𝑀x+Ο΅ to the observed data. The figure of merit is the mean squared error on the predicted model y^=𝑀(x*)+Ο΅, where is unseen points of interest. We did two examples, where in first example we use all data for regression, and then test it on any other dataset point x*[1,…, 20]. In second example, we use half of the data for training, i.e. to determine the parameters (𝑀, Ο΅) of the linear model, and use the rest of the data to test the fitted line. In both examples, we plot residuals or mean squared error.

Gaussian Process Regression

Bayes’ Rule allow us to infer about the posterior distribution 𝑝(𝑀|𝑦,𝑋) by specifying a prior distribution, 𝑝(𝑀), on the parameter, 𝑀, and relocating probabilities based on evidence (i.e. observed data).

𝑝(𝑀|𝑦,𝑋)=𝑝(π‘Œ|𝑋,𝑀)𝑝(𝑀) / 𝑝(π‘Œ|𝑋)

The updated distribution 𝑝(𝑀|𝑦,𝑋), called the posterior distribution, thus incorporates information from both the prior distribution and the dataset. To get predictions at unseen points of interest, x*, the predictive distribution can be calculated by weighting all possible predictions by their calculated posterior distribution.

𝑝(π‘“βˆ—|π‘₯βˆ—,𝑦,𝑋) = βˆ«π‘€π‘(π‘“βˆ—|π‘₯βˆ—,𝑀) 𝑝(𝑀|𝑦,𝑋) 𝑑𝑀

𝑝(π‘“βˆ—|π‘₯βˆ—,𝑦,𝑋) = N(π‘“βˆ—|πœ‡βˆ—,Ξ£βˆ—)

The prior and likelihood is usually assumed to be Gaussian for the integration to be tractable, and thus the predictive distribution, is also a Gaussian distribution, from which we can obtain a point prediction, using its mean (πœ‡βˆ—) and an uncertainty quantification using its variance (Ξ£βˆ—). Therefore, instead of calculating the probability distribution of parameters of a specific function, GP calculates the probability distribution over all admissible functions that fit the data.

In GP regression, we first assume a Gaussian process prior, which can be specified using a mean function, m(x), and covariance function, k(x, x’):

𝑓(π‘₯) ~ GP ( m(X), k(x, x’) )

A Gaussian process is like an infinite-dimensional multivariate Gaussian distribution, where any collection of the labels, y(x) of the dataset has a joint Gaussian distribution. We can also incorporate independently, identically distributed Gaussian noise, Ο΅ ∼ N(0, σ²) to the labels.

y(x) = 𝑓(π‘₯) + Ο΅

y(x) ~ GP ( m(X), k(x,x’) + Ξ΄ij σ²y)

Since the collection of training points and test points are joint multivariate Gaussian distributed, and so we can write their distribution as follows:

From this, we can calculate πœ‡βˆ— and Ξ£βˆ—, which are the mean and variance of the predictive posterior distribution.

In third tutorial, we draw samples from a prior distribution, with a mean and a covariance to visualize the prior distribution. Then we compute the predictive posterior distribution assuming zero noise in training data. We study the effect of Kernel parameters and noise effect on the posterior predictive distributions. In last step, we found the optimal parameters of the Kernel, and the noise by maximizing the marginal log-likelihood, and then plot the predictive posterior distribution.

Updated: