One of the classic techniques in machine learning is linear regression. This approach models a function using a linear relationship between one or more input variables and the output set. It’s used in a ton of different situations, from building classifiers for large e-commerce sites to suggest new products, to fitting a line to a data set in Excel.
Today we’ll implement this algorithm in Haskell as both an exercise in using the hmatrix library, and as a practical endeavor in using a functional language for a useful purpose.
Linear regression is implemented using linear algebra and an approach that minimizes the least squares error of a matrix of weights, using a training data set.
Overview of Linear Regression
We define a linear equation that accepts an arbitrary number of variables.
We setup a matrix of X values with a column of ones appended to the beginning to allow for an offset term.
Our matrix is set up with m rows of data, and n columns of independent variables. This represents the a set of input data.
You’ll notice that when we multiply w by the data set, we produce a column matrix of output values, corresponding to the y value for each of the set of independent variables.
This defines a set of equations, but we still have to determine the value of the weight matrix. To do this we take a set of data, called the training set, and plug it into the input data matrix, along with the output values. We subtract the estimated value, obtained via the weight matrix, from the actual output value and square this, take the derivative of the equation, set it equal to zero, and solve for the weight matrix.
All these steps are a little complex, and if you’d like to learn more the Wikipedia article on linear regression is pretty well done and gives a lot of information on the statistical theory behind regression, but in the end you end up with the following equation to determine the weight matrix:
Implementation in Haskell
To implement these functions in Haskell is actually pretty simple, we make use of hmatrix, an excellent matrix library built in Haskell.
Installing hmatrix in Mac OS is a little tricky, we have to install the GNU Scientific Library development libs first, luckily MacPorts can help us out:
sudo port install gsl-devel cabal install hmatrix
After that we’ll go ahead and load in some data. The authors of hmatrix provide a wonderful manual to get up and running with the library and a number of functions have been built to allow us to quickly build matrices up.
In this case we’ll use the loadMatrix IO function, which will pull in some input data from files. We format our data files as space-separated rows of numbers:
0 0.660691332817078 0.1 0.754551916894156 0.2 0.925818388603147 0.3 0.904216776317371 0.4 0.754324606651347 0.5 0.572540852930199 0.6 0.226045290129906 0.7 0.135596809334937 0.8 0.075248476906341 0.9 0.186604404237266 1 0.546356452677449
0 0.465670943260193 0.1 0.799516151425082 0.2 0.894981363854821 0.3 0.836263760453476 0.4 0.749666819566768 0.5 0.491481010373484 0.6 0.138236347335107 0.7 0.101170288570109 0.8 0.170054617567581 0.9 0.319242745425697 1 0.455485771272382
The input parameter is on the left, with the function output on the right. We have both a test and training set of data so we can evaluate the performance of our regression algorithm. Loading these files in Haskell is a cinch.
dat1 <- loadMatrix "train.txt" dat2 <- loadMatrix "test.txt" let [x_in, y_in] = toColumns dat1 let [x_in_test, y_in_test] = toColumns dat2
Next we need to add some dimensionality to our input variable. A cool trick commonly used in linear regression is to allow fitting of higher-order polynomials by adding additional columns to our input variable data set, and computing higher order values of the input variable for them. In our case we’ll use a power series from 0 to 2, to allow for a quadratic fit:
let x = fromColumns $ map (x_in^) [0..2] let x_test = fromColumns $ map (x_in_test^) [0..2]
After performing this our X matrix will look something like this:
11x3 1.000 0.000 0.000 1.000 0.100 0.010 1.000 0.200 0.040 1.000 0.300 0.090 1.000 0.400 0.160 1.000 0.500 0.250 1.000 0.600 0.360 1.000 0.700 0.490 1.000 0.800 0.640 1.000 0.900 0.810 1.000 1.000 1.000
The first column is the input value raised to the 0th power, which gives us the ability to compute an offset in our weight matrix, the second column is the input value raised to the first power, and the third column is that value raised to the second power.
Next we’ll compute the weight matrix, using the equation defined previously. Using the reference sheet in hmatrix’s manual proved to be useful here to look up your basic matrix manipulation functions:
let w = (pinv $ (ctrans x) <> x) <> (ctrans x) <> y_in
Breaking this apart we use a couple functions from hmatrix.
pinv– This generates the pseudo-inverse of a matrix for us, equivalent to X^(-1).
ctrans– The transpose of a matrix. Swaps out the rows and columns.
<>– The cross product of two matrices.
The function operates over the entire training set, and computes a weight matrix that minimizes the error given the dimensional constraints of the input matrix.
Finally we can evaluate our function on both the test set and training set:
let train_y = x <> w let test_y = x_test <> w putStrLn "Training set root-mean-square value:" print $ sqrt . sum $ toList ((y_in - train_y) ^ 2) putStrLn "Test set root-mean-square value:" print $ sqrt . sum $ toList ((y_in_test - test_y) ^ 2)
It’s super easy to use the weight matrix, multiplying an input set of x values against the weight matrix generates the output values. In our case we’ll get output that looks something like this:
Training set root-mean-square value: 0.7082342551441069 Test set root-mean-square value: 0.7088158510569752
The data set was generated from a sinusoidal function added with a slight amount of Gaussian noise, so a quadratic fit is not the best approximation. A 3rd order fit will generate a very nice root-mean-square value, and a
sin(x) kernel function will generate an almost perfect fit.
The approach does generalize well though, hmatrix is a wonderful library and allows you to perform a lot of statistical operations in Haskell that you would typically reach for a program like MATLAB for.