Archive for the ‘Algorithms’ Category
Machine Learning in Haskell – Linear Regression
One of the classic techniques in machine learning is linear regression. This approach models a function using a linear relationship between one or more input variables and the output set. It’s used in a ton of different situations, from building classifiers for large e-commerce sites to suggest new products, to fitting a line to a data set in Excel.
Today we’ll implement this algorithm in Haskell as both an exercise in using the hmatrix library, and as a practical endeavor in using a functional language for a useful purpose.
Linear regression is implemented using linear algebra and an approach that minimizes the least squares error of a matrix of weights, using a training data set.
Overview of Linear Regression
We define a linear equation that accepts an arbitrary number of variables.
We setup a matrix of X values with a column of ones appended to the beginning to allow for an offset term.
Our matrix is set up with m rows of data, and n columns of independent variables. This represents the a set of input data.
You’ll notice that when we multiply w by the data set, we produce a column matrix of output values, corresponding to the y value for each of the set of independent variables.
This defines a set of equations, but we still have to determine the value of the weight matrix. To do this we take a set of data, called the training set, and plug it into the input data matrix, along with the output values. We subtract the estimated value, obtained via the weight matrix, from the actual output value and square this, take the derivative of the equation, set it equal to zero, and solve for the weight matrix.
All these steps are a little complex, and if you’d like to learn more the Wikipedia article on linear regression is pretty well done and gives a lot of information on the statistical theory behind regression, but in the end you end up with the following equation to determine the weight matrix:
Implementation in Haskell
To implement these functions in Haskell is actually pretty simple, we make use of hmatrix, an excellent matrix library built in Haskell.
Installing hmatrix in Mac OS is a little tricky, we have to install the GNU Scientific Library development libs first, luckily MacPorts can help us out:
sudo port install gsl-devel cabal install hmatrix
After that we’ll go ahead and load in some data. The authors of hmatrix provide a wonderful manual to get up and running with the library and a number of functions have been built to allow us to quickly build matrices up.
In this case we’ll use the loadMatrix IO function, which will pull in some input data from files. We format our data files as space-separated rows of numbers:
test.txt
0 0.660691332817078 0.1 0.754551916894156 0.2 0.925818388603147 0.3 0.904216776317371 0.4 0.754324606651347 0.5 0.572540852930199 0.6 0.226045290129906 0.7 0.135596809334937 0.8 0.075248476906341 0.9 0.186604404237266 1 0.546356452677449
train.txt
0 0.465670943260193 0.1 0.799516151425082 0.2 0.894981363854821 0.3 0.836263760453476 0.4 0.749666819566768 0.5 0.491481010373484 0.6 0.138236347335107 0.7 0.101170288570109 0.8 0.170054617567581 0.9 0.319242745425697 1 0.455485771272382
The input parameter is on the left, with the function output on the right. We have both a test and training set of data so we can evaluate the performance of our regression algorithm. Loading these files in Haskell is a cinch.
dat1 <- loadMatrix "train.txt"
dat2 <- loadMatrix "test.txt"
let [x_in, y_in] = toColumns dat1
let [x_in_test, y_in_test] = toColumns dat2
Next we need to add some dimensionality to our input variable. A cool trick commonly used in linear regression is to allow fitting of higher-order polynomials by adding additional columns to our input variable data set, and computing higher order values of the input variable for them. In our case we’ll use a power series from 0 to 2, to allow for a quadratic fit:
let x = fromColumns $ map (x_in^) [0..2]
let x_test = fromColumns $ map (x_in_test^) [0..2]
After performing this our X matrix will look something like this:
11x3 1.000 0.000 0.000 1.000 0.100 0.010 1.000 0.200 0.040 1.000 0.300 0.090 1.000 0.400 0.160 1.000 0.500 0.250 1.000 0.600 0.360 1.000 0.700 0.490 1.000 0.800 0.640 1.000 0.900 0.810 1.000 1.000 1.000
The first column is the input value raised to the 0th power, which gives us the ability to compute an offset in our weight matrix, the second column is the input value raised to the first power, and the third column is that value raised to the second power.
Next we’ll compute the weight matrix, using the equation defined previously. Using the reference sheet in hmatrix’s manual proved to be useful here to look up your basic matrix manipulation functions:
let w = (pinv $ (ctrans x) <> x) <> (ctrans x) <> y_in
Breaking this apart we use a couple functions from hmatrix.
pinv– This generates the pseudo-inverse of a matrix for us, equivalent to X^(-1).ctrans– The transpose of a matrix. Swaps out the rows and columns.<>– The cross product of two matrices.
The function operates over the entire training set, and computes a weight matrix that minimizes the error given the dimensional constraints of the input matrix.
Finally we can evaluate our function on both the test set and training set:
let train_y = x <> w
let test_y = x_test <> w
putStrLn "Training set root-mean-square value:"
print $ sqrt . sum $ toList ((y_in - train_y) ^ 2)
putStrLn "Test set root-mean-square value:"
print $ sqrt . sum $ toList ((y_in_test - test_y) ^ 2)
It’s super easy to use the weight matrix, multiplying an input set of x values against the weight matrix generates the output values. In our case we’ll get output that looks something like this:
Training set root-mean-square value: 0.7082342551441069 Test set root-mean-square value: 0.7088158510569752
Evaluation
The data set was generated from a sinusoidal function added with a slight amount of Gaussian noise, so a quadratic fit is not the best approximation. A 3rd order fit will generate a very nice root-mean-square value, and a sin(x) kernel function will generate an almost perfect fit.
The approach does generalize well though, hmatrix is a wonderful library and allows you to perform a lot of statistical operations in Haskell that you would typically reach for a program like MATLAB for.
Why does Haskell use Singly Linked Lists Anyway?
As I continue down the road often less traveled in my journey to learn Haskell, I’ve found some tasks are exceedingly difficult to perform efficiently. One of the strengths of a functional language is its ability to allow a programmer to work with a higher-level representation of the problem, leaving implementation details to the system, whether it be a compiler, interpreter, or operating system. Systems like relational databases have allowed us to take a declarative approach to specifying a program, merely naming fields and describing relationships is enough to allow the database engine to efficiently choose an algorithm and pull results from the database. Functional languages like Haskell promise this same abstraction with algorithms themselves. In imperative languages we set up carefully constructed loops, and iterate over lists to find values and calculate results. In functional programming, we instead focus on using higher order reusable iteration functions, and composing a chain of functions over a list that achieve the desired result.
In theory this sounds great, it’s a more mathematical way to think about programming. Traditionally you focus a lot on how exactly the computer should manipulate data structures to get a desired result, but in Haskell you focus on formulating your problem. If you can adequately describe the problem, GHC takes care of turning that functional description into runnable code, life is good, your job is done, and the program runs.
Functional Programming: The Op Amp of Computer Science
A popular sentiment in the functional programming community is that teaching a functional language like Haskell first is the way things should be done, and would create better CS students. I don’t agree with this. While expressing problems in a functional manner is great, I think it makes the performance of an algorithm too opaque, understanding what makes functional code truly efficient is extremely difficult.

In fact, Haskell reminds me a lot of another tool that was designed to be easy to use in a different domain, the classical op-amp.
Op-amps were created in the world of electronics to be ideal amplifiers, allowing for designers to create large, linear gains in the amplitude of a small signal using a few low-cost components. They were created because before their existence, doing this was hard to get right, and created a lot of headaches.
To their credit, op-amps solve a number of difficult problems, and make the lives of electrical engineers everywhere much, much easier. The problem arises when one tries to do something mildly complex with an op-amp, and the abstraction breaks down. A ton of scary mathematics is brought into the picture, and one has to understand all the theoretical underpinnings of the device. It turns out op amps can only be maximally useful to those who understand how to build circuits without them, what their drawbacks are, and all the theory behind their design.
This sounds a lot like Haskell. It’s a wonderful tool, and allows for a programmer to offload a lot of the mental effort required to solve an issue to the compiler, but once you move beyond simple tasks you can quickly find yourself in deep water, with poor, unexplained performance.
A good example is list manipulation in Haskell. In imperative languages there’s an intuitive grasp of how efficient an algorithm will be. You’ll literally be writing the loops out, so the asymptotic runtime is explicitly stated in the code. However, in Haskell, things aren’t so simple. Prepending to the front of lists will take on the order of O(1), while appending elements to the end of a list will operate on the order of O(n). Why is this exactly? Well, the answer isn’t immediately obvious to someone first getting started in functional programming. Some research will lead you to the conclusion that it’s due to lists in Haskell being implemented as singly linked data structures. Some further thought will give insight into why this data structure was chosen when at first it seems like a limiting design choice. One can imagine that a singly linked list would be the data structure of choice for a functional language with lazy evaluation. The compiler can make assumptions and reuse lists due to their immutable nature, and create branches from elements because they lack a backwards facing pointer, along with a number of other nifty optimizations.
That’s all great, but again, it’s not obvious to someone who isn’t skilled in the trade and has a deep understanding of traditional imperative languages.
Not to say it’s all bad
In Haskell’s defense, they have done a wonderful job creating tools and documentation that try to address this problem. GHC contains some beautiful profiling tools that allow you to analyze what the performance-limiting steps are in your computations, and the documentation for the prelude frequently mentions asymptotic runtime of functions.
Make no mistake, Haskell is a powerful tool, the class of languages it exemplifies may very well be the future of programming, but to properly utilize it requires a pretty deep understanding of what’s going on under the hood, and of the fundamentals of computer science. I think teaching a functional language as a first language would be a mistake.
Algorithm Puzzle: Finding Three Numbers that Sum to a Value
As part of the new year I’d like to start presenting a classic algorithm problem every week, briefly analyze the runtime, and shows implementations in a standard language such as C++ or Python, and an implementation in Haskell.
The first question I’d like to look at is a simple interview question I was asked a while ago. The basic problem is defined as follows:
Given a list of random numbers of arbitrary length, find a combination of three numbers that sum to a certain value with the best possible big-O runtime.
There are generally two accepted approaches to this problem, the naive approach, with a Big-O time complexity of O(n3), and a slightly more intelligent approach, with O(n2).
The Naive Approach
The obvious way to first attack this problem is by simply iterating over every possible combination of values, and testing each one. This is easily implemented in Python.
#Naive Approach def naiveApproach(numbers, value): for x in numbers: for y in numbers: for z in numbers: if (x+y+z == value): print 'Value found: %s' % str((x, y, z)) return
In Haskell this approach is also really elegant looking, I took the approach of simply creating a list of tuples representing every possible combination, and filtering it for tuples that only meet the criteria of being correct.
isSum :: Int -> (Int, Int, Int) -> Bool
isSum sum (a, b, c) = sum == (a + b + c)
naiveApproach :: [Int] -> Int -> (Int, Int, Int)
naiveApproach xs sum = head $ flip List.filter (combinations xs) $ isSum sum
where combinations xs = [ (a, b, c) | c <- xs, b <- xs, a <- xs ]
This works alright for smaller datasets, but becomes impractical for large sets of numbers and slows down dramatically.
The Better Approach
To solve this I take the typical approach and see if I can exploit some sort of underlying structure in the problem. In this case the data source is a random array of numbers, but we know that sorting a list can be done in O(nlg(n)) time, so this is a good starting point for optimization, considering that the runtime is pretty dismal as it stands.
After sorting the list it’s not entirely obvious how to approach this problem. I found that examining a simpler problem helped out. Simplifying the problem, I asked how I would go about finding two numbers in a list that sum together to a certain value. This turns out to be pretty easy. One can sort the list, assuming the sorted list ascends from left to right and then, starting from both ends, move towards the middle. For each pair, sum the values and compare to the required value. If the value summed is less than the required value, increment the left pointer by one, else decrement the right pointer. Once a value is found, or when these pointers collide, abort.
After solving this simpler problem, we apply it to the three number problem. We can apply this sub solution iteratively to each element in the list. That is, for each element in the list, test if there exists two other elements using our sub solution that, when summed together with the current element, equal the required value. The Python implementation looks something like this:
def betterApproach(numbers, value): numbers.sort() for z in range(0, len(numbers)): x = 0 y = len(numbers) - 1 while (x <= y): sumValue = numbers[x] + numbers[y] + numbers[z] if (sumValue == value): print 'Value found: %s' % str((numbers[x], numbers[y], numbers[z])) return elif (sumValue < value): x += 1 else: y -= 1
The sub solution has a runtime of O(n), assuming the list is presorted, and applying the sub solution to each element in the list nets us another n of time, bringing the total runtime of the algorithm to O(n2).
Writing this solution in Haskell was not easy, but this is my first venture into serious functional programming.
betterApproach :: [Int] -> Int -> (Int, Int, Int)
betterApproach zs sumVal = head $ betterHelper1 (sort zs)
where betterHelper1 :: [Int] -> [(Int, Int, Int)]
betterHelper1 (w:ws) = (betterHelper2 0 w (w:ws) (reverse (w:ws))) ++ betterHelper1 ws
betterHelper1 [] = []
betterHelper2 :: Int -> Int -> [Int] -> [Int] -> [(Int, Int, Int)]
betterHelper2 c z (x:xs) (y:ys)
| c >= (length zs - 1) = []
| sum [x, y, z] == sumVal = [(z, x, y)]
| sum [x, y, z] > sumVal = betterHelper2 (c + 1) z (x:xs) ys
| sum [x, y, z] < sumVal = betterHelper2 (c + 1) z xs (y:ys)
betterHelper2 _ _ _ _ = []
What we’re doing here is defining two helper functions. The first helper is pretty simple, recursing over the list and appending any results from the second helper to the end of a list. The second helper is a little more complex.
It takes four parameters, a depth counter, the current z value, and two lists of numbers. To emulate traversal from both ends of a list, I pass in a copy of the list reversed as the second list. I traverse each list head first, using guards to conditionally return a result based on the comparison of the sum of the three values with the required value. When we recurse through the list, depending on if the value is greater or less than the required value I will pop the front element of one of the lists off and discard it.
As a performance enhancement, and to prevent repetitive evaluation, we squash the recursive branch when the number of pops has exceeded the size of the list of random numbers, which is the functional equivalent of detecting when the pointers overlap.
Evaluating Performance
Predictably the Haskell implementation far outpaced the Python implementation. I turned on multicore support when running benchmarks, which gives Haskell a slightly unfair advantage, but since concurrency is one of the core strengths of the language I felt it was fair.
The following benchmarks were run using a small Python script over iterations of 10-100 random data sets of 10,000 or 100,000 elements ranging from 1 to 10,000,000. Three numbers were chosen at random to be summed together as the target value.
| Naive Algorithm 10,000 pts |
Better Algorithm 100,000 pts |
|
|---|---|---|
| Python | 27127ms | 1972ms |
| Haskell | 384ms | 682ms |
As you can see, Haskell really shines when evaluating the naive approach, which parallelizes easily. When using the more efficient implementation, Python and Haskell both can handle much larger data sets, and the performance gap narrows a little, although Haskell still comes out on top.
If you can find a more efficient implementation please let me know! I’ll be the first to admit my Haskell skills aren’t that great at this point in time. Feel free to write in the comments, or send me an e-mail.