<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Moving Forward</title>
	<atom:link href="http://andrewbrobinson.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://andrewbrobinson.com</link>
	<description>Homepage of Andrew Robinson</description>
	<lastBuildDate>Thu, 16 Feb 2012 06:10:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Generating SVN Statistics</title>
		<link>http://andrewbrobinson.com/2012/02/16/generating-svn-statistics/</link>
		<comments>http://andrewbrobinson.com/2012/02/16/generating-svn-statistics/#comments</comments>
		<pubDate>Thu, 16 Feb 2012 06:10:56 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=747</guid>
		<description><![CDATA[Recently I became very interested in generating some statistics from a SVN repo. In our research group we have a repository for all the currently in progress papers, which are written in LaTeX, and doing some rudimentary reporting on the number of committed lines by author sounded like a fun way to gamify the process [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I became very interested in generating some statistics from a SVN repo. In our research group we have a repository for all the currently in progress papers, which are written in LaTeX, and doing some rudimentary reporting on the number of committed lines by author sounded like a fun way to gamify the process of writing. You can see below one of the highlights of this reporting. As would be expected by a graduate student research lab, a large number of commits happen late in the night, with a large void during working business hours. </p>
<p><img src="http://andrewbrobinson.com/wp-content/uploads/2012/02/activity_time.png" alt="" title="activity_time" width="600" height="370" class="aligncenter size-full wp-image-751" /></p>
<p>I found a great tool to generate some statistics from SVN repos, appropriately called <a href="http://statsvn.org/">StatSVN</a>. It&#8217;s decent out of the box, but lacked some customizability, and automation.</p>
<p>The way it works by default is you invoke it as shown below, and it uses a generated output file from SVN, along with the path to a checked out local repo, to generate a pile of HTML reports and figures tallying various commit statistics. It automatically invokes subversion, and requests the diffs between commits, storing data in a local cache file. </p>
<pre class="brush: plain; title: ;">
java -jar statsvn.jar papers/logfile.log papers -include &quot;**/*.tex&quot; -config-file config.txt
</pre>
<p>This works pretty well, but to really create some fun statistics we need to work a little harder. I wanted to filter out some of the larger bulk-commits that don&#8217;t accurately reflect actual work, and I wanted to customize the generated report. Naturally I fired up vim and started writing some Python&#8230;</p>
<h2>Filtering Out Certain Revisions</h2>
<p>The first problem was that this repository is pretty new, and a lot of the first commits involved setting up templates and doing other administrative tasks. I want to collect statistics on who produced the most content, not who can push the metaphorical broom hardest in cleaning up templates and moving directories around, so I needed a method to filter out certain commits. The way StatSVN works is by first parsing an exported svn log file, containing a list of commits. What I found is that by simply removing the associated log entry for a commit StatSVN will simply ignore it.</p>
<h4>A Sample Log Entry from the SVN Log</h4>
<pre class="brush: xml; title: ;">
&lt;logentry revision=&quot;172&quot;&gt;
&lt;author&gt;androbin&lt;/author&gt;
&lt;date&gt;2012-02-15T19:06:10.225746Z&lt;/date&gt;
&lt;paths&gt;
&lt;path kind=&quot;file&quot; action=&quot;M&quot;&gt;/papers/mobicom12-audio/tex/design.tex&lt;/path&gt;
&lt;/paths&gt;
&lt;msg&gt;Fixed broken paper by updating design.tex&lt;/msg&gt;
&lt;/logentry&gt;
</pre>
<h4>Python Code to Perform an Update and Generate the Log</h4>
<pre class="brush: python; title: ;">
print 'Updating SVN repo'
os.system('cd papers; svn up')

print 'Running XML export from SVN repo'
os.system('cd papers; svn log -v --xml &gt; logfile.log')
</pre>
<p>Before removing it, we update the repository, which I&#8217;ve checked out into a directory called <code>papers/</code>, and generate a fresh log file. Next using <code>lxml</code> we load the log file, and an exclude list, and perform the deletion.</p>
<h4>Removing Revisions from Statistics based on Number</h4>
<pre class="brush: python; title: ;">
listToExclude = []
with open('exclude-list.txt', 'r') as f:
    listToExclude = map(lambda x: x.strip(), f.readlines())

print 'Exclude list: ' ,
print listToExclude 

doc = le.parse('papers/logfile.log')
elementsToRemove = []
for pat in listToExclude:
    for elt in doc.findall('logentry[@revision=\'' + pat + '\']'):
        print 'Removing element...'
        elt.getparent().remove(elt)

print 'Writing fille back to disk...'
with open('papers/logfile.log', 'w') as f:
    f.write(le.tostring(doc))
</pre>
<p><code>exclude-list.txt</code> simply consists of revision numbers, separate by newlines.</p>
<p>After we&#8217;ve modified the logfile we invoke the statistics generation program manually.</p>
<h4>Invoking StatSVN</h4>
<pre class="brush: python; title: ;">
print 'Invoking graph generation software...'
os.system('java -jar statsvn.jar papers/logfile.log papers -include &quot;**/*.tex&quot; -config-file config.txt')
</pre>
<p>Of interest here is the fact that we&#8217;ve passed it a configuration file. I&#8217;ve identified three key graphs I&#8217;d like to include in my final repo, and resizing them to appropriately fit in the spaces I&#8217;ve allocated for them is a little challenging, so I&#8217;ve used StatSVN&#8217;s ability to specify a config file to resize them and pump up the plot lineStroke to be a little more readable.</p>
<h4>StatSVN Config File</h4>
<pre class="brush: plain; title: ;">
chart.loc_per_author.lineStroke=4
chart.loc_per_author.width=600
chart.loc_per_author.height=300

chart.activity_time.width=600
chart.activity_day.width=600
chart.activity_time.height=370
chart.activity_day.height=408
</pre>
<h2>Making an Aggregate Report</h2>
<p>So now I&#8217;ve filtered out all the commits I don&#8217;t care about, but I&#8217;m not that happy with the default reports. My goal is to load these stats on a display-case monitor, and none of the default reports are attractive enough, or contain the right information, to make the cut. The approach I decided to take here was to use BeautifulSoup to extract the information I wanted from each of the reports, and then composite it into one report using a template file. This works really well in practice, since the report software&#8217;s format won&#8217;t change BeautifulSoup has no problems selecting the elements of interest.</p>
<h4>HTML Template for the Final Report</h4>
<pre class="brush: xml; title: ;">
&lt;html&gt;
&lt;head&gt;
&lt;title&gt;Group Dangerzone Paper Log&lt;/title&gt;
&lt;link rel=&quot;stylesheet&quot; href=&quot;ocss.css&quot; type=&quot;text/css&quot;&gt;
&lt;/head&gt;

&lt;body&gt;
&lt;h1&gt;Dangerzone Paper Commit Log&lt;/h1&gt;
&lt;table width=&quot;100%&quot;&gt;
&lt;tr&gt;
    &lt;td valign=&quot;top&quot; width=&quot;70%&quot;&gt;
    &lt;table width=&quot;100%&quot;&gt;
        &lt;tr&gt;
            &lt;td valign=&quot;top&quot;&gt;[A]&lt;/td&gt;
            &lt;td align=&quot;right&quot;&gt;&lt;img src=&quot;loc_per_author.png&quot; /&gt;&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/table&gt;
    &lt;br&gt;&lt;br&gt;&lt;br&gt;
    &lt;table width=&quot;100%&quot;&gt;
        &lt;tr&gt;
            &lt;td valign=&quot;top&quot;&gt;&lt;img src=&quot;activity_time.png&quot; /&gt;&lt;/td&gt;
            &lt;td&gt;&lt;img src=&quot;activity_day.png&quot; /&gt;&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/table&gt;
    &lt;h2&gt;Commit Message Tag Cloud&lt;/h2&gt;
    [T]
    &lt;/td&gt;
    &lt;td&gt;
        [C]
    &lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;/body&gt;
&lt;/html&gt;
</pre>
<p>In the template shown above we use placeholders <code>[C]</code>, <code>[T]</code>, and <code>[A]</code> for the commit log, tag cloud, and list of author contribution by percentage respectively. Below the python script will extract those elements from the generated reports, and push them into the template, before writing it to <code>output.html</code>.</p>
<h4>Making a Pretty Report</h4>
<pre class="brush: python; title: ;">

print 'Generating output HTML...'

def getSoup(fileName):
    with open(filename, 'r') as f:
        return BeautifulSoup(f.read())

template = ''
with open('template.html', 'r') as f:
    template = f.read()

developers = getSoup('developers.html')
index = getSoup('index.html')
clog = getSoup('commitlog.html')

authorTable = developers.html.body.table
template = template.replace('[A]', str(authorTable))

tagCloud = index.html.body.findAll('div')[2].p
template = template.replace('[T]', str(tagCloud))

commitList = clog.html.body.findAll('dl')[1]
for i in range(24,len(commitList.contents)):
    commitList.contents[len(commitList.contents) - 1].extract()
template = template.replace('[C]', str(commitList))

with open('output.html', 'w') as f:
    f.write(template)
</pre>
<h2>The End Result</h2>
<p>This whole script is saved in a file, set to run with a cron job every half-hour, and a line is added to the template file to cause the browser to refresh the page every so often. The finished product is shown below.</p>
<p><a href="http://andrewbrobinson.com/wp-content/uploads/2012/02/Screen-Shot-2012-02-16-at-1.03.23-AM.png"><img src="http://andrewbrobinson.com/wp-content/uploads/2012/02/Screen-Shot-2012-02-16-at-1.03.23-AM.png" alt="" title="Screen Shot 2012-02-16 at 1.03.23 AM" width="650" class="aligncenter size-full wp-image-765" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/02/16/generating-svn-statistics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RAID &#8211; Overcoming Limits of Annoying Spinning Things</title>
		<link>http://andrewbrobinson.com/2012/01/28/raid-a-technology-to-overcome-limitations-of-annoying-spinning-disks/</link>
		<comments>http://andrewbrobinson.com/2012/01/28/raid-a-technology-to-overcome-limitations-of-annoying-spinning-disks/#comments</comments>
		<pubDate>Sat, 28 Jan 2012 20:15:12 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=718</guid>
		<description><![CDATA[The story behind the development of RAID is really interesting. It&#8217;s a technology that was the consequence of a unique time in computing history, when disks were just beginning their evolution from the large, washing-machine sized units of commercial use to the small units found in a modern personal computer. As usual with progress like [...]]]></description>
			<content:encoded><![CDATA[<p>The story behind the development of RAID is really interesting. It&#8217;s a technology that was the consequence of a unique time in computing history, when disks were just beginning their evolution from the large, washing-machine sized units of commercial use to the small units found in a modern personal computer. As usual with progress like this, a number of people all within a similar window of time independently started looking at these small personal hard drives, did some math and realized that if some reliability issues could be addressed they could actually totally demolish the single large disks in almost every possible metric. Large numbers of personal hard drives in an array could easily beat single large disks in performance, cost, and reliability if some key insights were made in how to store data across them.</p>
<p>Recently I gave a talk on the original <a href="http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf">RAID paper</a> by David Patterson et al. While a lot of the ideas in this paper weren&#8217;t new at the time of publication, it worked to create a common language and platform to open discussions of spanning disk arrays. It prompted a lot of industry work, and is a definite cornerstone of the world of storage as we know it today.</p>
<blockquote><p>Every 2 Days We Create As Much Information As We Did Up To 2003</p></blockquote>
<p style="text-align: right;">- Eric Schmidt (2010)</p>
<p>Nowadays it&#8217;s interesting to see how well the ideas in the RAID paper have held up. It&#8217;s obvious that small cheap inexpensive disks won out over their larger counterparts, but whether RAID is still practical in todays world is less clear. The amount of data we generate is amazing, and storage technologies have scaled and scaled to sizes that would hardly be imaginable a few years ago. RAID used to be defacto standard for spanning drives, but I don&#8217;t think that&#8217;s true anymore. Ars Technica <a href="http://arstechnica.com/business/news/2012/01/the-big-disk-drive-in-the-sky-how-the-giants-of-the-web-store-big-data.ars">published an article</a> a few days ago detailing some of the technology behind the data centers driving the big cloud operations out there today. This architecture isn&#8217;t RAID, but at the same time I think it captures a lot of the key ideas in RAID.</p>
<p>Feel free to use the slides in any sort of derivative work or as part of a presentation. </p>
<p style="text-align: center;">
<object type='application/x-shockwave-flash' wmode='opaque' data='http://static.slideshare.net/swf/ssplayer2.swf?id=11282545&doc=eecs582-raid-120126152018-phpapp01' width='500' height='410'><param name='movie' value='http://static.slideshare.net/swf/ssplayer2.swf?id=11282545&doc=eecs582-raid-120126152018-phpapp01' /><param name='allowFullScreen' value='true' /></object></p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/28/raid-a-technology-to-overcome-limitations-of-annoying-spinning-disks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why are event-driven servers so great?</title>
		<link>http://andrewbrobinson.com/2012/01/27/why-are-event-driven-servers-so-great/</link>
		<comments>http://andrewbrobinson.com/2012/01/27/why-are-event-driven-servers-so-great/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 21:46:54 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=688</guid>
		<description><![CDATA[Recently there has been a huge surge in event-driven servers. With the introduction and wide-spread adoption of Node.js as a Javascript based application server, and nginx, a HTTP proxying server one has to wonder what it is about event-driven architecture that works so well. These servers are touted as literal silver bullets for devops, promising [...]]]></description>
			<content:encoded><![CDATA[<p>Recently there has been a <a href="http://news.netcraft.com/archives/2011/10/06/october-2011-web-server-survey.html">huge surge</a> in event-driven servers. With the introduction and wide-spread adoption of <a href="http://nodejs.org">Node.js</a> as a Javascript based application server, and <a href="http://nginx.org/">nginx</a>, a HTTP proxying server one has to wonder what it is about event-driven architecture that works so well. These servers are touted as literal silver bullets for devops, promising massive gains in performance and concurrency with no changes in hardware, and amazingly enough for a lot of workloads they do indeed deliver.</p>
<p>Let&#8217;s take a closer look at event-driven architecture, how it&#8217;s different than traditional concurrency models, and what the future of these servers might look like.</p>
<h3>The Old Way</h3>
<p>Traditionally servers like Apache have used the single child per connection model. When a user connects to the server a child process is spawned, and handles the connection. Each connection gets a separate thread and child, and as the request is processed data is returned. As the request blocks on things like database reads and web service requests the child process waits. </p>
<p><img  class="comic" src="http://andrewbrobinson.com/wp-content/uploads/2012/01/apache1.jpg" alt="" title="apache" width="500" height="407" class="aligncenter size-full wp-image-712" /></p>
<p>This works pretty well for small workloads, but it really doesn&#8217;t scale well. The going gets tough when the number of requests gets too large. Apache will quickly hit the maximum number of child processes, and everything gets slow. Each request has its own thread, and when using PHP the amount of memory required by each process can be quite large. The typical PHP runtime can take up to 64MB. </p>
<p>There&#8217;s also a number of reliability problems associated with this model. With a misconfigured server it&#8217;s super easy to launch a denial of service attack against Apache. A large number of simultaneous requests can quickly exhaust the available resources, and typical workloads can face really serious bottlenecks. </p>
<p>The fact of the matter is that operating systems aren&#8217;t designed to handle workloads associated with web traffic. The traditional threading model assumes that a small number of intensive operations will be required for an application to run. Linux was designed with the idea of a handful of users executing multithreaded programs that juggled simple operations, such as background file writing and UI presentation. What it wasn&#8217;t designed to do was handle thousands of simultaneous connections, where the constraints don&#8217;t come from the system itself, but from waiting for database queries to return and remote procedure calls to execute. </p>
<p>Forking and threading are pretty heavy processes, creating threads creates overhead, and necessitates the allocation of an entirely new stack and execution thread for each child. Additionally the context swaps are pretty brutal, and there&#8217;s an open question of whether the CPU scheduling model fits well with what a typical web-server workload looks like. </p>
<p>All of this basically sums to the fact that operating systems don&#8217;t provide, out of the box, an abstraction that makes sense for handling highly concurrent workloads. We realized this pretty quickly once the internet started taking off, and started looking for a solution. </p>
<h3>The New Way</h3>
<p>Recognizing that the observed workloads are dramatically different than what was expected, a solution was proposed. The observation was made that the web workload consists of a lot of waiting. Apache, despite spawning a bunch of child processes, consuming gobs of memory, typically just sits around waiting for other tasks to complete. This observation led to a lot of head scratching, and someone had an interesting idea.</p>
<p>Since so much of the web workload consisted of waiting, it was proposed that we abandon the idea of child processes all together. Instead of spawning a thread for each request, all the requests would be managed by a single thread, and this thread would be called the event loop. This event loop would gracefully pop between all the active connections, and fire off asynchronous requests to storage and database servers, and when these requests return additional events would be popped onto the stack to be handled. </p>
<p><img class="comic" src="http://andrewbrobinson.com/wp-content/uploads/2012/01/node1.jpg" alt="" title="node" width="650" height="360" class="aligncenter size-full wp-image-707" /></p>
<p>This is actually a really cool idea, we solve a lot of problems. All of the sudden the number of concurrent requests handled isn&#8217;t bounded as tightly. Sure, there&#8217;s some overhead in maintaining a list of open TCP connections, but memory requirements aren&#8217;t ballooning out of control anymore because so much of the runtime is now shared.</p>
<p>Node.js and Nginx both use this approach to build applications that scale to a super large number of connections. Everything happens inside an event loop, and multiple connections are handled gracefully. </p>
<h3>Limitations</h3>
<p>Event loops are not all roses and butterflies however. Looking specifically at Node.js there are some thrones as well. The most glaring omission from Node.js is a multithreaded implementation. It seems like event loop techniques are uniquely suited to be made multithreaded. The intuitive thought is that since events are pretty independent of each other, it shouldn&#8217;t be difficult to parallelize. </p>
<p>Theoretically that&#8217;s true, but there&#8217;s some technical reasons Node hasn&#8217;t become multithreaded, as well as a interesting argument I&#8217;ll explore in a second. Node is based off of V8, developed by Google. V8 is a high-performance Javascript engine, and it works remarkably well, but it was not designed to be multithreaded. Javascript executes on a single thread and this makes a lot of sense in the Chrome browser. Adding multithreading would be pretty tough, the architecture just wasn&#8217;t designed with it in mind. </p>
<h3>What does the future look like?</h3>
<p>So there&#8217;s also a really interesting argument against adding threading to Node. With the evolution of nginx as a reverse proxy server for HTTP, allowing for distribution of loads across many separate running instances, the authors of Node would tell you that the best way to multithread Node is to fork into separate processes. </p>
<p>At first glance this might seem like an argument constructed to explain away an implementation defect, but I think there&#8217;s a much more interesting story here. They really are advocating that a logical server should be a program that runs best on single core machine, with a small memory footprint, and well-defined limitations that allow for predictable performance. In contrast, Apache originally tried to manage concurrency and threading within the process, gobbling up all available resources. In Node, we don&#8217;t even bother, eschewing complexity in return for a really elegant, scalable implementation. </p>
<p>Event-driven architectures are a large step away from threads being the basic unit of concurrency, to a model where a CPU itself serves as the basic unit. Predictions of CPUs with hundreds of cores becoming commonplace keep rolling in, and tossing a bunch of cores onto a die is really one of the best ways to keep Moore&#8217;s law alive into the coming decade.</p>
<p>Interestingly enough cloud-based computing platforms sell units of computation that match these requirements. It&#8217;s obvious that a single cloud instance is well suited to running a single Node.Js server, and that scaling horizontally should involve spinning up identical servers with separate Node instances, and load balancing. </p>
<p>Web workloads are really different than that of typical user applications. They require systems to support a high number of concurrent users with relatively simple CPU and memory requirements. They also are highly compartmentalized. We&#8217;re lucky that the web workload requirements fit so nicely with the capabilities of large distributed server farms. They are mostly composed of high numbers of completely orthogonal requests, with very loose concurrency models.</p>
<h4>A New Operating System?</h4>
<p>I think that the event-driven model is here to stay. It&#8217;s part of a larger trend to address the impedance mismatch between existing servers and the demand of the web workload, and to break apart large monolithic servers (Apache) into small scalable pieces (nginx, Node.js, and fastphp servers). The next big shift I think will be a recombination of these pieces. By breaking them apart we allow for an exploration of what works, by combining them together, along different boundaries, we can build high-performance servers that shed some of the overhead introduced by ill-suited legacy systems. </p>
<p>Diverging from the topic of event-driven servers, lets take a look at what the future of servers that handle web workloads looks like. If we settle on a virtual CPU as the basic unit of concurrency, with a single Node.js instance running on each virtual server, I think the next obvious place for optimization is the operating system itself. It&#8217;s obvious that the threading and concurrency abstractions presented by modern operations systems are mismatched to the demands of the web workload, and tighter integration of the server with the kernel could yield even better performance by removing impedance caused by inefficiencies in the system call interfaces. </p>
<p>One could imagine a streamlined operating system that eliminates much of the overhead, and contains additional tweaks to file and network drivers to achieve even better performance. With paravirtualization we&#8217;re starting to see the emergence of a common hardware interface for the virtual kernels of the future that allows for much of the complexity of the operating system to be removed. This paves the way for development of microkernels, highly tuned to run a Node-like server. </p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/27/why-are-event-driven-servers-so-great/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Installing a WordPress LEMP Stack in Under an Hour</title>
		<link>http://andrewbrobinson.com/2012/01/26/installing-a-wordpress-lemp-stack-in-under-an-hour/</link>
		<comments>http://andrewbrobinson.com/2012/01/26/installing-a-wordpress-lemp-stack-in-under-an-hour/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 21:06:11 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=681</guid>
		<description><![CDATA[Good news, I&#8217;ve moved everything to a LEMP based stack! Previously this blog, as well as a number of small personal projects had been hosted using the downright terrible shared hosting service provided by HostGator. I&#8217;ve never been a fan of HostGator, they treat their customers poorly and I&#8217;ve had numerous support incidents while hosting [...]]]></description>
			<content:encoded><![CDATA[<p>Good news, I&#8217;ve moved everything to a LEMP based stack!</p>
<p>Previously this blog, as well as a number of small personal projects had been hosted using the downright terrible shared hosting service provided by HostGator. I&#8217;ve never been a fan of HostGator, they treat their customers poorly and I&#8217;ve had numerous support incidents while hosting my site with them. It&#8217;s been clear for a long time that I needed a change.</p>
<p>With everyone moving to new technology like <a href="http://nodejs.org">Node.js</a> and <a href="http://nginx.org/en/">Nginx</a> it only makes sense that I should move to a virtualized server. My next project more than most likely will involve utilizing some of these new technologies, and hosting non web-based services. I took some time and evaluated a lot of the VPS solutions out there currently and the word on the street is that Linode offers some of the best service, features, and reliability out there. Cheaper solutions exist, but for $20 you essentially get a complete self-managed server. </p>
<h3>The Scary World of Unmanaged Hosting</h3>
<p>The downside of a Linode instance is that the management of your site is suddenly in your own hands, and you are responsible for the security and maintenance of your server. This really isn&#8217;t a problem in modern-day systems, Linux is pretty secure out of the box, and servers are designed with a minimal footprint. </p>
<h3>Why Linode?</h3>
<p>Linode is not the cheapest VPS out there. You can find better deals, but what you won&#8217;t be able to find is a better community and collection of technical information. They have done a wonderful job providing ample documentation to get up and running, and cover all the common questions that one has when starting to host their own server in the real world.</p>
<p>The people behind Linode know their technology. Linode is built on top of <a href="http://xen.org">Xen</a>, the industry leading virtualization technology you&#8217;ll find powering many popular cloud services (EC2, Rackspace, etc) with a really nice web-based control panel for managing instances. </p>
<p>Additionally for development you&#8217;ll find a number of prebuilt <i>StackScripts</i>. These are really handy for quickly trying out a new technology. Allowing for user-submitted StackScripts makes it easy to try out new technologies if you have a spare node. I didn&#8217;t use a StackScript, only because installing the tools by hand generally gives you a pretty good feel for where they are located and how they are configured. </p>
<h3>Let&#8217;s Get Started!</h3>
<p>So let&#8217;s get started. I really didn&#8217;t do anything novel here, so I&#8217;ll be concise:</p>
<ul>
<li><b>Order a Linode</b> &#8211; I ordered the $20 a month Linode, it should work for now. I set it up with Ubuntu 11.10 out of the box. It was set up instantly and with a few quick clicks I had a server up and running with SSH access.</li>
<li><b>Install the Stack</b> &#8211; <a href="http://library.linode.com/lemp-guides/ubuntu-11.10-oneiric">This</a> wonderful tutorial documents the process really well. I installed nginx, PHP, and mySQL from apt-get, unless you&#8217;re planning on doing devel work on these tools I&#8217;m not sure what advantages there are to compiling from source.</li>
<li><b>Backup and Move WordPress</b> &#8211; Surprisingly moving a large application like WordPress is surprisingly simple. I took the easy approach and exported a database copy in the form of an SQL script to recreate the database, and did a file-by-file copy of the WordPress site. I uploaded everything using SFTP (FTP is kind of old school- I wouldn&#8217;t recommend it) and quickly was up and running.</li>
<li><b>Update the DNS Records</b> &#8211; Linode offers a complete DNS solution, and provides nameservers for hosting domains. Surprisingly, by far the most time intensive part of this process was waiting for GoDaddy&#8217;s domain configuration page to load. I&#8217;m planning on moving my domain away from GoDaddy, and this has added some fuel to the fire.</li>
</ul>
<h3>And the results&#8230;</h3>
<p>I had always suspected that HostGator had oversold its service to a unacceptable degree, I often experienced phantom delays and timeouts, with their support team being unable to turn up any problems. After switching I&#8217;ve noticed that the response time for my site has improved significantly, and the variance has also fallen. This means that visitors will get a much better experience, which makes me happy.</p>
<p>I&#8217;ve already started looking at new projects now that I have a nice permanent piece of computation hooked up with a large pipe to the internet at large. I would recommend something like this to any developer who needs more capabilities than shared hosting can provide. While there&#8217;s a slight overhead required to manage and update services, I think the flexibility and performance you gain are totally worth the effort. </p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/26/installing-a-wordpress-lemp-stack-in-under-an-hour/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Machine Learning in Haskell &#8211; Linear Regression</title>
		<link>http://andrewbrobinson.com/2012/01/22/machine-learning-in-haskell-linear-regression/</link>
		<comments>http://andrewbrobinson.com/2012/01/22/machine-learning-in-haskell-linear-regression/#comments</comments>
		<pubDate>Sun, 22 Jan 2012 16:06:29 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[Haskell]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=624</guid>
		<description><![CDATA[One of the classic techniques in machine learning is linear regression. This approach models a function using a linear relationship between one or more input variables and the output set. It&#8217;s used in a ton of different situations, from building classifiers for large e-commerce sites to suggest new products, to fitting a line to a [...]]]></description>
			<content:encoded><![CDATA[<p>One of the classic techniques in machine learning is linear regression. This approach models a function using a linear relationship between one or more input variables and the output set. It&#8217;s used in a ton of different situations, from building classifiers for large e-commerce sites to suggest new products, to fitting a line to a data set in Excel.</p>
<p>Today we&#8217;ll implement this algorithm in Haskell as both an exercise in using the hmatrix library, and as a practical endeavor in using a functional language for a useful purpose. </p>
<p>Linear regression is implemented using linear algebra and an approach that minimizes the least squares error of a matrix of weights, using a training data set.</p>
<h3>Overview of Linear Regression</h3>
<p>We define a linear equation that accepts an arbitrary number of variables.</p>
<p style="text-align: center;">
<img src='http://s0.wp.com/latex.php?latex=%5Cmathbf%7By%7D+%3D+%5Cmathbf%7BX%7D%5Cmathbf%7Bw%7D+&#038;bg=fff&#038;fg=1c1c1c&#038;s=0' alt='&#92;mathbf{y} = &#92;mathbf{X}&#92;mathbf{w} ' title='&#92;mathbf{y} = &#92;mathbf{X}&#92;mathbf{w} ' class='latex' />
</p>
<p>We setup a matrix of X values with a column of ones appended to the beginning to allow for an offset term.</p>
<p style="text-align: center;">
<img src='http://s0.wp.com/latex.php?latex=%5Cmathbf%7BX%7D+%3D+%5Cbegin%7Bbmatrix%7D++1+%26+x_%7B1%2C1%7D+%26+%5Ccdots+%26+x_%7B1%2Cn%7D+%5C%5C++1+%26+%5Cvdots+%26+%5Cvdots+%26+%5Cvdots+%5C%5C++1+%26+x_%7Bm%2C1%7D+%26+%5Ccdots+%26+x_%7Bm%2Cn%7D++%5Cend%7Bbmatrix%7D+&#038;bg=fff&#038;fg=1c1c1c&#038;s=0' alt='&#92;mathbf{X} = &#92;begin{bmatrix}  1 &amp; x_{1,1} &amp; &#92;cdots &amp; x_{1,n} &#92;&#92;  1 &amp; &#92;vdots &amp; &#92;vdots &amp; &#92;vdots &#92;&#92;  1 &amp; x_{m,1} &amp; &#92;cdots &amp; x_{m,n}  &#92;end{bmatrix} ' title='&#92;mathbf{X} = &#92;begin{bmatrix}  1 &amp; x_{1,1} &amp; &#92;cdots &amp; x_{1,n} &#92;&#92;  1 &amp; &#92;vdots &amp; &#92;vdots &amp; &#92;vdots &#92;&#92;  1 &amp; x_{m,1} &amp; &#92;cdots &amp; x_{m,n}  &#92;end{bmatrix} ' class='latex' />
</p>
<p>Our matrix is set up with m rows of data, and n columns of independent variables. This represents the a set of input data. </p>
<p style="text-align: center;">
<img src='http://s0.wp.com/latex.php?latex=%5Cmathbf%7Bw%7D+%3D+%5Cbegin%7Bbmatrix%7D++w_%7B0%7D+%5C%5C++%5Cvdots+%5C%5C++w_%7Bn%7D++%5Cend%7Bbmatrix%7D+&#038;bg=fff&#038;fg=1c1c1c&#038;s=0' alt='&#92;mathbf{w} = &#92;begin{bmatrix}  w_{0} &#92;&#92;  &#92;vdots &#92;&#92;  w_{n}  &#92;end{bmatrix} ' title='&#92;mathbf{w} = &#92;begin{bmatrix}  w_{0} &#92;&#92;  &#92;vdots &#92;&#92;  w_{n}  &#92;end{bmatrix} ' class='latex' />
</p>
<p>You&#8217;ll notice that when we multiply w by the data set, we produce a column matrix of output values, corresponding to the y value for each of the set of independent variables. </p>
<p>This defines a set of equations, but we still have to determine the value of the weight matrix. To do this we take a set of data, called the training set, and plug it into the input data matrix, along with the output values. We subtract the estimated value, obtained via the weight matrix, from the actual output value and square this, take the derivative of the equation, set it equal to zero, and solve for the weight matrix. </p>
<p>All these steps are a little complex, and if you&#8217;d like to learn more the <a href="http://en.wikipedia.org/wiki/Linear_regression">Wikipedia article</a> on linear regression is pretty well done and gives a lot of information on the statistical theory behind regression, but in the end you end up with the following equation to determine the weight matrix:</p>
<p style="text-align: center;">
<img src='http://s0.wp.com/latex.php?latex=%5Cmathbf%7B%5Chat%7Bw%7D%7D+%3D+%28X%5ETX%29%5E%7B-1%7DX%5ETy+&#038;bg=fff&#038;fg=1c1c1c&#038;s=0' alt='&#92;mathbf{&#92;hat{w}} = (X^TX)^{-1}X^Ty ' title='&#92;mathbf{&#92;hat{w}} = (X^TX)^{-1}X^Ty ' class='latex' />
</p>
<h3>Implementation in Haskell</h3>
<p>To implement these functions in Haskell is actually pretty simple, we make use of <a href="http://hackage.haskell.org/package/hmatrix-0.13.0.0">hmatrix</a>, an excellent matrix library built in Haskell. </p>
<p>Installing hmatrix in Mac OS is a little tricky, we have to install the GNU Scientific Library development libs first, luckily MacPorts can help us out:</p>
<pre class="brush: plain; title: ;">
sudo port install gsl-devel
cabal install hmatrix
</pre>
<p>After that we&#8217;ll go ahead and load in some data. The authors of hmatrix provide a <a href="http://perception.inf.um.es/hmatrix/hmatrix.pdf">wonderful manual</a> to get up and running with the library and a number of functions have been built to allow us to quickly build matrices up.</p>
<p>In this case we&#8217;ll use the loadMatrix IO function, which will pull in some input data from files. We format our data files as space-separated rows of numbers:</p>
<h4>test.txt</h4>
<pre class="brush: plain; title: ;">
0 0.660691332817078
0.1 0.754551916894156
0.2 0.925818388603147
0.3 0.904216776317371
0.4 0.754324606651347
0.5 0.572540852930199
0.6 0.226045290129906
0.7 0.135596809334937
0.8 0.075248476906341
0.9 0.186604404237266
1 0.546356452677449
</pre>
<h4>train.txt</h4>
<pre class="brush: plain; title: ;">
0 0.465670943260193
0.1 0.799516151425082
0.2 0.894981363854821
0.3 0.836263760453476
0.4 0.749666819566768
0.5 0.491481010373484
0.6 0.138236347335107
0.7 0.101170288570109
0.8 0.170054617567581
0.9 0.319242745425697
1 0.455485771272382
</pre>
<p>The input parameter is on the left, with the function output on the right. We have both a test and training set of data so we can evaluate the performance of our regression algorithm. Loading these files in Haskell is a cinch. </p>
<pre class="brush: haskell; title: ;">
    dat1 &lt;-  loadMatrix &quot;train.txt&quot;
    dat2 &lt;- loadMatrix &quot;test.txt&quot;
    let [x_in, y_in] = toColumns dat1
    let [x_in_test, y_in_test] = toColumns dat2
</pre>
<p>Next we need to add some dimensionality to our input variable. A cool trick commonly used in linear regression is to allow fitting of higher-order polynomials by adding additional columns to our input variable data set, and computing higher order values of the input variable for them. In our case we&#8217;ll use a power series from 0 to 2, to allow for a quadratic fit:</p>
<pre class="brush: haskell; title: ;">
    let x = fromColumns $ map (x_in^) [0..2]
    let x_test = fromColumns $ map (x_in_test^) [0..2]
</pre>
<p>After performing this our X matrix will look something like this:</p>
<pre class="brush: plain; title: ;">
11x3
1.000  0.000  0.000
1.000  0.100  0.010
1.000  0.200  0.040
1.000  0.300  0.090
1.000  0.400  0.160
1.000  0.500  0.250
1.000  0.600  0.360
1.000  0.700  0.490
1.000  0.800  0.640
1.000  0.900  0.810
1.000  1.000  1.000
</pre>
<p>The first column is the input value raised to the 0th power, which gives us the ability to compute an offset in our weight matrix, the second column is the input value raised to the first power, and the third column is that value raised to the second power. </p>
<p>Next we&#8217;ll compute the weight matrix, using the equation defined previously. Using the reference sheet in hmatrix&#8217;s manual proved to be useful here to look up your basic matrix manipulation functions:</p>
<pre class="brush: haskell; title: ;">
    let w = (pinv $ (ctrans x) &lt;&gt; x) &lt;&gt; (ctrans x) &lt;&gt; y_in
</pre>
<p>Breaking this apart we use a couple functions from hmatrix.</p>
<ul>
<li><b><code>pinv</code></b> &#8211; This generates the pseudo-inverse of a matrix for us, equivalent to X^(-1). </li>
<li><b><code>ctrans</code></b> &#8211; The transpose of a matrix. Swaps out the rows and columns.</li>
<li><b><code>&lt;&gt;</code></b> &#8211; The cross product of two matrices.</li>
</ul>
<p>The function operates over the entire training set, and computes a weight matrix that minimizes the error given the dimensional constraints of the input matrix.</p>
<p>Finally we can evaluate our function on both the test set and training set:</p>
<pre class="brush: haskell; title: ;">
    let train_y = x &lt;&gt; w
    let test_y = x_test &lt;&gt; w

    putStrLn &quot;Training set root-mean-square value:&quot;
    print $ sqrt . sum $ toList ((y_in - train_y) ^ 2)
    putStrLn &quot;Test set root-mean-square value:&quot;
    print $ sqrt . sum $ toList ((y_in_test - test_y) ^ 2)
</pre>
<p>It&#8217;s super easy to use the weight matrix, multiplying an input set of x values against the weight matrix generates the output values. In our case we&#8217;ll get output that looks something like this:</p>
<pre class="brush: plain; title: ;">
Training set root-mean-square value:
0.7082342551441069
Test set root-mean-square value:
0.7088158510569752
</pre>
<h3>Evaluation</h3>
<p>The data set was generated from a sinusoidal function added with a slight amount of Gaussian noise, so a quadratic fit is not the best approximation. A 3rd order fit will generate a very nice root-mean-square value, and a <code>sin(x)</code> kernel function will generate an almost perfect fit. </p>
<p>The approach does generalize well though, hmatrix is a wonderful library and allows you to perform a lot of statistical operations in Haskell that you would typically reach for a program like MATLAB for. </p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/22/machine-learning-in-haskell-linear-regression/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Mounting Linux Volumes with SFTP in Mac OS</title>
		<link>http://andrewbrobinson.com/2012/01/22/mounting-linux-volumes-with-sftp-in-mac-os/</link>
		<comments>http://andrewbrobinson.com/2012/01/22/mounting-linux-volumes-with-sftp-in-mac-os/#comments</comments>
		<pubDate>Sun, 22 Jan 2012 00:40:53 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Mac OS]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=626</guid>
		<description><![CDATA[Mounting Linux directories remotely using SFTP is incredibly handy. It allows you to easily move files back and forth between systems, and easily test and develop websites and other hosted services. It used to be super-easy to setup, MacFuse was the de facto way to install things, but has recently had some growing pains when [...]]]></description>
			<content:encoded><![CDATA[<p>Mounting Linux directories remotely using SFTP is incredibly handy. It allows you to easily move files back and forth between systems, and easily test and develop websites and other hosted services. It used to be super-easy to setup, <a href="http://code.google.com/p/macfuse/">MacFuse</a> was the de facto way to install things, but has recently had some growing pains when it comes to being supported in Mac OS Lion. It is no longer actively maintained by Google and just doesn&#8217;t work very well. </p>
<p>There&#8217;s been a number of groups that have picked up the gauntlet and continued support for MacFuse, but instructions for installation by the end user aren&#8217;t really clear. MacFuse is designed to be a module to build file systems on top of so it&#8217;s generally not documented too well. </p>
<p>After a little bit of fumbling, I&#8217;ve found a procedure that works.</p>
<ul>
<li><b>Install everything with MacPorts</b> &#8211; You&#8217;ll need <a href="http://www.macports.org/">MacPorts</a> installed for this to work, the open-source package management system for Mac OS.
<p>We&#8217;ll be installing <a href="http://fuse4x.org/">Fuse4X</a>, which is a continued effort to develop on the MacFuse codebase. </p>
<pre class="brush: plain; title: ;">
sudo port install fuse4x sshfs
</pre>
</li>
<li><b>Make a mount point and test it out</b> &#8211; Go ahead and create a directory in your home directory, and try to create a sshfs mount.
<pre class="brush: plain; title: ;">
mkdir ~/remote
sshfs andrew@someserver.com:/home/andrew/ ~/remote -oauto_cache,reconnect,defer_permissions,negative_vncache
cd ~/remote; ls
</pre>
<p>If all goes well this should simply complete, substituting your username and server hostname in above. I recommend having SSH public key authentication turned on, else you&#8217;ll have to enter a password every time you&#8217;d like to mount this volume.</p>
<p>You&#8217;ll be able to browse the directory in Finder just like a normal directory, however it appears as a volume within a directory, and Finder masks the file name. No worries though, it&#8217;ll still work for applications that reference the path.
</li>
<li><b>Make it happen every time your login</b> &#8211; Finally we can use a login hook to make this happen every time. Toss the mount command in a shell script:
<pre class="brush: plain; title: ;">
#!/bin/bash
sshfs andrew@someserver.com:/home/andrew/ ~/remote -oauto_cache,reconnect,defer_permissions,negative_vncache
</pre>
<p>Chmod the file 777 and run the following command to add the script to your login hook:</p>
<pre class="brush: plain; title: ;">
sudo defaults write com.apple.loginwindow LoginHook /path/to/script
</pre>
</li>
</ul>
<p>Now every time you login, as long as you&#8217;re connected to the internet, you&#8217;ll have access to your files!</p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/22/mounting-linux-volumes-with-sftp-in-mac-os/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Effectively using Socket.io</title>
		<link>http://andrewbrobinson.com/2012/01/18/effectively-using-socket-io/</link>
		<comments>http://andrewbrobinson.com/2012/01/18/effectively-using-socket-io/#comments</comments>
		<pubDate>Wed, 18 Jan 2012 07:38:18 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=608</guid>
		<description><![CDATA[Socket.io is one of the new technologies on the block when it comes to interactive realtime client-server messaging. It&#8217;s a library, that true to its name, tries to embody the traditional properties of sockets. It handles a lot of the low-level details required to establish fast bidirectional communication between a client and a server, and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="socket.io">Socket.io</a> is one of the new technologies on the block when it comes to interactive realtime client-server messaging. It&#8217;s a library, that true to its name, tries to embody the traditional properties of sockets. It handles a lot of the low-level details required to establish fast bidirectional communication between a client and a server, and makes implicit guarantees of reliable, sequential transmission in the bold statement found on their home page of &#8217;100% care-free realtime&#8217;.</p>
<p>Like most of these new-fangled web technologies the homepage is intentionally sparse, with a trendy &#8216;Fork me on GitHub&#8217; link in the top corner, a few code snippets showing how amazing it is, and a three question FAQ section. This is the brave new world of technologies. They move so fast there&#8217;s little time for the developer to spend properly documenting. The trend nowadays is to throw the entire source up on GitHub, along with a few examples, and toss it to the masses.</p>
<p>Unfortunately, this gives the developer little guidance on its proper usage or implementation. The main page gives a deceptively simple implementation, but to really effectively use this technology, you need to do a little bit more work. </p>
<h3>Do Not Trust Socket.io</h3>
<p>On foreign policy Ronald Reagan had a very concise policy, summed up by his infamous quote, &#8220;Trust, but verify.&#8221; The same principle applies to socket.io. You must not, at any point in time, assume that socket.io will provide a reliable communication tunnel between your client and server. </p>
<p>This seems a little odd, after all, some of their extremely sparse documentation mentions how socket.io has great error handling facilities, and there&#8217;s even mechanisms in place to ensure reliable transport. If socket.io has these built-in already, then why bother to reimplement them?</p>
<p>Well, socket.io doesn&#8217;t know anything about your application. You are building something with very specific requirements. It might need to be super-responsible, where a second long delay (or longer) could make or break the experience, or it might need to be super-dependable, with a requirement that every single message needs to be passed without fault, but speed isn&#8217;t quite so important. Your application&#8217;s transport requirements are unique, and there&#8217;s just no way that what&#8217;s built into socket.io can fit your needs appropriately.</p>
<p>To properly fulfill the unique requirements of your application, you&#8217;ll find yourself implementing algorithms and procedures to ensure receipt of packets, and to ensure the connection channel stays open. </p>
<p>In addition to meeting the unique requirements of your application, another reason to implement your own verification of delivery is because no where is it guaranteed that socket.io is implemented correctly, or will handle all the errors a websocket could face. By handling things in the application level, you gain a level of safety. It&#8217;s very probable that socket.io might not catch an edge condition, or will silently fail for a long time before delivering an error, at which point you might not be sure what messages have been successfully delivered. </p>
<h3>Why use it at all then?</h3>
<p>If using socket.io requires so much work on the developer&#8217;s part, it might be tempting to ditch it all together, and use traditional websockets, or some other technology. I wouldn&#8217;t recommend this at all. The folks behind the project have done a very fantastic job supporting a large number of browsers and transports. You will not be able to, in a reasonable amount of time, be able to put together an implementation anywhere near as clean as these guys have. One area they have really nailed is the feature-detection and abstracting the interface above the specific transport, such that an app can use websockets, flash plugins, or long-polling pretty transparently depending on what the browser might support. It&#8217;s truly fantastic work. Additionally handling many clients is done pretty elegantly, as is the entire API. </p>
<h3>Why do they bother in the first place?</h3>
<p>So, with this argument, one must wonder why the guys behind socket.io bother to implement any sort of guarantees at all. If all of this should be done in the application level, why bother with transport level implementations? I think there&#8217;s a couple of reasons. </p>
<h4>Rapid Prototyping</h4>
<p>The most obvious is that by building these features into socket.io, it enables rapid development of prototype applications, and does a reasonably good job of delivering on the promises for a broad variety of use cases. For times when a quick prototype is needed, the developer isn&#8217;t often thinking about guaranteeing the delivery of message, they are usually more focused on actually making the prototype work at all. Having this guarantees built it makes their life a little easier at this stage.</p>
<h4>Internal Book-keeping</h4>
<p>Another very useful reason for features specifically like heartbeats is that it helps socket.io maintain internal state. Taking a look at the Node.js implementation you&#8217;ll realize that socket.io does actually have to maintain a few bytes of state for each client. Because it&#8217;s transport agnostic, and some of the transports have no concept of a session, while others don&#8217;t really detect sudden disconnects, a heartbeat is a great way to check if a session is still active, and if it&#8217;s not, clean up associated state. You can use the disconnect events in a similar fashion if you wish, but don&#8217;t rely on them being triggered. </p>
<h3>The End-to-End Argument</h3>
<p>Finally, I&#8217;ll finish off this post by mentioning that the point made here is a pretty direct application of the <a href="http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf">end-to-end argument</a>, originally made by J. H. Saltzer in 1984 and one of the classic ideas in systems design. I&#8217;d recommend everyone take a read of this paper, the ideas presented will help guide you in implementing any system.</p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/18/effectively-using-socket-io/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Collection &#8211; Getting Data out of a Tektronix Oscilloscope</title>
		<link>http://andrewbrobinson.com/2012/01/16/data-collection-getting-data-out-of-devices/</link>
		<comments>http://andrewbrobinson.com/2012/01/16/data-collection-getting-data-out-of-devices/#comments</comments>
		<pubDate>Mon, 16 Jan 2012 18:05:36 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Data Collection]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=565</guid>
		<description><![CDATA[Continuing from where we left off left time, because I cannot possibly cover the full range of equipment out there, I&#8217;ll be doing a case study on the MSO/DPO2000 Mixed Signal Oscilloscope from Tektronix, a wonderful scope, with a terrible interface. Unfortunately the set of people who design professional lab equipment and the set of [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://andrewbrobinson.com/wp-content/uploads/2012/01/designers.jpg" alt="" title="designers" width="434" height="327" class="alignright size-full wp-image-557 comic" /><br />
<a href="http://andrewbrobinson.com/2012/01/13/collecting-analyzing-and-plotting-data-from-lab-bench-equipment/">Continuing from where we left off left time</a>, because I cannot possibly cover the full range of equipment out there, I&#8217;ll be doing a case study on the MSO/DPO2000 Mixed Signal Oscilloscope from Tektronix, a wonderful scope, with a terrible interface.</p>
<p>Unfortunately the set of people who design professional lab equipment and the set of people who know a thing or two about computer science are essentially disjunct. This means that there almost certainly won&#8217;t be an easy way to get data from a given piece of lab equipment with a web interface. </p>
<h3>Where does the plug go?</h3>
<p>The first challenge in getting data out of a device is going to be finding a physical entry port. In the case of our scope it was lucky enough to have an Ethernet port, one that actually connects to an IP network nonetheless. This is particularly convenient. I would recommend that if your instrument has an Ethernet port that you both connect it to the network, and understand how to push commands and pull data from it. Ethernet is truly a universal interface, and worth the effort.</p>
<p>If your device doesn&#8217;t have an Ethernet port, there are other options. A lot of devices have a flash drive, which is great if you&#8217;re the type of person who enjoys waiting and repeatedly replugging small USB dongles. The GPIB interface is present on a lot of equipment, but it&#8217;s old and boring, implemented as an 8-bit parallel bus, and requires expensive interfaces to use with a PC. Serial ports are great, but force you to sit next to the piece of equipment and are generally slower than Ethernet.</p>
<h3>How do I talk to this thing?</h3>
<p>The next challenge is how to successfully talk to this device. The naive approach at this point in time is to consult the device-maker&#8217;s website and look for some surely available documentation that clearly spells out how to use the well thought out interface. This is a sure way to waste time, it is doubtful that the device-maker will be of any help. All documentation will be poorly organized, and chapters on advanced techniques such as communication will be incomplete and left out. </p>
<p>Luckily, since I&#8217;m using Ethernet we have the narrow waist of the internet to help us along. Almost all devices with Ethernet ports support a web interface over HTTP as the primary means of interaction. Typing in the Tektronix scope&#8217;s IP address into your browser will direct you to&#8230; a nice blank white page. The web interface is incompatible with Google Chrome. Opening it in Firefox, you&#8217;ll find a rather primitive interface available, with plenty of malformed HTML. Let&#8217;s thumb over to the data tab and take a look.</p>
<p><img src="http://andrewbrobinson.com/wp-content/uploads/2012/01/Screen-Shot-2012-01-12-at-1.33.20-PM.png" alt="" title="Screen Shot 2012-01-12 at 1.33.20 PM" width="600" class="aligncenter size-full wp-image-572" /></p>
<p>This looks entirely intuitive! Prodding a little further we notice that we actually can get data from the instrument, and we do this using the &#8216;waveform transform from the instrument&#8217; section. This looks like a simple POST request, and does indeed return a set of comma separated values. Things are looking up for a split second! Taking a peak at the HTML code ends our momentary celebration. The Tektronix interface is actually a pile of iframes, and we notice some really weird Javascript action going on: </p>
<pre class="brush: jscript; title: ;">
// read the value of the selection made in the &quot;command&quot; listbox
// check for a selection beginning with the letter D
// if found set the &quot;command1&quot; listbox to the only applicable choice
function onSetSource(form, text, index) {
    try {
        //check to see if the 15th letter is a 'd' indicating digital
        //if so, INTERNAL is not an applicable choice so remove it
        if (&quot;d&quot; == text.substring(15,16)) {
            for (var i=0; i&lt;form.command1.options.length; i++) {
                if (form.command1.options[i].text == &quot;INTERNAL&quot;) {
                    form.command1.options[i] = null;
                }
            }
        }
        else {
            //if the listbox was previously shortened by the 'digital' selection
            //add the INTERNAL option back into the listbox
            if ((form.command1.options.length == 1) &amp;&amp;
                (form.command1.options[0].text != &quot;INTERNAL&quot;)) {
                var oOption = document.createElement(&quot;option&quot;);
                oOption.text = &quot;INTERNAL&quot;;
                oOption.value = &quot;save:waveform:fileformat internal&quot;;
                form.command1.options.add(oOption);
            }
        }
        SetFileExt(form,
        form.command1.options[form.command1.selectedIndex].text);
        form.WFMFILENAME.value = form.command.options[index].text;
    }

    catch (e) {
        alert('An unknown error has occurred. Please try again.');
    }
}
</pre>
<p>That&#8217;s a little odd, it&#8217;s called every time the source combo box is changed, this looks like we&#8217;re building a request in memory using Javascript, a little untraditional, but let&#8217;s press on. The simplest way to reverse engineer this is to simply duplicate a request, so lets capture what the inside of the post request looks like:</p>
<pre class="brush: plain; title: ;">
Content-Type: text/plain Content-Length: 202 WFMFILENAME=CH1 WFMFILEEXT=csv command=select:control ch1
command1=save:waveform:fileformat spreadsheet command2=:data:resolution reduced;:save:waveform:spreadsheet:resolution reduced wfmsend=Get
</pre>
<p>Great, a POST request, but not quite. Traditionally POST requests are a series of name value pairs, separated by ampersands. It seems like the engineers at Tektronix knew a thing or two more than the W3C and went ahead and improvised their own POST request formatting scheme.</p>
<p>Anyway, it&#8217;s not standards-compliant, but we can work around it. With a little effort I came up with some code in Python that seems to work:</p>
<pre class="brush: python; title: ;">
def getData(channel, ip):
	url = 'http://' + ip + '/data/Tek_CH' + str(channel) + '_Wfm.csv'

	# The format of this payload comes from a Wireshark capture from the scope.
	# It's not a standard POST request.
	data = &quot;&quot;&quot;WFMFILENAME=CH&quot;&quot;&quot; + str(channel) + &quot;&quot;&quot;\r
WFMFILEEXT=csv\r
command=select:control ch&quot;&quot;&quot; + str(channel) + &quot;&quot;&quot;\r
command1=save:waveform:fileformat spreadsheet\r
command2=:data:resolution full;:save:waveform:spreadsheet:resolution full\r
wfmsend=Get\r\n&quot;&quot;&quot;

	req = urllib2.Request(url, data)
	response = urllib2.urlopen(req)
	rawData = response.read()

	lines = map(lambda x: x.strip().split(','), rawData.split('\n'))
	return lines[16:-3]
</pre>
<p>Well, alright, that&#8217;s not too bad. We have to specify the channel we want data from in three separate locations for some odd reason, but it does work. I&#8217;ve found that if you don&#8217;t you&#8217;ll get inconsistent data out of the scope. We also repeat twice on line 10 that we want full resolution. I&#8217;m not sure why the scope requires so much duplication, but since we&#8217;ve wrapped it in Python it&#8217;s reasonably modular and after perfecting it one can forget promptly about how much of a kludge it is. </p>
<h3>Putting it all together</h3>
<p>Finally I put this all together in a simple script that takes some data from the user. You can find it on GitHub <a href="https://github.com/ab500/labTools/blob/master/collectScopeData.py">here</a>.</p>
<h3>Next Time</h3>
<p>We&#8217;ve gotten some data from our scope now, and it&#8217;s not pretty, or plotted, or really much of anything, but it&#8217;ll do. Next time I&#8217;ll follow up with some processing techniques and workflow management that make it possible to take this data and prepare it for plotting. </p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/16/data-collection-getting-data-out-of-devices/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using SQLite in Python &#8211; Building FlashCarder</title>
		<link>http://andrewbrobinson.com/2012/01/15/using-sqlite-in-python-building-flashcarder/</link>
		<comments>http://andrewbrobinson.com/2012/01/15/using-sqlite-in-python-building-flashcarder/#comments</comments>
		<pubDate>Sun, 15 Jan 2012 04:02:00 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=528</guid>
		<description><![CDATA[There&#8217;s a grey area in data storage when writing small applications where you find yourself stuck between the limitations of flat file storage, and the complications imposed by a fully featured database engine. When writing small scripts and utilities it sometimes becomes necessary to store persistent data, and support advanced querying methods, but running a [...]]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s a grey area in data storage when writing small applications where you find yourself stuck between the limitations of flat file storage, and the complications imposed by a fully featured database engine. When writing small scripts and utilities it sometimes becomes necessary to store persistent data, and support advanced querying methods, but running a database server is simply overkill. </p>
<p>For situations like these a file-based database such as <a href="http://sqlite.org/">SQLite</a> seems like an appropriate fit. While flat file storage might work, by using a prebuilt relational engine you&#8217;ll gain a much more refined data access model, including support for transactions, concurrency, and built-in support for relational querying.</p>
<p>I&#8217;ve set out to build an application from ground up, integrating some of the best practices available to support well-built, but minimalistic data access. First, let&#8217;s briefly talk about the application we&#8217;ll be building today.</p>
<h3>FlashCarder &#8211; For people who can&#8217;t remember stuff</h3>
<p>As long as I can remember, I&#8217;ve had a bad memory (heh). FlashCarder is a small application I wrote to help me. It basically amounts to a flash card system usable from the terminal. You run the application, create a couple data sets, consisting of a list of questions and a list of tuples representing a set of answers to these questions, and the application will randomly read a question back to you, asking you the remaining questions. For example, we could create a set called &#8220;Inventions&#8221; with the following questions:</p>
<ul>
<li>What is the name of the inventor?</li>
<li>What is the invention</li>
</ul>
<p>And the following answer tuples:</p>
<ul>
<li>(Benjamin Franklin, Lightning Rod)</li>
<li>(Nikola Tesla, AC Generators)</li>
<li>(Thomas Edison, Smear Campaigns)</li>
</ul>
<p>Running the program and selecting this data set would yield a session like this:</p>
<pre class="brush: plain; title: ;">
The name of the inventor is Nikola Tesla.

What is the invention?
(user presses Enter)
The Lightning rod.
</pre>
<p>Pretty simple stuff, but actually really useful in practice, and just technically complex to serve as an interesting example.</p>
<p>The full source to this application is available on GitHub <a href="https://github.com/ab500/flashCarder">here</a>! Since I use it day to day, I&#8217;ll continue to update it as I add features.</p>
<h3>Getting Started</h3>
<p>Rather than try to describe what SQLite is, I&#8217;ll copy what the the <a href="http://www.sqlite.org/index.html">SQLite Homepage</a> has to say:</p>
<blockquote><p>
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.
</p></blockquote>
<p>You&#8217;ll find SQLite in a lot of odd places. A ton of software out there uses it as a persistent data store, and they do this because it&#8217;s well-written, has a clean interface, and is in the public domain. It&#8217;s super useful.</p>
<p>I&#8217;ll be using the standard version of Python 2.7 that comes with Mac OS. Since Python 2.5 a version of the sqlite3 library has come standard with python so theirs no need to install anything, you&#8217;re literally just an import statement away from using it.</p>
<p>There&#8217;s a lot of good documentation out there, here&#8217;s a short list of reference material that should help you out a lot while working to understand the DB-API2.0 interface, SQLite, and how everything fits together:</p>
<ul>
<li><a href="http://www.python.org/dev/peps/pep-0249">PEP 249 &#8211; The DB-API2.0 Specification</a></li>
<li><a href="http://docs.python.org/library/sqlite3.html">SQLite3 Library Documentation</a></li>
<li><a href="http://code.google.com/p/pysqlite/source/browse/#hg%2Fsrc">SQLite3 Source Repository</a></li>
<li><a href="http://www.sqlite.org/fileformat2.html">The SQLite File Format</a> &#8211; I highly recommend taking a look through this one! SQLite is really well designed, and compact enough to the point where it&#8217;s pretty easy to wrap your head around.</li>
</ul>
<p>I&#8217;d recommend checking out all of these links, not only for the purposes of understanding SQLite, but as good examples of a very well done design. I&#8217;m going to assume a basic familarity with SQL syntax, and really focus on some techniques that make working with SQLite in Python a lot easier. There&#8217;s a lot of good basics tutorials out there, but once you know how to query a database there&#8217;s not a lot of material to guide you in building a robust database application.</p>
<h3>Data Design and Creating Tables Automatically</h3>
<p>Let&#8217;s take a look at how we should build this application. Since our target is rather small applications, we really need to think about how the database will be created in the first place. In a large scale application it&#8217;s typical to put up with long installation procedures, and other rift-raft to get up and running, but I don&#8217;t think it&#8217;s appropriate in this situation. It&#8217;s completely likely that someone will simply copy our script to a new directory, or delete the database file, and as such we need to make it a little robust.</p>
<p>To this end I&#8217;ll advocate ensuring that when you start your application up for the first time it&#8217;ll be built to create the database from scratch each time. I encapsulate all data access in a class, and the initialization member looks something like this:</p>
<pre class="brush: python; title: ;">
class DataInterface(object):
	def __init__(self):
		self._conn = sqlite3.connect(&quot;faceMaker.db&quot;)

		# This allows us to access rows by thier name
		self._conn.row_factory = sqlite3.Row
		# And ensure we are using ASCII representation, no need for UTF here
		self._conn.text_factory = str
		try:
			self.initDatabase()
		except Exception:
			print 'Something has gone wrong initializing the database'
			quit()

	def initDatabase(self):
		c = self._conn.cursor()
		with open('database.sql', 'r') as f:
			buildUpQuery = f.read()
		# Create tables if they don't already exist
		c.executescript(buildUpQuery)
</pre>
<p>We do a couple things when this class is first instantiated, first we set some parameters to the SQLite3 module to make life a little easier. I highly recommend setting the row_factory as I have, this gives you the ability to access row data via the row names, vs. having to use indexes that really don&#8217;t have a firm correlation to the data. The SQLite3 module automatically will create the database file if it doesn&#8217;t exist. </p>
<p>After setting up the connection, we load up an external file and execute it. This file contains a couple SQL statements to build up our database:</p>
<pre class="brush: sql; title: ;">
PRAGMA foreign_keys = ON;

-- Since we're prototyping, let's drop the tables
-- if they already exist.

--DROP TABLE IF EXISTS set_member_questions;
--DROP TABLE IF EXISTS set_questions;
--DROP TABLE IF EXISTS set_members;
--DROP TABLE IF EXISTS sets;

CREATE TABLE IF NOT EXISTS sets(
	id INTEGER PRIMARY KEY,
	name CHAR(255));

CREATE TABLE IF NOT EXISTS set_questions(
	id INTEGER PRIMARY KEY,
	set_id INTEGER NOT NULL,
	question_phrase TEXT,
	FOREIGN KEY(set_id) REFERENCES sets(id));

CREATE TABLE IF NOT EXISTS set_members(
	id INTEGER PRIMARY KEY,
	set_id INTEGER NOT NULL,
	FOREIGN KEY(set_id) REFERENCES sets(id));

CREATE TABLE IF NOT EXISTS set_member_questions(
	id INTEGER PRIMARY KEY,
	set_id INTEGER NOT NULL,
	question_id INTEGER NOT NULL,
	member_id INTEGER NOT NULL,
	answer TEXT,
	FOREIGN KEY(member_id) REFERENCES set_members(id)
	FOREIGN KEY(question_id) REFERENCES set_questions(id),
	FOREIGN KEY(set_id) REFERENCES sets(id));
</pre>
<p>The first line of this file enables foreign keys for the SQLite engine. While in an ideal world we program carefully enough to avoid foreign key violations, reality isn&#8217;t always so convenient. We also create a few tables. Here&#8217;s a super-brief description of each table.</p>
<ul>
<li><b>sets</b> &#8211; Contains the name of each set and a unique identifier, from the example I gave in the introduction, a entry in this table would be &#8220;Inventions&#8221;</li>
<li><b>set_questions</b> &#8211; Contains the questions required for each member of the set. We store the questions in a statement form, so simple string manipulation can produce decent questions. For example, for the question &#8220;What is the invention?&#8221; we would only store the phrase &#8220;the invention&#8221;, so we can easily concat it with phrases to form questions (What is the invention?), statements (The invention), and answers (The invention is a light bulb).</li>
<li><b>set_members</b> &#8211; A joining-table to give each question-answer tuple a unique identifier, allows joins from set_questions to set_member_questions.</li>
<li><b>set_member_questions</b> &#8211; This table contains an entry per answer to each of the question, per member added to the table. For our example (Benjamin Franklin, Lightning Rod) there would be entry for Benjamin Franklin, linking this answer to the question_id, and the set_member for this answer tuple, and a similar entry for Lightning Rod.</li>
</ul>
<p>The DROP statements exist for debugging purposes when tweaking the structure of the tables, I&#8217;ve commented them out for production use. </p>
<p>Keeping the SQL database creation script in a separate file is a interesting design design. I&#8217;d argue for integrating it into a Python file for a production deployment, but for development work this works out to be a really clean way to do things. It&#8217;s handy to keep the database schema open in a text-editor while writing the Python code to talk to the database. </p>
<p>Spending some time working on the layout of the tables is never a bad thing. I spent a long time working out the details of the schema, and what the class interfaces would look like, and as a result the implementation turned out really clean.</p>
<h3>Querying Techniques &#038; Fun Language Features</h3>
<p>With a database in existence now I&#8217;d like to focus on constructing queries. The DB-API 2.0 interface provides a mechanism to write queries that&#8217;s reasonably expressive. It&#8217;s very similar to the formatting approach taken when building strings with the &#8216;%&#8217;. We create a string with placeholders, and pass a tuple to the module, and it forms a query string.</p>
<pre class="brush: python; title: ;">
cursor.execute(&quot;&quot;&quot;INSERT INTO set_member_questions
	(set_id, member_id, question_id, answer)
	VALUES (?, ?, ?, ?)&quot;&quot;&quot;,
	(self._sid, newMemberId, key, value))
</pre>
<p>In an interesting design decision, the DB-API 2.0 doesn&#8217;t actually specify what character should be used as the placeholder. In SQLite it&#8217;s a question mark, but you&#8217;ll see things like &#8216;%s&#8217; in mySQLdb. This is something to keep in mind when writing queries, if you&#8217;ve gotten use to a different DB-API module, it&#8217;s likely you&#8217;ll find minor differences in implementation like this quite frequently. The DB-API interface runs pretty close to the database engine, and inherents a lot of the nuances as a result. </p>
<p>Remember that in Python to create a Tuple with a single value, you must end it in a trailing comma, failure to do so will cause exceptions when passing data into the execute function:</p>
<pre class="brush: python; title: ;">
#This is correct and creates a tuple
(&quot;someDataHere&quot;,)
#This... doesn't.
(&quot;someDataHere&quot;)
</pre>
<p>At this point I have to mention the customary warning when it comes to building query strings: if at all possible avoid manually constructing them. Escaping SQL queries can be non-intuitive at times, as proven over and over again. While the type of applications I&#8217;m targeting don&#8217;t require a lot of security, you should always write robust code and use best practices. There&#8217;s a reason why the execute command provides a facility to accept parameters. </p>
<h4>Handling Transactions</h4>
<p>One thing that I found rather sparely documented was how transaction support was implemented in the sqlite3 module. It&#8217;s important to pay attention to transactions, and ensure that when you&#8217;re updating the database checks are in place to ensure data integrity if the application is interrupted during a write. While the foreign key constrains, and SQLite&#8217;s built in atomic operations, will prevent a good deal of abuse, there are scenarios where multi-row updates need to happen atomically.</p>
<p>Transactions are handled transparently in SQLite3 by default, happening behind the scenes when you make use of data cursors, and other execution mechanisms, at execution of non-query statements, or when closing the connection. If you desire immediate commits, the isolation_level property of the module allows you a greater control over the transaction model. Assuming we keep the default, there&#8217;s two ways to execute a transaction really cleanly. </p>
<h4>Data Cursor Handling and Explicit Commits</h4>
<p>This is generally the old-school way to manage things. Grabbing a data cursor from the connection object, executing a couple of statements, and explicitly calling a commit. Something like this:</p>
<pre class="brush: python; title: ;">
c = self._conn.cursor()
c.execute('DELETE FROM set_members WHERE set_id = ?'
	, (setO._sid,))
c.execute('DELETE FROM set_questions WHERE set_id = ?'
	, (setO._sid, ))
c.execute('DELETE FROM sets WHERE id = ?', (setO._sid, ))
self._conn.commit()
</pre>
<p>Well, it works certainly, but it doesn&#8217;t have that new shiny feel us programmers enjoy so much, and it kind of sucks at exception handling. An unexpected exception upon executing one of those statements will shift the code execution path to god know where, without a proper exception handler and logic to call the rollback() function to discard changes. Since <a href="http://www.python.org/dev/peps/pep-0343/">PEP 343</a> and Python 2.6 we&#8217;ve had a nifty new language feature, the with block, that allows us to automatically manage context and handle this for us:</p>
<pre class="brush: python; title: ;">
with self._conn as c:
	c.execute('DELETE FROM set_members WHERE set_id = ?'
		, (setO._sid,))
	c.execute('DELETE FROM set_questions WHERE set_id = ?'
		, (setO._sid, ))
	c.execute('DELETE FROM sets WHERE id = ?', (setO._sid, ))
</pre>
<p>Wow! That&#8217;s nifty. The with block is really useful for handling resource management, everything inside the block is committed at the end. For those interested, you can implement context managers in your own classes, the PEP covers implementation nicely. Note that this really isn&#8217;t a replacement for error handling, you still must use try.. catch blocks to handle errors, all this does is guarantee that the database transaction will be rolled back if an exception does occur, and committed otherwise. </p>
<h4>Connections and Concurrency</h4>
<p>Python has never been strong at concurrency, and the sqlite3 module is no exception. It flat-out does not support concurrency across a single connection object. This stems a lot from notoriously poor support for concurrency in Python. The recommended solution is to either wrap all your database calls in locking mechanisms, ensuring that only one operation is occurring at once, or open separate connections. SQLite supports locks in its data structure out of the box, so by creating multiple connections to the database file, to the SQLite engines, it&#8217;ll appear that multiple separate applications are accessing the database and handle concurrency as such. It&#8217;s not the most elegant solution out there, but it&#8217;ll work!</p>
<h4>Let&#8217;s Build a Generator</h4>
<p>Finally, I think creating a generator from an SQL query is a dandy idea. For part of the code, I query the database for the list of filled out questions for each member of a set. I took this approach to allow for the addition of questions, while removing the need to go back and fill out the answer to the question for all previous members, which could take a substantial amount of time. We can actually delay the querying of the database for extra data until it&#8217;s needed by using a simple generator:</p>
<pre class="brush: python; title: ;">
	def listMembers(self):
		c = self._conn.cursor()
		c.execute('SELECT id FROM set_members WHERE set_id = ?',
			(self._sid, ))
		for r in c:
			cRow = self._conn.cursor()
			cRow.execute(&quot;&quot;&quot;SELECT set_member_questions.answer as answer,
				set_questions.question_phrase as question, set_questions.id
				as qid FROM set_member_questions, set_questions
				WHERE set_member_questions.question_id =
				set_questions.id
				AND set_member_questions.member_id = ?&quot;&quot;&quot;, (r['id'], ))
			yield (r['id'], cRow.fetchall())
</pre>
<p>This, in a functional spirit, will act as an iterator with delayed evaluation. While a discussion of generators is best saved for another day, the idea is you can yield an element, and preserve the state of the function, stack and execution, until the next() function is called. The yield keyword acts as a nice coating of syntax-sugar for this pattern. The benefit is that we delay execution of future SQL statements until they are needed by the main application.</p>
<h3>Encapsulation</h3>
<p>Finally, let&#8217;s take a look at encapsulating all this data access nonsense in some classes. It&#8217;s tough to strike a good balance between abstraction and easy of implementation. I would argue that a good deal of your time as a programmer should be spent worrying about where exactly to draw the boundaries between interfaces, and what those interfaces should look like. When working on a project of scale, it&#8217;s super important to create interfaces the hide just enough of the implementation to lift the mental model required to effectively use it above the implementation level, while staying close enough to the resource to not introduce unnecessary code (the YAGNI acronym from extreme programming comes to mind).</p>
<p>We&#8217;re dealing with a file-based database designed to add a thin layer of persistence to our application, some would argue that we don&#8217;t need any encapsulation. I would argue that creating a very minimal set of classes designed to give you all the data access methods you need to effectively add and remove data objects is of great value. You will very rarely find yourself in a situation where a request comes down the pipe to remove a feature, and short, one-off scripts have a way of becoming unwieldy hacks as more and more features are tacked on the end. By paying a small price up front, we get a lot of benefits, and the code produced is much cleaner. </p>
<p>I call this really basic approach ORM without the bullshit. I won&#8217;t go into too many details, but I encourage you to take a look at the source code to get a feel for where I think the line should be drawn between too much and too little when it comes to the weight of the interfaces to the data. </p>
<h3>Get the Source!</h3>
<p>Anyone can write code to query a SQL database, but I think it takes a lot of practice to write clean code. I didn&#8217;t focus on the rudimentary aspects of SQLite data access in this article because they are echoed with a great amplitude across the net in many other tutorials. Instead, I hope that I&#8217;ve captured some of the elements of style that come together to create code that is maintainable and elegantly handles some of the real world issues that crop up when using databases. </p>
<p>The source, as I mentioned, is available on my <a href="https://github.com/ab500/flashCarder">GitHub page</a>! It includes the full front-end menu system to this application, and the data access layer I&#8217;ve discussed here.</p>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/15/using-sqlite-in-python-building-flashcarder/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Collecting, Analyzing, and Plotting Data from Lab Bench Equipment</title>
		<link>http://andrewbrobinson.com/2012/01/13/collecting-analyzing-and-plotting-data-from-lab-bench-equipment/</link>
		<comments>http://andrewbrobinson.com/2012/01/13/collecting-analyzing-and-plotting-data-from-lab-bench-equipment/#comments</comments>
		<pubDate>Fri, 13 Jan 2012 01:29:44 +0000</pubDate>
		<dc:creator>Andrew Robinson</dc:creator>
				<category><![CDATA[Data Collection]]></category>
		<category><![CDATA[Research]]></category>

		<guid isPermaLink="false">http://andrewbrobinson.com/?p=530</guid>
		<description><![CDATA[In our lab we have a very nice Tektronix scope. It has many fancy dials and more features than you can shake a stick at. It is truly a great piece of technology, and makes our lives much easier when we&#8217;re working on analyzing and troubleshooting circuits. Unfortunately, and I&#8217;ve found this holds true for [...]]]></description>
			<content:encoded><![CDATA[<p>In our lab we have a very nice Tektronix scope. It has many fancy dials and more features than you can shake a stick at. It is truly a great piece of technology, and makes our lives much easier when we&#8217;re working on analyzing and troubleshooting circuits.</p>
<p><img src="http://andrewbrobinson.com/wp-content/uploads/2012/01/scarycScopee.jpg" alt="" title="scarycScopee" width="528" height="330" class="aligncenter size-full wp-image-551 comic" /></p>
<p>Unfortunately, and I&#8217;ve found this holds true for almost every piece of lab equipment I&#8217;ve used, when it comes to actually collecting data from the device for plotting in paper-quality graphs everything falls to pieces. What was previously your 4-channel best friend, now becomes your mortal enemy, and you must do battle with this device to achieve great research success. </p>
<p>One of the biggest struggles I&#8217;ve had as I start my academic career has been creating an efficient pipeline for data collection and plotting. As a computer scientist, it&#8217;s in my nature to avoid solving the specific problem at hand, and instead spend my time trying to solve the general class of problems that it belongs to. To that end I&#8217;ve started to amass a small collection of data processing tools that I&#8217;ve found useful, and some thoughts I&#8217;ve had on what works well, and what doesn&#8217;t work so well, that I will share with you as we embark down this path. I&#8217;ll do this in a couple different articles, as to avoid information overload, and segment things at appropriate boundaries. </p>
<h3>Data Collection?</h3>
<p><img src="http://andrewbrobinson.com/wp-content/uploads/2012/01/confusedData-215x300.jpg" alt="" title="confusedData" width="215" height="300" class="alignleft size-medium wp-image-556 comic" /><br />
In their most basic form data sets are nothing more than big piles of numbers. We collect and worship these numbers because we feel that by looking at them the right way we&#8217;ll be able to draw some sort of meaningful conclusion from them, or that they will perhaps help us prove some hypothesis, or demonstrate the validity of a concept. By themselves data sets have almost no meaning or value. A big pile of numbers can represent almost anything, and prove almost nothing, and this makes keeping track of the context is one of the most important aspects of data collection. While I can create a reasonably useful pipeline for plotting data, if I don&#8217;t remember why or to what ends I took a data set I might as well throw it away, it is of no use to me. </p>
<p>Data sets have a tendency of feeling rather vague and ill-defined at times. Almost immediately after taking a data set we begin to forget why we took it, and it starts to entropy. To combat this gradual increase in entropy you must document everything you can. Write memos in the folders containing your data that border on the point of obsessive compulsive, and try to capture the entire essence of your work.</p>
<p>Your goal in data collection is to tell a story. You want to collect the data you need, filter out any irrelevant data, and crystalize the essence of your message by displaying the data in a way meaningful to your reader. There&#8217;s essentially three pieces to the data collection puzzle I&#8217;ll cover.</p>
<ul>
<li><b>Getting Data Out of Devices</b> &#8211; The art of retrieving data from devices that are designed with seemingly intentional malice. </li>
<li><b>Processing that Data</b> &#8211; The science of taking the pile of wrongly formatted, poorly scaled, and generally inconvenient data that you finally manage to coax out of the device, and formatting it such that it might be useful.</li>
<li><b>Plotting It</b> &#8211; The black magic behind gnuplot and others, a brief overview of the esoteric incantations required to transform your data set into a story</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://andrewbrobinson.com/2012/01/13/collecting-analyzing-and-plotting-data-from-lab-bench-equipment/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

