Wednesday, August 21, 2013

Zero to GHC development in an Ubuntu VM in 19 lines of Bash

If you're like me, you want to contribute to GHC and even have a few small tasks in mind, but since you're an occasional contributor you don't pull, build, and validate it every day. Thus when you come back to it periodically, getting it validating again can be a chore. For example, most recently it failed on my RHEL 6 work machine and Mac OS 10.8 laptop, but worked on my Ubuntu machine.

Given the complexity and fragility of the process, it would be good for us to have more standardized (hermetic) environments so that people can (1) get started easily and (2) reproduce eachother's build outcomes.

The idea of using a VM image for this purpose was discussed back in 2009. But it seems not to have happened. To get things moving in this direction, I came up with a script that bootstraps a working GHC development repository inside a VirtualBox VM running Ubuntu 12.04 LTS (either server or desktop worked for me).

The following Bash script is all there is to it:

Once I figure out how to shrink down an Ubuntu VM to a reasonable size I'll post a VM for download, or publish an Amazon Machine Image (AMI). Or, if someone knows Chef and would like to help me convert the above to a Chef recipe, that would be better than a bash script.

Sunday, August 18, 2013

Revisiting Google-Drive a year later: Still not ready for intensive use. Benchmark against AeroFS and BitTorrent Sync.



Last year, I eagerly anticipated the release Google Drive.  I had complained a lot about my experiences with other synchronization software, and fully expected Google to knock this one out of the park.  It's an application that should really emphasize Google's strengths: systems software, storage, scaling distributed systems.

In Spring 2012, I made the leap and moved all my personal cloud storage over to Google, but I ran into too many technical problems and gave up.  (Some of these problems I'll detail below.)  Now, a year later, I wanted to check in again and see how things are improving.

I'm afraid I have to report that there are still these major deal breakers for me, and perhaps for other "power users":
  • Scalability isn't there.  For example, if you try the simple benchmark of adding a folder with thousands of very small files, you'll see that maximum throughput is a few files per second.
  • Getting stuck ("Unable to sync") seems common.
  • Symlinks are still ignored.
It's surprising to me when solutions for syncing do not aggregate meta-data for changed files before communicating over the wire (e.g. like rsync).  The Google drive API seems to encourage per-file remote operations.  I heard there is some support for batching, but I'm not sure if that is specific to some Google APIs or generic across them.  It would sure help here.

Of course, these services all work great for storing small numbers of medium sized files.  Maybe there's no desire or need to support scaling and more intensive use?  Yet, I think even non-techie users may end up with large numbers of small files even if they don't created them directly (e.g. in my Aperture library).  For myself, I ultimately want something closer to a distributed file system.   For example, I like to edit files within a git checkout locally on my laptop and have them synced to a server where I run the code.  This requires three things:
  • Cross platform -- Linux/Mac in my case.
  • Low latency -- file edits should appear quickly on the other side.
  • Equally good treatment of large numbers of small files and small numbers of large files.
Alas, in spite of the massive increase in the number of cloud-based directory synchronization options, none seem to meet all three of these criteria.  Still.  I'll go through a list of disqualifying points at the end of this post.

The end result is that I still use the same solution I did ten years ago.  I run "unison -repeat 2" for linking working copies on different machines.   The only thing missing is convenient file-system watching via inotify (i.e. OS-driven notification of changes rather than scanning).  This is the killer feature that many of the newer cloud offerings have compared to unison, and it is the key to low-latency, as well as the always-on usage model Dropbox-style systems employ.  Unison has rudimentary support for integrating with a file-system watcher and I've sporadically had that functionality working, but it was fragile and hard to set up last time I tried it.