Data Science VM is a script that automatically launches and configures a data science system on your computer or in the cloud in a half hour or less, across Linux, Windows, and OSX.
In my experience, any new machine for a serious project takes 3-5 days to set up. During my first semester at MIT, I spent weeks installing MediaCloud (it’s easier now, I hear). I lost around 3 days each when my laptop was stolen in March of 2012, when my MacBook Pro died just before my thesis deadline, and when I started a summer internship at Microsoft. Setup time is also a major problem during hack days; I’ve attended too many events where the event ends just as the participants finish setting up their machines.
What It Includes
Data Science VM is a set of scripts that automatically launches and configures a new virtual machine locally or on Amazon EC2. Inspired by MIT StarCluster, It contains some basic tools for statistics, natural language processing, and data analysis:
-
Python NLTK, with full language pack downloads
-
The Python fork of the Stanford Core NLP Library, with a Web API
-
Vim
-
… make suggestions, and I’ll add them (I’m planning to add rvm and ruby)
Now, whenever I need to set up a new laptop (Windows, OSX, or Linux), I simply need to install Vagrant and VirtualBox (or VMWare) and run “vagrant up”. Within a half hour, the core tools I need for basic data analysis work will be ready to go. Instructions and more are on the project’s Github page.
About Vagrant
Vagrant is an awesome system for auto-configuring virtual machines across multiple services. My scripts offer custom configurations for popular data science packages, including web access for R Studio, iPython Notebook, and the Stanford Core NLP library.
Making it Better
Data Science VM was first tested at the Mozilla Festival session I co-hosted on Measuring News: Tracking Content and Engagement. I’m planning to maintain it in an ongoing way, so do send bug reports and feedback. This is also hardly the only way to set up a new data science machine. What’s your favourite approach? Link to it in the comments.