Project Website, Part One: Migrating Data

Now that I've explained the state of play on our Project Website application, it's time to role up my sleeves and get started.
rails -d mysql
For those of you that haven't done this in a while, the -d mysql is needed since Rails 2.0.2 and up use SQLite3 as the default database.
I suspect that everybody who has worked with Rails for a while has their own way of building up from the basic app skeleton to get to a useful structure. Here's mine:
First step, source control.
Source control comes on board immediately. I'm using git for this project, because that's what all the cool kids are using these days, and also because it has a lot of features that are useful in my specific situation. Most notably, being able to manage source control when off-line and without a central server. This my first fully git-ified project (I've been using git as a front end to Subversion for about a month) so I'm pretty sure I'll mess something up along the way.
You can see the details of basic git setup other places on the web, but I did want to mention the contents of my .gitignore file.
log/* doc/* tmp/* config/database.yml data/* .DS_Store db/schema.rb vendor/rails/.git/* legacy/data/data.sql
Most of this is pretty Rails standard -- the log, doc, and tmp directories don't need to be stored. It's considered good practice to keep the database.yml file out of source control but include a template for creating one. The data directory is something I use for custom rcov and other metric tasks, .DS_Store is a Mac OS X file (this will block the file in any subdirectory).
I go back and forth on whether it's a good idea to put the schema.rb file in source control or not -- DHH has mentioned that it's a useful shortcut for getting a new developer up to speed. On the other hand, it is a derived file, and I generally try to avoid putting derived files in source control.
I'm not sure whether you actually have to block subproject .git files here (it seems like a useful thing to do when using git-svn, but I'm still kind of feeling my way around here). I've put the old Python code in the legacy/ directory, so I'm also blocking a derived file in that directory.
Step two, getting Rails on board
With git on board, the next piece of basic setup is getting the right Rails version in the vendor/rails directory -- always put Rails in that directory so as to not be dependent on what happens to be installed on a deployment system. I'll be frank here -- what with the Rails move to git, it's not completely clear to me what the best-practice way is to keep Rails and plugins up to date within my git repository.
For the moment, that's not a huge problem, I acquire Rails via git, then use git to switch to the 2.1.0 branch that contains the 2.1.0 release source:
> git clone git://github.com/rails/rails.git vendor/rails cd vendor/rails git branch -b v2.1.0
I'm not completely sure about how the main git repository interacts with the internal git repository -- I notice that when I add Rails (git add vendor/rails) git seems to bring the entire directory as a single entity, rather than adding all the individual files. Anyway, this should make it easy for me to move to edge rails, just by pulling the current git repository and moving the branch back to the git master branch.
Step three, plugins and gems
Before I start coding, I add a few gems and plugins that I use in writing tests. Gems were installed via sudo gem install (actutally most of them were already on my machine), then added to the config\environment.rb, then moved to the vendor\gems directory via the new Rake task rake gems:update:dependencies. The gems and plugins I start out with are targeted at testing or development.
* shoulda -- testing framework of choice.
* andand -- allows for nice workaround of the common case of thing.child && thing.child.name, allowing you to write thing.child.andand.name
* flexmock . Mock object framework of choice. I the case of this and shoulda, it's not as much a case of radical differences in functionality as it is just comfort with the basic API.
* quiet backtrace. Handy little utility that filters out the ruby and rails core lines from backtraces in test results, leaving just the lines that reference your code. Very helpful when working with autotest.
Data Migration
The first bit of code I'm going to write for this site is the data input. There are a couple of reasons for this. One is that having real data in the system will make it easier for me to visualize the real system and easier for the client to usefully comment on it. Also, in this particular application, inputting bulk data is a major piece of the admin functionality, and getting it in early will give me more time to refine it later.
Anyway, the data here is provided as a list of products and it comes to the client as either a CSV or XML file. For various boring historical reasons, we use the CSV version. The products are attached to brands via a text ID, and an individual brand may have more than ID associated with it.
The data model is straightforward, we've got Products, Brands, and BrandDesignations (which, in retrospect is overly verbose, it probably should be BrandAliases). For the moment, the only piece of data in the Brand is the brand name, and the only pieces of data in the Designation is the alias and the brand it is associated with. Products have a one-to-many association with Brands, Brands have a one-to-many association with Designations. Products have a lot of other data fields that are less important at this moment. I create basic database migrations for each of these. By now, I've set up a database.yml.template file that will go into source control and also the database.yml file that is actually used by the system.
I like to structure data migrations by placing the bulk of the code in the ActiveRecord model being created and then writing some wrapper tasks in Rake. The combination is easily testable, runnable from the command line, but still can be embedded in the application itself if needed.
The important thing about a data migration task is that it be idempotent, which is one of my favorite fancy words to drop into a technical discussion. (Come to think of it, that'd be an interesting interview question...). An idempotent function returns the same value for a given input no matter how many times it's reapplied. For a data migration, this means that applying the migration multiple times will not result in duplicate database rows. The easy way to accomplish this is to have the rake task delete the table before the migration. It's only slightly more complicated to check for duplicates before adding them to the database, however in this case, it's an explicit client requirement that the new product file completely replace the old products, so that makes the choice simple. The basic Rake tasks look like this:
desc "add the brand list" task :create_brands => [:environment, :clear_brands] do INITIAL_BRANDS.each do |name| Brand.create(:name => name) end end task :clear_brands, :environment do Brand.delete_all end
In this particular case, I already know the initial list of brands and brand aliases, so I've put them in the Rake file as literal constants (such as INITIAL_BRANDS).
There's not much to that, there's a little bit more in the products tasks
desc "add products" task :create_products => [:environment, :clear_products] do FasterCSV.foreach("legacy/data/marsanweb.csv") do |row| next if row[0] == 'Brand_ID' Product.create_from_input(*row) end end desc "remove products" task :clear_products do Product.delete_all end desc "add brands and designations" task :create_all => [:create_brands, :create_designations, :create_products]
The first line of the CSV file is a header, so I somewhat brute-force-ly skip over lines that look like headers. I prefer the API of FasterCSV to the normal Ruby CSV module -- I think it's a little bit easer to read and work with.
The Brand and Product code that converts the CSV lines to ActiveRecords has a couple of points of interest. The product code largely takes the array, creates an ActiveRecord object and fills the attributes one by one. A quirk is that pricing is in the CSV file as a string of the form "$17.99", so it needs to be converted:
def dollar_string(string) string[1..-1].to_f end
Also, I implemented a single finder to get brands via their names or their aliases. This is in the Brand class:
class << self attr_accessor :cache def find_by_designation(designation) @cache ||= {} @cache[designation] ||= begin find_by_name(designation) || BrandDesignation.find_by_designation(designation).andand.brand || raise("No brand for #{designation}") end end
So, I put this in the singleton class class << self because it's manipulating a class level variable for the cache, and I find that easier to manage if it's in the singleton class and I can treat it as an instance variable. Inside the method itself, there's a search in the brands table, followed by a search in the designations tabel if the first search fails. It raises an exception if both fail, although I suspect that later on I'll just want it to return nil. The result is cached so the lookup doesn't have to go back to the database.
(And here's a nice use of andand to protect against a malformed BrandDesignation without a Brand).
And that's the migration code. There are unit tests to go along with it, but I'll talk about test setups later on when the tests get a little more complex.
Next time on Project Website:
I integrate an Ajax admin page using jQuery, watch me as I fumble and learn, so you don't have to (Fumble, that is. You're still going to learn.)
Topics: Project Website, Ruby on Rails
Comments: 1 so far
Leave a comment
About Pathfinder
Follow the Blog
-
Get a monthly update on best practices for delivering successful software.
Subscribe via email
Subscribe via RSS
Categories
Topics
Archives
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
Blogroll
Recent
- Elements of Testing Style
- Aesthetics and Web Design
- Asterisk-Java Testing with Groovy
- 3 Misuses of Code Comments
- Fluently NHibernate
- Digging a Hole and Covering it with Leaves — The Software Development Version
- The Importance of User Experience - Do You Understand It in Your Bones?
- Writing Your Own Protocol With NSURLProtocol
- What’s In Your Dock: iPhone edition
- Feature Fatigue

Hi,
I’m pretty schema.rb should be in source control as it should be used to create any new databases instead of running migrations from scratch. This becomes more apparent once you get quite a few migrations.
Regarding Git, I think git-submodule is what you’re looking for to maintain vendor/rails.
Comment by Andrew, Thursday, July 10, 2008 @ 5:31 am