A Day of data munging

Today I tackled a data munging problem that had been festering on my To Do list for several weeks. The problem was blocking other, more pleasant work, and was contributing to my having a foul mood.

The problem involved reworking 121 HTML documents to hoist content out of a table and then discard the table structure along with some other content. Editing the documents by hand would have been both exhausting and error prone (25 documents? Maybe. 121? No thanks). There were enough slight variations in the documents to make writing editor macros tricky. It could have been done in multiple passes, but that’s error prone with that many documents.

This was essentially a tree rewriting problem, so XSLT would seem to be a good tool to reach for. Unfortuntely, the documents were non-well-formed pre-XHTML. (Many of the files has passed through Adobe GoLive, which produces some… uh, interesting HTML.) Was it worth taking the time to convert to XHTML? Tempting, but not in this case.

I ended up using Perl and HTML::TokeParser to tokenize the documents. A simple “keep tokens until we see a <table> tag, discard it, and keep discarding until we see… and then keep tokens until…” state machine was quick to code up and test, and ran so fast that at first I thought there had to be a bug. Converting the tokens back to HTML let me clean up some of the damage that GoLive had done.

Perl has long been my tool of choice for problem like this, though lately I’ve been spending most of my evening programming time with Ruby. It’ll be interesting to see if or when Ruby supplants Perl for problems like this one. My knowledge of Ruby’s libraries outside of those needed for Rails work is still meager, and my bag of Perl tricks is pretty big. I’m guessing it’ll be another year.