I had fun repurposing one of the example projects from Automate The Boring Stuff WIth Python yesterday. THe project was designed to download local copies of all of the XKCD comics. I wanted to grab a local copy of Mookie’s Dominic Deegan, since it’s an obscure comic that hasn’t been updated in years and I didn’t want it to vanish into the internet aether.
Of course, being a very old webcomic from a creator with a much less HTML-savvy than Randall, I had a few fun challenges. I needed to pick apart the format of the web pages to find a good way to parse the Beautiful Soup tag objects (Beautiful Soup being a fabulous Python utility you can turn on a webpage address to get back a data structure containing all of the HTML). I had to find a good way to pick out just the tags I wanted, and I needed to figure out that some pages early on, Mookie did multi-line comics as multiple image files slapped together in HTML, and account for that. There wasn’t the clean way of handling start and end URLs here, so I had to find a away around that, too.
What I found most interesting and fun about all of these problems was the degree to which “Is this a good enough solution?” was my guideline. For the start and end dates, I could just run over the second to second-to-last comics by setting them as my start and end points, and just manually scrape out the start and end comics. For the multi-part comics, I really didn’t want to have to identify them as special cases, so I did structural changes to apply the comic download function to every tag that matched the criteria.
The project is now done, more or less. It’s not perfect by any means; if I wanted to refresh the archive I now have, I’d still have a few manual steps I’d need to do. But since I’ve done what I set out to do with this program, I don’t need to improve it any.
The worrying thing, though, is that I see this exact same pattern at work in basically all of the production software I work on. There are huge areas of good-enough that could be refactored and re-done to handle the edge cases and avoid special procedures, but people there treat runs-every-day production code like toy run-once projects.
On one hand, I feel a little bit guilty about leaving my project in a good-enough state. On the other, I guess this means I’m comfortable enough in Python to be writing disposable utilities in it, and treating them as utilities rather than mini study projects.
But however I look at it, I did have a good time spending a few hours digging into both general and specific problems, and I now have what I wanted to have (and a good framework for setting up future web-scraping if I want it), so I consider it time well spent.