Easy readers

June 10th, 2009
Douglas Knox

At THATCamp ’08 I learned how to draw a smiley face with a few geometric programming commands.

Dan Chudnov demonstrated how to download Processing, a Java-based environment intended for designers, visual artists, students, and others who want to create something without being full-time professional programmers. Dan’s purpose was to show librarians, scholars, artists, and free-range humanists that getting started with simple programming isn’t as hard as people sometimes think. You don’t have to be a computer scientist or statistician to develop skills that can be directly useful to you. Dan posted a version of what he was demonstrating with the tag “learn2code.”

I’m not a trained programmer, was not new to programming altogether, but was new to Processing, and for a while I didn’t have much reason or time to do more with it. But last winter I found myself highly motivated to spend some of my spare time making sense of tens of thousands of pages of text images from the Internet Archive that were, for my purposes, undifferentiated. The raw, uncorrected OCR was not much help. I wanted to be able to visually scan all of them, start reading some of them, and begin to make some quick, non-exhaustive indexes in preparation for what is now a more intensive full-text grant-funded digitization effort (which I will also be glad to talk about, but that’s another story). I wanted to find out things that just weren’t practical to learn at the scale of dozens of reels of microfilm.

Processing has turned out to be perfect for this. It’s not just good for cartoon faces and artistic and complex data visualizations (though it is excellent for those). It is well suited to bootstrapping little scraps of knowledge into quick cycles of gratifying incremental improvements. I ended up cobbling together a half-dozen relatively simple throwaway tools highly customized to the particular reading and indexing I wanted to do, minimizing keystrokes, maximizing what I could get from the imperfect information available to me, and efficiently recording what I wanted to record while scanning through the material.

Having spent plenty of hours with the clicks, screeches, and blurs of microfilm readers, I can say that being able to fix up your own glorified (silent) virtual microfilm reader with random access is a wonderful thing. (It’s also nice that the images are never reversed because the person before you didn’t rewind to the proper spool.) And immensely better than PDF, too.

At THATCamp I would be glad to demonstrate, and would be interested in talking shop more generally about small quasi-artisanal skills, tools, and tips that help get stuff done — the kind of thing that Bill Turkel and his colleagues have written up in The Programming Historian, but perhaps even more preliminary. How do you get structured information out of a PDF or word processing document, say, and into a database or spreadsheet? Lots of “traditional humanists,” scholars and librarians, face this kind of problem. Maybe sometimes student labor can be applied, or professional programmers can help, if the task warrants and resources permit. But there is a lot of work that is big enough to be discouragingly inefficient with what may pass for standard methods (whether note cards or word processing tools), and small enough not to be worth the effort of seeking funding or navigating bureaucracy. There are many people in the humanities who would benefit from understanding the possibilities of computationally-assisted grunt work. Like artificial lighting, some tools just make it easier to read what in principle you could have found some other way to read anyway. But the conditions of work can have a considerable influence on what actually gets done.

More abstractly and speculatively, it would be interesting to talk about efficiencies of reading and scale. Digital tools are far from the first to address and exacerbate the problem that there is far more to be read and mapped out than any single person can cope with in a lifetime. Economies of effort and attention in relation to intellectual and social benefit have long shaped what questions can be asked and do get asked, and to some extent what questions can even be imagined. Digital tools can change these economies, although not in deterministically progressive ways. Particular digital applications and practices have all too often introduced obfuscations and inefficiencies that limit what questions can plausibly be asked at least as much as microfilm does. Which is why discussions even of low-level operational methods, and their consequences, can be of value. And where better than THATCamp?

9 Responses to “Easy readers”

Frédéric Clavert Says:
June 11th, 2009 at 2:45 am
Hello,

I think your idea is a good one and would like to participate to such a session.

Your post make me think of what could be called “the high cost of entering the digital world” for non digital humanists: the little things to learn to be efficient with digital tools – even “simple” ones like Wordprocessors – are costing too much (in terms of learning time) to the eyes of many humanists and historians and that’s often why they are skeptical about digital humanities.

Frédéric
ghbrett Says:
June 12th, 2009 at 7:26 am
While not programming per se, I have had a long term interest in the notion of general purpose tools. How one might use a spreadsheet as a flat file database, or a text file to store boiler plate. I admit that these were more interesting in the mid 1980’s before folks were using these alot, but I believe that there still are ways to think about how to extend or tailor Google Docs, or SurveyMonkey, or WordPress, or Atlassian’s Confluence, or possibly even Twitter. So, I’m interested in learning, discussing, and even collecting ideas about what are the new general purpose tools and how can they better serve the Humanities as well as Research & Education.

— George
Sterling Fluharty Says:
June 15th, 2009 at 10:28 am
It would be really interesting if this became a guide or resource for digital humanities hacks, especially since Bill Turkel retired his blog.
Peter Jones Says:
June 16th, 2009 at 10:21 pm
I would definitely be interested in this discussion for several reasons:

1)I’m also in between the “programmer” and neophyte stages. Thus Processing sounds interesting and useful.

2) Almost all scholars will be doing this sort of reading online (just think about Google Books). Easily created tools to make this work better sound like a great investment (I think Zotero broadly fits this definition and look at its success).

3)My next job involves reading many documents online and anything that can make this more efficient and successful using OCR or metadata sounds like a great use of digital humanities tools.

I’m not sure about other’s interests, but this topic certainly looks useful to me.
Musebrarian Says:
June 17th, 2009 at 11:00 am
I’m also interested in learning more about Processing, but also things like Flare flare.prefuse.org/ and the JavaScript Infovis toolkit. thejit.org/

It would be nice to have a discussion about the relative merits of different approaches & languages for this kind of work – especially as it relates to the kinds of humanities materials we are working with.

For the Collection Dashboard, we’re also exploring existing online APIs and services that help with some of this work. My colleague Piotr Adamczyk has built a nice site that explores tools like Yahoo! Pipes, Many Eyes, Google charts and mapping. (museumpipes.wordpress.com).
Karin Dalziel Says:
June 24th, 2009 at 11:33 pm
I too am a beginning programmer- I know just enough to be painfully aware of what I don’t know. Processing is the next thing I’d like to learn, mostly because I’d like to get a lot more into visualizing data.
Douglas Says:
June 25th, 2009 at 11:29 am
I’m pretty much in the same situation as Karin — I know what I want to do, and I know (roughly) how it can be done, but I don’t know how I can actually do it (in particular with respect to data visualization). I’ll definitely take a look at Processing; I’ve also been experimenting with SEASR (seasr.org/), which lets a non-programmer string together pre-made text processing and data visualization modules. I’m still not very good at it though 😉

I really like the idea of learning about these tools and their uses (and how they can be productively hacked) and also developing a community developed resource so we can continue to share our expertise and experimentation.
thowe Says:
June 26th, 2009 at 12:37 pm
I’d also love to learn more about these tools, like processing and prefuse, especially since I am a non-programmer. Like Karin and Douglas, I know what I want to do, but not how! I can figure out, logically, what might work and what kinds of steps would be needed, but the how of using these tools on my own is I fear beyond me. Visualizing text in ways other than word/tag clouds, trees, and so on is also deeply mysterious to me. I know that prefuse can be used to work with text, but I have absolutely no idea how!
THATCamp » Blog Archive Says:
June 26th, 2009 at 1:11 pm
[…] grasp and retain content. This strikes me as simlar, as well, to the kinds of issues raised by Douglas Knox–using scale and format to retrieve “structured information.” Do the […]