Blog

David Staley discussing his digital installation, Syncretism

Friday, July 3rd, 2009 | Dave

As mentioned in his previous blog post, David Staley displayed a digital installation in the Showcase center during THATCamp. Here’s video of David discussing his work in greater detail:

Session Notes: Libraries and Web 2.0

Wednesday, July 1st, 2009 | Vika Zafrin

These are the notes from the first breakout session I attended, is Libraries and Web 2.0. People attending included “straight-up” librarians, digital humanists, a programmer at NCSA even. Let’s see if I can capture what we talked about.

The European Navigator was originally intended to show people what the EU is, in general. But then teachers started using it in classrooms, with great success, and later began asking for specific documents to be added. The site talks about historical events, has interviews and “special files”, has a section devoted to education, and one for different European organizations. The interface is intricate yet easy to use, and uploaded documents (some of them scanned) are well captioned.

Teachers are asking for more on pedagogy/education, but the site’s maintainers feel they don’t have the skills to oblige. [vz: So are teachers willing to contribute content?] The site is having a bit of technical problems: the back end was based on an Access database exported into SQL (exporting is painful! quality control of exports takes a lot of time), and the front end is Flash (slow); they’ll be changing that. It’s made as a browser, which means a navigator within a navigator (which, Frederic Clavert says, is bad, because it doesn’t lend itself to Web 2.0 tool addition — vz: plus accessibility is pretty much shot, and they haven’t created special accessibility tools), and they have to ask users to contribute content, which ends up being too biased.

They do know who their audience is: they did a study of their users in 2008. That’s a great and important thing to do, for libraries.

They’re migrating to the Alfresco repository, which seems to be popular around the room. They want annotation tools, comment tools, a comment rating engine, maybe a wiki, but ultimately aren’t sure what web 2.0 tools they’ll want. They’re obliged to have moderators of their own to moderate illegal stuff (racist comments, for example), but for the most part it seems that the community will be able to self-regulate. Reserchers who are able to prove that they’re researchers will automatically have a higher ranking, and they’re thinking of a reputation-economy classification of users, where users who aren’t Researchers From Institutions but contribute good stuff will be able to advance in ranking. But this latter feature is a bit on the backburner, and — vz — I don’t actually think that’s a good thing. Starting out from a position of a default hierarchy that privileges the academe is actively bad for a site that purports to be for Europe as a whole, and will detract from participation by people who aren’t already in some kind of sanctioned system. On the other hand, part of ENA’s mission is specifically to be more open to researchers. They’re aware of the potential loss of users, and have thought about maybe having two different websites, but that’s also segregation, and they don’t think it’s a good solution. It’s a hard one.

On to the Library of Congress, Dan Chudnov speaking. They have two social-media projects: a Flickr project that’s inaugurating Flickr Commons, and YouTube, where LC has its own channel. YouTube users tend to be less serious/substantial in their responses to videos than Flickr users are, so while LC’s Flickr account allows (and gets great) comments, their YouTube channel just doesn’t allow comments at all.

They’ve also launched the World Digital Library, alongside which Dan presented the Europeana site. Both available in seven and six languages, respectively (impressive!). WDL has multi-lingual query faceting; almost all functionality is JavaScript-based and static, and comes out of Akamai, with whom LC has partnered; so the site is really really stable; on the day they launched, they had 35 million requests per hour and didn’t go down. Take-away: static HTML works really well for servability and reliability and distributability. Following straight-forward web standards also helps.

Good suggestion for Flickr Commons (and perhaps Flickr itself?): comment rating. There seems to be pushback on that; I wonder why? It would be a very useful feature, and people would be free to ignore it.

Dan Chudnov: the web is made of links, but of course we have more. Authority records, different viewers for the big interconnected web, MARC/item records from those, but nobody knows that. More importantly, Google won’t find it without screenscraping. What do you do about it? Especially when you have LC and other libraries having information on the same subject that isn’t at all interconnected?

Linked data, and its four tenets: use URIs as names for things; use HTTP URIs; provide useful information; include links to other URIs. This is a great set of principles to follow; then maybe we can interoperate. Break down your concepts into pages. Use the rel tag, embed information in what HTML already offers. So: to do web 2.0 better, maybe we should do web 1.0 more completely.

One site that enacts this is Chronicling America. Hundreds of newspapers from all over the country. Really great HTML usage under the hood; so now we have a model! And no “we don’t know how to do basic HTML metadata” excuse for us.

Raymond Yee raises a basic point: what is Web 2.0? These are the basic principles: it’s collective intelligence; the web improves as more users provide input. Raymond is particularly interested in remixability and decomposeability of it, and into making things linkable.

So, again, takeaways: follow Web 1.0 standards; link to other objects and make sure you can link your own objects; perhaps don’t make people get a thousand accounts, so maybe interoperate with OpenID or something else that is likely to stick around? Use encodings that are machine-friendly, machine-readable — RDF, JASN, XML, METS, OpenSearch, etc. Also, view other people’s source! And maybe annotate your source, and make sure you have clearly formatted source code?

There’s got to be a more or less central place to share success stories and best practices. Maybe Library Success? Let’s try that and see what happens.

(Edited to add: please comment to supplement this post with more information, whether we talked about it in the session or not; I’ll make a more comprehensive document out of it and post it to Library Success.)

Digital training session at 9am

Sunday, June 28th, 2009 | Amanda French

So @GeorgeOnline (whose last name I simply MUST discover) has set up several platforms at teachinghumanities.org, and we semi-agreed over Twitter that it’d be fun to use the 9am “Digital Training” session to build it out a bit. Gee, anyone have a laptop they can bring?

Do please let us know via Twitter or comments on this post whether you’d like to use the session for that purpose; far be it from me to curtail conversation, especially the extraordinarily stimulating sort of conversation that has so far been the hallmark of THATcamp.

Six degrees of Thomas Kuhn

Saturday, June 27th, 2009 | shermandorn

The recent PLoS ONE article on interdisciplinary connections in science made me wish instantly for a way to map citation links between individuals at my institution.

From Bollen J, Van de Sompel H, Hagberg A, Bettencourt L, Chute R, et al. 2009 Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE 4(3): e4803. doi:10.1371/journal.pone.0004803

So the authors of the article looked for connections among huge areas and journals. In practice, interdisciplinary collaboration is helped tremendously by individualized matchmaking. The clickstream data for Bollen et al is one example of “linkage” but there are others: Google Scholar can probably help connect scholars at individual institutions by the sources they use in common. The title is a misnomer: trying to follow sequential citations to find the grand-grand-grand-grand-grandciters of Thomas Kuhn would be overkill and impractical. First-level citation overlaps would identify individuals who share either substantive or methodological understandings.

I thought this was impossible until one fellow camper told me at the end of the day that there is a Google Scholar API available to academics. Woohoo! Is any enterprising programmer interested? Or someone who works at a DH center interested in getting this started? Or someone….

Incidentally, I suspect that there are many possible data sources (six degrees of Twitter @ refs?) and ways of working the practical uses of this (seeing detailed overlaps for two specified individuals, or identifying summary overlaps for groups of individuals at a university, in an organization, attending a conference, etc.).

And, finally, … yes, to answer the logical question by those at the last session today in the GMU Research I auditorium, the Bollen piece is catalogued in visualcomplexity.com.

Context & Connections notes

Saturday, June 27th, 2009 | shermandorn

The following are my raw notes on the Saturday morning “Context and Connections” session. Assume that unless otherwise noted, they are paraphrases of comments made by participants (either labeled or not). It began with a note that this was about making connections with and adding context to historical document collection (e.g., The Papers of Thomas Jefferson with Monticello/Jefferson Foundation, on the UVA Press Rotunda site), but this is about both research and teaching. The problem in the classroom: students often USE digital archives but do not interact with it in terms of mashups (or scholars, with contribution)

Someone suggested this is sort of like Thomas Jefferson’s FB page: who were his friends, etc.

Montpelier/Madison foundation has a hierarchical set of keywords and two separate databases for names that may not interact.

Problem of places/data sets that do not talk to each other (e.g., LoC has largest set of Jefferson papers, but limited (and difficult-to-read) image sets.

So if there’s a suite of tools, is there one appropriate for both archivist/research community and for students?

MIT Media Lab’s Hugo Liu has an older project that simulated “what would they think?” AI.

Web forces textual connections (links). E.g., Wikipedia keyword linkages. It is not required to rely on a folksonomy; could have a multi-level tagging system (by persona).

How much text-mining (by computer) and how much is intensity of analysis/interpretive-focused? LoC project on Civil War letters is on the second end of the spectrum.

From library/archive world: WordPress has hierarchical categories AND (nonhierarchical) tags

Someone asked about a tag suggestion system? Someone noted that existed with delicious.

Another person: Try Open VocabThat does move it into the semantic

What to do with “rough piles” of tags, etc. If the tags accrete, we will want to analyze who tags how, and how that changes depending on context (and time.

“That sounds like scholarship.

Conversation. “That sounds like scholarship.”

Tags aren’t enough. Conversation isn’t enough. I want both.
We want a person behind that tag.

The Old Bailey is working on this problem — the largest holding of information on dead proletariats in the world, and how do we make connections among sparse information (e.g., Mary arrested as prostitute, with place of work, date, and pimp).

We need a Friendster of the Dead.

Maybe a way of figuring out by context who wrote (or context of writing).

[Sherman]: Like quantitative ways of guessing authors of individual Federalist Papers, except less well defined

Archivists have to do that all the time — “what did this word mean”? Time and place contexts

A question of how much preprocessing is required…

We need a way of mapping concepts across time. There’s only so much computationally that you can do. A social-networking peer review structure so that experts winnowed out the connections that a program suggested.

That’s a task good for socializing students — give them a range of potential connections, make them winnow the set and justify the judgments.

As a scholar, I need computers to suggest connections that I will judge by reading the sources.

Library (archival collection) no longer provides X that scholars use. There needs to be a conversation/collaboration.

Philologists on disambiguation: that’s a tool I can use.

Toolbuilding is where these connections will be made: with Zotero and Omeka, I spend as much time talking with archivists/librarians as with scholars.

Does anyone know about the Virtual International Authority File?

There are standards for marking up documents in public format? Will that standardization translate to what we do online, much more loose and free with Digital Humanists.

Back channel link to Historical Event Markup and Linking (HEML) project.

The “related pages” links for Google sometimes work for documents.

You don’t know why something is coming up as similar, and that’s a personal disambiguation process (reflection).

Discussion about extending core function of Text Encoding Initiative.

Discussion around www.kulttuurisampo.fi/ about intensity of work, selection of projects, etc.

DBPedia– controlled-vocabulary connection analysis for Wikipedia from infoboxes on articles, but the software is open-source. (and could be applied to any MediaWiki site).

Keep an eye on the IMLS website! – there is a project proposal to use TEI for other projects.

More on Libraries of Early America

Saturday, June 27th, 2009 | endrina tay

So I didn’t have time to ask the two questions I had for everyone during my Dork Shorts session on the Libraries of Early America, so here they are …

1. I’m very keen to hear what folks think about how this sort of data might be used by scholars and for teaching?

2. What kinds of visualizations would folks be interested in experimenting with by using such bibliographical data, e.g. date/place of publication, publishers, content, etc.

The links again:

Thomas Jefferson’s library on LibraryThing

Subject tag cloud | Author tag cloud

Examples of interesting overlaps:

Books shared with John Adams

Books shared with George Washington

The list of Collections in the pipeline is here. This is a subset of the larger Legacy Libraries or “I See Dead People’s Books” project.

Crowdsourcing – The Session

Saturday, June 27th, 2009 | Lisa Grimm

Being a semi-liveblog of our first session of the day – please annotate as you see fit (and apologies if I left anything or anyone out).

Attendees: Andy Ashton, Laurie Kahn-Leavitt, Tim Brixius, Tad Suiter, Susan Chun, Josh Greenberg, Lisa Grimm, Jim Smith, Dan Cohen

Lisa: Kickoff with brief explanation of upcoming project needing crowdsourcing.

Susan: Interested in access points to large-scale collections – machine-generated keywords from transcriptions/translations, etc. Finding the content the user is most likely to engage with.

Josh: Landing page of the site to steer the crowd to certain areas – flickr commons?

Susan: Discovery skillset? Asking users – ‘what are you interested in?’ Discipline-specifc, multi-lingual vocabularies could be generated?

Josh: Getting more general: moving beyond the monoculture – what is the crowd? Layers of interest; Figuring out role – lightweight applications tailored to particular communities. NYPL historical maps project example – can we crowdsource the rectification of maps? Fits well w/community dynamics, but the information is useful elsewhere. Who are the user communities?

Laurie: Relation between face to face contact and building a crowdsource community? Susan & Josh’s projects have large in-person component.

Defining the need for crowdsourcing – what is the goal? Josh likes notion of hitting multiple birds with one stone. What is the crowd’s motivation? How can we squeeze as many different goals as possible out of one project?

Tad: issue of credentialing – power of big numbers.

Jim: Expert vs. non-expert – research suggests amateurs are very capable in certain circumstances.

Susan: Dating street scenes using using car enthusiasts – effective, but key is in credentialing.

Andy: The problem of the 3% of information that isn’t good – the 97% that’s correct goes by the wayside. Cultural skepticism over crowdsourcing, but acceptance of getting obscure information wherever possible (e.g. ancient texts). Looking into crowdsourcing for text encoding. Data curation and quality control issues to be determined. Interested to see Drexel project results next year?

Susan: Human evaluation of results of crowdsourcing – tools available from the project there. (Yay!)

Jim: Transitive trust model – if I trust Alice, can I trust Bob?

Josh: Citizen journalism, e.g. Daily Kos – self-policing crowd. Relies on issues of scale, but not just ‘work being done that’s not by you.’ Cultural argument about expertise/authority – ‘crowd’ meaning the unwashed vs. the experts.

Susan: Long tail is critical – large numbers of new access points. How to encourage and make valuable?

Tad: Translations: ‘they’re all wrong!’ (Great point).

Andy: Depth, precision & granularity over breadth

Jim: Unpacking the digital humanities piece – leveling effect. Providing an environment for the community, not just a presentation.

Josh: Using metrics to ‘score’ the crowd.

Tad: Wikipedia example – some interested in only one thing, some all over.

Josh: Difference between crowdsource activity as work vs. play. Treating it as a game – how to cultivate that behavior?

Susan: Fun model relies on scale.

Josh: MIT PuzzleHunt example; how to create a game where the rules generate that depth?

Susan: Validation models problematic – still requires experts to authorize.

Tad: Is PuzzleHunt work, rather than play?

Andy: NITLE Predictions Market – great example of crowdsourcing as play.

Dan: Still hasn’t gotten the scale of InTrade, etc. – how to recruit the crowd remains the problem. Flickr participation seems wide, but not deep.

Josh: Compel to do job because they have to, do Amazon Mechanical Turk model and pay or get deeper into unpacking the motivations between amateur and expert communities.

Susan: Work on motivation in their project determined that invited users tagged at a very much higher rate vs. those who have just jumped in.

Susan: Paying on Mechanical Turk not as painful as it might be – many doing tons of work for about $25.

Josh: So many ways to configure crowdsourcing model – pay per action? Per piece? Standards & practices don’t exist yet.

Susan: We’ve talked a lot about them, but there are still relatively few public crowdsourcing projects.

Dan: Google averse to crowdsourcing (GoogleBooks example) – they would rather wait for a better algorithm (via DH09).

Susan: But they have scale!

Dan: Data trumps people for them.

Andy: Image recognition – it’s data, but beyond the capabilities now.

Dan: Third option: wait five years – example of Google’s OCR. Google has the $$ to re-key all Google Books, but they are not doing it.

Josh: Google believes that hyperlinks are votes –

Dan: Latent crowdsourcing, not outright

Susan: Translation tools largely based on the average – our spaces don’t fit that model

Tad: Algorithm model gives strong incentive to proprietary information – you have everything invested in protecting your information, not openess.

Dan: OpenLibrary wiki-izing their catalog, vs. Google approach. Seems purely an engineering decision.

Andy: Approach informed by a larger corporate strategy – keeping information in the Google wrapper. Institutional OPACs almost always averse to crowdsourcing as well. What is the motivating factor there?

Josh: Boundary drawing to reinforce professional expertise and presumption that the public doesn’t know what it’s doing.

Andy: Retrieval interfaces horrible in library software – why keep best metadata locked away.

Sending around link to Women Physicians…

Susan: different views for different communities – work with dotSub for translation.

Dan: Other examples of good crowdsourced projects?

Susan: Examples of a service model?

Josh: Terms of service? Making sure that the data is usable long-term to avoid the mistakes of the past. Intellectual property remains owned by person doing the work, license granted to NYPL allowing NYPL to pass along license to others. Can’t go back to the crowd to ask for pernission later. Getting users to agree at signup key. Rights and policies side of things should appear on blog in future.

Jim: Group coding from Texas A&M moved into a crowdsourcing model – future trust model ‘model’

Please continue to add examples of projects (and of course correct any ways I’ve wildly misquoted you).

It would be great to have some crowdsourcing case studies – e.g., use flickr for project x, a different approach is better for project y…

Museum Content–A Couple of Ideas

Saturday, June 27th, 2009 | schun

Posting to the THATCamp blog *so* late has allowed me to change the focus of my proposed session and to consider my very most recent project. For reference (and perhaps post-conference follow-up), I’m posting a description of my original THATCamp proposal, in addition to some thoughts about a possible session about searching of museum records:

My original proposal involved a project called “The Qualities of Enduring Publications” that I developed at The Metropolitan Museum of Art during the financial crisis that followed the 9/11 attacks. Faced with a deficit budget resulting from severely diminished attendance, the museum planned to implement radical budget cuts, including significant cutbacks in publishing. In light of these cutbacks, I was interested in examining the essential nature of the publications (for 2002, read: books and print journals) that the discipline was producing and reading, and in thinking about what gives an art history publication enduring value. The question was examined through a personal prism, in a series of small workshops (ca. 10 participants each) at the Met and at museums around the country. Participants came to the workshop having selected one or two publications that had had enduring value for them in their professional lives–books that they had consulted regularly, had cited frequently, or had used as models for their own publications. A few minutes at the start of the workshop were spent sharing the books, after which I (as workshop chair), began the discussion, which centered around a series of simple scripted questions, to which answers were responded for later analysis. The questions asked whether titles had been selected for (for example) the fidelity of the reproductions, for the lucidity of the prose, for the multiplicity of voices, for the well-researched bibliography, and so on. The workshops were fascinating, not just for the results they produced (the publications most valued by art historians had relatively little in common with the gigantic multi-authored exhibition catalogues produced by museums during that time frame), but also for the lively conversation and debate that they engendered amongst museum authors and future authors.

I have recently been encouraged to expand the workshop scope to include participants and titles from all humanities disciplines, as well as to consider the impact of electronic publishing and distribution on an individual’s choices. Staging the new version of the workshop will require the recruitment of workshop chairs from across the country and throughout the humanities, and the drafting of a series of additional questions about the ways in which electronic publishing might impact a participant’s thinking about his or her enduring publications. I had hoped to use THATCamp as an opportunity to identify potential workshop chairs in humanities disciplines other than art history, to discuss examine the existing workshop discussion template and to work on the questions to be added on e-publishing, and to think about ways to analyze a (much larger) body of responses, perhaps considering some bibliometric analysis techniques.

Though I’m still interested in speaking informally with ANY THATCamp participant who might be interested in participating in the expanded “Qualities of Enduring Publications” workshops, I’m actually focused right now on a newer project for which some preliminary discussion is needed to seed the project wiki. Along with colleagues at ARTstor and the Museum Computer Network, I’ll be organizing a team that will examine the user behaviors (particularly search) in repositories that aggregate museum records. The project, which will take place during the six weeks before the Museum Computer Network conference in November, 2009, will involve analysis of the data logs of ARTstor, the museum community’s key scholarly resource for aggregated museum records, as well as logs from other libraries of museum collection information, including (we hope) CAMIO and AMICA. A group of recruited participants will consider the logs, which will be released about six weeks before the November conference, articulate questions that might be answered by interrogating the data, and write and run queries. We’ll also think about ways to establish and express some useful ways to query and analyze an individual museum’s search logs, and will use these methods to look at the logs of participants’ museums, as a baseline for comparison with the ARTstor, CAMIO, and AMICA records. At an all-day meeting during MCN, we’ll gather to examine the results of the preliminary results; discuss, modify and re-run the queries, and work together to formulate some conclusions. In the eight weeks after the meeting, ARTstor staff and/or graduate student volunteers will produce a draft white paper, which will circulate to the meeting participants before being released to the community at large. Although the project is limited in scope (we have not yet figured out how to get any useful information about how users of Google look for museum content), we hope that it will help museums to begin to think about how their content is accessed by users in the networked environment using real evidence; at present, very little quantitative information about user behaviors (including which terms/types of terms are used to search, whether searches are successful, which objects are sought) is available. Results could have lasting impact on museum practice, as organizations prioritize digitization and cataloguing activities, and consider what content to contribute to networked information resources. I hope that a discussion at THATCamp might provide some seed content for the project wiki, which serve as the nexus of discussion about what questions we will ask, and about what methods will be used to answer them.

ICONCLASS demo/discussion

Friday, June 26th, 2009 | eposthumus

Vieing for a spot on the ‘absolutely last-minute proposal postings’ roster, here’s mine:

A demo and discussion of the ICONCLASS multilingual subject classification system (www.iconclass.nl)

This system might be known to students of Art History and hard-core classification library science geeks, but it has applicability to other fields in cultural heritage. Originally conceived for use by Art History Prof Henri van de Waal in the Netherlands, it has matured over the past 40 years and is in use internationally. Over the past few years we have made several new digital editions, software tools and have applied it to diverse other fields including textual content. In the near future we will be making a brand-new ‘illustrated’ version public, and hope to also make it a Linked Data node.

A session showing what it is and how to use it, or a more advanced discussion on thematic classification is possible, depending on feedback.

Mapping Literature

Friday, June 26th, 2009 | barbarahui

Apologies for the extremely last-minute post, which I’m writing on the plane en route to THATCamp!

In a nutshell, what I’d like to discuss in this session is the mapping of literature. By this I mean not only strictly geographical mappings (i.e. cartographical/GIS representations of space and place) but also perhaps more abstract and conceptual mappings that don’t lend themselves so well to mathematical geospatial mash-ups.

How can we (and do we already) play with the possibilities of existing technology to create great DH tools to read literature spatially?

I’ll first demo my Litmap project and hopefully that’ll serve as a springboard for discussion. You can read more about Litmap and look at it ahead of time here.

Very much looking forward to a discussion with all of the great people who are going to be there!