Comments on: Zotero and Semantic Search

By: Sterling Fluharty

Sterling Fluharty — Mon, 15 Jun 2009 02:21:55 +0000

Thanks for the lengthy reply. I don’t hear traditional historians clammiring for software that can analyze text. So maybe the development of text mining is one area will digital history will have to pursue its own agenda. I am getting ready to write a text mining post, so stay posted. One of things we need to think about is designing programs that classify and describe historical objects and documents with some level of human supervision. This will help to reduce a fair amount of the workload, if we can pull it off.

By: Arden Kirkland

Arden Kirkland — Thu, 11 Jun 2009 17:33:18 +0000

Between this post and the post about Standards, I think we have quite a mammoth session to look forward to. Perhaps we need to break this down somehow into multiple sessions?

Anyway – my responses:

-The digital history I have followed “simply facilitates the research agenda of traditional history.”
I’d be very interested to see examples of projects where “digital history has its own agenda.”

-There is absolutely a need for controlled vocabularies to be not only “non-proprietary” but also inclusive in other ways, created as a result of collaboration between multiple institutions so that diverse collections can be represented. I could share the different, not very strategic, ways the costume history community is responding to this need, and I’d be interested to hear about how other fields are responding.

-In my field, my concern is not so much for “presentist ontologies” as it is for “art historicist” ontologies – paintings, for example, are typically catalogued by visual resources librarians for use in art history databases, and rarely mention details of costume portrayed therein. To see what I mean, take a look at this great slideshow about user contributed metadata. This relates to the point that rnelson2 makes in his post about Standards, about how ontologies can exclude alternate interpretations.

-Your discussion of Zotero vs. Amazon vs. Pandora is very thought-provoking. Even though I’m a devoted Pandora user, I hadn’t really thought about it as a model for other kinds of searches. I’ve been saying for years that academic sites should use Amazon as a model, in terms of its intuitive interface. However, I agree that Pandora is a wonderful example of “how the semantic web trumps web 2.0.” Looking at the list of attributes used to catalog a song for Pandora, it reminds me of the complexity involved with cataloging costume, and provides an appropriate solution in terms of the layering of attributes (as opposed to the either/or of some ontologies).

However, while we may aspire to a Pandora-type model, I don’t know how realistic it is for other kinds of fields. Pandora is a commercial model for a very economically viable field with a wide audience – the time spent cataloging results in financial profit. My field is somewhat the opposite of this – small, specialized, underfunded, overworked. Even if a large, popular museum collection was able to come up with the funding to pay catalogers to pursue this approach, their collections are so large that they would either only catalog highlights of their collection, or it would take forever! A small collection like mine would take less time, but would be much less likely to be funded.

The big question is, for our field, would it really be worth it? For the kind of searching, the kind of work we’re doing, would the research coming out justify the intensive labor in? My instinct is no, but the only way to know for sure would be to test it – and a costume collection would be a great collection to test this on, in general. Costume is one of the most complex kinds of material culture artifacts you can study, with multiple layers of physical construction and multiple layers of social interpretation. If we could really define a list of attributes to define costume, I’m convinced we could do it with any kind of object (“If I can make it there, I’ll make it anywhere”). But I’m not convinced that doing so would substantially add to our wisdom.

By: Sterling Fluharty

Sterling Fluharty — Sat, 30 May 2009 16:41:25 +0000

Rick: You are absolutely right that end users, almost without exception, do not have to pay for access to the metadata they download into their Zotero libraries. But let me suggest to you that there may be other ways of measuring the cost involved. For example, have you followed any of the discussion of the Record Use Policy that OCLC WorldCat announced in November 2008? I think you will find that OCLC asserts some rights over its metadata. It says its metadata is a shared community asset and that people outside of the OCLC Cooperative who use the metadata in certain ways should provide a fair return. In fact, OCLC might say that Zotero needs to sign an agreement with it in order to re-use and transfer data obtained directly or indirectly from OCLC databases. I suspect that you and I hold similar views on wanting information to be as free as possible. Still, I don’t think Zotero can afford to ignore the intellectual property rights and policies that many, if not most, data providers may be asserting. I think it would be wonderful if litigation could be avoided and compromises or understandings could be reached. Perhaps OCLC could be assured or persuaded that Zotero’s aggregation and analysis of OCLC metadata actually provides a fair return to members of the OCLC cooperative.

Sorry for taking potshots on your comments about catalogers. It sounds like we both have some respect for what they do.

It sounds to me like you are saying that few companies or individuals will want to develop ontologies because they are time- and resource-intensive. I won’t dispute that this can be intellectually challenging work. But you might be surprised to find that the WorldCat search [su: terminology or (su: abstracting and su: indexing) or ((su: subject and su: headings))], which doesn’t necessarily even include traditional thesauri, yields 53,699 results. I take this as a sign of a rich tradition of ontology work in the United States, United Kingdom, and elsewhere.

It also appears to me that you are suggesting that companies or individuals may not find it worth their time to develop ontologies since they can expect few readers of works in their specialized domain. Within the world of print, I agree with you that this is certainly a concern and constraint for those who contemplate indexing works in their fields. But the Internet is a game changer, in my opinion. Suddenly works that had tiny audiences in print now have vastly larger audiences on the web. If the semantic web is to succeed on a massive scale, I think its designers will have to engage both this new-found audience and the kinds of ontologies that have been worked out in the last hundred years or so.

I think I hear you saying that Zotero won’t need ontologies because once the number of users who sync their libraries reaches critical mass, the linked data between citations/records and within its database will provide the data necessary for running a recommendation engine. Like you, I believe this aggregation will certainly increase the quality of recommendations. However, I think some caution is in order. Some web semanticists have envisioned that Wikipedia, as an aggregation of the world’s knowledge, would provide them with the means for defining semantic relationships and measuring similarity between various items, objects, or things. Over the last year or so, though, systematic study of certain domains within Wikipedia has revealed that it is far from being comprehensive or complete. To remedy these gaps, content specialists will have to flesh out the ontologies within their particular domains. This is in large part the reason why VoCamp was started in 2008. So I hope Zotero developers and users will think critically about whether it would be worth their time to join this larger movement within the semantic web and develop ontologies. And I can’t help but wonder if historians are uniquely situated to tell us how the lessons of previous ontology work can be applied to problems in the present.

By: Bruce D'Arcus

Bruce D'Arcus — Sat, 30 May 2009 14:48:15 +0000

Oops; the comments don’t allow embedded markup; the first paragraph above is of course a quote.

By: Bruce D'Arcus

Bruce D'Arcus — Sat, 30 May 2009 14:47:48 +0000

It envisions linking identical information and objects on different web sites by treating this as essentially a problem of matching equivalent items in relational databases.

Relational databases have no necessary connection to this. All I’m saying is that if you treat subjects as things, with independent existence, you can attach other information to them: different language-dependent labels, connections to other things, etc., etc. *That* is what is at the heart of the notion of linked data.

By: Rick

Rick — Sat, 30 May 2009 03:46:41 +0000

Zotero users have been taking metadata from proprietary databases at no cost already. And some metadata has been given to and/or is being harvested for the Linked Data project. Finally, some proprietary database holders have already licensed their databases to customers under rather open terms (though this is certainly subject to change).

No offense intended to catalogers. Most do have it a little easier: the Library of Congress has a “mere” 32M books. An ontology for these makes a bit more sense.

By “academic literature,” I mostly meant journal articles. There are orders of magnitude more of these out there & they are also much more esoteric. While I pointed out some companies have made a go at this, normalization of keywords to an ontology would be very difficult: PACS alone has 8,000 or so different classifications. More works + more keywords + fewer readers of the average work = suffering for any noble catalogers who took this task on.

By: Sterling Fluharty

Sterling Fluharty — Sat, 30 May 2009 00:30:32 +0000

Govind: Thanks for sharing a commercial application. I was thinking you might want to add craigslist into your apartment search results.

Rick: Good to hear from you. I think you are right: historians, including me, make for poor prognosticators. I suppose I will have to try harder to tame my feelings for the future of Zotero. 😉

It sounds like you enjoy working with Zotero. I guess we have that in common. I like your theory about the quality of and potential for linking between citation information. I am not so sure, though, that metadata from proprietary databases comes at no cost.

I agree with you that we shouldn’t be biased against tags and terms that are wild. After all, not every keyword has the chance to become domesticated. I try to be very open minded and accepting of ontologies, regardless of their level of formality or where they come from.

I think it would be wonderful if you were right about disambiguation. In my estimation, it is human nature for us to believe that we can easily classify things into categories and determine whether they are related, similar, equivalent, opposite, etc. Perhaps computers can help humble and heal us, myself included, of our hubris.

We may find that we are able to outsource much of the work of ontology derivation to savvy computer algorithms. And I promise to not tell library catalogers what you just said about their profession.

By: Rick

Rick — Fri, 29 May 2009 23:28:49 +0000

While there is some great rhetoric here, I think that much is dubious speculation. Even if Zotero does not immediately live up to its potential, I think you'd be hard pressed to claim that they're really "about to hit another roadblock." In many ways, citation information has been better linked & more structured than information for music/movies. The academic literature is connected by citations. Unlike movies, where people must manually change IMDB to add links, these citations are author-created and inherent to the medium. Authors and publishers (as well as some aggregators, such as ISI) also often assign works keywords. Unlike the music "genes" in Pandora, these also come for free. While there are some keyword ontologies (PACS is notable & most publishers have lists of keywords), author-selected keywords and tags are wild & I don't think that is a bad thing. A recommendation system doesn't need to rely that they be chosen them from some small list or even have them linked together publicly (although that would be useful, see delicious's 'related' tags). With the data both on citation-linkages & what people with similar references also have (and, for that matter, content from similar (usually very subject-specific) journals), this mess can definitely be sorted out to make recommendations. (It also is not a problem if Zotero users delete or don't carry-over automatic tags: the rest of the rich citation information can be used to verify that two different user items probably refer to the same web source, so you can use the information captured by other users.) As an aside: I do think the music genome project is cool & listen to pandora regularly. But the scaling sucks. allmusic has data on 15M tracks, while only 0.5m have been cataloged in the MusicGenome. So they have less than 5% of what is in allmusic (and who knows what allmusic is missing). It takes around five listens to categorize a single song. This process would be horrendous in academic literature, where there are fewer subject-matter experts & each reading takes a considerable amount of time & there's much more content (PubMed, alone, has 20M articles).

By: Govind Kabra

Govind Kabra — Fri, 29 May 2009 21:08:44 +0000

It is certain that understanding the structured data on the Web will enable novel semantic search applications. At Cazoodle, we use our technology for large scale Web data integration for building vertical applications in new domains—including, apartments rentals, online shopping, and local events.

www.cazoodle.com

By: Sterling Fluharty

Sterling Fluharty — Fri, 29 May 2009 19:28:03 +0000

Bruce & patrickmj: Thanks for the thoughtful comment. I am still learning about URIs and Linked Data. I agree they offer some intriguing possibilities for the architecture of the semantic web. I think Zotero has already started down this path by using an ISBN as a URI when it "locates" a book in a library catalog. As I see it, this paradigm comes out of computer science. It envisions linking identical information and objects on different web sites by treating this as essentially a problem of matching equivalent items in relational databases. I appreciate and respect that position, but I think it is ultimately limited. If you take a linguistic perspective, however, this situation looks very different. Tags are no longer dumb or useless. Instead they become valuable semantic context for deep approaches to word sense disambiguation. This is part of what I was trying to say about blending information frameworks with conceptual domains. And I definitely second your idea of including LoC developers in this conversation at THATCamp. briancroxall: I am glad to see your interest. How, why, and whether we should trust tags are some interesting questions. The answers can be both scientific and subjective. Let me just say it would be fascinating if Zotero had an open API where users could run queries to see the variety of tags that are employed for a discrete item that appears in the libraries of multiple users. This could reveal to us a semantic dimension of the wisdom of the crowd. It could also help to define semantic relations between items and serve as a catalyst for a recommendation engine within Zotero.

By: patrickmj

patrickmj — Fri, 29 May 2009 14:20:08 +0000

Bruce,

Ah! looks like I was mulling over while you were writing! Ed Summers will be here, so there will definitely be a great discussion!

By: patrickmj

patrickmj — Fri, 29 May 2009 14:17:11 +0000

It’s probably worth hearing about where things stand with connecting Zotero with the bibliography ontology. There was a recent flurry of discussion on the bibo: list about this, and so I see some very strong progress there. When that’s done, I think we’re just talking about the bigger picture of developing the linked open data apps to do this kind of recommending

By: briancroxall

briancroxall — Fri, 29 May 2009 14:08:33 +0000

Sterling, I don’t know that I have much to add at this point except to say that you seem to have hit on a subtle and important thing to think about as far as how tagging should be trusted to do all the work for us. I look forward to this discussion.

By: Bruce D'Arcus

Bruce D'Arcus — Fri, 29 May 2009 13:44:42 +0000

The problem with tags is they are dumb strings. This is why imported controlled tags aren't very useful. If, OTOH, they were URI that could be linked with other URIs, things get much more interesting. I think it would be a really good thought experiment to consider how Zotero could usefully build on nascent linked data sources like the new Library of Congress id service. It would be a really good idea to do this during the meeting, as there's likely to one or more of the LoC developers responsible for this service in attendance.