Zotero and Semantic Search

Here is my original proposal for THATCamp, which I hoped would fit in with session ideas from the rest of you:

I would like to discuss theoretical issues in digital history in a way that is accessible and understandable to beginning digital humanists.  This is probably the common thread running through my interests and research.  I really wonder, for instance, whether digital history has its own research agenda or whether it simply facilitates the research agenda of traditional academic history.  I believe that Zotero will need a good theory for its subject indexing before it can launch a recommendation service.  Are any digital historians planning on producing any non-proprietary controlled vocabularies?  We need to have a good discussion of what the semantic web means for digital history.  Are we going to sit on our hands while information scientists hardwire the Internet with presentist ontologies?  Can digital historians create algorithmic definitions for historical context that formally describe the concepts, terms, and the relationships that prevailed in particular times and places?  What do digital historians hope to accomplish with text mining?  Are we going to pursue automatic summarization, categorization, clustering, concept extraction, entity relation, and sentiment analysis?  What methods from other disciplines should we consider when pursuing text mining?  What should be our stance on the attempt to reduce the “reading” of texts to computational algorithms and mathematical operations?  Will the programmers among us be switching over to parallel programming as chip manufacturers begin producing massively multi-core processors?  How prepared will we be to exploit the full capabilities of high-performance computing once it arrives on personal computers in the next few years?

Here is a post that just went up at my blog that addresses some of these issues and questions:

Zotero and Semantic Search

The good news is that Zotero 2.0 has arrived.  This long-awaited version allows a user to share her or his database/library of notes and citations with others and to collaborate on research in groups.  This will be a tremendous help to scholars who are coauthoring papers.  It also has a lot of potential for teaching research methods to students and facilitating their group projects.

The bad news is that I think Zotero is about to hit another roadblock.  The development roadmap says version 2.1 “will offer users a recommendation service for books, articles, and other scholarly resources based on the content in your Zotero library.”  This could mean simply that Zotero will aggregate all of the user libraries, identify overlap and similarity between them, and then offer items to users that would fit well within their library.  This would be similar to how Facebook compares my list of friends with those of other people in order to recommend to me likely friends with whom I already have a lot of friends in common.  If this was all there was to the process of a recommendation system in Zotero, then I think Zotero would meet its goal.  But if Zotero is to live up to its promise to enable users to discover relevant sources, then I think there is still a lot of work to be done.

This may seem like a distinction without a difference.  My point is a subtle one and hopefully some more examples will illustrate what I am trying to say. But first let’s define the semantic web.  According to Wikipedia, “The semantic web is a vision of information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing, and combining information on the web.”  Zotero fulfils this vision when it captures citation information from web sites and makes it available for sharing and editing.  Amazon does something similar with its recommendation service.  It keeps track of what books people purchase, identifies patterns in these buying behaviors, and then recommends books that customers will probably like.  Zotero developers have considered using a similar system to run Zotero’s recommendation engine.  These are examples of the wisdom of the crowd in the world of web 2.0 at its best.

Unfortunately, there are limits to how much you can accomplish through crowdsourcingNetflix figured this out recently and is offering $1 million to whoever can “improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.”  The programming teams in the lead have, through trial and error, figured out that they needed to extract rich content from sites like the Internet Movie Database in order to refine their algorithms and predictive models.  This is kind of like predicting the weather; the more variables you can include in your calculations, the better your prediction will be.  However, in the case of movies the concepts for classifying movies is somewhat subjective.  Without realizing it, these prize-seeking programmers have been developing an ontology for movies.  (That may be a new word for you–according to Wikipedia, “an ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts.”)  Netflix is essentially purchasing a structured vocabularly and matching software that will allow it to vastly improve the accuracy of its recommendation engine when it comes to predicting what movies its customers will like.

One company that has taken ontologies quite seriously is Pandora, a “personalized internet radio service that helps you find new music based on your old and current favorites.”  The company tackled head-on the problem of semantically representing music by creating the Music Genome Project, which categorizes every song in terms of nearly 400 attributes.  And here is where the paradigm shift becomes evident.  Rather than aggregating and mining the musical preferences of groups of people, like what Amazon does with its sales data on books, Pandora defines similarity between songs in terms of conceptual overlap.  In other words, two songs are related to one another in the world of Pandora because they share a whole bunch of attributes–not because similar people listen to similar music.  (I told you this would be a subtle distinction.)  This is an example of how the semantic web trumps web 2.0.

Now let’s return to our discussion of Zotero.  As mentioned earlier, the envisioned recommendation engine for Zotero has been compared to Amazon’s recommendation engine.  The ability of users to add custom tags underscores how Zotero was influenced by web 2.0 models.  Apparently Zotero developers looked forward to the day when the “data pool” in Zotero would reach critical mass and enable the recommendation system to predict what items users would want to add to their library.  As we have seen, these models have inherent limitations.  They make recommendations on the basis of shared information, rather than on the basis of similarity between concepts.  I think at some level Zotero developers sensed this problem.  That is why they probably designed Zotero to capture terms from controlled vocabularies as part of the metadata it downloaded from online databases.  Unfortunately, though, some users and developers have said that imported tags, such as subject headings from library catalogs, are pretty much useless in Zotero.  Furthermore, the fact that Zotero comes with a button for turning off “automatic” tags, and that some translators sloppy or fail to capture subject headings, suggests that most users would rather avoid using these terms from controlled vocabularies.

And so the problem with Zotero is that its users and developers generally resist incorporating ontologies into their libraries (item types and item relations/functions are notable exceptions).  That may sound like a very abstract thing to say.  So let me provide you with some concrete examples of what this would look like.  The first is a challenge I would like to issue to the Zotero developers.  It has been said that Zotero would allow a group of historians to collaboratively build a library on “a topic lacking a chapter in the Guide to Historical Literature.”  My “bibliographer test” is a slight variation on this: 1) pick any section in this bibliographic guide, 2) enter all but one of the books in the given bibliography into Zotero, and 3) program Zotero’s recommendation engine so that, in the majority of cases, it can identify the item missing from the library.  Similarly, I would like to see us develop algorithms for “related records searches.”  You may think this is impossible, but this capability already exists in the Web of Science database.  And as we have already seen, Netflix and Pandora provide examples of the kind of semantic work it takes to make these types of searches feasible.

After reading this post, you may feel that Zotero has been heading down the wrong path.  I prefer to think of Zotero as having made some amazing progress over the last three years.  And I think the genesis of the ideas it needs are already in place.  In my estimation, we need to think more expansively about what means to carry out semantic searches with Zotero.  It also seems to me that we need to think more carefully about balancing the benefits of web 2.0 with the sophistication of the semantic web.  I will be excited to see what the developers come up with.  And maybe if I work more on my programming skills, I can help with writing the code.  As I see it, this will be an exciting opportunity for carrying out theoretical research in the digital humanities.

14 Responses to “Zotero and Semantic Search”

  1. Bruce D'Arcus Says:

    The problem with tags is they are dumb strings. This is why imported controlled tags aren’t very useful. If, OTOH, they were URI that could be linked with other URIs, things get much more interesting.

    I think it would be a really good thought experiment to consider how Zotero could usefully build on nascent linked data sources like the new Library of Congress id service. It would be a really good idea to do this during the meeting, as there’s likely to one or more of the LoC developers responsible for this service in attendance.

  2. briancroxall Says:

    Sterling, I don’t know that I have much to add at this point except to say that you seem to have hit on a subtle and important thing to think about as far as how tagging should be trusted to do all the work for us. I look forward to this discussion.

  3. patrickmj Says:

    It’s probably worth hearing about where things stand with connecting Zotero with the bibliography ontology. There was a recent flurry of discussion on the bibo: list about this, and so I see some very strong progress there. When that’s done, I think we’re just talking about the bigger picture of developing the linked open data apps to do this kind of recommending

  4. patrickmj Says:

    Bruce,

    Ah! looks like I was mulling over while you were writing! Ed Summers will be here, so there will definitely be a great discussion!

  5. Sterling Fluharty Says:

    Bruce & patrickmj: Thanks for the thoughtful comment. I am still learning about URIs and Linked Data. I agree they offer some intriguing possibilities for the architecture of the semantic web. I think Zotero has already started down this path by using an ISBN as a URI when it “locates” a book in a library catalog. As I see it, this paradigm comes out of computer science. It envisions linking identical information and objects on different web sites by treating this as essentially a problem of matching equivalent items in relational databases. I appreciate and respect that position, but I think it is ultimately limited. If you take a linguistic perspective, however, this situation looks very different. Tags are no longer dumb or useless. Instead they become valuable semantic context for deep approaches to word sense disambiguation. This is part of what I was trying to say about blending information frameworks with conceptual domains. And I definitely second your idea of including LoC developers in this conversation at THATCamp.

    briancroxall: I am glad to see your interest. How, why, and whether we should trust tags are some interesting questions. The answers can be both scientific and subjective. Let me just say it would be fascinating if Zotero had an open API where users could run queries to see the variety of tags that are employed for a discrete item that appears in the libraries of multiple users. This could reveal to us a semantic dimension of the wisdom of the crowd. It could also help to define semantic relations between items and serve as a catalyst for a recommendation engine within Zotero.

  6. Govind Kabra Says:

    It is certain that understanding the structured data on the Web will enable novel semantic search applications. At Cazoodle, we use our technology for large scale Web data integration for building vertical applications in new domains—including, apartments rentals, online shopping, and local events.

    www.cazoodle.com

  7. Rick Says:

    While there is some great rhetoric here, I think that much is dubious speculation. Even if Zotero does not immediately live up to its potential, I think you’d be hard pressed to claim that they’re really “about to hit another roadblock.” In many ways, citation information has been better linked & more structured than information for music/movies. The academic literature is connected by citations. Unlike movies, where people must manually change IMDB to add links, these citations are author-created and inherent to the medium. Authors and publishers (as well as some aggregators, such as ISI) also often assign works keywords. Unlike the music “genes” in Pandora, these also come for free.

    While there are some keyword ontologies (PACS is notable & most publishers have lists of keywords), author-selected keywords and tags are wild & I don’t think that is a bad thing. A recommendation system doesn’t need to rely that they be chosen them from some small list or even have them linked together publicly (although that would be useful, see delicious’s ‘related’ tags). With the data both on citation-linkages & what people with similar references also have (and, for that matter, content from similar (usually very subject-specific) journals), this mess can definitely be sorted out to make recommendations.

    (It also is not a problem if Zotero users delete or don’t carry-over automatic tags: the rest of the rich citation information can be used to verify that two different user items probably refer to the same web source, so you can use the information captured by other users.)

    As an aside: I do think the music genome project is cool & listen to pandora regularly. But the scaling sucks. allmusic has data on 15M tracks, while only 0.5m have been cataloged in the MusicGenome. So they have less than 5% of what is in allmusic (and who knows what allmusic is missing). It takes around five listens to categorize a single song. This process would be horrendous in academic literature, where there are fewer subject-matter experts & each reading takes a considerable amount of time & there’s much more content (PubMed, alone, has 20M articles).

  8. Sterling Fluharty Says:

    Govind: Thanks for sharing a commercial application. I was thinking you might want to add craigslist into your apartment search results.

    Rick: Good to hear from you. I think you are right: historians, including me, make for poor prognosticators. I suppose I will have to try harder to tame my feelings for the future of Zotero. 😉

    It sounds like you enjoy working with Zotero. I guess we have that in common. I like your theory about the quality of and potential for linking between citation information. I am not so sure, though, that metadata from proprietary databases comes at no cost.

    I agree with you that we shouldn’t be biased against tags and terms that are wild. After all, not every keyword has the chance to become domesticated. I try to be very open minded and accepting of ontologies, regardless of their level of formality or where they come from.

    I think it would be wonderful if you were right about disambiguation. In my estimation, it is human nature for us to believe that we can easily classify things into categories and determine whether they are related, similar, equivalent, opposite, etc. Perhaps computers can help humble and heal us, myself included, of our hubris.

    We may find that we are able to outsource much of the work of ontology derivation to savvy computer algorithms. And I promise to not tell library catalogers what you just said about their profession.

  9. Rick Says:

    Zotero users have been taking metadata from proprietary databases at no cost already. And some metadata has been given to and/or is being harvested for the Linked Data project. Finally, some proprietary database holders have already licensed their databases to customers under rather open terms (though this is certainly subject to change).

    No offense intended to catalogers. Most do have it a little easier: the Library of Congress has a “mere” 32M books. An ontology for these makes a bit more sense.

    By “academic literature,” I mostly meant journal articles. There are orders of magnitude more of these out there & they are also much more esoteric. While I pointed out some companies have made a go at this, normalization of keywords to an ontology would be very difficult: PACS alone has 8,000 or so different classifications. More works + more keywords + fewer readers of the average work = suffering for any noble catalogers who took this task on.

  10. Bruce D'Arcus Says:

    It envisions linking identical information and objects on different web sites by treating this as essentially a problem of matching equivalent items in relational databases.

    Relational databases have no necessary connection to this. All I’m saying is that if you treat subjects as things, with independent existence, you can attach other information to them: different language-dependent labels, connections to other things, etc., etc. *That* is what is at the heart of the notion of linked data.

  11. Bruce D'Arcus Says:

    Oops; the comments don’t allow embedded markup; the first paragraph above is of course a quote.

  12. Sterling Fluharty Says:

    Rick: You are absolutely right that end users, almost without exception, do not have to pay for access to the metadata they download into their Zotero libraries. But let me suggest to you that there may be other ways of measuring the cost involved. For example, have you followed any of the discussion of the Record Use Policy that OCLC WorldCat announced in November 2008? I think you will find that OCLC asserts some rights over its metadata. It says its metadata is a shared community asset and that people outside of the OCLC Cooperative who use the metadata in certain ways should provide a fair return. In fact, OCLC might say that Zotero needs to sign an agreement with it in order to re-use and transfer data obtained directly or indirectly from OCLC databases. I suspect that you and I hold similar views on wanting information to be as free as possible. Still, I don’t think Zotero can afford to ignore the intellectual property rights and policies that many, if not most, data providers may be asserting. I think it would be wonderful if litigation could be avoided and compromises or understandings could be reached. Perhaps OCLC could be assured or persuaded that Zotero’s aggregation and analysis of OCLC metadata actually provides a fair return to members of the OCLC cooperative.

    Sorry for taking potshots on your comments about catalogers. It sounds like we both have some respect for what they do.

    It sounds to me like you are saying that few companies or individuals will want to develop ontologies because they are time- and resource-intensive. I won’t dispute that this can be intellectually challenging work. But you might be surprised to find that the WorldCat search [su: terminology or (su: abstracting and su: indexing) or ((su: subject and su: headings))], which doesn’t necessarily even include traditional thesauri, yields 53,699 results. I take this as a sign of a rich tradition of ontology work in the United States, United Kingdom, and elsewhere.

    It also appears to me that you are suggesting that companies or individuals may not find it worth their time to develop ontologies since they can expect few readers of works in their specialized domain. Within the world of print, I agree with you that this is certainly a concern and constraint for those who contemplate indexing works in their fields. But the Internet is a game changer, in my opinion. Suddenly works that had tiny audiences in print now have vastly larger audiences on the web. If the semantic web is to succeed on a massive scale, I think its designers will have to engage both this new-found audience and the kinds of ontologies that have been worked out in the last hundred years or so.

    I think I hear you saying that Zotero won’t need ontologies because once the number of users who sync their libraries reaches critical mass, the linked data between citations/records and within its database will provide the data necessary for running a recommendation engine. Like you, I believe this aggregation will certainly increase the quality of recommendations. However, I think some caution is in order. Some web semanticists have envisioned that Wikipedia, as an aggregation of the world’s knowledge, would provide them with the means for defining semantic relationships and measuring similarity between various items, objects, or things. Over the last year or so, though, systematic study of certain domains within Wikipedia has revealed that it is far from being comprehensive or complete. To remedy these gaps, content specialists will have to flesh out the ontologies within their particular domains. This is in large part the reason why VoCamp was started in 2008. So I hope Zotero developers and users will think critically about whether it would be worth their time to join this larger movement within the semantic web and develop ontologies. And I can’t help but wonder if historians are uniquely situated to tell us how the lessons of previous ontology work can be applied to problems in the present.

  13. Arden Kirkland Says:

    Between this post and the post about Standards, I think we have quite a mammoth session to look forward to. Perhaps we need to break this down somehow into multiple sessions?

    Anyway – my responses:

    -The digital history I have followed “simply facilitates the research agenda of traditional history.”
    I’d be very interested to see examples of projects where “digital history has its own agenda.”

    -There is absolutely a need for controlled vocabularies to be not only “non-proprietary” but also inclusive in other ways, created as a result of collaboration between multiple institutions so that diverse collections can be represented. I could share the different, not very strategic, ways the costume history community is responding to this need, and I’d be interested to hear about how other fields are responding.

    -In my field, my concern is not so much for “presentist ontologies” as it is for “art historicist” ontologies – paintings, for example, are typically catalogued by visual resources librarians for use in art history databases, and rarely mention details of costume portrayed therein. To see what I mean, take a look at this great slideshow about user contributed metadata. This relates to the point that rnelson2 makes in his post about Standards, about how ontologies can exclude alternate interpretations.

    -Your discussion of Zotero vs. Amazon vs. Pandora is very thought-provoking. Even though I’m a devoted Pandora user, I hadn’t really thought about it as a model for other kinds of searches. I’ve been saying for years that academic sites should use Amazon as a model, in terms of its intuitive interface. However, I agree that Pandora is a wonderful example of “how the semantic web trumps web 2.0.” Looking at the list of attributes used to catalog a song for Pandora, it reminds me of the complexity involved with cataloging costume, and provides an appropriate solution in terms of the layering of attributes (as opposed to the either/or of some ontologies).

    However, while we may aspire to a Pandora-type model, I don’t know how realistic it is for other kinds of fields. Pandora is a commercial model for a very economically viable field with a wide audience – the time spent cataloging results in financial profit. My field is somewhat the opposite of this – small, specialized, underfunded, overworked. Even if a large, popular museum collection was able to come up with the funding to pay catalogers to pursue this approach, their collections are so large that they would either only catalog highlights of their collection, or it would take forever! A small collection like mine would take less time, but would be much less likely to be funded.

    The big question is, for our field, would it really be worth it? For the kind of searching, the kind of work we’re doing, would the research coming out justify the intensive labor in? My instinct is no, but the only way to know for sure would be to test it – and a costume collection would be a great collection to test this on, in general. Costume is one of the most complex kinds of material culture artifacts you can study, with multiple layers of physical construction and multiple layers of social interpretation. If we could really define a list of attributes to define costume, I’m convinced we could do it with any kind of object (“If I can make it there, I’ll make it anywhere”). But I’m not convinced that doing so would substantially add to our wisdom.

  14. Sterling Fluharty Says:

    Thanks for the lengthy reply. I don’t hear traditional historians clammiring for software that can analyze text. So maybe the development of text mining is one area will digital history will have to pursue its own agenda. I am getting ready to write a text mining post, so stay posted. One of things we need to think about is designing programs that classify and describe historical objects and documents with some level of human supervision. This will help to reduce a fair amount of the workload, if we can pull it off.