recommendation engines – THATCamp CHNM 2009 http://chnm2009.thatcamp.org The Humanities And Technology Camp Mon, 06 Aug 2012 18:37:51 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.12 Zotero and Semantic Search http://chnm2009.thatcamp.org/05/29/zotero-and-semantic-search/ http://chnm2009.thatcamp.org/05/29/zotero-and-semantic-search/#comments Fri, 29 May 2009 08:26:37 +0000 http://thatcamp.org/?p=62

Here is my original proposal for THATCamp, which I hoped would fit in with session ideas from the rest of you:

I would like to discuss theoretical issues in digital history in a way that is accessible and understandable to beginning digital humanists.  This is probably the common thread running through my interests and research.  I really wonder, for instance, whether digital history has its own research agenda or whether it simply facilitates the research agenda of traditional academic history.  I believe that Zotero will need a good theory for its subject indexing before it can launch a recommendation service.  Are any digital historians planning on producing any non-proprietary controlled vocabularies?  We need to have a good discussion of what the semantic web means for digital history.  Are we going to sit on our hands while information scientists hardwire the Internet with presentist ontologies?  Can digital historians create algorithmic definitions for historical context that formally describe the concepts, terms, and the relationships that prevailed in particular times and places?  What do digital historians hope to accomplish with text mining?  Are we going to pursue automatic summarization, categorization, clustering, concept extraction, entity relation, and sentiment analysis?  What methods from other disciplines should we consider when pursuing text mining?  What should be our stance on the attempt to reduce the “reading” of texts to computational algorithms and mathematical operations?  Will the programmers among us be switching over to parallel programming as chip manufacturers begin producing massively multi-core processors?  How prepared will we be to exploit the full capabilities of high-performance computing once it arrives on personal computers in the next few years?

Here is a post that just went up at my blog that addresses some of these issues and questions:

Zotero and Semantic Search

The good news is that Zotero 2.0 has arrived.  This long-awaited version allows a user to share her or his database/library of notes and citations with others and to collaborate on research in groups.  This will be a tremendous help to scholars who are coauthoring papers.  It also has a lot of potential for teaching research methods to students and facilitating their group projects.

The bad news is that I think Zotero is about to hit another roadblock.  The development roadmap says version 2.1 “will offer users a recommendation service for books, articles, and other scholarly resources based on the content in your Zotero library.”  This could mean simply that Zotero will aggregate all of the user libraries, identify overlap and similarity between them, and then offer items to users that would fit well within their library.  This would be similar to how Facebook compares my list of friends with those of other people in order to recommend to me likely friends with whom I already have a lot of friends in common.  If this was all there was to the process of a recommendation system in Zotero, then I think Zotero would meet its goal.  But if Zotero is to live up to its promise to enable users to discover relevant sources, then I think there is still a lot of work to be done.

This may seem like a distinction without a difference.  My point is a subtle one and hopefully some more examples will illustrate what I am trying to say. But first let’s define the semantic web.  According to Wikipedia, “The semantic web is a vision of information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing, and combining information on the web.”  Zotero fulfils this vision when it captures citation information from web sites and makes it available for sharing and editing.  Amazon does something similar with its recommendation service.  It keeps track of what books people purchase, identifies patterns in these buying behaviors, and then recommends books that customers will probably like.  Zotero developers have considered using a similar system to run Zotero’s recommendation engine.  These are examples of the wisdom of the crowd in the world of web 2.0 at its best.

Unfortunately, there are limits to how much you can accomplish through crowdsourcingNetflix figured this out recently and is offering $1 million to whoever can “improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.”  The programming teams in the lead have, through trial and error, figured out that they needed to extract rich content from sites like the Internet Movie Database in order to refine their algorithms and predictive models.  This is kind of like predicting the weather; the more variables you can include in your calculations, the better your prediction will be.  However, in the case of movies the concepts for classifying movies is somewhat subjective.  Without realizing it, these prize-seeking programmers have been developing an ontology for movies.  (That may be a new word for you–according to Wikipedia, “an ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts.”)  Netflix is essentially purchasing a structured vocabularly and matching software that will allow it to vastly improve the accuracy of its recommendation engine when it comes to predicting what movies its customers will like.

One company that has taken ontologies quite seriously is Pandora, a “personalized internet radio service that helps you find new music based on your old and current favorites.”  The company tackled head-on the problem of semantically representing music by creating the Music Genome Project, which categorizes every song in terms of nearly 400 attributes.  And here is where the paradigm shift becomes evident.  Rather than aggregating and mining the musical preferences of groups of people, like what Amazon does with its sales data on books, Pandora defines similarity between songs in terms of conceptual overlap.  In other words, two songs are related to one another in the world of Pandora because they share a whole bunch of attributes–not because similar people listen to similar music.  (I told you this would be a subtle distinction.)  This is an example of how the semantic web trumps web 2.0.

Now let’s return to our discussion of Zotero.  As mentioned earlier, the envisioned recommendation engine for Zotero has been compared to Amazon’s recommendation engine.  The ability of users to add custom tags underscores how Zotero was influenced by web 2.0 models.  Apparently Zotero developers looked forward to the day when the “data pool” in Zotero would reach critical mass and enable the recommendation system to predict what items users would want to add to their library.  As we have seen, these models have inherent limitations.  They make recommendations on the basis of shared information, rather than on the basis of similarity between concepts.  I think at some level Zotero developers sensed this problem.  That is why they probably designed Zotero to capture terms from controlled vocabularies as part of the metadata it downloaded from online databases.  Unfortunately, though, some users and developers have said that imported tags, such as subject headings from library catalogs, are pretty much useless in Zotero.  Furthermore, the fact that Zotero comes with a button for turning off “automatic” tags, and that some translators sloppy or fail to capture subject headings, suggests that most users would rather avoid using these terms from controlled vocabularies.

And so the problem with Zotero is that its users and developers generally resist incorporating ontologies into their libraries (item types and item relations/functions are notable exceptions).  That may sound like a very abstract thing to say.  So let me provide you with some concrete examples of what this would look like.  The first is a challenge I would like to issue to the Zotero developers.  It has been said that Zotero would allow a group of historians to collaboratively build a library on “a topic lacking a chapter in the Guide to Historical Literature.”  My “bibliographer test” is a slight variation on this: 1) pick any section in this bibliographic guide, 2) enter all but one of the books in the given bibliography into Zotero, and 3) program Zotero’s recommendation engine so that, in the majority of cases, it can identify the item missing from the library.  Similarly, I would like to see us develop algorithms for “related records searches.”  You may think this is impossible, but this capability already exists in the Web of Science database.  And as we have already seen, Netflix and Pandora provide examples of the kind of semantic work it takes to make these types of searches feasible.

After reading this post, you may feel that Zotero has been heading down the wrong path.  I prefer to think of Zotero as having made some amazing progress over the last three years.  And I think the genesis of the ideas it needs are already in place.  In my estimation, we need to think more expansively about what means to carry out semantic searches with Zotero.  It also seems to me that we need to think more carefully about balancing the benefits of web 2.0 with the sophistication of the semantic web.  I will be excited to see what the developers come up with.  And maybe if I work more on my programming skills, I can help with writing the code.  As I see it, this will be an exciting opportunity for carrying out theoretical research in the digital humanities.

]]>
http://chnm2009.thatcamp.org/05/29/zotero-and-semantic-search/feed/ 14