ontologies – THATCamp CHNM 2009 http://chnm2009.thatcamp.org The Humanities And Technology Camp Mon, 06 Aug 2012 18:37:51 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.12 Standards http://chnm2009.thatcamp.org/06/08/standards/ http://chnm2009.thatcamp.org/06/08/standards/#comments Mon, 08 Jun 2009 21:22:08 +0000 http://thatcamp.org/?p=103

Here’s my original proposal for THATCamp. The question and issues I’m interested in entertaining dovetail nicely, I think, with those that have been raised by Sterling Fluharty in his two posts.


The panel at last year’s THATCamp that I found the most interesting was the one on “Time.” We had a great discussion about treating historical events as data, and a number of us expressed interest in what an events microformat/standard might look like. I’d be interested in continuing that conversation at this year’s THATCamp. I know Jeremy Boggs has done some work on this, and I’m interested in developing such a microformat so that we can expose more of the data in our History Engine for others to use and mashup.

While I’d like to talk about that particular task, I’d also be interested in discussing a related but more abstract question too that might be of interest to more THATCampers. Standards make sense when dealing with discrete, structured, and relatively simple kinds of data (e.g. bibliographic citations, locations), but I’m wondering if much of the evidence we deal with as humanists requires enough individual interpretation to make it into structured data that the development of interoperability standards might not make that much sense. I’m intrigued by the possibility of producing data models that represent complex historical and cultural processes (e.g. representing locations and time in a way that respects and reflects a Native American tribe’s sense of time and space, etc.). An historical event doesn’t seem nearly that complicated, but even with it I wonder if as humanists we might not want a single standard but instead want researchers to develop their own idiosyncratic data models that reflect their own interpretation of how historical and cultural processes work. I’m obviously torn between the possibilities afforded by interoperability standards and a desire for interpretive variety that defies standardization.


In his first post, Sterling thoughtfully championed the potential offered by “controlled vocabularies” and “the semantic web.” I too am intrigued to by the possibilities that ontologies, both modest and ambitious, offer, say, to find similar texts (or other kinds of evidence), to make predictions, to uncover patterns. (As an aside, but on a related subject, I’d be in favor of having another session on text mining at this year’s THATCamp if anyone else is interested.) Sterling posed a question in his proposal: “Can digital historians create algorithmic definitions for historical context that formally describe the concepts, terms, and the relationships that prevailed in particular times and places?” I’m intrigued by that ambitious enterprise, but as my proposal suggests I’m cautious and skeptical for a couple of reasons. First, I’m dubious that most of what we study and analyze as humanists can be fit into anything resembling an adequate ontology. The things we study–e.g. religious belief, cultural expression, personal identity, social conflict, historical causation, etc., etc.–are so complex, so heterogeneous, so plastic and contingent that I have a hard time envisioning how they can be translated into and treated as structured data. As I suggested in my proposal, even something as modest as an “historical event” may be too complex and subjective to be the object of a microformat. Having said that, I’m intrigued by the potential that data models offer to consider quantities of evidence that defy conventional methods, that are so large that they can only be treated computationally. I’m sure that the development of ambitious data models will lead to interesting insights and help produce novel and valuable arguments. But–and this brings me to my second reservation–those models or ontologies are, of course, themselves products of interpretation. In fact they are interpretations–informed, thoughtful (hopefully) definitions of historical, cultural relationships. There’s nothing wrong with that. But adherence to “controlled” vocabularies or established “semantic” rules or any standard, while unquestionably valuable in terms of promoting interoperability and collaboration, defines and delimits interpretation and interpretative possibility. I’m anti-standards in that respect. When we start talking about anything remotely complex–which includes almost everything substantive we study as humanists–I hope we see different digital humanists develop their own idiosyncratic, creative data models that lead to idiosyncratic, creative, original, thoughtful, and challenging arguments.

All of which is to say that I second Sterling in suggesting a session on the opportunities and drawbacks of standards, data models, and ontologies in historical and humanistic research.

]]>
http://chnm2009.thatcamp.org/06/08/standards/feed/ 7
Zotero and Semantic Search http://chnm2009.thatcamp.org/05/29/zotero-and-semantic-search/ http://chnm2009.thatcamp.org/05/29/zotero-and-semantic-search/#comments Fri, 29 May 2009 08:26:37 +0000 http://thatcamp.org/?p=62

Here is my original proposal for THATCamp, which I hoped would fit in with session ideas from the rest of you:

I would like to discuss theoretical issues in digital history in a way that is accessible and understandable to beginning digital humanists.  This is probably the common thread running through my interests and research.  I really wonder, for instance, whether digital history has its own research agenda or whether it simply facilitates the research agenda of traditional academic history.  I believe that Zotero will need a good theory for its subject indexing before it can launch a recommendation service.  Are any digital historians planning on producing any non-proprietary controlled vocabularies?  We need to have a good discussion of what the semantic web means for digital history.  Are we going to sit on our hands while information scientists hardwire the Internet with presentist ontologies?  Can digital historians create algorithmic definitions for historical context that formally describe the concepts, terms, and the relationships that prevailed in particular times and places?  What do digital historians hope to accomplish with text mining?  Are we going to pursue automatic summarization, categorization, clustering, concept extraction, entity relation, and sentiment analysis?  What methods from other disciplines should we consider when pursuing text mining?  What should be our stance on the attempt to reduce the “reading” of texts to computational algorithms and mathematical operations?  Will the programmers among us be switching over to parallel programming as chip manufacturers begin producing massively multi-core processors?  How prepared will we be to exploit the full capabilities of high-performance computing once it arrives on personal computers in the next few years?

Here is a post that just went up at my blog that addresses some of these issues and questions:

Zotero and Semantic Search

The good news is that Zotero 2.0 has arrived.  This long-awaited version allows a user to share her or his database/library of notes and citations with others and to collaborate on research in groups.  This will be a tremendous help to scholars who are coauthoring papers.  It also has a lot of potential for teaching research methods to students and facilitating their group projects.

The bad news is that I think Zotero is about to hit another roadblock.  The development roadmap says version 2.1 “will offer users a recommendation service for books, articles, and other scholarly resources based on the content in your Zotero library.”  This could mean simply that Zotero will aggregate all of the user libraries, identify overlap and similarity between them, and then offer items to users that would fit well within their library.  This would be similar to how Facebook compares my list of friends with those of other people in order to recommend to me likely friends with whom I already have a lot of friends in common.  If this was all there was to the process of a recommendation system in Zotero, then I think Zotero would meet its goal.  But if Zotero is to live up to its promise to enable users to discover relevant sources, then I think there is still a lot of work to be done.

This may seem like a distinction without a difference.  My point is a subtle one and hopefully some more examples will illustrate what I am trying to say. But first let’s define the semantic web.  According to Wikipedia, “The semantic web is a vision of information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing, and combining information on the web.”  Zotero fulfils this vision when it captures citation information from web sites and makes it available for sharing and editing.  Amazon does something similar with its recommendation service.  It keeps track of what books people purchase, identifies patterns in these buying behaviors, and then recommends books that customers will probably like.  Zotero developers have considered using a similar system to run Zotero’s recommendation engine.  These are examples of the wisdom of the crowd in the world of web 2.0 at its best.

Unfortunately, there are limits to how much you can accomplish through crowdsourcingNetflix figured this out recently and is offering $1 million to whoever can “improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.”  The programming teams in the lead have, through trial and error, figured out that they needed to extract rich content from sites like the Internet Movie Database in order to refine their algorithms and predictive models.  This is kind of like predicting the weather; the more variables you can include in your calculations, the better your prediction will be.  However, in the case of movies the concepts for classifying movies is somewhat subjective.  Without realizing it, these prize-seeking programmers have been developing an ontology for movies.  (That may be a new word for you–according to Wikipedia, “an ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts.”)  Netflix is essentially purchasing a structured vocabularly and matching software that will allow it to vastly improve the accuracy of its recommendation engine when it comes to predicting what movies its customers will like.

One company that has taken ontologies quite seriously is Pandora, a “personalized internet radio service that helps you find new music based on your old and current favorites.”  The company tackled head-on the problem of semantically representing music by creating the Music Genome Project, which categorizes every song in terms of nearly 400 attributes.  And here is where the paradigm shift becomes evident.  Rather than aggregating and mining the musical preferences of groups of people, like what Amazon does with its sales data on books, Pandora defines similarity between songs in terms of conceptual overlap.  In other words, two songs are related to one another in the world of Pandora because they share a whole bunch of attributes–not because similar people listen to similar music.  (I told you this would be a subtle distinction.)  This is an example of how the semantic web trumps web 2.0.

Now let’s return to our discussion of Zotero.  As mentioned earlier, the envisioned recommendation engine for Zotero has been compared to Amazon’s recommendation engine.  The ability of users to add custom tags underscores how Zotero was influenced by web 2.0 models.  Apparently Zotero developers looked forward to the day when the “data pool” in Zotero would reach critical mass and enable the recommendation system to predict what items users would want to add to their library.  As we have seen, these models have inherent limitations.  They make recommendations on the basis of shared information, rather than on the basis of similarity between concepts.  I think at some level Zotero developers sensed this problem.  That is why they probably designed Zotero to capture terms from controlled vocabularies as part of the metadata it downloaded from online databases.  Unfortunately, though, some users and developers have said that imported tags, such as subject headings from library catalogs, are pretty much useless in Zotero.  Furthermore, the fact that Zotero comes with a button for turning off “automatic” tags, and that some translators sloppy or fail to capture subject headings, suggests that most users would rather avoid using these terms from controlled vocabularies.

And so the problem with Zotero is that its users and developers generally resist incorporating ontologies into their libraries (item types and item relations/functions are notable exceptions).  That may sound like a very abstract thing to say.  So let me provide you with some concrete examples of what this would look like.  The first is a challenge I would like to issue to the Zotero developers.  It has been said that Zotero would allow a group of historians to collaboratively build a library on “a topic lacking a chapter in the Guide to Historical Literature.”  My “bibliographer test” is a slight variation on this: 1) pick any section in this bibliographic guide, 2) enter all but one of the books in the given bibliography into Zotero, and 3) program Zotero’s recommendation engine so that, in the majority of cases, it can identify the item missing from the library.  Similarly, I would like to see us develop algorithms for “related records searches.”  You may think this is impossible, but this capability already exists in the Web of Science database.  And as we have already seen, Netflix and Pandora provide examples of the kind of semantic work it takes to make these types of searches feasible.

After reading this post, you may feel that Zotero has been heading down the wrong path.  I prefer to think of Zotero as having made some amazing progress over the last three years.  And I think the genesis of the ideas it needs are already in place.  In my estimation, we need to think more expansively about what means to carry out semantic searches with Zotero.  It also seems to me that we need to think more carefully about balancing the benefits of web 2.0 with the sophistication of the semantic web.  I will be excited to see what the developers come up with.  And maybe if I work more on my programming skills, I can help with writing the code.  As I see it, this will be an exciting opportunity for carrying out theoretical research in the digital humanities.

]]>
http://chnm2009.thatcamp.org/05/29/zotero-and-semantic-search/feed/ 14