Crowdsourcing – The Session

Being a semi-liveblog of our first session of the day – please annotate as you see fit (and apologies if I left anything or anyone out).

Attendees: Andy Ashton, Laurie Kahn-Leavitt, Tim Brixius, Tad Suiter, Susan Chun, Josh Greenberg, Lisa Grimm, Jim Smith, Dan Cohen

Lisa: Kickoff with brief explanation of upcoming project needing crowdsourcing.

Susan: Interested in access points to large-scale collections – machine-generated keywords from transcriptions/translations, etc. Finding the content the user is most likely to engage with.

Josh: Landing page of the site to steer the crowd to certain areas – flickr commons?

Susan: Discovery skillset? Asking users – ‘what are you interested in?’ Discipline-specifc, multi-lingual vocabularies could be generated?

Josh: Getting more general: moving beyond the monoculture – what is the crowd? Layers of interest; Figuring out role – lightweight applications tailored to particular communities.  NYPL historical maps project example – can we crowdsource the rectification of maps? Fits well w/community dynamics, but the information is useful elsewhere. Who are the user communities?

Laurie: Relation between face to face contact and building a crowdsource community? Susan & Josh’s projects have large in-person component.

Defining the need for crowdsourcing – what is the goal? Josh likes notion of hitting multiple birds with one stone. What is the crowd’s motivation? How can we squeeze as many different goals as possible out of one project?

Tad: issue of credentialing – power of big numbers.

Jim: Expert vs. non-expert – research suggests amateurs are very capable in certain circumstances.

Susan: Dating street scenes using using car enthusiasts – effective, but key is in credentialing.

Andy: The problem of the 3% of information that isn’t good – the 97% that’s correct goes by the wayside. Cultural skepticism over crowdsourcing, but acceptance of getting obscure information wherever possible (e.g. ancient texts). Looking into crowdsourcing for text encoding. Data curation and quality control issues to be determined. Interested to see Drexel project results next year?

Susan: Human evaluation of results of crowdsourcing – tools available from the project there. (Yay!)

Jim: Transitive trust model – if I trust Alice, can I trust Bob?

Josh: Citizen journalism, e.g. Daily Kos – self-policing crowd. Relies on issues of scale, but not just ‘work being done that’s not by you.’ Cultural argument about expertise/authority – ‘crowd’ meaning the unwashed vs. the experts.

Susan: Long tail is critical – large numbers of new access points.  How to encourage and make valuable?

Tad: Translations: ‘they’re all wrong!’ (Great point).

Andy: Depth, precision & granularity over breadth

Jim: Unpacking the digital humanities piece – leveling effect. Providing an environment for the community, not just a presentation.

Josh: Using metrics to ‘score’ the crowd.

Tad: Wikipedia example – some interested in only one thing, some all over.

Josh: Difference between crowdsource activity as work vs. play. Treating it as a game – how to cultivate that behavior?

Susan: Fun model relies on scale.

Josh: MIT PuzzleHunt example; how to create a game where the rules generate that depth?

Susan: Validation models problematic – still requires experts to authorize.

Tad: Is PuzzleHunt work, rather than play?

Andy: NITLE Predictions Market – great example of crowdsourcing as play.

Dan: Still hasn’t gotten the scale of InTrade, etc. – how to recruit the crowd remains the problem. Flickr participation seems wide, but not deep.

Josh: Compel to do job because they have to, do Amazon Mechanical Turk model and pay or get deeper into unpacking the motivations between amateur and expert communities.

Susan: Work on motivation in their project determined that invited users tagged at a very much higher rate vs. those who have just jumped in.

Susan: Paying on Mechanical Turk not as painful as it might be – many doing tons of work for about $25.

Josh: So many ways to configure crowdsourcing model – pay per action? Per piece? Standards & practices don’t exist yet.

Susan: We’ve talked a lot about them, but there are still relatively few public crowdsourcing projects.

Dan: Google averse to crowdsourcing (GoogleBooks example) – they would rather wait for a better algorithm (via DH09).

Susan: But they have scale!

Dan: Data trumps people for them.

Andy: Image recognition – it’s data, but beyond the capabilities now.

Dan: Third option: wait five years – example of Google’s OCR. Google has the $$ to re-key all Google Books, but they are not doing it.

Josh: Google believes that hyperlinks are votes –

Dan: Latent crowdsourcing, not outright

Susan: Translation tools largely based on the average – our spaces don’t fit that model

Tad: Algorithm model gives strong incentive to proprietary information – you have everything invested in protecting your information, not openess.

Dan: OpenLibrary wiki-izing their catalog, vs. Google approach. Seems purely an engineering decision.

Andy: Approach informed by a larger corporate strategy – keeping information in the Google wrapper. Institutional OPACs almost always averse to crowdsourcing as well. What is the motivating factor there?

Josh: Boundary drawing to reinforce professional expertise and presumption that the public doesn’t know what it’s doing.

Andy: Retrieval interfaces horrible in library software – why keep best metadata locked away.

Sending around link to Women Physicians…

Susan: different views for different communities – work with dotSub for translation.

Dan: Other examples of good crowdsourced projects?

Susan: Examples of a service model?

Josh: Terms of service? Making sure that the data is usable long-term to avoid the mistakes of the past. Intellectual property remains owned by person doing the work, license granted to NYPL allowing NYPL to pass along license to others.  Can’t go back to the crowd to ask for pernission later.  Getting users to agree at signup key. Rights and policies side of things should appear on blog in future.

Jim: Group coding from Texas A&M moved into a crowdsourcing model – future trust model ‘model’

Please continue to add examples of projects (and of course correct any ways I’ve wildly misquoted you).

It would be great to have some crowdsourcing case studies – e.g., use flickr for project x, a different approach is better for project y…

Comments are closed.