Some real (bizarre) constructive interference

Paul Kwiat’s group at UIUC is doing experiments in counterfactual computing, which use constructive interference (among other phenomena).

Kwiat explains that:

In a sense, it is the possibility that the algorithm could run which prevents the algorithm from running…

which sounds dangerously like post-modern jargon. Should we issue a Sokal alert? No, physics is just outrunning our ability to parody it (again).

Of course none of this is actually relevant to our project. Maybe we should be finding ways to use the internet for counterfactual computing…

Update: John Holbo at Crooked Timber has a somewhat more extreme reaction to this development…

Update: Wait, there’s more!  Sean Carroll at Cosmic Variance explains it all in terms of salad, steak, and sleeping puppies (no, really!  I told you physics was beyond parody).  As a bonus he tells you all you need to know about quantum mechanics, and notes that “the rest is just some equations to make it look like science.”

Folksonomies, taxonomies and population thinking

In preparation for a talk on “Science as social practice” I came across a quote by Ernst Mayr (via Three Toed Sloth):

The assumptions of population thinking are diametrically opposed to those of the typologist. …. For the typologist, the type (eidos) is real and the variation an illusion, while for the populationist the type… is an abstraction and only the variation is real.

This seems to me to perfectly capture the issues that divide advocates of folksonomies and taxonomies. Taxonomies attempt (and, as advocates acknowledge, always fail) to find a complete and sharp set of “types” with which to classify their domain. Folksonomies rely on the population of users to create a population of tags that can adequately organize a population of content items.

Perhaps the biggest differences come in managing change. Folksonomies are inherently squishy and messy, and accomodate change much more comfortably — at the cost of never providing the benefits of crisp classification. They are much more successful online because they leverage the power and fluidity of digital environments.

Taxonomies are inherently sharp edged and therefore brittle. They tend to break rather than change gracefully. Furthermore, because they typically require a substantial investment of human effort in construction, training, labeling, etc. change is expensive and painful. On the other hand, sometimes we find these prices worth paying. Searching a physical library where the books were informally tagged by users (perhaps using post-its) would be a nightmare.

We are working on the population side of this dichotomy. Our difference from existing folksonomies (e.g. del.icio.us) is that current sites provide only an extensional definition of their tags, while our technology will create an intensional definition (though squishier than the standard philosophical idea of intension). This lets us provide significantly more complete support for the real diversity and complexity of the user population than existing sites. Let me explain.

On existing sites, the meaning of a tag is simply the set of things that have been given that tag. You want to know whether a tag is appropriate in a given case? Look at what else has been given that tag, and decide whether the thing you were planning to tag is “like” the others. This is a classic case of extensional definition.

Our initial tagging system will create intensional definitions of tags, based on a given user’s choices. Specifically, it will summarize a population of tagging decisions by a classifier. Initially we’ll use a simple bayesian filter, like today’s spam filters. If this doesn’t work well enough, we’ll use other technologies — there are a lot to choose from (Support Vector Machines, Latent Semantic Indexing, etc.). But we think simple methods will work well enough.

So what does this do for our users? There are two benefits that we think are important. First, since our system has an intensional definition of what a given user means by a tag, it can classify new items for that user. If it is wrong the user can correct it and thus improve that definition.

Second, since the system has these intensional definitions for each user, it can compare them and find similar tag definitions. Note that it makes no difference how a given user spelled their tag. If I say “SF”, you say “Science Fiction”, and someone else says “Space opera”, the system judges the similarity of the intensional definitions and ignores the way their names are spelled. In our environment, similarity is a relation on the intensional definitions, not on the tag spellings.

So how does all this relate to population thinking? Systems with shared extensional definitions of tag meaning or shared moderation values implicitly push their users toward a single, shared, “ideal” perspective. Tags and moderation values tend to be viewed in terms of “consensus meaning” — and sites like Slashdot have explicit meta-moderation to enforce this concensus!

In contrast, by generating intensional definitions we explicitly accept (and celebrate!) that the users of this technology will be a diverse population, each with different interests and ways of classifying the world. Our goal is to accomodate the real diversity of the user population, while also giving people ways to adopt each others’ perspectives, and to collaborate when they want to.

Update: Tim Spalding of LibraryThing has some interesting meditations on how to compute intensional meanings from an extensional tag pool.  (He doesn’t use those terms, but he needs some sort of intensional definition to make the similarity judgements he’s aiming for.)  He specifically comments on the messy nature of tagging and the need to use statistical methods to extract cleaner information from the tags.

However Tim is still effectively throwing away a lot of information. A given user might have a very focused meaning for some of the tags he mentions — e.g. “WWI” or “memoir”. But since LibraryThing is looking only at the total pool of tags spelled the same, rather than the meaning of those tags within each user’s library, that focused meaning gets washed out. This is exactly the effect we are trying to avoid. The key differences are that Tim is computing intensional definitions over the whole tag pool, while we are computing them for each user separately, and Tim is using only the tags and the identities of the tagged items, while we are also using the “fine structure” of the tagged items (the tokens they contain).  Of course for books this fine structure is often not available.

Paying by choosing — learning user interests to enhance site value

In his blog EconoMeta Adam Marsh examines the “deal” between websites and readers — content is free and advertising funds the site. Marsh points out that selling advertising depends on accurately targeting an audience, and an audience forms when content serves the interests of readers. Thus advertising and serving content to readers have a tight bidirectional relationship. That relationship depends, in both directions, upon knowing audience interest.

Audience interest can be approached from two ends: group and individual. Google starts with the group; it sorts search hits by aggregating everyones’ opinions (as expressed by their links). At Peerworks we are working from the opposite end, developing technology to help each individual teach the system what he or she is personally interested in. We do this by aggregating information about what a reader has found interesting in the past.

Knowing a reader’s preferences lets a site show them interesting content, but it also lets the site show them more precisely targeted ads, presumably generating more income and quite possibly making the user happier.

A more profitable site can offer better content. But for this circle to be virtuous, a website must provide readers with ways to understand and control the information that is collected about their interests, and must assure them that information cannot be shared or sold without their permission.

Brief status

Peerworks will build an automated tagging engine — portable and open source — that can be used by existing systems such as Scoop, Slashcode, Drupal and Plone, but also form the basis for new systems.

We’ve developed a custom feed aggregator for testing, and have collected lots of feeds and feed items for moderation. We are now working on the moderation UI. This will be used by a team of moderators to create a stable body of tagged items to use for testing and tuning individual tagging algorithms under development.

Initially we expect to use an algorithm very close to the SpamBayes classifier. There are existing examples of how to do this; Ben Kamens’ analysis is helpful, and Laird Breyer’s library dbacl uses Bayesian classifiers to assign tags.

Once our system can learn individual tagging styles well enough to make users comfortable, we’ll be looking for existing sites that would like to integrate it. Our further development will be guided by the needs of these initial partner sites.

If you’d like to know more, leave a comment with your questions or drop me a line: jed (at) peerworks.org