Folksonomies, taxonomies and population thinking

In preparation for a talk on “Science as social practice” I came across a quote by Ernst Mayr (via Three Toed Sloth):

The assumptions of population thinking are diametrically opposed to those of the typologist. …. For the typologist, the type (eidos) is real and the variation an illusion, while for the populationist the type… is an abstraction and only the variation is real.

This seems to me to perfectly capture the issues that divide advocates of folksonomies and taxonomies. Taxonomies attempt (and, as advocates acknowledge, always fail) to find a complete and sharp set of “types” with which to classify their domain. Folksonomies rely on the population of users to create a population of tags that can adequately organize a population of content items.

Perhaps the biggest differences come in managing change. Folksonomies are inherently squishy and messy, and accomodate change much more comfortably — at the cost of never providing the benefits of crisp classification. They are much more successful online because they leverage the power and fluidity of digital environments.

Taxonomies are inherently sharp edged and therefore brittle. They tend to break rather than change gracefully. Furthermore, because they typically require a substantial investment of human effort in construction, training, labeling, etc. change is expensive and painful. On the other hand, sometimes we find these prices worth paying. Searching a physical library where the books were informally tagged by users (perhaps using post-its) would be a nightmare.

We are working on the population side of this dichotomy. Our difference from existing folksonomies (e.g. del.icio.us) is that current sites provide only an extensional definition of their tags, while our technology will create an intensional definition (though squishier than the standard philosophical idea of intension). This lets us provide significantly more complete support for the real diversity and complexity of the user population than existing sites. Let me explain.

On existing sites, the meaning of a tag is simply the set of things that have been given that tag. You want to know whether a tag is appropriate in a given case? Look at what else has been given that tag, and decide whether the thing you were planning to tag is “like” the others. This is a classic case of extensional definition.

Our initial tagging system will create intensional definitions of tags, based on a given user’s choices. Specifically, it will summarize a population of tagging decisions by a classifier. Initially we’ll use a simple bayesian filter, like today’s spam filters. If this doesn’t work well enough, we’ll use other technologies — there are a lot to choose from (Support Vector Machines, Latent Semantic Indexing, etc.). But we think simple methods will work well enough.

So what does this do for our users? There are two benefits that we think are important. First, since our system has an intensional definition of what a given user means by a tag, it can classify new items for that user. If it is wrong the user can correct it and thus improve that definition.

Second, since the system has these intensional definitions for each user, it can compare them and find similar tag definitions. Note that it makes no difference how a given user spelled their tag. If I say “SF”, you say “Science Fiction”, and someone else says “Space opera”, the system judges the similarity of the intensional definitions and ignores the way their names are spelled. In our environment, similarity is a relation on the intensional definitions, not on the tag spellings.

So how does all this relate to population thinking? Systems with shared extensional definitions of tag meaning or shared moderation values implicitly push their users toward a single, shared, “ideal” perspective. Tags and moderation values tend to be viewed in terms of “consensus meaning” — and sites like Slashdot have explicit meta-moderation to enforce this concensus!

In contrast, by generating intensional definitions we explicitly accept (and celebrate!) that the users of this technology will be a diverse population, each with different interests and ways of classifying the world. Our goal is to accomodate the real diversity of the user population, while also giving people ways to adopt each others’ perspectives, and to collaborate when they want to.

Update: Tim Spalding of LibraryThing has some interesting meditations on how to compute intensional meanings from an extensional tag pool.  (He doesn’t use those terms, but he needs some sort of intensional definition to make the similarity judgements he’s aiming for.)  He specifically comments on the messy nature of tagging and the need to use statistical methods to extract cleaner information from the tags.

However Tim is still effectively throwing away a lot of information. A given user might have a very focused meaning for some of the tags he mentions — e.g. “WWI” or “memoir”. But since LibraryThing is looking only at the total pool of tags spelled the same, rather than the meaning of those tags within each user’s library, that focused meaning gets washed out. This is exactly the effect we are trying to avoid. The key differences are that Tim is computing intensional definitions over the whole tag pool, while we are computing them for each user separately, and Tim is using only the tags and the identities of the tagged items, while we are also using the “fine structure” of the tagged items (the tokens they contain).  Of course for books this fine structure is often not available.