Reply to Craig Hubley (7)

Responding to Craig’s point (7):

Mediawiki has an irreplaceable base of content, and if there is going to be a testbed, it really has to be in that format. It’s relatively easy to build front-ends to mediawiki content that enrich it (nationmaster.com, wordiq.com, wikinfo.org are all good examples, and metaweb.com was exploring doing it also). But far more important, there’s one million articles there in English alone, including cross-links to the same concept in other languages, disambiguation pages listing all meanings of a word, deep heavy linking to make it easy to tell in what sense a word is being used (simply by comparing occurrence on pages that we are following, to occurrence on the parallel Wikipedia article). This is open content (GFDL) and thus available for any such use. There’s no other comparable source for collectively understood undisputed definitions of relatedness of terms/words/phrases. A thesaurus can complement it easily, but, a thesaurus doesn’t tell you the exact translation of the name of a particular region of Europe into nineteen languages.

There are two very interesting issues here: (1) How we can support wikis and Mediawiki in particular, and (2) How we can leverage large content repositories and Wikipedia in particular. I’ll address them in that order.

Supporting Wikis: Our thoughts about tagging have mostly been about item streams, like blogs, comments in discussions, RSS feeds, etc. The need for filtering is very apparent in those cases, and that was our original motivation. In some sense the material on a Wiki (and Wikipedia in particular) is typically “pre-filtered”. However since we are helping users to tag based on their interests, and large compilations like Wikipedia are so big that users might be unable to find the pages most related to their interests, it seems that our technology could well be relevant. I’m hoping that we can find partners who want to explore this by building a tagging front-end to Wikipedia (as Craig suggests) or to other substantial compliations.

Leveraging existing compliations, taxonomies, etc.: As Craig points out, existing compilations such as Wikipedia already encode an enormous amount of “world knowledge” and expert judgement about a huge range of topics. It seems likely that we can mine such compilations to improve our services in various ways. Craig’s example of thesaurus-like support is highly relevant.

The experience of statistical document retrieval, automatic summarization, and various other kinds of statistical text processing provides some cautionary results, however. It has proven difficult to improve the performance of such systems by integrating thesauruses and other knowledge sources. Intuition is not a reliable guide in this area and experiments are always required, no matter how reasonable an approach may appear.

I see this as an important direction for exploration once we have an initial system working well enough to generate value for users and test sites. I hope that we can attract researchers with a variety of interests in statistical learning, text classification, collaborative filtering, etc. I also hope to attract a variety of open source developers. Both groups (preferably working together) can leverage our working code and any data our users are willing to share. This seems like the kind of idea that is likely to attract their interest.

One Response to “Reply to Craig Hubley (7)”

  1. April 9th, 2006 | 7:54 pm

    “It seems likely that we can mine such compilations to improve our services in various ways”, yes, specifically, categories are just tags (once you permit more than one to be applied), and Wikipedia has categories, and you can push them one way or another just by participating in editing articles via that one portal. So it’s not really a scheme “from above” in the sense Britannica’s or Compton’s or Roget’s are, it’s something that you have influence in if you participate - a participatory and deliberative sort of semi-direct democracy, where you take the action to correct the categories yourself, and if they stick, they stick. I think it isn’t just a consideration for later, to figure out how to use these things. Partially because you end up having to make data format choices early, and choosing a bad format (like the “wikistyle” nonsense pbwiki uses, which doesn’t even permit one to write proper English [using square brackets or uppercase] in its textbook way) could kill the whole project. Mediawiki’s [[open link]] format is acceptable mostly because double square brackets never occur in English (though double-parenthesis does, making tikiwiki also useless) and because mediawiki already dealt with the Unicode problems.

    “Intuition is not a reliable guide in this area and experiments are always required, no matter how reasonable an approach may appear.” Totally true. And a very good reason not to reinvent wheels. Mediawiki already figured out how to write textbook English and many other languages, and not clash with tagging conventions or punctuation conventions, and that really must be the starting point. Yahoo360, by aligning itself with pbwiki, has probably killed itself. These early choices can be fatal.

    For other aspects of category and how category is signalled in robust wikis, read this.

Leave a reply