Brief status

Peerworks will build an automated tagging engine — portable and open source — that can be used by existing systems such as Scoop, Slashcode, Drupal and Plone, but also form the basis for new systems.

We’ve developed a custom feed aggregator for testing, and have collected lots of feeds and feed items for moderation. We are now working on the moderation UI. This will be used by a team of moderators to create a stable body of tagged items to use for testing and tuning individual tagging algorithms under development.

Initially we expect to use an algorithm very close to the SpamBayes classifier. There are existing examples of how to do this; Ben Kamens’ analysis is helpful, and Laird Breyer’s library dbacl uses Bayesian classifiers to assign tags.

Once our system can learn individual tagging styles well enough to make users comfortable, we’ll be looking for existing sites that would like to integrate it. Our further development will be guided by the needs of these initial partner sites.

If you’d like to know more, leave a comment with your questions or drop me a line: jed (at) peerworks.org

One Response to “Brief status”

  1. April 5th, 2006 | 10:07 pm

    I’d offer a few cautions, first:

    1. Any means to “learn individual tagging style” implies use of past tagging decisions, and respect for those - which suggests that the algorithm’s recommendations (similar to Amazon’s, I’d presume) constantly pull the user towards their past style, and could inhibit them learning new styles or even adopting new tags. Treating tagging patterns as a deliberate choice, or “style”, suggests too strongly that they are informed decisions as opposed to evolving habits. So I’d change this language and “track individual tagging habits” rather than “learn” a “style” - avoid implying that something’s worth learning, or was chosen deliberately, or has anything to do with user individuality. I think tagging is new enough and hard enough to master, that any user who wants to make good use of new tools, will be learning new habits, discarding old ones, and will rarely want simply to continue their current “style” and have software “learn” it. I suspect it’s far more likely that they’ll want to pick one or more “masters” doing exemplary tagging and prefer to follow the tag styles they recommend.

    2. The influences on how we look up tags and use them are just as complex as the influences on how we use nouns or adjectives. In addition to our own “style” or “habits”, there are “masters” (like dictionaries, or exemplary opinion leaders) who we’ll prefer to copy because we wish to have powers like theirs to make distinctions and discernments “just like they make”. A good example any software engineer will understand is the use of particular APIs: no one really wants to be first to learn them, and only if they get good reviews or become indispensible to employment, do we want to take the overhead of learning them at all, let alone mastering them. Taxonomy is high investment.

    So, consider that the direction in which we are desiring to modify our own habits or style, is a key factor in tagging: tag choices have at least a direction in time: there’s more distinction in this area over time, less distinction in that, adoption of standard terms here, abandonment of non-functional legacy terms there. This suggests we need at least the time to be associated with a tag’s correlation with a given user in some given task - and that we might extinct the older data over time, and allow users to express preferences to tag “more like Clay Shirky”, “less like Craig Hubley”, “less like Microsoft”, or “more like the Communist Party of China”. Yes its political - categories and taxonomies always have been, always will be.

    3. Because they’re political, the influence of our politics on our tags probably has to be (if not explicit) easy to determine by correlation. Someone who’s using tags like “Peak Oil” or “culture of life” at all, is clearly part of some group or movement. Mention of “seal hunt” as a specific or distinct topic suggests concern with it as an issue. Varying terms like “monetary reform”, “capital base”, “reserve rules”, “Bretton Woods”, “dollar hegemony”, all suggest different angles on the same problem, some of which (”reform”) suggest action should be taken. The people tagging may NOT all want to find each other, but the reader DOES want to find all the angles on these issues and therefore would PREFER that tags aggregate in certain ways. For someone “on the left”, perhaps “abortion” aggregates with “women’s rights”, while “on the right” it aggregates with “culture of life”. There must be respect for these choices, and there must be ways to keep these aggregations (”redirects” in wiki-speak) under the control of the user, or a user-chosen, user-trusted, agent. At the highest level of abstraction, I’d simply choose metaphors I wished to reinforce or move towards, and those I wished to abandon, and let aggregation occur as a function of those choices. At least, it might decide which of a long list of hits to drop off, or set some ordering choices. That would be no more insidious than what google is now doing, for its own reasons (not mine):

    4. Commercial enterprises, including search engines especially, are influencing our use and choice of words deliberately, and if nothing else there must be ways to undo their influence on what we see and read, say by de-weighting tags they weight for commercial reasons. Already google puts “link farm” hits lower on its lists, and as technology improves they’ll go lower still. That’s fine, but what happens when mention of competing technology or business offerings to google’s is lower on their lists, and google is in 40 businesses, including the one you’re trying to research? Microsoft isn’t the only Microsoft out there waiting to become Microsoft. A general solution to the problem of commercial agents probably follows is less difficult than the solution to giving users control of political influence or aggregation, but, actually finding the commercial weightings and censorings on our searches may be much more difficult than to find the political influences on our metaphors (which tend to be written large in the newspapers).

    5. A REST-type approach where only a very few verbs (GET, PUT, REFRESH, DELETE, POST) are specialized, seems wise here, that is, a parallel protocol to “http://”, like “tag://”, would be a better way to update tags (and taglike structures including the categorization of sexually-explicit materials, assignments of academic trustworthiness, medical credentials, etc., which all could be part of this protocol) than to do it using HTTP etc.. That fits very neatly into a service-oriented architecture and it leaves it up to the service how to record or store the tags or even if it chooses to process the tagging (it might not). I see no reason not to accept anonymous tagging, without making any acknowledgement - spam can be recognized and undone without prior restraint (as in wikis), and if someone wants to know if their tagging was counted, all they need do is compare before- and after-the-fact weightings.

    6. Whatever architecture is chosen, it has to find the path of least resistance through these constraints and be dead simple. HTTP works for this reason, IP works for this reason, and DOS worked for this reason. We aren’t looking to solve all of the above problems, we’re looking to anticipate them and ensure it is possible to solve them in future without having to undo any architectural choices. Any extra commands in it that don’t need to be there, and their implementation will lag and usage fall off and optimization cease to occur and a protocol without them will catch on instead. Missing key commands or verbs that did need to be in the protocol, however, cannot be fixed later on. It’s a question of finding exactly and only the right set.

    Which is to say, it’s a typiecal software architecture problem.

    7. Mediawiki has an irreplaceable base of content, and if there is going to be a testbed, it really has to be in that format. It’s relatively easy to build front-ends to mediawiki content that enrich it (nationmaster.com, wordiq.com, wikinfo.org are all good examples, and metaweb.com was exploring doing it also). But far more important, there’s one million articles there in English alone, including cross-links to the same concept in other languages, disambiguation pages listing all meanings of a word, deep heavy linking to make it easy to tell in what sense a word is being used (simply by comparing occurrence on pages that we are following, to occurrence on the parallel Wikipedia article). This is open content (GFDL) and thus available for any such use. There’s no other comparable source for collectively understood undisputed definitions of relatedness of terms/words/phrases. A thesaurus can complement it easily, but, a thesaurus doesn’t tell you the exact translation of the name of a particular region of Europe into nineteen languages.

    8. Usage varies more by geography than anything else, and the need for discernment of a geographic kind tends to vary not by who we are, but where we are. I need much more detail about a place in California, if I’m actually in California looking for it, than if I’m just asking about it from London, England. I think there must be room here for aggregations to vary and be finer or looser grained based on how “far” we are (in space or in time) from making a decision. Trying to pretend knowledge is spaceless and timeless has a name: scholasticism. And in a word, it’s crap. Temporal and spatial database considerations need to be there from the beginnings, or this will not be useful to guide situated effective action in the real world. In other words, no one will want to read it on a BlackBerry, and that’s where most of the interesting “driving problems” (note the sysygy, that’s “driving” as per Fred Brooks’ “so hard you solve other problems by solving it”, and also as per “moving around in the real world”). If space and time tags are wrong on the first cut, you simply cannot fix them at all later - they’ll lack integrity. As a simple example of the problems, “Toronto” was a very different place in 1998 than in 1999 (the entire 416 area code merged into a new “megacity” whereas it’d only been one of five cities and one borough beforehand). If you have a tag that says “Toronto” on something about North York prior to 1998, it’s about a neighbouring city. If it’s from 1999, it’s about the same city. This matters much more if you are heading into that city than if you are considering just visiting it next year. And, if time has to be on the tagging to determine how our tagging habits/style changes, then, we’re very exposed to semantic errors, e.g. failing to put time zone on date, leaving something which can’t be resolved to an actual time span, other than stochastically.

    9. Finally are sociosemantic web considerations. For that I’d suggest the book “Ambient Findability” or (much better) the UN 1993 State of the Future Report from the American Committee for the United Nations University, which covers this in great depth and deals with the various ways delphi methods, semantic webs, sociological and psychographic categories (e.g. “Islamist”, “socialist”, “feminist”) tend to determine aggregation, order, credential, etc. It’s a surprisingly operational description.

    Or read this http://openpolitics.ca/sociosemantic+web

Leave a reply