Current status

It’s been quite a while since we’ve provided an update, and we’ve made a lot of progress. I apologize for the jargon from the statistical learning world; I hope to come back and provide links for it. In the meantime, Google and Wikipedia are your friends. Conversely, if people are curious about more detailed technical questions I’d be happy to try to fill the picture in.

To summarize our current goal, we’re working on individual content classification, that is, auto-tagging or individualized tagging. Each user creates and assigns their own tags, and we train classifiers to assign tags for them. We have an initial implementation of classifiers, and are improving them using cross-validation. Testing and tuning classifiers with cross-validation needs a lot of tagged content, and RSS feeds provide a lot of diverse, freely available content of the right general sort. So we built a feed aggregator and have been accumulating a big corpus. Our technology has many more uses than just tagging blog posts, but that is actually a potentially promising application domain.

We’ve carved out a smaller “clean” corpus (about 7000 items that meet various criteria) and multiple people are now tagging it. At the same time we’re adding new feeds to the collection process to improve item diversity. We can easily carve out new corpuses using different criteria, but of course it is a big effort to get them fully tagged. We have a cross-validation testing and reporting framework set up. We’re gradually enhancing it as we better understand our requirements.

Our classifiers are naive bayes — more or less the SpamBayes variant, by no means pure naive bayes. However they work reasonably well already (on smaller corpuses) and we think that by fairly simple enhancements we can get them working “adequately” (we haven’t established definite criteria yet). The main area where we may need help in this phase, if the straightforward enhancements aren’t adequate, is stronger feature extraction. Right now we are just using simple tokenizing. We may also need help with ongoing performance work since the classification is compute intensive enough to be an issue.

Unless we hit major rocks, which looks unlikely at this point, we expect to have our individual classifiers good enough to be usable in the first quarter of 2007. If we need fancier features that would delay things, but quite likely not by more than a month or two.

We’ll initially put up a limited access demo, but we’ll probably expand that to general access as soon as we’re confident we can handle a reasonable number of users reliably.

Once we’ve got adequate classifiers and start getting a significant amount of user data, we’ll want to model the structure of the semantic space across users. Modeling the semantic space exceeds our “cookbook” level understanding of statistical modeling technology, so when we get near this point we very much need people with a lot more expertise.

A simple example of how we’d like to use the semantic information is tiling the semantic space with clusters of users who have sufficiently similar interests. Then we could let new users pick an initial cluster, so they get a set of classifiers “pre-built” and can just tune and extend the set, rather than having to build one from scratch. Of course these clusters would change over time as the user population evolves. Trickier examples are
(1) collaborative training — learning from other users’ similar classifiers and
(2) identifying sufficiently strong “interest groups” — clusters of users with enough in common to perhaps enjoy shared discussions, etc.

There are probably a lot of exciting things to do with the user data that are yet to be imagined. We are also interested in contributing data (with appropriate privacy protection) to research projects. Get in touch with us if you have ideas about how to use this technology, if you are interested in related research questions, or if you want to work for us on the statistical learning side (there’s a job ad on peerworks.org).