Current status

It’s been quite a while since we’ve provided an update, and we’ve made a lot of progress. I apologize for the jargon from the statistical learning world; I hope to come back and provide links for it. In the meantime, Google and Wikipedia are your friends. Conversely, if people are curious about more detailed technical questions I’d be happy to try to fill the picture in.

To summarize our current goal, we’re working on individual content classification, that is, auto-tagging or individualized tagging. Each user creates and assigns their own tags, and we train classifiers to assign tags for them. We have an initial implementation of classifiers, and are improving them using cross-validation. Testing and tuning classifiers with cross-validation needs a lot of tagged content, and RSS feeds provide a lot of diverse, freely available content of the right general sort. So we built a feed aggregator and have been accumulating a big corpus. Our technology has many more uses than just tagging blog posts, but that is actually a potentially promising application domain.

We’ve carved out a smaller “clean” corpus (about 7000 items that meet various criteria) and multiple people are now tagging it. At the same time we’re adding new feeds to the collection process to improve item diversity. We can easily carve out new corpuses using different criteria, but of course it is a big effort to get them fully tagged. We have a cross-validation testing and reporting framework set up. We’re gradually enhancing it as we better understand our requirements.

Our classifiers are naive bayes — more or less the SpamBayes variant, by no means pure naive bayes. However they work reasonably well already (on smaller corpuses) and we think that by fairly simple enhancements we can get them working “adequately” (we haven’t established definite criteria yet). The main area where we may need help in this phase, if the straightforward enhancements aren’t adequate, is stronger feature extraction. Right now we are just using simple tokenizing. We may also need help with ongoing performance work since the classification is compute intensive enough to be an issue.

Unless we hit major rocks, which looks unlikely at this point, we expect to have our individual classifiers good enough to be usable in the first quarter of 2007. If we need fancier features that would delay things, but quite likely not by more than a month or two.

We’ll initially put up a limited access demo, but we’ll probably expand that to general access as soon as we’re confident we can handle a reasonable number of users reliably.

Once we’ve got adequate classifiers and start getting a significant amount of user data, we’ll want to model the structure of the semantic space across users. Modeling the semantic space exceeds our “cookbook” level understanding of statistical modeling technology, so when we get near this point we very much need people with a lot more expertise.

A simple example of how we’d like to use the semantic information is tiling the semantic space with clusters of users who have sufficiently similar interests. Then we could let new users pick an initial cluster, so they get a set of classifiers “pre-built” and can just tune and extend the set, rather than having to build one from scratch. Of course these clusters would change over time as the user population evolves. Trickier examples are
(1) collaborative training — learning from other users’ similar classifiers and
(2) identifying sufficiently strong “interest groups” — clusters of users with enough in common to perhaps enjoy shared discussions, etc.

There are probably a lot of exciting things to do with the user data that are yet to be imagined. We are also interested in contributing data (with appropriate privacy protection) to research projects. Get in touch with us if you have ideas about how to use this technology, if you are interested in related research questions, or if you want to work for us on the statistical learning side (there’s a job ad on peerworks.org).

Series of six responses to Craig Hubley

Craig Hubley recently wrote a long comment on our initial “Brief status” post, which raised many interesting and relevant issues. I’ve now responded in a series of six posts, quoting most of Craig’s comment and adding my thoughts. Since I wrote them “forward”, you’ll have to scroll down, or end up reading them “backward”. Happily, each stands fairly well on its own, so the order you choose is not so crucial, but now you have been warned.

Reply to Craig Hubley (8 and 9)

Taking these points in reverse order, in (9) Craig refers to interesting resources about which I know little or nothing, so I’ll just quote him without comment:

Finally are sociosemantic web considerations. For that I’d suggest the book “Ambient Findability” or (much better) the UN 1993 State of the Future Report from the American Committee for the United Nations University, which covers this in great depth and deals with the various ways delphi methods, semantic webs, sociological and psychographic categories (e.g. “Islamist”, “socialist”, “feminist”) tend to determine aggregation, order, credential, etc. It’s a surprisingly operational description.

Or read this.

Craig’s point (8) raises a number of issues on which I can comment briefly:

Usage varies more by geography than anything else, and the need for discernment of a geographic kind tends to vary not by who we are, but where we are. I need much more detail about a place in California, if I’m actually in California looking for it, than if I’m just asking about it from London, England. I think there must be room here for aggregations to vary and be finer or looser grained based on how “far” we are (in space or in time) from making a decision. Trying to pretend knowledge is spaceless and timeless has a name: scholasticism. And in a word, it’s crap. Temporal and spatial database considerations need to be there from the beginnings, or this will not be useful to guide situated effective action in the real world. In other words, no one will want to read it on a BlackBerry, and that’s where most of the interesting “driving problems” (note the sysygy, that’s “driving” as per Fred Brooks’ “so hard you solve other problems by solving it”, and also as per “moving around in the real world”). If space and time tags are wrong on the first cut, you simply cannot fix them at all later - they’ll lack integrity. As a simple example of the problems, “Toronto” was a very different place in 1998 than in 1999 (the entire 416 area code merged into a new “megacity” whereas it’d only been one of five cities and one borough beforehand). If you have a tag that says “Toronto” on something about North York prior to 1998, it’s about a neighbouring city. If it’s from 1999, it’s about the same city. This matters much more if you are heading into that city than if you are considering just visiting it next year. And, if time has to be on the tagging to determine how our tagging habits/style changes, then, we’re very exposed to semantic errors, e.g. failing to put time zone on date, leaving something which can’t be resolved to an actual time span, other than stochastically.

I don’t think we have a general way to solve the problems Craig refers to, and I’m not sure we can really address them at all. However in one respect we may be able to offer help. I would tag items “Bangkok” as a potential tourist, and our system, if it works well enough, will reflect that in building my classifier. Someone else, living in Thailand, might tag items “Bangkok” because they are analyzing real estate deals in several cities. Then, in building a social landscape, my tag should be grouped with other tourists to southeast asia, whereas theirs should be grouped with other Thai real estate investors.

This is an ambitious goal, and it may well be beyond our grasp. But it indicates our approach toward problems like the one that Craig brings up. Our users are ultimately the ones who decide what distinctions are important to them. They know more about the subtle structure of the world than we can ever hope to capture in a system of categories or attributes. Our goal is to create an environment in which they can make their most important knowledge and values available to the system, and through it to other users, with as little overhead and distraction as possible. Much knowledge and values are tacit, and often could not be made explict with any amount of effort. So to minimize the burden on users, and to engage these tacit sources of knowledge and values, we have to learn from each user’s actions without requiring them to explain themselves.

The example of Bangkok illustrates this tacit component of tagging. Neither I as a tourist, nor the Thai resident as a real estate investor, would be able to explain how we use the term “Bangkok”, but our actual use reflects our interests and knowledge. If the system can learn from that, it will help us find the material we want to see, and potentially help us understand where we fit in a larger social context, and find others whose interests mesh with ours.

To the extent that distinctions of era, locale, etc. form a vital background to the users’ way of understand some topic, the system should capture that, and ultimately help the community reflect on this initially tacit aspect of their diverse perspectives.

Reply to Craig Hubley (7)

Responding to Craig’s point (7):

Mediawiki has an irreplaceable base of content, and if there is going to be a testbed, it really has to be in that format. It’s relatively easy to build front-ends to mediawiki content that enrich it (nationmaster.com, wordiq.com, wikinfo.org are all good examples, and metaweb.com was exploring doing it also). But far more important, there’s one million articles there in English alone, including cross-links to the same concept in other languages, disambiguation pages listing all meanings of a word, deep heavy linking to make it easy to tell in what sense a word is being used (simply by comparing occurrence on pages that we are following, to occurrence on the parallel Wikipedia article). This is open content (GFDL) and thus available for any such use. There’s no other comparable source for collectively understood undisputed definitions of relatedness of terms/words/phrases. A thesaurus can complement it easily, but, a thesaurus doesn’t tell you the exact translation of the name of a particular region of Europe into nineteen languages.

There are two very interesting issues here: (1) How we can support wikis and Mediawiki in particular, and (2) How we can leverage large content repositories and Wikipedia in particular. I’ll address them in that order.

Supporting Wikis: Our thoughts about tagging have mostly been about item streams, like blogs, comments in discussions, RSS feeds, etc. The need for filtering is very apparent in those cases, and that was our original motivation. In some sense the material on a Wiki (and Wikipedia in particular) is typically “pre-filtered”. However since we are helping users to tag based on their interests, and large compilations like Wikipedia are so big that users might be unable to find the pages most related to their interests, it seems that our technology could well be relevant. I’m hoping that we can find partners who want to explore this by building a tagging front-end to Wikipedia (as Craig suggests) or to other substantial compliations.

Leveraging existing compliations, taxonomies, etc.: As Craig points out, existing compilations such as Wikipedia already encode an enormous amount of “world knowledge” and expert judgement about a huge range of topics. It seems likely that we can mine such compilations to improve our services in various ways. Craig’s example of thesaurus-like support is highly relevant.

The experience of statistical document retrieval, automatic summarization, and various other kinds of statistical text processing provides some cautionary results, however. It has proven difficult to improve the performance of such systems by integrating thesauruses and other knowledge sources. Intuition is not a reliable guide in this area and experiments are always required, no matter how reasonable an approach may appear.

I see this as an important direction for exploration once we have an initial system working well enough to generate value for users and test sites. I hope that we can attract researchers with a variety of interests in statistical learning, text classification, collaborative filtering, etc. I also hope to attract a variety of open source developers. Both groups (preferably working together) can leverage our working code and any data our users are willing to share. This seems like the kind of idea that is likely to attract their interest.

Reply to Craig Hubley (4, 5 & 6)

I don’t have any comments on Craig’s point (4) about search engines, since I don’t know how our work will affect them.

Regarding our software architecture, and how it ties into existing systems (as discussed in Craig’s points 5 and 6):

We very much agree with Craig’s statement: “Whatever architecture is chosen, it has to find the path of least resistance through [existing mechanisms] and be dead simple.” Our goal is to come up with a back end tagging library that can be shoehorned into the widest range of existing (and future) front ends with the minimum disruption. Each front end will have to provide some UI to allow users to apply tags to items, and we’ll have to have a SQL database to store our tag info. Other than that, we can have a very narrow relationship to the host environment.

We’re planning to design the specific embedding in consultation with our first few test sites. Of course to implement and test the library we’ll have to embed it in our own feed aggregation environment, but since we control all the pieces, that doesn’t raise the same architectual issues. However our initial experience with our own environment will give us a basis for collaborating on the design with our test sites.

Craig mentions the possibility of standard APIs or protocols for tag manipulation. We will probably get to these, but I’m most comfortable trying to write standard APIs once we’ve built a few diverse working implementations. Otherwise we’re trying to design based on our fantasies about how people will use this, which are likely to be wrong.

There is another domain where standarization will be even more important, but will probably also take longer. When multiple sites are using this kind of individual tagging, it will be very helpful for them to have a way to operate within a shared social landscape, let individuals extend their profile across sites, and so forth. Interaction between sites requires careful design, since it raises lots of issues about user privacy, security, control by each site of how much information it shares, etc. Also, the protocol for interaction has to be a standard, since it will be used by multiple different developers, and has to be implemented consistently to provide interoperability.

While this cross-site interaction has the potential to generate a great deal of value, we have to realistically defer this design until we have enough experience and enough different developers involved to get it right.

Reply to Craig Hubley (3)

Continuing with my response to Craig’s comment, his third point:

Because they’re political, the influence of our politics on our tags probably has to be (if not explicit) easy to determine by correlation. Someone who’s using tags like “Peak Oil” or “culture of life” at all, is clearly part of some group or movement. Mention of “seal hunt” as a specific or distinct topic suggests concern with it as an issue. Varying terms like “monetary reform”, “capital base”, “reserve rules”, “Bretton Woods”, “dollar hegemony”, all suggest different angles on the same problem, some of which (”reform”) suggest action should be taken. The people tagging may NOT all want to find each other, but the reader DOES want to find all the angles on these issues and therefore would PREFER that tags aggregate in certain ways. For someone “on the left”, perhaps “abortion” aggregates with “women’s rights”, while “on the right” it aggregates with “culture of life”. There must be respect for these choices, and there must be ways to keep these aggregations (”redirects” in wiki-speak) under the control of the user, or a user-chosen, user-trusted, agent. At the highest level of abstraction, I’d simply choose metaphors I wished to reinforce or move towards, and those I wished to abandon, and let aggregation occur as a function of those choices. At least, it might decide which of a long list of hits to drop off, or set some ordering choices. That would be no more insidious than what google is now doing, for its own reasons (not mine).

This point raises a number of technical issues about our approach. They are worth discussing, but realistically, we don’t yet know how we’re going to handle them, so any response at this point is speculative. But hey, speculation is fun!

First, we definitely plan to give each user the ability to make their own individual decisions about how to aggregate issues. On the other hand note that that does not require them to make all the decisions themselves, we certainly plan to let them use the aggregation done by others.

Second, as far as possible we’d like the system to implicitly acquire each user’s current preferences, rather than making users “explain” what they want or why they want it. The approach we’ve adopted is to let users attach tags, and then try to learn the attributes of the content that are statistically common to the way that user uses a given tag. In some sense this should let us describe the user’s current “rule” for applying that tag. The “rule”, of course, can change gradually or abruptly over time, and we should be able to track those changes.

Now let’s consider how to achieve Craig’s design goals within this framework. First, we very likely can find the collection of people who have similar “rules”. For example, if one person uses the tag “Peak oil” and another uses the tag “Energy crisis”, and they have tagged different collections of articles for some reason, but their implicit “rules” are very similar, we can recognize that. Our ability to group people together doesn’t depend on how they spell their tags, or which specific items they have tagged, but on the statistical similarity of their tag use.

On the other hand, we don’t currently have any plans to analyze the terms used in the tags themselves. So if one person used the tag “women’s choice” and another used the tag “baby killers” but they had very similar patterns in using these tags, we couldn’t detect that they had opposite feelings about the material. We would just see them as having similar “rules”. I think current computational linguistics doesn’t give us any way analyze tag names accurately enough to avoid this limitation.

Because we start with the individual user I think we can have some confidence that how things are aggregated will remain under their control. Each user determines the interpretation of their tags. If others use tags that are spelled the same, that won’t change how your tags are interpreted. (Note that this is not at all true of existing shared tagging, which may lead to some confusion.)

The harder question in our approach is how to let users group together, share tags, and influence each other’s tagging “rules”. Because we can find users with similar tagging we can help them to group together if they want. At this point it is less clear how to show users the “landscape” of other users with similar tagging patterns, or how best to give them control over their connections to other users in that landscape. I think once we have the user base, tagging data, and technology to work on those questions, the really interesting part will begin.

Reply to Craig Hubley (2)

Continuing my response to Craig’s comment, his second point was:

The influences on how we look up tags and use them are just as complex as the influences on how we use nouns or adjectives. In addition to our own “style” or “habits”, there are “masters” (like dictionaries, or exemplary opinion leaders) who we’ll prefer to copy because we wish to have powers like theirs to make distinctions and discernments “just like they make”. A good example any software engineer will understand is the use of particular APIs: no one really wants to be first to learn them, and only if they get good reviews or become indispensible to employment, do we want to take the overhead of learning them at all, let alone mastering them. Taxonomy is high investment.

So, consider that the direction in which we are desiring to modify our own habits or style, is a key factor in tagging: tag choices have at least a direction in time: there’s more distinction in this area over time, less distinction in that, adoption of standard terms here, abandonment of non-functional legacy terms there. This suggests we need at least the time to be associated with a tag’s correlation with a given user in some given task - and that we might extinct the older data over time, and allow users to express preferences to tag “more like Clay Shirky”, “less like Craig Hubley”, “less like Microsoft”, or “more like the Communist Party of China”. Yes its political - categories and taxonomies always have been, always will be.

Craig’s point about “masters” is an interesting one and brings up a central issue related to the discussion of taxonomies vs. folksonomies. His example of dictionaries illustrates it nicely. Written language existed for thousands of years before we had dictionaries. Presumably writers learned how to use words in a literary way mainly by imitating others, especially those regarded as particularly exemplary. Dictionaries were only possible because there was an existing consensus on the meaning of words, along with a substantial historical trail showing the evolution of usage. So this consensus was historically prior to explicit definitions.

APIs (software definitions of values and operations that others can use, for those who are not developers), and before them mathematical and some scientific terminology, take the opposite path. Someone defines the meaning of a term (such as “set” in mathematics, or “force” in physics) and others have to learn the definition as given, ultimately creating a consensus if the term — and the theory it helps to express — is useful enough. Terms evolve to some extent, but basically the definition preceeds the consensus. But note that these terms get defined within a highly organized community that already operates under a broader consensus on what is important, and how to define and solve problems. Of course this consensus evolves, but at each step most of it survives.

The implication of these examples is that if we have a community with a strong consensus on how to do things, then putting forward new terms and frameworks by definition can be efficient (though this requires high investment by adopters, as Craig notes). However while a community is forming and negotiating a consensus, commitments have to be soft and largely implicit, and creating new frameworks by definition is not likely to work very well. Right now, our focus is on these early phases of consensus formation, where many potential participants have different, more or less incongruent ways of organizing the material, but where they share enough common concerns so that they want to work together. We aim to create systems that help these participants move toward an effective consensus with the least possible stress and friction.

One interesting question this raises is how we can detect when a community consensus is solid enough to start introducing explicit frameworks. However we don’t really need to answer this — people are going to keep trying to define frameworks, and when the consensus is solid enough, these attempts will gain traction.

As discussed in my commentary on Craig’s first point, I agree that we need to let users’ previous actions diminish in influence over the present as they fade into history. I also agree that we need to let each user view tags as influenced by the choices of other designated users. Craig’s additional suggestion that users could choose for this influence to be negative (”less like Microsoft”) is probably a good one. All of this needs empirical feedback, and the discussion helps us to figure out what possiblities to try, and what feedback to look for.

Reply to Craig Hubley (1)

Craig Hubley wrote a long comment on our initial “Brief status” post, and I’m going to write several posts responding to his specific points. His first point was:

Any means to “learn individual tagging style” implies use of past tagging decisions, and respect for those - which suggests that the algorithm’s recommendations (similar to Amazon’s, I’d presume) constantly pull the user towards their past style, and could inhibit them learning new styles or even adopting new tags. Treating tagging patterns as a deliberate choice, or “style”, suggests too strongly that they are informed decisions as opposed to evolving habits. So I’d change this language and “track individual tagging habits” rather than “learn” a “style” - avoid implying that something’s worth learning, or was chosen deliberately, or has anything to do with user individuality. I think tagging is new enough and hard enough to master, that any user who wants to make good use of new tools, will be learning new habits, discarding old ones, and will rarely want simply to continue their current “style” and have software “learn” it. I suspect it’s far more likely that they’ll want to pick one or more “masters” doing exemplary tagging and prefer to follow the tag styles they recommend.

This raises a number of intertwined issues.

First, Craig is absolutely correct that each user’s tagging will change over time. In addition, the stream of content the user is tagging will change over time. We have to be careful not to nail down any decisions based on a user’s prior actions. (I’m avoiding the word “choices” here, since we don’t really know how much the user made a “choice”.) This is a crucial difference between rules and statistical learning — rules require prior thought and tend to be hard to change. Classifiers created by statistical learning are softer and track current actions, so as the user and/or the content changes, the classifier will change.

Second, many of the questions Craig is raising are ones we want to explore empirically. How stable a user’s preferences are, how happy users are with the classifier judgements, etc. are very much questions we want to test. Within broad limits we can tune the learning to match user preferences (how fast it forgets, how sharp the thresholds are, etc.)

Finally, Craig’s point about taking advantage of other users’ exemplary tagging is potentially very interesting. We have to get basic tagging working, but our next step after that is to let users benefit from tagging decisions by others. There are many ways to do this. The easiest (for us) is to let users see the content from anothers’ point of view — presumably the sort of “master” Craig mentions. We can do this as soon as tagging is working. A more complex option is to figure out what users are “near” each other based on similar tagging patterns. This could automatically let new users get the benefit of other users more experienced tagging.

Again, all these more advanced options need empirical testing and user feedback. I’m sure in the process we’ll learn a lot about both what is technically feasible (and cost effective) and, more important, how users experience the system and what they find appealing.

Some real (bizarre) constructive interference

Paul Kwiat’s group at UIUC is doing experiments in counterfactual computing, which use constructive interference (among other phenomena).

Kwiat explains that:

In a sense, it is the possibility that the algorithm could run which prevents the algorithm from running…

which sounds dangerously like post-modern jargon. Should we issue a Sokal alert? No, physics is just outrunning our ability to parody it (again).

Of course none of this is actually relevant to our project. Maybe we should be finding ways to use the internet for counterfactual computing…

Update: John Holbo at Crooked Timber has a somewhat more extreme reaction to this development…

Update: Wait, there’s more!  Sean Carroll at Cosmic Variance explains it all in terms of salad, steak, and sleeping puppies (no, really!  I told you physics was beyond parody).  As a bonus he tells you all you need to know about quantum mechanics, and notes that “the rest is just some equations to make it look like science.”

Folksonomies, taxonomies and population thinking

In preparation for a talk on “Science as social practice” I came across a quote by Ernst Mayr (via Three Toed Sloth):

The assumptions of population thinking are diametrically opposed to those of the typologist. …. For the typologist, the type (eidos) is real and the variation an illusion, while for the populationist the type… is an abstraction and only the variation is real.

This seems to me to perfectly capture the issues that divide advocates of folksonomies and taxonomies. Taxonomies attempt (and, as advocates acknowledge, always fail) to find a complete and sharp set of “types” with which to classify their domain. Folksonomies rely on the population of users to create a population of tags that can adequately organize a population of content items.

Perhaps the biggest differences come in managing change. Folksonomies are inherently squishy and messy, and accomodate change much more comfortably — at the cost of never providing the benefits of crisp classification. They are much more successful online because they leverage the power and fluidity of digital environments.

Taxonomies are inherently sharp edged and therefore brittle. They tend to break rather than change gracefully. Furthermore, because they typically require a substantial investment of human effort in construction, training, labeling, etc. change is expensive and painful. On the other hand, sometimes we find these prices worth paying. Searching a physical library where the books were informally tagged by users (perhaps using post-its) would be a nightmare.

We are working on the population side of this dichotomy. Our difference from existing folksonomies (e.g. del.icio.us) is that current sites provide only an extensional definition of their tags, while our technology will create an intensional definition (though squishier than the standard philosophical idea of intension). This lets us provide significantly more complete support for the real diversity and complexity of the user population than existing sites. Let me explain.

On existing sites, the meaning of a tag is simply the set of things that have been given that tag. You want to know whether a tag is appropriate in a given case? Look at what else has been given that tag, and decide whether the thing you were planning to tag is “like” the others. This is a classic case of extensional definition.

Our initial tagging system will create intensional definitions of tags, based on a given user’s choices. Specifically, it will summarize a population of tagging decisions by a classifier. Initially we’ll use a simple bayesian filter, like today’s spam filters. If this doesn’t work well enough, we’ll use other technologies — there are a lot to choose from (Support Vector Machines, Latent Semantic Indexing, etc.). But we think simple methods will work well enough.

So what does this do for our users? There are two benefits that we think are important. First, since our system has an intensional definition of what a given user means by a tag, it can classify new items for that user. If it is wrong the user can correct it and thus improve that definition.

Second, since the system has these intensional definitions for each user, it can compare them and find similar tag definitions. Note that it makes no difference how a given user spelled their tag. If I say “SF”, you say “Science Fiction”, and someone else says “Space opera”, the system judges the similarity of the intensional definitions and ignores the way their names are spelled. In our environment, similarity is a relation on the intensional definitions, not on the tag spellings.

So how does all this relate to population thinking? Systems with shared extensional definitions of tag meaning or shared moderation values implicitly push their users toward a single, shared, “ideal” perspective. Tags and moderation values tend to be viewed in terms of “consensus meaning” — and sites like Slashdot have explicit meta-moderation to enforce this concensus!

In contrast, by generating intensional definitions we explicitly accept (and celebrate!) that the users of this technology will be a diverse population, each with different interests and ways of classifying the world. Our goal is to accomodate the real diversity of the user population, while also giving people ways to adopt each others’ perspectives, and to collaborate when they want to.

Update: Tim Spalding of LibraryThing has some interesting meditations on how to compute intensional meanings from an extensional tag pool.  (He doesn’t use those terms, but he needs some sort of intensional definition to make the similarity judgements he’s aiming for.)  He specifically comments on the messy nature of tagging and the need to use statistical methods to extract cleaner information from the tags.

However Tim is still effectively throwing away a lot of information. A given user might have a very focused meaning for some of the tags he mentions — e.g. “WWI” or “memoir”. But since LibraryThing is looking only at the total pool of tags spelled the same, rather than the meaning of those tags within each user’s library, that focused meaning gets washed out. This is exactly the effect we are trying to avoid. The key differences are that Tim is computing intensional definitions over the whole tag pool, while we are computing them for each user separately, and Tim is using only the tags and the identities of the tagged items, while we are also using the “fine structure” of the tagged items (the tokens they contain).  Of course for books this fine structure is often not available.

Next Page »