April 8, 2006
Reply to Craig Hubley (2)
Continuing my response to Craig’s comment, his second point was:
The influences on how we look up tags and use them are just as complex as the influences on how we use nouns or adjectives. In addition to our own “style” or “habits”, there are “masters” (like dictionaries, or exemplary opinion leaders) who we’ll prefer to copy because we wish to have powers like theirs to make distinctions and discernments “just like they make”. A good example any software engineer will understand is the use of particular APIs: no one really wants to be first to learn them, and only if they get good reviews or become indispensible to employment, do we want to take the overhead of learning them at all, let alone mastering them. Taxonomy is high investment.
So, consider that the direction in which we are desiring to modify our own habits or style, is a key factor in tagging: tag choices have at least a direction in time: there’s more distinction in this area over time, less distinction in that, adoption of standard terms here, abandonment of non-functional legacy terms there. This suggests we need at least the time to be associated with a tag’s correlation with a given user in some given task - and that we might extinct the older data over time, and allow users to express preferences to tag “more like Clay Shirky”, “less like Craig Hubley”, “less like Microsoft”, or “more like the Communist Party of China”. Yes its political - categories and taxonomies always have been, always will be.
Craig’s point about “masters” is an interesting one and brings up a central issue related to the discussion of taxonomies vs. folksonomies. His example of dictionaries illustrates it nicely. Written language existed for thousands of years before we had dictionaries. Presumably writers learned how to use words in a literary way mainly by imitating others, especially those regarded as particularly exemplary. Dictionaries were only possible because there was an existing consensus on the meaning of words, along with a substantial historical trail showing the evolution of usage. So this consensus was historically prior to explicit definitions.
APIs (software definitions of values and operations that others can use, for those who are not developers), and before them mathematical and some scientific terminology, take the opposite path. Someone defines the meaning of a term (such as “set” in mathematics, or “force” in physics) and others have to learn the definition as given, ultimately creating a consensus if the term — and the theory it helps to express — is useful enough. Terms evolve to some extent, but basically the definition preceeds the consensus. But note that these terms get defined within a highly organized community that already operates under a broader consensus on what is important, and how to define and solve problems. Of course this consensus evolves, but at each step most of it survives.
The implication of these examples is that if we have a community with a strong consensus on how to do things, then putting forward new terms and frameworks by definition can be efficient (though this requires high investment by adopters, as Craig notes). However while a community is forming and negotiating a consensus, commitments have to be soft and largely implicit, and creating new frameworks by definition is not likely to work very well. Right now, our focus is on these early phases of consensus formation, where many potential participants have different, more or less incongruent ways of organizing the material, but where they share enough common concerns so that they want to work together. We aim to create systems that help these participants move toward an effective consensus with the least possible stress and friction.
One interesting question this raises is how we can detect when a community consensus is solid enough to start introducing explicit frameworks. However we don’t really need to answer this — people are going to keep trying to define frameworks, and when the consensus is solid enough, these attempts will gain traction.
As discussed in my commentary on Craig’s first point, I agree that we need to let users’ previous actions diminish in influence over the present as they fade into history. I also agree that we need to let each user view tags as influenced by the choices of other designated users. Craig’s additional suggestion that users could choose for this influence to be negative (”less like Microsoft”) is probably a good one. All of this needs empirical feedback, and the discussion helps us to figure out what possiblities to try, and what feedback to look for.
Filed by Jed Harris at 1:05 pm under Status & plans
You can’t grow beans or tomatoes without tying them up. If you let them run along the ground, you’ll get radically fewer beans and tomatoes, and the bugs and squirrels (and in my case deer) will eat them. There’s got to be something solid. To take up the maximum space with the minimum materials, a tent must have flexible but strong and solid poles.
“APIs (software definitions of values and operations that others can use, for those who are not developers), and before them mathematical and some scientific terminology, take the opposite path” to the data. To give users flexibility, one must take it away from developers. Developers, the best of which become architects, and the best of those can battle it out in the bloody field of ontology, are supposed to know what can safely be made solid, and what has to be left flexible. A good example is the various ways of dealing with time and space. Geospatial data and temporal data require standards, and extremely strict ones, and many shorthands used by people will fail when they are picked up by systems without human-like brains. To an individual human living in one place, the date “April 5, 2006″ really exists, and it starts when they wake up and ends when they go to sleep, even if they go to sleep after midnight. To the global financial markets, there’s no such concept. There’s “April 5, 2006 EDT” or “April 5, 2006 UTC”, which are specific 24 hours periods. If you want to prevent databases exploding because things are created on April 6 and then other things are created AFTER that on April 5, then you *must* get this temporal convention right from the beginning. You do users no favour if you allow them to vary the ways they talk about time, to the degree that it destroys the integrity of the database for automatic processing purposes. For the same reason, an acronym like “USA” or “UK”, which were not the same collection of places in 1950 as they were in 1910, cannot be acceptable as an assertion of “where” something occurred. In 1910 Dublin was in the UK, and Guam was not part of the USA.
API developers know these things and they think hard about problems like how to deal with the many ways space, time, and personal identity are expressed. They are not like theoretical physicists or mathematicians, just defining meaning into existence. Yes, “others have to learn the definition as given, ultimately creating a consensus if the term — and the theory it helps to express — is useful enough” in all these cases, but an API can be extended, augmented and modified - that’s what open source is for. However it also opens the possibility that some idiot (and I use this term advisedly) will “improve” the API by “making it easier to state the date” by omitting the time zone, thus resulting in non-data floating around pretending it is a meaningful data token. Hmm.
“Terms evolve to some extent, but basically the definition precedes the consensus” in anything to do with software, including mathematics. Yes, “these terms get defined within a highly organized community that already operates under a broader consensus on what is important, and how to define and solve problems. Of course this consensus evolves, but at each step most of it survives.” And it has to survive. Or else we have no dictionary.
Term identification, amplification, aggregation and deprecation is a basic process in software architecture. If the project is not robust enough to evolve APIs sensibly, it isn’t a project, it’s just a bunch of hackers doing something that they try to make all run on the same computer. So the API is the central artifact being evolved between the more flexible (application) software and the less flexible (OS or boot loaded) software, it directly parallels the protocol that works between hosts. The protocol and the API have to use the same terms and mean the same things when they do.
“The implication of these examples is that if we have a community with a strong consensus on how to do things, then putting forward new terms and frameworks by definition can be efficient (though this requires high investment by adopters, as Craig notes).” Because of the high investment, you have to get as close as possible to the correct API on the first pass, and you don’t get this by concentrating on something else (like oh say the reliability of data storage by piling in SQL, etc.).
“However while a community is forming and negotiating a consensus, commitments have to be soft and largely implicit, and creating new frameworks by definition is not likely to work very well” if they are truly new. But most of them aren’t. We have a rich literature on temporal algebra, on geospatial data, there are standards in these areas which our time stamps and GPS output is going to meet (eventually). As for people, we’ve got pretty much every different way of naming each important person in history in sources like Wikipedia, in Unicode yet, and translated where necessary. There’s very little “new” in the universe of categories and tags.
At the University of Toronto’s Robarts Library, they specialize in library science cataloging schemes. There are 1100+ of these on file, for all purposes from small research libraries in specific fields to general purpose systems like the Dewey Decimals. They will be filing Wikipedia’s category scheme in there about now. Unlike all the others it’s GFDL (open content) and continually being updated. So start there, and look hard at the toughest geospatial and temporal problems to see where the solutions are. The API will need to have those problems solved, or solving other problems just won’t occur.
Something like “World War II” doesn’t have fixed start or end dates, though some textbooks will say it does, those have a POV of a specific country or group. But it must be possible to say somehow that it does, or specify a rule that if it happened at that time, and is relevant to certain fields, that it must have that tag. Something like “Bangkok” also has formal definitions, a set of city limits that’s changed over time, but must somehow be statable (if not stated) somewhere. I am not one of those who thinks that all terms are utterly mutable, and that things like “set” in mathematics are wholly made up. They have roots in the cognitive system, we have ways to recognize a set of similar objects, e.g. raindrops, berries, blades of grass. To exist at all we have to treat them similarly and develop methods of dealing with them as classes or categories.
There’s a lot written about these processes in the cognitive science of mathematics, and it’s more solid than you’d think reading all the other nonsense written that’s called “philosophy of mathematics“.