Supplementary information for I.R. manuscript "Term based comparison metrics for controlled and uncontrolled indexing languages" by Benjamin M. Good and Joseph T. Tennis

Abstract

Introduction. We define a collection of metrics for describing and comparing sets of terms in controlled and uncontrolled indexing languages and then show how these metrics can be used to characterize a set of languages spanning folksonomies, ontologies, and thesauri. Method. Metrics for term set characterization and comparison were identified and programs for their computation implemented. These programs were then used to identify descriptive features of term sets from twenty-two different indexing languages and to measure the direct overlap between the terms. Analysis. The computed data was analyzed using manual and automated techniques including visualization, clustering and factor analysis. Distinct subsets of the metrics were sought that could be used to distinguish between the uncontrolled languages produced by social tagging systems (folksonomies) and the controlled languages produced using professional labour. Results. The metrics proved sufficient to differentiate between instances of different languages and to enable the identification of term-set patterns associated with indexing languages produced by different kinds of information system. In particular, distinct groups of term-set features appear to distinguish folksonomies from the other languages. Conclusions. The metrics organized here and embodied in freely available programs provide an empirical lens useful in beginning to understand the relationships that hold between different, controlled and uncontrolled indexing languages.

Acknowledgements

Thanks to Philip Ogren for providing the code related to the compositionality measurements. See his papers about the implications of compositionality in the Gene Ontology [in PSB 2005 and and PSB 2006].