Preferential Text Classification: Learning Algorithms and Evaluation Measures

by Fabio Aiolli, Riccardo Cardin, Fabrizio Sebastiani and Alessandro Sperduti

Researchers from ISTI-CNR, Pisa and from the Department of Pure and Applied Mathematics at the University of Padova, are explicitly attacking the document classification problem of distinguishing primary from secondary classes by using 'preferential learning' technology.

In many contexts in which textual documents are labelled with thematic classes, a distinction is made between the primary and secondary classes to which a given document belongs. The primary classes of a document represent the topic(s) that are central to the document, or that the document is mainly about. The secondary classes instead represent topics that are somehow touched upon, albeit peripherally, and do not represent the main thrust of the document.

This distinction has been neglected in text classification (TC) research. We contend that it is important and deserves to be explicitly tackled since, in most contexts in which the distinction is made, the degree of importance of a misclassification can depend on whether it involves a primary or a secondary class. For instance, when a patent application is submitted to the European Patent Office (EPO), a primary class from the International Patent Classification (IPC) scheme is attached to the application, and that class determines the expert examiner who will be in charge of evaluating the application. Secondary classes are attached only for the purpose of identifying related prior art, since the appointed examiner will need to determine the novelty of the proposed invention against existing patents classified under either the primary or any of the secondary classes. Thus, for the purposes of the EPO, failing to recognize the true primary class of a document is a more serious mistake than failing to recognize a true secondary class. Similar considerations apply to other scenarios in which the distinction is made.

In a concerted attempt to address this distinction, we define preferential text classification, a task which we define as the attribution to a textual document d of a partial ordering among the set of classes C. This partial ordering specifies whether or not a given class 'applies more than' (or 'is preferred to') another class in the document. In particular, we focus on a special case of preferential TC; namely, the case in which each document is associated to a 'three-layered' partial order. This consists of a top layer of one or more primary classes, each of which is preferred to those in a middle layer of secondary classes, which are in turn each preferred to those in a bottom layer of 'non-classes' (ie classes that do not apply at all to the document).

The original contribution of our work is twofold. First, we propose an evaluation measure for preferential TC, in which different kinds of misclassifications involving either primary or secondary classes have a different impact on effectiveness. Second, we attack preferential TC by using a learning model, dubbed the Generalized Preference Learning Model, that was explicitly devised for learning from training data expressed in preferential form, ie in the form "class c' is preferred to class c'' for document d". This model allows us to draw a fine distinction between primary and secondary classes in both the testing and learning phases, thus making use of the different importance of primary and secondary classes to which a training document belongs. Experiments run on WIPO-alpha, a well-known benchmark dataset consisting of manually classified patents, show that the Generalized Preference Learning Model outperforms standard (ie non-preferential) state-of-the-art learning approaches.

Link:
http://www.isti.cnr.it/People/F.Sebastiani/Publications/IRJ08b.pdf

Please contact:
Fabrizio Sebastiani
ISTI-CNR, Italy
Tel: +39 050 3152 892
E-mail: fabrizio.sebastianiisti.cnr.it