by Giacomo Berardi, Andrea Esuli and Fabrizio Sebastiani
Researchers from ISTI-CNR, Pisa, have addressed the problem of optimizing the work of human editors who proofcheck the results of an automatic text classifier with the goal of improving the accuracy of the automatically classified document set.
Suppose an organization needs to classify a set of texts under a given classification scheme, and suppose that this set is too large to be classified manually, so that resorting to some form of automated text classification (TC) is the only viable option. Suppose also that the organization has strict accuracy standards, so that the level of accuracy that can be obtained via state-of-the-art TC technology is not sufficient. In this case, the most plausible strategy to follow is to classify the texts by means of an automatic classifier (which we assume here to be generated via supervised learning), and then to have a human editor proofcheck the results of the automatic classification, correcting misclassifications where appropriate.
The human editor will obviously inspect only a subset of the automatically classified texts, since it would otherwise make no sense to have an initial automated classification phase. A software system could actively support the human editor by ranking, after the classification phase has ended and before the inspection begins, the automatically classified documents in a such a way that, if the human editor inspects the documents starting from the top of the ranking and working down the list, the expected increase in classification accuracy that derives from this inspection is maximized. We call this scenario “semi-automated text classification” (SATC).
A common-sense ranking method for SATC could consist in ranking the automatically classified texts in ascending order of the confidence scores generated by the classifier, so that the top-ranked documents are the ones that the classifier has classified with the lowest confidence . The rationale is that an increase in accuracy can derive only by inspecting misclassified documents, and that a good ranking method is simply the one that top-ranks the documents with the highest probability of misclassification, which (in the absence of other information) we may take to be the texts which the classifier has classified with the lowest confidence.
We have recently shown  that this strategy is, in general, suboptimal. Simply stated, the reason is that, when we deal with imbalanced TC problems (as most TC problems indeed are ) and, as a consequence, choose an evaluation measure - such as F1 - that caters for this imbalance, the improvements in effectiveness that derive from correcting a false positive or a false negative may not be the same.
We have devised a ranking method for SATC that combines, via utility theory, (i) information on the probability that the document is misclassified, and (ii) information on the gain in overall accuracy that would derive by proofchecking it.
We have also proposed a new evaluation measure for SATC, called Expected Normalized Error Reduction (ENER). Since different users will inspect the ranked list down to a certain “inspection depth”, ENER uses a probability distribution over inspection depths as a parameter. ENER measures then the expected value (over this probability distribution) of the reduction in error that inspecting a ranked list down to the specified depth would bring about.
We have used ENER as the evaluation measure for our experiments, which we have run on a standard text classification dataset. The results show that, with respect to the common-sense baseline method mentioned above, our utility-theoretic ranking method is substantially more effective, with computed improvements ranging from +16% to +138%.
The approach we present is extremely general, since it applies straightforwardly to cases in which evaluation measures different from F1 are used; multivariate and non-linear evaluation measures can be handled too, provided they can be computed from a standard contingency table. By using our method, it is also easy to dynamically provide the human editor with an estimate of how accurate the classified set has become as a result of the proofchecking activity.
 A. Esuli and F. Sebastiani: “Active Learning Strategies for Multi-Label Text Classification”, in proc. of ECIR 2009, Toulouse, FR, 2009, pp. 102-113
 G. Berardi, A. Esuli, and F. Sebastiani: “A Utility-Theoretic Ranking Method for Semi-Automated Text Classification”, in proc. of ACM SIGIR 2012, Portland, US, pp. 961-970
 H. He and E. Garcia: “Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering”, 21(9), 1263-1284.
Fabrizio Sebastiani, ISTI-CNR, Italy
Tel: +39 050 3152 892