by Andrea Esuli, Diego Marcheggiani and Fabrizio Sebastiani
Researchers from ISTI-CNR, Pisa, aim to effectively and efficiently extract information from free-text mammography reports, as a step towards the automatic transformation of unstructured medical documentation into structured data.
Information Extraction is the discipline concerned with the extraction of natural language expressions from free text, where these expressions instantiate concepts of interest in a given domain. For instance, given a corpus of job announcements, one may want to extract from each announcement the natural language expressions that describe the nature of the job, annual salary, job location, etc. Put another way, information extraction may be seen as the activity of populating a structured information repository (such as a relational database, where “job”, “annual salary” and “job location” count as attributes) from an unstructured information source such as a corpus of free text. Another example of information extraction is searching free text for named entities, ie, names of persons, locations, geopolitical organizations, and the like.
An application of great interest is extracting information from free-text medical reports, such as radiology reports. These reports are unstructured in nature, since they are written in free text by medical personnel. However, applying information extraction would be beneficial, since extracting key data and converting them into a structured (eg, tabular) format would greatly contribute towards endowing patients with electronic medical records that, aside from improving interoperability among different medical information systems, could be used in a variety of applications, from a patient’s care to epidemiology and clinical studies.
In recent months we have worked on a system for automatically extracting information from mammography reports written in Italian. There are two main approaches to designing an information extraction system. One is the rule-based approach, which consists of writing a set of rules which relate natural language patterns with the concepts to be extracted from the text. This approach, while potentially effective, is too costly, since it requires a lot of human power to write the rules, which are jointly written by a domain expert - say, an expert radiologist - and a natural language engineer. We have followed an alternative approach, which is based on machine learning. According to this approach, a general-purpose learning software learns to relate natural language patterns with the concepts to be instantiated from a set of manually annotated free texts, ie texts in which the instances of the concepts of interest have been marked by a domain expert. The advantage of this approach is that the human power required to annotate the texts needed to train the system is much smaller than that needed to manually write the extraction rules.
Figure 1: A mammographic report automatically annotated according to the nine concepts of interest.
The system we have built uses “conditional random fields” (CRFs) as a learning technique. CRFs were explicitly devised for managing data of a sequential nature, such as text, and have given good results on text-related tasks such as named entity extraction and part-of-speech tagging.
We have tested our system on a set of 405 mammographical reports written (in Italian) by medical personnel of the Radiology Institute of Policlinico Umberto I of Rome, and manually annotated by two expert radiologists of the same institution according to nine concepts of interest (eg, “Followup Therapies”). 336 reports were annotated by one radiologist only, while the other 59 were independently annotated by both radiologists. The presence of reports annotated by both radiologists has allowed us to directly compare the accuracy of our system with human accuracy, by comparing the agreement between the system’s and the radiologist’s annotations with the agreement between the annotations of the two radiologists. Our experiments, run by 10-fold cross validation, have shown that our system obtains near-human performance: the agreement between system and human, measured by the standard “macroaveraged F1” measure, turned out to be 0.776, while the agreement between the two experts was 0.794 (higher values are better, since 0 and 1 indicate perfect disagreement and perfect agreement, respectively). These results are especially encouraging because no specialized lexical resource was used in the experiments, since no such resource exists for the radiological / mammographic sector for Italian.