by Mihály Héder

By studying and improving the process of digital content creation, we can develop more intelligent applications. One promising step ahead is to acquire machine representation of content as early as possible - while the user is still typing and is available to answer questions.

Internet triggered a major revolution in everyday communication. The basic goal has not changed- to reach other people and to share our ideas- but everything else is different.

Computers, as mediators, have been incorporated into communication channels and users turn directly to them, and indirectly to other users when they need information. Clearly, the better the computers understand the content, the more useful and convenient they will become. However, to achieve some sort of understanding, they first need at least 1) an adequate machine representation of the content and 2) background knowledge. In this article, I will present a means of achieving the former.

In general, the life cycle of digital content can be divided into three stages: 1) the creation and formatting of content carried out by the user; 2) the attempt to create a machine representation of the content, carried out by indexers, the information extractor and a text mining software (typical components of a search engine); and finally, 3) the content can be consumed by other users (see Figure 1).

Figure 1: The average life cycle of the digital content.
Figure 1: The average life cycle of the digital content.

We have to face the fact that the second step of this process is very hard to carry out in a fully automated manner, as it would require unassisted natural language understanding to achieve the best results. Fortunately, when the content is yet to be written, we have an alternative option.

In the frame of a project in SZTAKI, we try to merge the first two phases of the content creation process detailed above. Our experimental software processes content on the fly while the user is still typing, and tries to ask relevant questions from him/her using everyday language. The aim of these questions is to clarify what the machine representation of the text should be (see Figure 2). While there are a multitude of semantic annotator tools, we think that the dialogue-assisted input method makes our solution a rather unique one. Also, this same property is the key to enabling lay users to create semantic annotations.

Figure 2: The proposed method of content creation.
Figure 2: The proposed method of content creation.

Our system consists of a rich text editor built around TinyMCE which is capable of visually annotating the content, and presenting questions/suggestions to the user. On the server side, we have a UIMA-compatible text processing system which relies on various UIMA Analysis Engines. One of these is the Magyarlanc, a tokenizer, lemmatizer and part of speech (POS) analyser for Hungarian. We also have a language recognizer, an English POS tagger, a named entity recognizer, and Hitec3, a hierarchical document categorizer integrated.

We have two applications of this system and we plan to develop more. One is a complaint letter analyser, which is being developed with the help of a corpus of 888 letters written to the Hungarian Ministry of Justice. These letters normally tackle diverse issues and in many cases they are unclear and unstructured, so their autonomous processing as such has not been effective. We were able to improve the quality of their processing with semi-autonomous matching of scripts and frames, using the support of dialogues. For more information, see the links at the end of this article.

The other application is the “Sztakipedia” project, which aims to develop a Wikipedia editor. The tool will have an intuitive web interface with rich text editor, which supports the user with suggestions to improve the quality of the edited article.

Some of the vital aspects of this content creation process are already working. With the help of a Hitec3 instance, the software is able to suggest categories; this software was trained on the entire content of the Hungarian Wikipedia articles. It can also suggest “InfoBoxes”, which are table-like descriptions of the properties of common entities like “people” and “cities”. Furthermore, the software can offer a number of links leading to other Wiki articles by the analysis of words in the currently edited article. In addition to this, the user can ask for links in connection with a given phrase; in this case, our tool starts a search in the background, using the Wiki API and formulates suggestions from the search result.

In spite of the fact that these functions are very promising, we have many problems that we have to overcome. Probably the hardest one is that while our tool is able to generate proper wiki markup, this format itself does not support every type of semantic annotations that we need. We have to consider other questions as well, for example, whether it is important to store meta-information about the user's answers to our questions and suggestions, or what to do with “negative” answers, for example when a given InfoBox is not applicable on a given article.

We hope that we can start public tests in 2011 with this novel tool. Further details can be found at the project website.

Complaint letter project:
Sztakipedia project:

Please contact:
Mihály Héder
SZTAKI, Hungary
Tel: +36 1 279 6027 begin_of_the_skype_highlighting +36 1 279 6027 end_of_the_skype_highlighting
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

{jcomments on}
Next issue: January 2022
Special theme:
"Quantum Computing"
Call for the next issue
Get the latest issue to your desktop
RSS Feed