by Wojciech Jaworski
A vast amount of knowledge is contained in large collections of unstructured or weakly structured text documents, which started to emerge soon after the discovery of writing. We develop a methodology, which allows users to seek not only for information localized in specific documents but also knowledge spread across an entire document collection.
Our recent research has focused on the Sumerian Economic Text Corpus from the Ur III period. Sumerians lived from prehistoric times until late 3rd millennium BC in lower Mesopotamia (modern Iraq). Sumer was the first highly developed urban civilization, which used cuneiform script. During the reign of the 3rd dynasty of Ur (2100 BC-2000 BC), whose power extended as far as present Iraq and western Iran, the state introduced a centrally planned economy with an extensive bureaucratic structure.
Civil servants used clay tablets to record data about agriculture and factory production, worker salaries, summaries of executed work, distribution of commodities, goods and animals, lists of sacrificed animals, travel diets and other economical information.
Archaeologists have excavated about 100 000 tablets from this period. A corpus of over 45 000 tablets is available electronically, stored in the form of Latin transliteration (ie documents are represented using Latin alphabet and each cuneiform sign is replaced by its reading). For our studies, we have selected a subcorpus of 11 891 documents concerning distribution of domestic animals. This subcorpus consists of circa 850 000 Sumerian signs, each representing either a word or a syllable.
Figure 1 presents the contents of a typical Sumerian document. This document reports the transfer of lambs from three people to ab-ba-sa6-ga, an official of the Ur III state. The transfer took place on the 23rd day of the month sze-kin-ku5 in the year when the high priest of goddess Innana was elevated to office. The third verse of the document is ambiguous. We cannot determine whether ga is a part of a name or a part of an animal description.
Figure 1: An example of transliterated cuneiform tablet from UR III.
Economic documents are an essential source of information about ancient Sumer. The corpus contains crucial information about economic, social and political history of the state, as well as its political system and administration structure. Sources of this type provide the most complete information about the daily life of those days.
Owing to the large number of documents, the task of finding relevant ones is intractable for human readers. On the other hand classical information retrieval techniques fail when confronted with the Sumerian language. Our search engine, dedicated to Sumerian Economic documents, offers a solution to these problems. However, vital information is spread across a vast number of simple documents. In order to extract it we must process document contents into computer understandable format.
We take advantage of the fact that the Ur III Economic Text Corpus has restricted subject-matter which allows us to represent the structure of information enclosed in documents by means of ontology. This ontology represents the domain knowledge: it splits the set of objects described in documents into categories (types) and it determines relationships between these categories.
We process documents by means of a grammar which describes the way in which phrases are constructed from words and other simpler phrases. We assume that syntactic operations constructing compound phrases on the basis of simpler ones correspond to the presentation of complex objects by means of their components. As a result of parsing we obtain a logical formula which represents the contents of the document.
Figure 2: Animal flows between the officials in UR III Kingdom.
The content of all documents constitutes a vast knowledge base of Sumerian economy. We use it to determine relationships between Sumerian officials in terms of number of animals that were transferred between them. We represent this information in terms of the graph of animal flow. Vertices of the graph represent officials. Graph edge width is proportional to the number of animals transferred between individuals. In figure 2, we present a fragment of the animal flow graph. We selected edges labelled with animal quantities greater than 900. The complete graph that encloses all extracted transactions has 2754 vertices and 5275 edges.
Among other possible applications of our knowledge base we explored:
- observation of seasonal economic fluctuations as well as macro-economic changes that happened during the Ur III period;
- detection documents that describe the same object or event;
- reconstruction of broken documents and determining contents of missing ones.
We also develop a query language that allows us to retrieve information according to semantic patterns.
In future we plan to pursue the ultimate goal of creating a model of the Sumerian economy.
Institute of Informatics, University of Warsaw, Poland