Static Analysis of XML Programs

by Pierre Genevès and Nabil Layaïda

Static analysers for programs that manipulate Extensible Markup Language (XML) data have been successfully designed and implemented based on a new tree logic by the WAM (Web, Adaptation and Multimedia) research team, a joint lab of INRIA and Laboratoire d'Informatique de Grenoble. This is capable of handling XML Path Language (XPath) and XML types such as Document Type Definitions (DTDs) and XML Schemas.

Since its introduction a decade ago, Extensible Markup Language (XML) has gained considerable interest from industry and now plays a central role in modern information system infrastructures. In particular, XML is the key technology for describing and exchanging a wide variety of data on the Web. The essence of XML consists in organising information in tree-tagged structures conforming to some constraints, which are expressed using standard type languages such as DTDs, XML schemas and Relax NG. XML processing can be seen as transforming these structures using tree-oriented query languages such as XPath expressions and XQuery within full-blown transformation languages such as XSLT.

With the ever-increasing information flow in the current Web infrastructure, XML programming is becoming a key factor in realizing the trend in Web services aimed at enhancing machine-to-machine communication. There still exist important obstacles along this path: performance and reliability. Programmers are given two options, Domain-Specific Languages such as XSLT, or general-purpose languages augmented with XML application programming interfaces such as the Document Object Model (DOM). Neither of these alternatives is a satisfactory answer to performance and reliability, nor is there even a trade-off between the two. As a consequence, new paradigms are being proposed and all have the aim of incorporating XML data as first-class constructs in programming languages. The hope is to build a new generation of tools that are capable of taking reliability and performance into account at compile time.

One of the biggest challenges in this line of research is to develop automated and tractable techniques for ensuring static-type safety and optimization of programs. To this end, there is a need to solve some basic reasoning tasks that involve very complex constructions such as XML types (regular tree types) and powerful navigational primitives (XPath queries). In particular, every future compiler of XML programs will have to routinely solve problems such as:

XPath query emptiness in the presence of a schema: if one can decide at compile time that a query is not satisfiable then subsequent bound computations can be avoided
query equivalence, which is important for query reformulation and optimization
path type-checking, for ensuring at compile time that invalid documents can never arise as the output of XML processing code.

All of these problems are known to be computationally heavy (when decidable), and the related algorithms are often tricky.

In the WAM research team we have developed an XML/XPath static analyser based on a new logic of finite trees. This analyser consists in compilers that allow XML types and XPath queries to be translated into this logic, and an optimized logical solver for testing satisfiability of a formula of this logic.

The benefit of these compilers is that they allow one to reduce all the problems listed above, and many others, to logical satisfiability. This approach has a couple of important practical advantages. First of all, one can use the satisfiability algorithm to solve all of these problems. More importantly, one could easily explore new variants of these problems, generated for example by the presence of different kinds of type or schema information, with no need to devise a new algorithm for each variant.

The system has been implemented in Java and uses symbolic techniques (binary decision diagrams) in order to enhance its performance. It is capable of comparing path expressions in the presence of real-world DTDs (such as the W3C SMIL and XHTML language recommendations). The cost ranges from several milliseconds for comparison of XPath queries without tree types, to several seconds for queries under very large, heavily recursive, type constraints, such as the XHTML DTD. These preliminary measurements shed light for the first time on the cost of solving static analysis problems in practice. Furthermore, the analyser generates XML counter-examples that allow program defects to be reproduced independently from the analyser.

The development of these analysers was initiated by the Web Adaptation and Multimedia research team at INRIA, Grenoble - Rhône-Alpes Research Centre, France. The project commenced in October 2005 and the full system will soon be released publicly.

Links:
Project home page: http://wam.inrialpes.fr/xml
WAM team: http://wam.inrialpes.fr

Please contact:
Pierre Genevès, WAM Team, CNRS, France
Tel: +33 4 76 61 53 84
E-mail: Pierre.Genevesinria.fr