Monday, July 19, 2004

[emerging tech] "Managing XML Data" (Web Engineering: The Evolution of New Technologies)

Monday, July 19, 2004
Dateline: China
 
Excerpts from the current issue of Computing in Science & Engineering, special issue on "Web Engineering: The Evolution of New Technologies."  To order this article, click on http://tinyurl.com/6v3cc .  (Note: Formatting has been changed from the original article; however, ordering is consistent.)
 
XML's flexibility makes it a natural format for both exchanging and integrating data from diverse data sources.  In this survey, the authors give an overview of issues in managing XML data, discuss existing solutions, and outline the current technology's open problems and limitations.

A diverse set of factors has fueled the explosion of interest in XML ( http://www.w3.org/TR/REC-xml ): XML's self-describing nature makes it more amenable for use in loosely coupled data-exchange systems, and the flexible semistructured data model behind it makes it natural as a format for integrating data from various sources.

But much of its success stems from the existence of standard languages for each aspect of XML processing and the rapid emergence of tools for manipulating XML.  Popular tools include parsers such as Xerces ( http://xml.apache.org/xerces-j ), query processors such as Galax ( http://db.bell-labs.com/galax ), and transformation tools such as Xalan ( http://xml.apache.org/xalan-j ).  The development of this standards framework has made XML dialects powerful vehicles for standardization in communities that exchange data.

In this article, we discuss the main problems involved in managing XML data.  Our objective is to clarify potential issues that must be considered when building XML-based applications---in particular, XML solutions' benefits as well as possible pitfalls.  Our intent is not to give an exhaustive review of XML data-management (XDM) literature, XML standards, or a detailed study of commercial products.  Instead, we aim to provide an overview of a representative subset to illustrate how some XDM problems are addressed. 

Because data typically is stored in non-XML database systems, applications must publish data in XML for exchange purposes.  When a target application receives XML data, it can remap and store it in internal data structures or a target database system.  Applications can also access an XML document either through APIs such as the Document Object Model (DOM; http://www.w3.org/DOM ) or query languages.  The applications can directly access the document in native format or, with conversion, from a network stream or non-XML database format.

In contrast with relational database management systems (RDBMSs) that had a clear initial motivation in supporting online transaction processing (OLTP) scenarios, XML applications' requirements vary widely.  Applications must deal with several different kinds of queries (structured and keyword-based) in different scenarios (with or without transaction support, over stored or streaming data), as well as data with varying characteristics (ordered and unordered, with or without a schema).

Commercial database vendors have also shown significant interest in XDM---support for XML data is present in most RDBMSs.  Examples include IBM's DB2 XML Extender ( http://www4.ibm.com/software/data/db2/extenders/xmlext.html ), Microsoft's support for XML ( http://msdn.microsoft.com/sqlxml/ ), and Oracle's XML DB ( http://otn.oracle.com/tech/xml/xmldb/ ).

In XML, common querying tasks include filtering and selecting values, merging and integrating values from multiple documents, and transforming XML documents.  While XML has enabled the creation of standard data formats within industries and communities, adoption of these standards has led to an enormous and immediate problem of exporting data available in legacy formats to meet newly created standard schemata.  Several publishing languages have been proposed to specify XML views over the legacy data---that is, how to map legacy data (such as tables) into a predefined XML format.

In this section, we discuss limitations of existing solutions as well as some open problems.  Our discussion is biased toward problems we have encountered in trying to create effective and scalable XDM solutions; it is by no means exhaustive.

Parsing and validating a document against an XML Schema or DTD are CPU-intensive tasks that can be a major bottleneck in XML management.  A recent study of XML parsing and validation performance indicates that response times and transaction rates over XML data cannot be achieved without significant improvements in XML parsing technology.  It suggests enhancements such as using parallel processing techniques and preparsed binary XML formats as well as better support for incremental parsing and validation.

By using XML-specific compression techniques, tools such as XMill compare favorably against several generic compressors.  Compression techniques have also been proposed that support direct querying over the compressed data, which besides saving space, also improve query processing times.

The ability to support updates is becoming increasingly important as XML evolves into a universal data representation format.  Although proposals for defining and implementing updates have emerged, a standard has yet to be defined for an update language.

Three figures & sample code; 23 references.

To request a copy of this article click on: http://tinyurl.com/6kcqw .

 
Cheers,
 
David Scott Lewis
President & Principal Analyst
IT E-Strategies, Inc.
Menlo Park, CA & Qingdao, China
 
http://www.itestrategies.com (current blog postings optimized for MSIE6.x)
http://tinyurl.com/2r3pa (access to blog content archives in China)
http://tinyurl.com/2azkh (current blog postings for viewing in other browsers and for access to blog content archives in the US & ROW)
http://tinyurl.com/2hg2e (AvantGo channel)
 
 
To automatically subscribe click on http://tinyurl.com/388yf .