- The structure of an XML document is included in the document in its label structure.
- As already mentioned, one speaks of self-describing data.
- Also, This is an essential difference with standard databases:
- In a database, one defines the type of data (e.g., a relational schema) before creating instances of this type (e.g., a relational database).
- In semi-structured data (and XML), data may exist with or without a type.
- Moreover, The “may” (it may exist) is essential.
- Types not forbidden; they are just not compulsory and we will spend quite some effort on XML typing.
But in many cases, XML Typing data often presents the following characteristics:
- the data are irregular: there may be variations of structure to represent the same information ( e.g., a date or an address) or unit (prices in dollars or euros);
- Also, this is typically the case when the data come from many sources;
- parts of the data may miss, for example, because some sources not answering, or some unexpected extra data (e.g., annotations) may be found;
- the structure of the data is not known a priori or some work such as parsing has to be
- performed to discover it (e.g., because the data come from a newly discovered source);
- Moreover, part of the data may simply untyped, (e.g., plain text).
- Another difference with database typing is that the type of some data may be quite complex.
- In some extreme cases, the size of the type specification may be comparable to, or even greater than, the size of the data itself.
- It may also evolve very rapidly.
- These are many reasons why the relational or object database models that propose too rigid typing were not chosen as standards for data exchange on the Web, but a semi-structured data model chosen instead.
Motivating XML Typing
Perhaps the main difference with typing in relational systems is that typing is not compulsory for XML.
- It is perfectly fine to have an XML document with no prescribed type.
- However, when developing and using software, types are essential, for interoperability, consistency, and efficiency.
- Schemas serve to document the interface of software components, and provide therefore a key ingredient for the interoperability between programs:
- a program that consumes an XML document of a given type can assume that the program that has generated it has produced a document of that type.
- Similarly to dependencies for the relational model (primary keys, foreign key constraints, etc.), typing an XML document is also useful to protect data against improper updates.
- Suppose that some XML document is very regular, say, it contains a list of companies, with, for each, an ID, a name, an address and the name of its CEO.
- This same information may stored very compactly, for instance, without repeating the names of elements such as the address for each company.
- Thus, a priori knowledge on the type of the data may help improve its storage.
Query Efficiency. Consider the following XQuery query:
- for $b in doc(“bib.xml”)/bib//* where $b/*/zip = ’12345’ return $b/title
- Knowing that the document consists of a list of books and knowing the exact type of book elements. One may be able to rewrite the query:
- for $b in doc(“bib.xml”)/bib/book where $b/address/zip = ’12345’ return $b/title
- that is typically much cheaper to evaluate.
- Note that in the absence of a schema, a similar processing is possible by first computing from the document itself a data guide, i.e., a structural summary of all paths from the root in the document.
- There are also other more involved schema inference techniques that allow attaching such an a posteriori schema to a schema-less document.