Popular Posts

Thursday, February 16, 2006

DTDs and XML Schemas

Describing your Data: DTDs and XML Schemas

If you've been developing with XML for even a short period of time, you are likely to have reached the point of wanting to describe your XML data structures. Document Type Definitions (DTDs) and XML Schemas are key technologies in this area.

Although neither are strictly required for XML development, both DTDs and XML Schemas are important parts of the XML toolbox. DTDs have been around for over twenty years as a part of SGML, while XML Schemas are relative newcomers. Though they use very different syntax and take different approaches to the task of describing document structures, both mechanisms definitely occupy the same turf. The W3C seems to be grooming XML Schemas as a replacement for DTDs, but it isn't yet clear that how quickly the transition will be made. DTDs are here-and-now, while XML Schemas, in large part, are for the future.

What DTDs and XML Schemas Do

Document Type Definitions and XML Schemas both provide descriptions of document structures. The emphasis is on making those descriptions readable to automated processors such as parsers, editors, and other XML-based tools. They can also carry information for human consumption, describing what different elements should contain, how they should be used, and what interactions may take place between parts of a document. Although they use very different syntax to achieve this task, they both create documentation.

Perhaps the most important thing DTDs and XML Schemas do is set expectations, using a formal vocabulary and other information to lay ground rules for document structures. Two parsers, given a document and a DTD, should have the same opinions about whether that document is valid, and different schema processors should similarly agree on whether or not a document conforms to the rules in a given schema. XML editing applications can use DTDs and schemas as frameworks, letting users create documents that meet these expectations. Similarly, developers can use DTDs and XML Schemas as a foundation on which to plan transformations from one format to another. By agreeing to a given DTD or schema, a group of developers has accepted a set of rules about document vocabulary and structure. While this doesn't solve all the problems of application development, it does at least mean that independent development of tools that process these documents is a lot easier.

Schemas and DTDs provide a number of additional functions that make contributions to document content:

  • Providing defaults for attributes: in addition to providing constraints on attribute content, DTDs and XML Schemas allow developers to specify default values that should be used if no value was set in the content explicitly.
  • Entity declaration: DTDs and XML Schemas provide for the declaration of parsed entities, which can be referenced from within documents to include content.

Schemas and DTDs may also describe "notations" and "unparsed entities", adding information to documents that applications may use to interpret their content.

Where DTDs and Schemas Come From

The main thrust of development work, initially for XML 1.0 and its DTDs, and now for XML Schemas, is taking place at the World Wide Web Consortium (W3C). However, the W3C is not the only source for schema languages. At least five other schema proposals have been developed and many of them are in actual use -- notably, Microsoft's XML-Data, which is used for its BizTalk initiative. Most of these proposals are feeding into the main W3C-sanctioned development process. The main contenders in the schema arena, including DTDs, are listed below:

  • DTDs - Document Type Definitions were originally developed for XML's predecessor, SGML. They use a very compact syntax and provide document-oriented data typing. XML DTDs are a subset of those available in SGML, and the rules for using XML DTDs provide much of the complexity of XML 1.0. Complete XML DTD support is (or should be) built into all validating XML parsers, and some XML DTD support is built into all XML parsers.

  • XML-Data/XML-Data Reduced - Based on a proposal that Microsoft and others submitted to the W3C even before XML 1.0 was completed, this schema proposal is used in Microsoft's BizTalk framework. XML-Data provides a large set of data types more appropriate to database and program interchange. XML-Data support is built into Microsoft's XML parser.

  • Document Content Description (DCD) - Created in a joint effort between IBM and Microsoft, DCD uses some ideas from XML-Data and some syntax from another W3C project, Resource Description Framework (RDF).

  • Schema for Object-Oriented XML (SOX) - SOX was developed by Veo Systems (now acquired by CommerceOne) and provides functionality like inheritance to XML structures. SOX has gone through multiple versions. The latest is SOX version 2.

  • Document Description Markup Language (DDML) - DDML was developed on the XML-dev mailing list, creating a schema language with a subset of DTD functionality. Development of DDML (which was once known as XSchema) has halted since the W3C Activity began.

Although you can start work with any of the above tools today -- DTDs being widely supported -- when the specification is complete, using the W3C XML Schemas is probably the safest long-term solution. Fortunately, converting among different schema formats isn't especially difficult, and tools are available to help you in the process.

How Schemas Differ from DTDs

The first, and probably most significant, difference between XML Schemas and XML DTDs is that XML Schemas use XML document syntax. While transforming the syntax to XML doesn't automatically improve the quality of the description, it does make those descriptions far more extensible than they were in the original DTD syntax. Declarations can have richer and more complex internal structures than declarations in DTDs, and schema designers can take advantage of XML's containment hierarchies to add extra information where appropriate -- even sophisticated information like documentation. There are a few other benefits from this approach. XML Schemas can be stored along with other XML documents in XML-oriented data stores, referenced, and even styled, using tools like XLink, XPointer, and XSL.

The largest addition XML Schemas provide to the functionality of the descriptions is a vastly improved data typing system. XML Schemas provide data-oriented data types in addition to the more document-oriented data types XML 1.0 DTDs support, making XML more suitable for data interchange applications. Built-in datatypes include strings, booleans, and time values, and the XML Schemas draft provides a mechanism for generating additional data types. Using that system, the draft provides support for all of the XML 1.0 data types (NMTOKENS, IDREFS, etc.) as well as data-specific types like decimal, integer, date, and time. Using XML Schemas, developers can build their own libraries of easily interchanged data types and use them inside schemas or across multiple schemas.

The current draft of XML Schemas also uses a very different style for declaring elements and attributes to DTDs. In addition to declaring elements and attributes individually, developers can create models -- archetypes -- that can be applied to multiple elements and refined if necessary. This provides a lot of the functionality SOX had developed to support object-oriented concepts like inheritance. Archetype development and refinement will probably become the mark of the high-end schema developer, much as the effective use of parameter entities was the mark of the high-end DTD developer. Archetypes should be easier to model and use consistently, however.

XML Schemas also support namespaces, a key feature of the W3C's vision for the future of XML. While it probably wouldn't be impossible to integrate DTDs and namespaces, the W3C has decided to move on, supporting namespaces in its newer developments and not retrofitting XML 1.0. In many cases, provided that namespace-prefixes don't change or simply aren't used, DTD's can work just fine with namespaces, and should be able to interoperate with namespaces and schema processing that relies on namespaces. There will be a few cases, however, where namespaces may force developers to use the newer schemas rather than the older DTDs.

Alternative Approaches

As exciting as XML Schemas are, there have been a few suggestions for very different approaches that also hold promise. Both Rick Jelliffe's Schematron and the Document Structure Description (DSD), from AT&T Labs and the University of Aarhus, look at documents from a more complex perspective than containment, and use tools derived from style languages -- Schematron is based on XSL, while DSD works from CSS -- to examine documents more closely.

Schematron allows developers to ask about the existence and contents of paths through documents rather than specify containment structures, and places great importance on producing human-readable results. Schematron processing, which can use XSL tools, can produce complete reports on the content and structure of documents, rather than a simple yes/no validation with error reporting.

DSD comes from somewhat similar origins, but uses its own vocabulary to create document descriptions rather than building on the XSL processing model. DSD schemas look much more like the W3C's XML Schemas, but support a different set of tests and have a much greater focus on tasks like providing default content for attributes and elements. DSD allows for context-sensitive rules, where the required usage of a given element may change depending on how it is used in a document. Attributes which are optional in one context may be required in another context. Declarations may impose order on some elements but not on others, making it possible to create 'floating' elements. An open-source implementation in C is available, which adds error information to the document as it is processed, giving applications or users a chance to react to the errors.

It isn't clear at this point whether these approaches will be integrated with XML Schemas at some level, or if they'll be useful tools for supplementing or replacing XML Schemas on particular kinds of projects. In any case, both of these projects are worth further investigation.

Planning Around DTDs and Schemas

Transitioning from one technology to another is often difficult, but at least the transition from DTDs to schemas only involves descriptions of documents, requiring only minor changes to the documents themselves. It is uncertain if it's time yet to begin the transition, as the latest public draft of XML Schemas came with a warning on the XML-dev mailing list that there may be significant changes in future drafts. XML Schemas are still far from stable, so probably only the most enthusiastic early adopters should be considering them at this point.

Although XML Schemas may not yet be ready, XML-based projects should be prepared for their eventual arrival. There are several strategies for handling this transition that may be appropriate to different kinds of projects and different developer needs.

  • Develop DTDs with an eye toward future conversion to schemas. Automated tools for converting among schema formats, like Extensibility's XML Authority, are already available and are likely to grow to include the final W3C XML Schemas.

  • Use other schema formats, like XML-Data and SOX. This lets developers take advantage of features like data typing immediately, and conversions from these experimental schema formats to the new XML Schemas shouldn't be prohibitively difficult.

  • Create well-formed documents for now, ignoring DTDs and schemas in their current incarnation. It's not always easy to retrofit a schema onto a set of documents, but it may be appropriate for some cases where the format of existing data sources (like databases) ensures that there's won't be wild variations in structure. When schemas arrive, you can add them to your processing.

  • Ignore DTDs and schemas completely, and only work with well-formed documents. If you don't need structure checking, this may be a perfectly appropriate strategy.

  • Plan to stick to DTDs. They're here now, they'll be here later. If your XML has to be processed by SGML tools, this may be the best route. Keeping your DTDs around, even if you supplement them with equivalent XML Schemas, will preserve interoperability.

There is no single answer for handling this transition that applies to all XML projects. If all your XML work involves documents, DTDs may be a perfectly adequate tool for your needs, and schemas might only be a distraction. If you're trying to manage data interchange between databases of different kinds, the data typing functionality that schemas provide may drive you to use XML-Data or SOX today, and XML Schemas when they arrive.

The Future for DTDs and Schemas

Right now there are too many options for describing your data, but in the future, they will probably slim down to: DTDs, for legacy XML 1.0 applications and integration with SGML; XML Schemas, and plain old well-formed documents for situations where describing document structures is unnecessary or counterproductive. Whatever you do with DTDs and XML Schemas, remember that their usage should be considered a part of document format specification and documentation. Where documentation is important, these tools will be important, both to set expectations and spare applications the task of checking document structures themselves.

The DSD and Schematron approaches will probably receive more attention in future development as well; Schematron is already an easy and useful supplement to both DTD and XML Schema processing. Both of these tools provide functionality that goes beyond anything the W3C has currently released, demonstrating that there are multiple useful approaches to describing document structures. While it seems unlikely that developers will want to create a DTD, an XML Schema, a Schematron schema, and a DSD, all for the same document, they are all important new tools in the XML developer's toolkit.

No comments: