Extensible Markup Language (XML) is a markup language which allows users to structure, describe, and interchange data on the Internet in clearly defined way. Markup can be used to give sense to data.

Why XML?

XML has revolutionized the web and numerous software applications. Why is XML so important and how could a simple markup language have such an impact. There are various different reasons and the most important is that it eliminates many of the problems associated with exchange of information.

Almost every programmer has to deal with reading and writing files. This involves creating a new file format or learning the syntax of an existing file format. Initially it seems easy but the devil lives in the details. To ensure smooth functionality, input must be meticulously checked to make sure not undesirable characters are inserted. Different programs claiming to use a certain format may not work exactly the same. Generally the differences are subtle and lead to bugs which are difficult to find. The file formats are not robust. Minor changes in format can wreak havoc. There are compatibility problems between different systems.

XML overcome all these problems by enforcing the concept of well-formedness, requiring proper nesting, and allowing validity checks. More importantly, XML is extensible, robust and easy to use.

XML Syntax

An XML document is a document which completely satisfies XML 1.0  or XML 1.1 specifications. These documents specify the Backus-Naur Form (BNF) grammar and 15 well-formedness constraints which must be satisfied by a document to be accepted as an XML document.

Markup

Markup gives meaning to data. In XML and HTML, tags (text enclosed between < and >) provide a way to markup data. There are three kinds of markups:

  • Structural Markup: modifies the structure of the displayed content (e.g. <p> <br>)
  • Stylistic Markup: modifies the appearance of the content (e.g. <i>, <u>, and <font>)
  • Semantic Markup: can be used to understand the content of the document (e.g. <name>, <gender>)

One of the most important reasons behind widespread use of XML is the user’s ability to define his own semantic markup:

<student>
   <name>Alice</name>
   <gender>Female</name>
   <major>Psychology</major>
</student>

Just by reading the example above, you can understand that we the data concerns a female student by the name of Alice, majoring in Psychology.

Root, Elements, Comments

An XML document is made up of tags and text content called character data (PCDATA). Every XML document is organized into a tree structure. The top-most node of the tree is called a root. The root contains the entire XML document. All other nodes with the exception of comments are called elements.

Elements are fundamental units of XML. Elements must have a name and they can have attributes, namespace, and content. Following are some examples of elements:

<student>John Smith</student>
<phone></phone>
<undergrad />

Each of the lines above is an element. Everything enclosed in < and > are tags. The content between the tags is called text or character data. In the first example, <student> is the start tag and </student> is the end tag. The second example does not contain an any character data and is therefore called empty element. The third example does not have start or end tags because according to its DTD, it should not contain any character data. This is called an Empty Element Tag. Both empty element tags and start tags can contain attributes.

Mixed Content Elements contain both text and child elements.

<book>
New York Times Best Seller
<title>Three Cups of Tea</title>
<publisher>Penguin Books</publisher>
</book>

Mixed content elements are very useful for content oriented applications such as DocBook but they are cumbersome for data-oriented applications especially where machine-to-machine communications are involved.

All text content in elements is in Unicode.

Comments are enclosed in <!– and –>. Elements can have other elements as children (speaking in terms of tree data structure) but comment cannot have children.

Well-formed and Valid XML Documents

To be an XML document, a data object needs to be well-formed. Well-formed XML document may also need to be valid.

Well-formed XML documents contain text and XML which conform to XML 1.0 or XML 1.1 specifications. A valid XML document is a well-formed document which satisfies the constraints defined in the Document Type Definition (DTD). A DTD is a set of rules which outline the tags which are allowed, allowed values for those tags, and how the tags relate to each other.

Checking an XML document against a list of constraints defined in a DTD or Schema is called validation. Validation is optional in XML.

Schema languages are used to create a list of constraints called schemas. DTD is also a schema but it is endorsed by W3C. Most XML parsers have DTD built-in.

A DTD focuses on the element structure of a document. It specifies:

  • which element an XML document can contain
  • what attributes and text an element can contain
  • order of the child elements in an element

Element Declarations

For an XML document to be valid, every element in the document must be declared.

<!ELEMENT country (#PCDATA)>
<!ELEMENT student (name,major+,gender?)>

In the first example above, the element is named country and it can contain any text data. In the second example, the element named student has three child elements name, major, and gender, in the order defined. The + sign specifies that the student can define more than one majors i.e. more than one major element with one student element. The ? sign specifies that the gender element is optional. EMPTY means that the element cannot contain any value.