Why are text formats needed?

Text formats are convenient for storing information because they can be created and processed by both programs and humans.

Text files (files in text format) can be opened, read and edited in a wide variety of text editors.

Many programs use text-based configuration files, even if the format contains numbers and binary (yes/no) values.

This makes the programs somewhat more complicated due to the need to convert from text to an internal format and vice versa, but it makes it possible to edit the configuration manually without using a configuration tool in the program itself.

Where is XML used now?

XML is used in various areas of IT. It can be configuration files (with program settings) or files used to transfer data between programs. In Java, one of the most common use cases is configuring Maven, a build automation tool.

Structure of an XML document

The physical and logical structures of an XML document are kept separate. In terms of the physical structure, the document consists of entities that can refer to other entities.

The sole root element is the document entity. An entity is the smallest part in a document. All entities have a name and contain characters.

In turn, characters belong to one of two categories: character data or markup.

Markup includes:

  • tags, which denote element boundaries;
  • declarations and processing instructions, including their attributes;
  • entity references;
  • comments;
  • character sequences wrapping CDATA sections.

Logically, the document consists of elements, comments, declarations, entity references, and processing instructions. Markup is used to create all of this structure in a document.

All the constituent parts of a document are divided into a prolog and a root element. The root element the mandatory, essential part of an XML document, while the prolog may not exist at all. The root element can consist of nested elements, character data, and comments. A document's elements must be nested correctly: any element that starts inside another element must also end inside that element.

Markup symbols

Markup always starts with < and ends with >.

The < and > (angle brackets) and & (ampersand) symbols are play a special role. Angle brackets indicate the boundaries of elements, processing instructions, and some other sequences. And the ampersand helps us replace text with entities.

XML declaration

An XML declaration specifies the version of the language used to write the document. The XML specification says to start a document with an XML declaration because the proper interpretation of the document contents depends on the version of the language.

In the language's first version (1.0), this declaration was optional, but it is mandatory in later versions. A missing declaration is assumed to mean version 1.0. The declaration may also contain information about the document encoding.

Example:

<?XML version="1.1" encoding="UTF-8" ?>

Tags

A tag is a markup construct that contains the name of an element. There are start tags and end tags. There are also empty-element tags that combine start and end elements.

Examples:

  • Start tag: <tag1>

  • End tag: </tag1>

  • Empty-element tag: <empty_tag1 />

Attributes

Another part of XML elements are attributes. An element can have multiple unique attributes. Attributes let us specify more information about an element. Or more accurately, attributes define the properties of elements.

An attribute is always a name-value pair:

name = "value"

Example of an attribute in a tag:

<tag1 name = "value">element</tag1>

The value of an attribute must be wrapped in double quotes (") or single quotes ('). Attributes are only used in start tags and the empty-element tags.

Escaping five special characters (<, >, ', ”, &)

Obviously, the <, > and & symbols cannot be used as such in character data and attribute values. You need to special escape sequences to represent them. Special sequences are also used when writing apostrophes and quotation marks inside attribute values:

Symbol Replacement
< &lt;
> &gt;
& &amp;
' &apos;
" &quot;

Also, to write the \ character, you need to use \\.

CDATA section

A CDATA section is not a logical unit of text. This type of section can occur where XML syntax lets us place character data in the document.

The section starts with <![CDATA[ and ends with ]]>. Character data is placed between these bits of markup, and the <, >, and & symbols can be used in their direct form.

Comments

Comments are not considered character data. A comment begins with <!-- and ends with -->. The character sequence -- cannot be used inside a comment. Also, inside a comment, the ampersand character is not denote markup.

Example:

<!-- this is a comment -->

Names

In XML, all names can contain only letters in the Unicode character table, Arabic numerals, periods, colons, hyphens, and underscores. Names can start with a letter, colon, or underscore. Note that a name cannot begin with the string XML.

Example

Let's look at a Java class and an object of that class. Then we will try to serialize the object in XML format. Class code:


public class Book {
   private String title;
   private String author;
   private Integer pageCount;
   private List<String> chapters;

   public Book(String title, String author, Integer pageCount, List<String> chapters) {
       this.title = title;
       this.author = author;
       this.pageCount = pageCount;
       this.chapters = chapters;
   }
// Getters/setters
}

and creation of objects:


Book book = new Book("My Favorite Book", "Amigo", 999, Arrays.asList("Chapter 1", "Chapter 2", "Chapter 3", "Chapter 4", "Chapter 5", "Chapter 6"));

Here is an example of a valid XML representation of a Java object that contains 4 fields, one of which is a collection (see the Java code above):

<Book>
  <title>My Favorite Book</title>
  <author>Amigo</author>
  <pageCount>999</pageCount>
 <chapters>
    <chapters>Chapter 1</chapters>
    <chapters>Chapter 2</chapters>
    <chapters>Chapter 3</chapters>
    <chapters>Chapter 4</chapters>
    <chapters>Chapter 5</chapters>
    <chapters>Chapter 6</chapters>
 </chapters>
</Book>

XML schema

An XML schema is description of the structure of an XML document. The corresponding specification (XML Schema Definition, or XSD) is a W3C recommendation.

XSD was designed to express the rules that an XML document must follow. But the most interesting thing for us is that XSD was designed to be used when developing software that processes XML documents. It lets us check the correctness of an XML document programmatically.

Files containing an XML schema have the .xsd extension. Designing an XML schema is beyond the scope of this lesson, so for now just be aware that the possibility exists.