Author
John Selawsky
Senior Java Developer and Tutor at LearningTree

XML in Java

Published in the Java Developer group
Hi! Today we will introduce another data format called XML. This is a very important topic. When working on real Java applications, you will almost certainly encounter XML-related tasks. In Java development, this format is used almost universally (we will find out why below), so I recommend that you don't superficially review this lesson, but rather gain a thorough understanding of everything and also study additional literature/links :) This definitely won't be a waste of time. So, let's start with the easy stuff: the "what" and the "why"!

What is Java XML?

XML stands for eXtensible Markup Language. You may already be familiar with a markup language — have you heard of HTML, which is used to create web pages :) What is XML? - 1HTML and XML even have a similar appearance:
HTML 1

<h1>title</h1>
<p>paragraph</p>
<p>paragraph</p>
XML 1

<headline>title</headline>
<paragraph>paragraph<paragraph>
<paragraph>paragraph<paragraph>
HTML 2

<h1>title</h1>
<p>paragraph</p>
<p>paragraph</p>
XML 2

<chief>title</chief>
<paragraph>paragraph<paragraph>
<paragraph>paragraph<paragraph>
In other words, XML is a language for describing data.

Why do you need XML?

XML was originally invented to more conveniently store and send data, including via the Internet. It has several advantages that help you achieve this. First, it is easy to read by both a human and a computer. I think you can easily understand what this XML file describes:

<?xml version="1.0" encoding="UTF-8"?>
<book>
   <title>Harry Potter and the Philosopher’s Stone</title>
   <author>J. K. Rowling</author>
   <year>1997</year>
</book>
A computer also easily understands this format. Second, since the data is stored as plain text, there will be no compatibility problems when we transfer it from one computer to another. It is important to understand that XML is not executable code — it's a data description language. After you describe data using XML, you need to write code (for example, in Java) that can send/receive/process this data.

How is XML structured?

The main component is tags: these are the things in angle brackets:

<book>
</book>
There are opening tags and closing tags. The closing tag has an additional symbol ("/"), as can be seen in the example above. Each opening tag must have a closing tag. They show where the description of each element in the file begins and ends. Tags can be nested! In our book example, the <book> tag has 3 nested tags: <title>, <author>, and <year>. This isn't limited to one level of nesting: nested tags can have their own nested tags, etc. This structure is called a tag tree. Let's look at this tree using a sample XML file that describes a car dealership:

<?xml version="1.0" encoding="UTF-8"?>
<carstore>
   <car category="truck">
       <model lang="en">Scania R 770</model>
       <year>2005</year>
       <price currency="US dollar">200000.00</price>
   </car>
   <car category="sedan">
       <title lang="en">Ford Focus</title>
       <year>2012</year>
       <price currency="US dollar">20000.00</price>
   </car>
   <car category="sport">
       <title lang="en">Ferrari 360 Spider</title>
       <year>2018</year>
       <price currency="US dollar">150000.00</price>
   </car>
</carstore>
Here we have a top-level tag: <carstore>. It is also called a root element. <carstore> has one child tag: <car>. <car>, in turn, also has 3 child tags: <model>, <year> and <price>. Each tag can have attributes, which contain additional important information. In our example, the <model> tag has a "lang" attribute, which indicates the language used to record the model name:

<model lang="en">Scania R 770</model>
Here we indicate that the name is written in English. Our <price> tag has a "currency" attribute.

<price currency="US dollar">150000.00</price>
Here we indicate that the car's price is given in US dollars. Thus, XML has a "self-describing" syntax. You can add any information you need to describe the data. Additionally, at the top of the file, you can add a line indicating the XML version and the encoding used to write the data. This is called the "prolog" and it looks like this:

<?xml version="1.0" encoding="UTF-8"?>
We're using XML version 1.0 and UTF-8 encoding. This isn't necessary, but it can come in handy if, for example, your file uses text in different languages. We mentioned that XML means "eXtensible Markup Language", but what does "extensible" mean? This means that it is perfect for creating new versions of your objects and files. For example, suppose we want to also start selling motorcycles at our car dealership! That said, we need our program needs to support both versions of <carstore>: the old one (without motorcycles) and the new one. Here's our old version:

<?xml version="1.0" encoding="UTF-8"?>
<carstore>
   <car category="truck">
       <model lang="en">Scania R 770</model>
       <year>2005</year>
       <price currency="US dollar">200000.00</price>
   </car>
   <car category="sedan">
       <title lang="en">Ford Focus</title>
       <year>2012</year>
       <price currency="US dollar">20000.00</price>
   </car>
   <car category="sport">
       <title lang="en">Ferrari 360 Spider</title>
       <year>2018</year>
       <price currency="US dollar">150000.00</price>
   </car>
</carstore>
And here's the new expanded one:

<?xml version="1.0" encoding="UTF-8"?>
<carstore>
   <car category="truck">
       <model lang="en">Scania R 770</model>
       <year>2005</year>
       <price currency="US dollar">200000.00</price>
   </car>
   <car category="sedan">
       <title lang="en">Ford Focus</title>
       <year>2012</year>
       <price currency="US dollar">20000.00</price>
   </car>
   <car category="sport">
       <title lang="en">Ferrari 360 Spider</title>
       <year>2018</year>
       <price currency="US dollar">150000.00</price>
   </car>
   <motorcycle>
       <title lang="en">Yamaha YZF-R6</title>
       <year>2018</year>
       <price currency="Russian Ruble">1000000.00</price>
       <owner>Vasia</owner>
   </motorcycle>
   <motorcycle>
       <title lang="en">Harley Davidson Sportster 1200</title>
       <year>2011</year>
       <price currency="Euro">15000.00</price>
       <owner>Petia</owner>
   </motorcycle>
</carstore>
That's how easy and simple it is to add a description of motorcycles to our file :) What's more, we absolutely don't need to have the same child tags for motorcycles as for cars. Please note that motorcycles, unlike cars, have an <owner> element. This does not prevent the computer (or human) from reading the data.

Differences between XML and HTML

We have already said that XML and HTML are very similar in appearance. That makes it very important to know how they differ. First, they are used for different purposes. HTML is for marking up web pages. For example, while creating a website, you can use HTML to specify: "The menu should be in the upper right corner. It should have such and such buttons". In other words, HTML's job is to display data. XML is for storing and sending information in a form convenient for humans and computers. This format doesn't contain any indication of how this data should be displayed: that depends on the code of the program that reads it. Second, there is a major technical difference. HTML tags are predefined. In other words, creating an HTML header (for example, a large caption at the top of the page) uses only <h1></h1> tags (<h2></h2> and <h3></h3> are used for smaller headers). You can't create HTML headers using other tags. XML does not use predefined tags. You can give tags any name you want: <header>, <title>, <idontknow2121>.

Conflict resolution

The freedom that XML provides can lead to some problems. For example, one and the same entity (for example, a car) can be used by a program for different purposes. For example, we have an XML file that describes cars. However, our programmers didn't reach a prior agreement among themselves. And now, in addition to data about real cars, we might find data about toy cars in our XML! Moreover, they have the same attributes. Let's say our program reads in such an XML file. How do we distinguish a real car from a toy car?

<?xml version="1.0" encoding="UTF-8"?>
<carstore>
   <car category="truck">
       <model lang="en">Scania R 770</model>
       <year>2005</year>
       <price currency="US dollar">200000.00</price>
   </car>
   <car category="sedan">
       <title lang="en">Ford Focus</title>
       <year>2012</year>
       <price currency="US dollar">100.00</price>
   </car>
</carstore>
Here prefixes and namespace will help us. In order to distinguish toy cars from real ones in our program (and indeed any toys from their real counterparts), we introduce two prefixes: "real" and "toy".

<real:car category="truck">
   <model lang="en">Scania R 770</model>
   <year>2005</year>
   <price currency="US dollar">200000.00</price>
</real:car>
<toy:car category="sedan">
   <title lang="en">Ford Focus</title>
   <year>2012</year>
   <price currency="US dollar">100.00</price>
</toy:car>
Now our program will be able to distinguish between the different entities! Everything that has the toy prefix will be treated as toys :) However, we're not done yet. To use prefixes, we need to register each of them as a namespace. Actually, "register" is a strong word :) We simply need to come up with a unique name for each of them. It's like with classes: a class has a short name (Cat) and a fully qualified name that includes all packages (zoo.animals.Cat). A URI is usually used to create a unique namespace name. Sometimes this is done using an Internet address, where the functions of this namespace are described. But it does not have to be a valid Internet address. Very often, projects simply use URI-like strings that help track the namespace hierarchy. Here is an example:

<?xml version="1.0" encoding="UTF-8"?>
<carstore xmlns:real="http://testproject.developersgroup1.companyname/department2/namespaces/real"
         xmlns:toy="http://testproject.developersgroup1.companyname/department2/namespaces/toy">
<real:car category="truck">
   <model lang="en">Scania R 770</model>
   <year>2005</year>
   <price currency="US dollar">200000.00</price>
</real:car>
<toy:car category="sedan">
   <title lang="en">Ford Focus</title>
   <year>2012</year>
   <price currency="US dollar">100.00</price>
</toy:car>
</carstore>
Of course, there is no website at "http://testproject.developersgroup1.companyname/department2/namespaces/real" But this string does contain useful information: the developers of Group 1 in Department 2 is responsible for creating the "real" namespace. If we need to introduce new names or discuss possible conflicts, we'll know where to turn. Sometimes developers use a real descriptive web address as a unique namespace name. For example, this may be the case for a large company whose project will be used by millions of people around the world. But this is certainly not always done: Stack Overflow has a discussion on this issue. In general, there is no strict requirement to use URIs as namespace names: you can also simply use random strings. This option would also work:

xmlns:real="nvjneasiognipni4435t9i4gpojrmeg"
That said, using a URI has several advantages. You can read more about this right here.

Basic XML standards

XML standards are a set of extensions that add extra functionality to XML files. XML has a lot of standards, but we'll just look at the most important ones and find out that they make AJAX possible, which is one of the most famous XML standards. It lets you change the contents of a web page without reloading it! XSLT lets you convert XML text to other formats. For example, you can use XSLT to convert XML to HTML! As we have said, the purpose of XML is to describe data, not to display it. But with XSLT we can get around this limitation! XML DOM lets you retrieve, modify, add, or delete individual elements from an XML file. Here is a small example of how this works. We have a books.xml file:

<bookstore>
   <book category="cooking">
       <title lang="en">Everyday Italian</title>
       <author>Giada De Laurentiis</author>
       <year>2005</year>
       <price>30.00</price>
   </book>
   <book category="children">
       <title lang="en">Harry Potter</title>
       <author>J. K. Rowling</author>
       <year>2005</year>
       <price>29.99</price>
   </book>
</bookstore>
It has two books. Books have a <title> element. Here we can use JavaScript to get all the book titles from our XML file and print the first one to the console:

<!DOCTYPE html>
<html>
<body>

<p id="demo"></p>

<script>
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
    if (this.readyState == 4 && this.status == 200) {
  myFunction(this);
  }
};
xhttp.open("GET", "books.xml", true);
xhttp.send();

function myFunction(xml) {
    var xmlDoc = xml.responseXML;
  document.getElementById("demo").innerHTML =
  xmlDoc.getElementsByTagName("title")[0].childNodes[0].nodeValue;
}
</script>

</body>
</html>
DTD ("document type definition") lets you define a list of allowed elements for an entity in an XML file. For example, suppose we are working on a bookstore website and that all development teams agree that only the title, author, and year attributes should be specified for the book elements in the XML files. But how do we protect ourselves from carelessness? Very easy!

<?xml version="1.0"?>
<!DOCTYPE book [
       <!ELEMENT book (title,author,year)>
       <!ELEMENT title (#PCDATA)>
       <!ELEMENT author (#PCDATA)>
       <!ELEMENT year (#PCDATA)>
       ]>

<book>
   <title>The Lord of The Rings</title>
   <author>John R.R. Tolkien</author>
   <year>1954</year>
</book>
Here we have defined a list of valid attributes for <book>. Try to add a new element there and you will immediately get an error!

<book>
   <title>The Lord of The Rings</title>
   <author>John R.R. Tolkien</author>
   <year>1954</year>
   <mainhero>Frodo Baggins</mainhero>
</book>
Error! "Element mainhero is not allowed here" There are many other XML standards. You can familiarize yourself with each of them and try to dig deeper into the code. Anyway, if you need information on XML, you can find almost everything there :) And with this, our lesson comes to an end. It's time to get back to the tasks! :) Until next time!
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION