1. DOM (Document Object Model)
Sometimes an XML file is too large to load entirely into memory. It may also be that the document structure is unknown in advance or too complex. In other cases, you only need to process specific parts of the data rather than the whole file, or you need to modify the document at runtime: remove a node, add a new element, or restructure part of it.
In these situations, the tried-and-true tools — DOM and SAX — are suitable. They let you work directly with XML content without binding to pre-defined Java classes.
How DOM works
DOM is a way to represent an XML document as an in-memory object tree. Each tag becomes a tree node (Node), and attributes, text values, and even comments are separate objects. After loading the document, you have full access to its structure: you can read, modify, delete, and add elements and attributes.
Key DOM classes in Java
- DocumentBuilderFactory — a factory for creating a parser.
- DocumentBuilder — the parser that turns XML into a tree.
- Document — the root object of the tree.
- Element — an XML element (tag).
- NodeList — a list of nodes (for example, all <item> inside <items>).
Example: Reading an XML file with DOM
Suppose we need to read the following XML file:
<contacts>
<person id="1">
<name>Ivan</name>
<email>ivan@example.com</email>
</person>
<person id="2">
<name>Maria</name>
<email>maria@example.com</email>
</person>
</contacts>
Code: reading and traversing elements
import javax.xml.parsers.*;
import org.w3c.dom.*;
import java.io.File;
public class DomExample {
public static void main(String[] args) throws Exception {
// 1. Create a factory and a parser
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
// 2. Load the XML file into memory
Document doc = builder.parse(new File("contacts.xml"));
// 3. Get the root element
Element root = doc.getDocumentElement();
System.out.println("Root element: " + root.getTagName());
// 4. Get the list of all <person>
NodeList persons = root.getElementsByTagName("person");
for (int i = 0; i < persons.getLength(); i++) {
Element person = (Element) persons.item(i);
String id = person.getAttribute("id");
String name = person.getElementsByTagName("name").item(0).getTextContent();
String email = person.getElementsByTagName("email").item(0).getTextContent();
System.out.println("id: " + id + ", name: " + name + ", email: " + email);
}
}
}
What’s going on here:
- First, we create a parser and load the XML file.
- We get the root element (contacts).
- We find all <person> elements and iterate over them.
- For each <person>, we read the id attribute and the <name> and <email> elements.
DOM tree diagram for the example
contacts
├── person (id="1")
│ ├── name ("Ivan")
│ └── email ("ivan@example.com")
└── person (id="2")
├── name ("Maria")
└── email ("maria@example.com")
Modifying XML with DOM
DOM lets you not only read but also change XML. For example, let’s add a new person:
// Create a new <person> element
Element newPerson = doc.createElement("person");
newPerson.setAttribute("id", "3");
// <name>
Element name = doc.createElement("name");
name.setTextContent("Sergey");
newPerson.appendChild(name);
// <email>
Element email = doc.createElement("email");
email.setTextContent("sergey@example.com");
newPerson.appendChild(email);
// Append to the root
root.appendChild(newPerson);
// Save changes to a file
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.transform(new DOMSource(doc), new StreamResult(new File("contacts-updated.xml")));
Main pros and cons of DOM
- Pros: convenient for small files, easy to make any changes, you can “walk” the tree in any direction.
- Cons: the entire XML is kept in memory. For large files (hundreds of MB and more), it’s not an option — you’ll run out of memory quickly.
2. SAX (Simple API for XML)
SAX is an event-driven parser. It doesn’t build a tree; it simply reads the XML left to right and invokes event handlers: “element started,” “text content,” “element ended,” etc. You write your own handler and react to the events you need.
Analogy:
If DOM is like laying all the pages of a planner out on a desk, SAX is like reading the planner page by page and taking notes only when you find the page you need.
Key SAX classes in Java
- SAXParserFactory, SAXParser — the factory and the parser.
- DefaultHandler — the base class for handling events.
- Handler methods: startElement, characters, endElement, startDocument, endDocument.
Example: Reading an XML file with SAX
Suppose we have the same contacts.xml file. We want to simply print the names and emails of all people.
Code: SAX parser
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import java.io.File;
public class SaxExample {
public static void main(String[] args) throws Exception {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(new File("contacts.xml"), new ContactHandler());
}
}
class ContactHandler extends DefaultHandler {
private String currentElement = "";
private String name = "";
private String email = "";
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) {
currentElement = qName;
if ("person".equals(qName)) {
String id = attributes.getValue("id");
System.out.println("New person, id: " + id);
}
}
@Override
public void characters(char[] ch, int start, int length) {
String text = new String(ch, start, length).trim();
if (text.isEmpty()) return;
if ("name".equals(currentElement)) {
name = text;
} else if ("email".equals(currentElement)) {
email = text;
}
}
@Override
public void endElement(String uri, String localName, String qName) {
if ("person".equals(qName)) {
System.out.println("Name: " + name + ", email: " + email);
name = "";
email = "";
}
currentElement = "";
}
}
Now let’s break down what’s happening in simple terms. When the SAX parser encounters an opening tag, for example <person>, it calls the startElement method. There, we immediately read the id attribute and print it. When text is encountered inside — for example, a name or an email — control goes to the characters method, where we store this text in temporary variables. And when the parser reaches the closing tag </person>, endElement is called. At that moment, we already know the person’s name and email and can print them. After that, the variables are cleared to be ready for the next contact.
The idea is that SAX doesn’t store the entire XML in memory; it works like a streaming “reader”: it goes through the file top to bottom and reacts to events — start tag, text, end tag. This is fast and memory-efficient, especially for large files.
Quick comparison of DOM and SAX
| DOM | SAX | |
|---|---|---|
| Style | Tree (entire structure in memory) | Events (processed “on the fly”) |
| Memory | Loads the entire XML | Minimal memory usage |
| Modifications | Can read/modify/add | Read-only (typically) |
| Data search | Easy to search anywhere | You need to track “where you are” |
| File size | For small and medium-sized files | For very large files |
3. When to use DOM and when to use SAX
DOM is a great fit if:
- The file is small or medium-sized.
- You need to access different parts of the XML repeatedly.
- You need to modify the document structure.
SAX is your choice if:
- The file is huge and cannot be loaded into memory.
- You need to quickly extract only part of the information (for example, find all <item> elements with a specific attribute).
- You need high performance and minimal memory usage.
- You don’t plan to modify the XML, only read it.
Tip:
In real projects, both approaches are often used: DOM — for “human-friendly” settings and small configs, SAX — for processing logs, exports, and large imports.
4. Practical exercise: a small parser for an app
In your practice app (for example, a “contact book”), suppose you need to quickly count how many users have emails on gmail.com.
DOM solution:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File("contacts.xml"));
NodeList emails = doc.getElementsByTagName("email");
int count = 0;
for (int i = 0; i < emails.getLength(); i++) {
String email = emails.item(i).getTextContent();
if (email.endsWith("@gmail.com")) {
count++;
}
}
System.out.println("Users with gmail.com: " + count);
SAX solution:
class GmailCounterHandler extends DefaultHandler {
private String currentElement = "";
int count = 0;
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) {
currentElement = qName;
}
@Override
public void characters(char[] ch, int start, int length) {
if ("email".equals(currentElement)) {
String email = new String(ch, start, length).trim();
if (email.endsWith("@gmail.com")) {
count++;
}
}
}
@Override
public void endDocument() {
System.out.println("Users with gmail.com: " + count);
}
}
5. Details and nuances of working with DOM and SAX
- DOM can eat up all memory if the XML file is very large. If you suddenly see an OutOfMemoryError, it’s probably time to switch to SAX.
- SAX requires careful attention: you need to track which element you’re in and carefully assemble the required data. Sometimes you have to use a stack or additional variables to avoid getting lost in deeply nested structures.
- In SAX, the characters method may be called multiple times for the same text block (especially if the text is long or contains special characters). It’s better to accumulate text in a StringBuilder.
- DOM is great for searching, navigation, and modifying structure, but not for streaming processing.
- If you’re not sure which approach to choose, start with DOM for simplicity; if it becomes “heavy” on memory, rewrite it using SAX.
6. Common mistakes when working with DOM and SAX
Error #1: Incorrect handling of spaces and line breaks in SAX.
The characters method may return chunks of text, including spaces and line breaks between elements. If you don’t filter with .trim().isEmpty(), you may get lots of “empty” calls or assemble text incorrectly.
Error #2: Trying to modify XML with SAX.
SAX is read-only! If you need to change the structure, use DOM.
Error #3: Violating the order of events in SAX.
If variables are not reset in endElement, you can get a “leak” of data between elements.
Error #4: Using DOM for huge files.
The result is OutOfMemoryError or very slow performance.
Error #5: Incorrect casting in DOM.
In DOM everything is a Node, but to work with attributes and child elements you need to cast to Element. An incorrect cast can lead to a ClassCastException.
GO TO FULL VERSION