1. The concept of character encoding (encoding)
Let’s start with the main question: what is an encoding?
Imagine you’ve arrived at an international conference. Everyone speaks their own language, but they all want to understand each other. You need an interpreter who knows that "hello" in English is "privet" in Russian and "hola" in Spanish. In the world of computers, an encoding is that interpreter.
Encoding is a way of representing characters as bytes
A computer is simple: it understands only zeros and ones, that is, bits and bytes. People, however, want to see letters, digits, emoji, and even — oh no! — Chinese characters. For a computer to “write down” a character, we must agree on which sequence of bytes corresponds to each character.
Encoding is a set of rules for turning characters (letters, digits, punctuation, emoji, etc.) into bytes for storage and transmission, and back again: turning bytes into characters for display.
Example: the letter 'A' in different encodings
- In UTF-8, the Cyrillic capital letter A (U+0410) is encoded as two bytes: 0xD0 0x90.
- In Windows-1251, the same letter is a single byte: 0xC0.
- The Latin 'A' is 0x41 in most popular encodings.
If you read a file using the wrong encoding, characters will turn into “mojibake” (garbled symbols or question marks).
2. Why encodings are needed
Why can’t we just store letters as they are?
Because a computer understands only numbers (zeros and ones). Which number corresponds to which letter — that’s the essence of an encoding.
Example: “Privet” on disk
When you write the word "Privet" (Russian for “hello”) to a file, for the computer it’s just a sequence of bytes. How those bytes are interpreted depends on the encoding.
- If the file is saved in UTF-8, each Cyrillic letter is encoded using two bytes.
- If it’s Windows-1251, each letter is one byte, but the byte values are different.
Where do encodings matter?
- When writing text to a file: so that the bytes can be read correctly later.
- When reading text from a file: to turn bytes back into letters.
- When sending text over a network (for example, HTTP, e‑mail).
- When working with databases: they also need to know which encoding is used to store text.
If you don’t specify an encoding…
It’s like opening a text in an unfamiliar language and trying to read it. Your chances improve if you know what the language is. If you don’t — at best, you won’t understand anything; at worst, you’ll get gibberish.
3. Problems without the right encoding
Mojibake and data loss
The most common complaint from beginners (and not only): “Why do I see "Привет" or just question marks instead of "Privet"?”
This happens when a file was written in one encoding and read in another. For example, the file was created on an old Windows system in Windows-1251, and you open it on Linux, which defaults to UTF-8. Or vice versa.
Example
- Saved the file in Windows-1251: the byte for Cyrillic capital letter Pe (U+041F) is 0xCF.
- Opened in UTF-8: the program expects Cyrillic to be two bytes, but it gets one. Everything breaks.
Data loss
If, when writing, a character is not supported by the chosen encoding (for example, you try to save an emoji in ASCII), it will disappear or be replaced with a question mark. Anything that doesn’t “fit” into the encoding is lost.
Problems when exchanging files
Files saved in one encoding may display incorrectly on other computers if they use a different default encoding. This often happens when exchanging files between Windows and Linux or when opening old files.
4. Encoding in Java: internal and external representation
Inside the JVM: always Unicode (UTF-16)
In Java, strings (String) inside a program are always stored in Unicode (specifically, UTF-16). This means you can safely assign strings in any language in the world, and Java will handle them just fine.
String hello = "Hello, world! 😀";
In JVM memory, this text is stored as a set of 16‑bit numbers (char), where each character has its own code in the Unicode table.
Fun fact
In Java, a char is 16 bits (2 bytes). But some characters (for example, rare ideographs or emoji) require two char values — these are “surrogate pairs.”
On input/output: encoding matters!
When you read or write strings to the outside world (files, network), Java must convert the internal representation (UTF-16) into a sequence of bytes. That’s where the encoding is needed.
- If you don’t explicitly specify an encoding, Java uses the system default (on Russian Windows this might be Windows-1251, on Linux — UTF-8).
- This is risky: on another machine the result may differ.
Example: reading and writing a file without specifying the encoding
// Bad practice! No encoding specified.
FileReader reader = new FileReader("data.txt");
FileWriter writer = new FileWriter("data.txt");
In this case, Java uses the system encoding. If the file was written on a different system, you’ll get mojibake.
Good practice: always specify the encoding
// Good! The encoding is specified explicitly.
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("data.txt"), StandardCharsets.UTF_8));
BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("data.txt"), StandardCharsets.UTF_8));
5. A brief illustration: what happens when working with encodings
Diagram: the path of a string from file to program and back
[File on disk (bytes, encoding X)]
|
V
[Java reads bytes and, using encoding X, turns them into a String (UTF-16)]
|
V
[You work with the string in the program]
|
V
[Java writes the String to bytes using encoding Y]
|
V
[File on disk (bytes, encoding Y)]
If X and Y match, everything is fine. If they differ, problems are possible.
6. A brief history of encodings (for the curious)
ASCII
ASCII is one of the oldest encodings: one byte per character, only the English alphabet, digits, and basic punctuation. Any other alphabets — out of scope.
Windows-1251, ISO-8859-1 and other legacy encodings
These are single-byte encodings for different sets of letters: Cyrillic, Latin, Greek, etc. Everyone picked their own, and chaos ensued.
Unicode and the UTF family
- Unicode is a global table for the world’s characters.
- UTF-8, UTF-16, UTF-32 are different ways to represent Unicode characters in bytes.
- UTF-8 has become the standard for the Web, files, and intersystem exchange.
7. Practice: how encoding affects working with files
Let’s look at a small example of writing and reading strings with different encodings.
Example: writing and reading in different encodings
import java.io.*;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
public class EncodingDemo {
public static void main(String[] args) throws IOException {
String text = "Hello, world! 😀";
// Write the file in UTF-8
try (Writer writer = new OutputStreamWriter(
new FileOutputStream("utf8.txt"), StandardCharsets.UTF_8)) {
writer.write(text);
}
// Now try to read it using the wrong encoding
try (Reader reader = new InputStreamReader(
new FileInputStream("utf8.txt"), Charset.forName("Windows-1251"))) {
int c;
while ((c = reader.read()) != -1) {
System.out.print((char) c);
}
}
// You’ll see mojibake on the screen!
}
}
Conclusion: If the encodings don’t match, the text will be corrupted.
8. Encoding and integration with other systems
In real projects, files are often exchanged between different programs, written in different languages and running on different OSes. Each might expect its own encoding. If you don’t agree on this in advance, you’ll get mojibake and hard-to-catch bugs. A typical case: a database stores text in UTF-8, but a program reads the source file as Windows-1251 and loads it into the DB — corrupted characters guaranteed.
9. Common mistakes when working with encodings
Mistake #1: No encoding specified when reading/writing a file.
As a result, the program “works on my machine,” but on a colleague’s machine — mojibake.
Mistake #2: Using outdated constructors (FileReader, FileWriter).
They always use the system default encoding — a trap for beginners.
Mistake #3: Wrong encoding of the source file.
If the file was saved in one encoding and read in another, some characters will be distorted or replaced with question marks.
Mistake #4: Character loss when converting between encodings.
If the target encoding doesn’t support all characters (for example, ASCII instead of UTF-8), part of the text will simply disappear.
GO TO FULL VERSION