"Hi, Amigo!"

"Now it's time for another interesting topic: encodings."

"Perhaps you've already heard somewhere that each character has a code (number). That's why the char type can represent both symbols and numbers."

"For example, the code for the letter 'A' in the English alphabet is 65. 'B' is 66, 'C' is 67, and so on. There are unique codes for uppercase letters, lowercase letters, Cyrillic letters, Chinese characters (yeah, lots and lots of codes), numbers, and various symbols. In short, there is a code for practically everything you'd call a character."

"So, every letter and character corresponds to some number?"


"A character can be converted to a number, and a number to a character. Java generally doesn't see a difference between them:"

char c = 'A'; //The code (number) for 'A' is 65
c++; //Now c contains the number 66, which is the code for 'B'


"So, an encoding is a set of symbols and their corresponding set of codes. But not just one encoding has been invented—there are quite a few. It wasn't until later that a common universal encoding, Unicode, was invented."

"But no matter how many universal standards are invented, no one is in a hurry to abandon the old ones. And then everything happens just like in this cartoon:"

Character encodings - 1

"Imagine that Vincent and Nick decide to make their own encodings."

"Here's Vincent's encoding:"
Character encodings - 2

"And here's Nick's encoding:"
Character encodings - 3

"They even use the same characters, but the codes for the characters are different."

"When the string 'ABC-123' is written to a file using Vincent's encoding, we get the following set of bytes:"
Character encodings - 4

"And now another program that uses Nick's encoding wants to read the file:"

"Here's what it will read: «345-IJK»."

"And the worst thing is that encodings typically aren't stored anywhere in files, so developers have to guess."

"Well, how do they guess them?"

"That's a different topic. But I want to explain how to work with encodings. As you already know, the size of a char in Java is two bytes. And Java Strings use the Unicode format."

"But Java lets you convert a String into a set of bytes in any encoding that it knows. The String class has special methods for this. Java also has a special Charset class that describes a specific encoding."

1) How do I get a list of all the encodings Java supports?

"There is a special static method called availableCharsets for that. "This method returns a set of pairs (encoding name, object that describes the encoding):"

SortedMap<String,Charset> charsets = Charset.availableCharsets();

"Each encoding has a unique name. Here are some of them: UTF-8, UTF-16, Windows-1251, KOI8-R,…"

2) How do I get the current active encoding (Unicode)?

"There is a special method called defaultCharset for that.

Charset currentCharset = Charset.defaultCharset();

3) How do I convert a String to a specific encoding?

"In Java, you can convert a String to a byte array in any encoding that Java knows:"

Method Example
byte[] getBytes()
String s = "Good news, everyone!";
byte[] buffer = s.getBytes()
byte[] getBytes(Charset charset)
String s = "Good news, everyone!";
Charset koi8 = Charset.forName("KOI8-R");
byte[] buffer = s.getBytes(koi8);
byte[] getBytes(String charsetName)
String s = "Good news, everyone!";
byte[] buffer = s.getBytes("Windows-1251")

4) How do I to convert a byte array that I read from a file to a String, if I know what its encoding was in the file?

"This is even easier. The String class has a special constructor:"

Method Example
String(byte bytes[])
byte[] buffer = new byte[1000];

String s = new String(buffer);
String(byte bytes[], Charset charset)
byte[] buffer = new byte[1000];

Charset koi8 = Charset.forName("KOI8-R");
String s = new String(buffer, koi8);
String(byte bytes[], String charsetName)
byte[] buffer = new byte[1000];

String s = new String(buffer, "Windows-1251");

5) How do I convert a byte array from one encoding to another?

"There are many ways. Here's one of the simplest:"

Charset koi8 = Charset.forName("KOI8-R");
Charset windows1251 = Charset.forName("Windows-1251");

byte[] buffer = new byte[1000];
String s = new String(buffer, koi8);
buffer = s.getBytes(windows1251);

"That's what I thought. Thanks for the interesting lesson, Rishi."