CodeGym /Courses /C# SELF /Main types of encodings: U...

Main types of encodings: UTF-8, UTF-16, ASCII

C# SELF
Level 37 , Lesson 1
Available

1. Introduction

We already understood that however clever computers are, they don't inherently know what a letter "A" or a symbol "⌘" is. They understand zeros and ones, and to decode symbols that humans get, they need a translator — an encoding.

The history of encodings is a history of compromises and evolution. At first it was simple, then it got more complicated, and then, finally, a more-or-less universal standard appeared. Let's walk through that timeline.

The beginning of the story

Let's start from the origins. In the previous lecture we already recalled the ancestor of text encodings — ASCII (pronounced "aski"). Reminder: it stands for American Standard Code for Information Interchange. So basically the American standard code for information interchange. From the name you can tell who it was made for, and why it's "American".

ASCII was developed in the 1960s and became the first widely used character encoding standard. It is a set of 128 characters:

  • Latin letters (uppercase and lowercase): A-Z, a-z
  • Digits: 0-9
  • Punctuation marks: .,!?"' and so on
  • Some control characters: newline, tab, etc.

Each of these 128 characters was encoded by one byte, using only 7 bits out of the 8 available (the highest bit was usually unused or used for parity). This is very compact and efficient for English.


Example:
Character 'A' in ASCII is encoded as byte 0x41 (binary 01000001)
Character '!' in ASCII is encoded as byte 0x21 (binary 00100001)

Limitations:
The main limitation of ASCII is obvious: it's tailored to English. If you want to write something in Russian ("Privet"), German ("Grüße") or Chinese, ASCII won't help. Its table simply doesn't contain those characters. This led to the emergence of many so-called code pages that tried to expand the 128 ASCII characters to 256 by using that eighth bit. For example, for Russian there were code pages like CP1251 (Windows Cyrillic), KOI8-R and others. But the problem was that these code pages were incompatible with each other: the same byte could mean completely different characters in different code pages! A real Tower of Babel.

Practical usage today:
Plain ASCII is rarely used for general text files nowadays, except for very specific needs or legacy systems. However, its legacy lives on: many modern encodings, as we'll see, are backward compatible with ASCII.

Let's try writing and reading something in ASCII, then add some Russian letters to see what happens.

Create a new console project in JetBrains Rider and call it, say, FileEncodingExplorer.

using System;
using System.IO;
using System.Text;

string file = "ascii.txt";
string asciiText = "Hello, world!";
string cyrillicText = "Privet, mir!";

// Writing in ASCII
using var writer = new StreamWriter(file, false, Encoding.ASCII);
writer.WriteLine(asciiText);
writer.WriteLine(cyrillicText);

// Reading from ASCII
using var reader = new StreamReader(file, Encoding.ASCII);
string content = reader.ReadToEnd();
Console.WriteLine("File contents:");
Console.WriteLine(content);

Console.WriteLine("\nCyrillic letters turned into '?', because ASCII doesn't support Cyrillic!");

When you run this code you'll see the English part reads fine, and the Russian turns into question marks (?) or other "unknown" symbols. That's because Encoding.ASCII doesn't know how to convert Cyrillic characters to bytes and just replaces them with something "safe" (usually ?), or bytes that correspond to Russian letters in some other encoding get interpreted by ASCII as different characters. In our case, StreamWriter forcibly replaces characters not present in ASCII with ? when writing. This demonstrates why it's important to use the correct encoding!

2. UTF-8: King of the internet and flexibility

Now we come to one of the most important and arguably the most popular encodings today — UTF-8. This is the encoding that most of the internet, Linux systems and modern applications use.

What is it?
UTF-8 (Unicode Transformation Format - 8-bit) is a Unicode encoding that solved the inefficiency of UTF-16 for English text. UTF-8 is a variable-length encoding, but with a very clever approach:

  • Characters that are plain ASCII characters (codes from 0 to 127) are encoded as one byte. And the best part: these bytes are identical to their representation in ASCII! That means UTF-8 is backward compatible with ASCII.
  • Other characters are encoded using 2 to 4 bytes:
    • Cyrillic — usually 2 bytes.
    • Many European characters with diacritics, Arabic, Hebrew, Greek — 2 bytes.
    • Chinese/Japanese/Korean ideographs — often 3 bytes.
    • Rare characters and some emoji — 4 bytes.

Examples of byte representations in UTF-8:

  • Character 'A' (ASCII): 01000001 (1 byte)
  • Character 'ya' (Russian): 11010001 10111111 (2 bytes)
  • Character '€' (Euro): 11100010 10000010 10101100 (3 bytes)
  • Character '😂' (emoji): 11110000 10011111 10011000 10000010 (4 bytes)

Why UTF-8 is king?

  1. Efficiency: Very compact for text that contains many ASCII characters (typical for English, source code, config files).
  2. Backward compatibility with ASCII: If you read a UTF-8 file that contains only ASCII characters, you can read it as ASCII and everything will work!
  3. No BOM (usually): Unlike UTF-16, UTF-8 usually doesn't use a BOM. If it appears (for example EF BB BF), it's an optional "feature" that sometimes causes issues (for example when parsing some formats or in Linux scripts).

Drawbacks:

  • Variable-length encoding can complicate some operations (like jumping to the N-th character without scanning), but in C# that's not a problem: string works with Unicode characters regardless of file encoding.

Practical usage:

  • Web pages (HTML, CSS, JavaScript),
  • APIs (JSON, XML),
  • Configuration files,
  • Source code for most programming languages,
  • Linux/Unix systems.

Let's write and read a file in UTF-8.

string file = "utf8.txt";
string text = "Hello, mir! 😀 €";

// Write in UTF-8 (default without BOM)
File.WriteAllText(file, text, Encoding.UTF8);

// Read from UTF-8
string readText = File.ReadAllText(file, Encoding.UTF8);
Console.WriteLine(readText); // Everything reads correctly!

When you run this code and compare file sizes, you'll see that utf8.txt for mixed text is usually smaller than a UTF-16 file, and if the text were only English, it'd be comparable in size to ASCII.

3. UTF-16: Unicode for almost everyone

The "Tower of Babel" problem with code pages became a real headache for developers, especially when applications went global. A universal solution was needed. It arrived — Unicode. Unicode is not an encoding itself, but a huge table that assigns a unique numeric code (code point) to every known character.

What is it?
UTF-16 (Unicode Transformation Format - 16-bit) is an encoding that originally assumed all Unicode characters would be encoded with two bytes (16 bits).

  • Most characters (BMP, up to 65535) are encoded with 2 bytes.
  • For characters outside the BMP surrogate pairs are used — 4 bytes. So UTF-16 is also variable-length, but is often thought of as 2 bytes per character.

Byte order (Endianness) and BOM:

  • Big-Endian (BE): the most significant byte comes first.
  • Little-Endian (LE): the least significant byte comes first.
  • To let the reader program know the order, a BOM (Byte Order Mark) is often placed at the start of the file:
    • For UTF-16 LE: FF FE
    • For UTF-16 BE: FE FF
    In C# and Windows the default is UTF-16 LE.

Advantages:

  • Supports the vast majority of the world's characters.
  • Easy to work with characters within the BMP (fixed 2 bytes).

Drawbacks:

  • Inefficient for English text: each ASCII character takes 2 bytes.
  • The presence of a BOM can cause issues if the reader doesn't expect it.

Practical usage:
UTF-16 is widely used inside Windows and, for example, in Java — for internal string representation. Notepad text files on Windows with Cyrillic often save as UTF-16 LE with a BOM.

string file = "utf16.txt";
string text = "Hello, mir! 👋";

// Write in UTF-16 (default Little-Endian, with BOM)
File.WriteAllText(file, text, Encoding.Unicode);

// Read from UTF-16
string readText = File.ReadAllText(file, Encoding.Unicode);
Console.WriteLine(readText); // Everything reads correctly!

Console.WriteLine($"File size: {new FileInfo(file).Length} bytes");

After running you'll see that all characters display correctly. English text files in UTF-16 take about twice as much space as in ASCII or UTF-8 (for the ASCII range).

4. Summary comparison table of encodings

To systematize our knowledge, let's put key characteristics into a single table.

Encoding Minimum bytes per character Maximum bytes per character Compatibility with ASCII (direct) Uses BOM (by default in .NET) Usage examples
ASCII 1 1 Full No Legacy systems, very simple text data, internal protocols
UTF-16 2 4 No Yes (Encoding.Unicode) Internal string representation in Windows, Java; Windows text files
UTF-8 1 4 Full No (Encoding.UTF8 in .NET 5+); Yes (Encoding.UTF8 in .NET Framework) Web (HTML, JSON), config files, source code, Linux/Unix

A small note about Encoding.UTF8 in .NET:
Historically in .NET Framework Encoding.UTF8 added a BOM by default. In modern .NET (Core/5+) the behavior changed: by default no BOM is added. If you need it, use new UTF8Encoding(true).

5. How to specify encoding in C#

As you probably noticed in the examples, to tell StreamReader or StreamWriter which "dictionary" to use, we pass them an object from the System.Text.Encoding class.

System.Text.Encoding provides ready-made options:

  • Encoding.ASCII: for working with ASCII.
  • Encoding.Unicode: UTF-16 LE (with BOM).
  • Encoding.UTF8: UTF-8 (no BOM by default in modern .NET).

Other encodings are available via Encoding.GetEncoding (for example, "windows-1251", "koi8-r"), but the focus now is on Unicode.

// Write in UTF-8
using var writer = new StreamWriter("my_file.txt", false, Encoding.UTF8);
writer.WriteLine("Kakoy-to tekst.");

// Read from UTF-16
using var reader = new StreamReader("another_file.txt", Encoding.Unicode);
string content = reader.ReadToEnd();
Console.WriteLine(content);

That's the whole secret! StreamReader and StreamWriter handle all the work of converting characters to bytes and back, using the rules of the chosen encoding.

6. Encoding problems: "mojibake"

We already saw "mojibake" when we tried to write Russian text in ASCII. But what if you wrote a file in one encoding and try to read it in another? That's where the real fun begins!

Imagine a message written in Russian and saved in UTF-8, and your client decides to read it as CP1251. The byte sequences will be interpreted incorrectly, and instead of "Privet, mir!" you'll get mojibake.

The reason is one: mismatch between encodings at write and read time. Always use the same encoding on both ends, unless you're intentionally re-encoding.

string file = "mismatch.txt";
string russianText = "Privet, mir!";

// Writing in UTF-8 (correct!)
File.WriteAllText(file, russianText, Encoding.UTF8);

// Wrong reading: trying to read a UTF-8 file as ASCII
string readAsAscii = File.ReadAllText(file, Encoding.ASCII);

Console.WriteLine($"Original: {russianText}");
Console.WriteLine($"Read as ASCII: {readAsAscii}"); // Here's where you'll get mojibake!

Run the program and you'll see how Cyrillic turns into question marks or other meaningless symbols. In the next lesson we'll talk about how to work with encodings more flexibly and avoid such problems, and how to detect the encoding of a file you're reading.

2
Task
C# SELF, level 37, lesson 1
Locked
Comparing File Sizes for Different Encodings
Comparing File Sizes for Different Encodings
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION