1. Introduction
Let's start with the main question — why bother with encodings at all? You might think one file, one encoding, and everything should be fine. But in reality it's more interesting. Very quickly you find out that a file can come from anywhere and in a totally unexpected encoding. And your app, of course, expects something else.
Sometimes you need to integrate with an API or system that rigidly requires a specific encoding and nothing else. Sometimes you dig up an old file that's ten years old and try to open it in a new editor — and you get garbled text. Or, for example, you save a CSV report and want to be sure colleagues on Windows, Mac, and even Excel can read it properly.
In short, re-encoding is not rare or exotic. On the contrary, it's a very real and fairly common task that pops up everywhere: when exporting from databases, working with archives, integrating with accounting, or in automation scripts. The sooner you deal with it, the fewer surprises later.
From old encoding to new
To re-encode a file you need to do two steps:
- Read the file using the source encoding to get strings (or chars).
- Write those strings to a new file, explicitly specifying the target encoding.
Analogy: imagine translating a book from French to Russian. First you must be able to read French (read the text), then express the same meaning in Russian (write the text).
In C# (and .NET) this is done by passing the appropriate Encoding object to the StreamReader (for reading) and StreamWriter (for writing) constructors.
Overview of encodings and the Encoding class
All the "magical" character conversions are performed using the System.Text.Encoding class.
Here's a table with the main encodings and how to get them in C#:
| Encoding | Description | C# constant |
|---|---|---|
| UTF-8 | Universal, without BOM by default | |
| UTF-8 with BOM | The same, but with a BOM signature | |
| UTF-16 (LE/BE) | "Wide char", little-endian | Encoding.Unicode (LE), Encoding.BigEndianUnicode (BE) |
| ASCII | 7-bit classic | |
| Windows-1251 | Popular for Cyrillic | |
| ISO-8859-1 | Latin (European) | |
Note: For legacy encodings you'll need to use Encoding.GetEncoding, sometimes with a code (1251), sometimes with a name ("windows-1251").
2. Re-encoding: step-by-step strategy in C#
Let's figure out how to implement re-encoding in practice.
Algorithm
- Open the source file with StreamReader, explicitly specifying the source encoding.
- Read the text (either fully or line-by-line — depends on file size).
- Open the output file with StreamWriter, specifying the desired target encoding.
- Write the text to the output file.
- Don't forget to close streams — use using for safety!
Example: re-encoding Windows-1251 → UTF-8
Say we have a file "input-1251.txt" encoded in Windows-1251, and we want "output-utf8.txt" in "clean" UTF-8 (no BOM).
// Specify the source with the source encoding
using var reader = new StreamReader("input-1251.txt", Encoding.GetEncoding(1251));
using var writer = new StreamWriter("output-utf8.txt", false, Encoding.UTF8);
string line;
// Read line-by-line — convenient for large files
while ((line = reader.ReadLine()) != null)
{
writer.WriteLine(line);
}
Console.WriteLine("The file was successfully re-encoded from Windows-1251 to UTF-8!");
This code ensures each line is correctly converted from one encoding to another.
Schematically it looks like this:
┌───────────────────────────────┐ ┌───────────────────┐ ┌───────────────────────────────┐
│ File "input-1251.txt" (1251) │ --> │ StreamReader │ --> │ string lines in memory │
└───────────────────────────────┘ │ (Encoding 1251) │ └───────────────────────────────┘
└───────────────────┘
│
▼
┌────────────────────────┐
│ StreamWriter │
│ (Encoding UTF-8) │
└───────────┬────────────┘
│
▼
┌──────────────────────────────┐
│ "output-utf8.txt" (UTF-8) │
└──────────────────────────────┘
3. Practical example: an encoding converter
Let's make the task a bit more complex: write a simple converter program where the user can choose the source and target files and the encodings.
Console.WriteLine("Enter path to the source file:");
string inputPath = Console.ReadLine();
Console.WriteLine("Source file encoding (for example, 1251, utf-8):");
string sourceEncodingName = Console.ReadLine();
Console.WriteLine("Enter path to the output file:");
string outputPath = Console.ReadLine();
Console.WriteLine("Encoding for saving (for example, utf-8, 1251):");
string destEncodingName = Console.ReadLine();
Encoding sourceEncoding = Encoding.GetEncoding(sourceEncodingName);
Encoding destEncoding = Encoding.GetEncoding(destEncodingName);
using var reader = new StreamReader(inputPath, sourceEncoding);
using var writer = new StreamWriter(outputPath, false, destEncoding);
string line;
while ((line = reader.ReadLine()) != null)
{
writer.WriteLine(line);
}
Console.WriteLine("Done! Check the result.");
Tip: If you're not sure which encoding to use, check documentation or try different options. Sometimes a file can only be "guessed" experimentally if no one left a README.
4. Important nuances: BOM, "invisible" characters and special cases
How to add a BOM when saving UTF-8?
By default Encoding.UTF8 in C# creates files without a BOM (Byte Order Mark). If you need a file with a BOM, use:
// true — means "with BOM"
var utf8WithBom = new UTF8Encoding(true);
using var writer = new StreamWriter("with-bom.txt", false, utf8WithBom);
writer.WriteLine("Text with BOM");
The file will now start with three "invisible" bytes: 0xEF, 0xBB, 0xBF.
When is BOM evil?
- If you export CSV for Excel, most foreign versions of Excel are bothered by BOM — weird characters appear.
- On Unix-like systems (Linux) BOM sometimes breaks file processing.
- For JSON files BOM is almost always evil, many parsers choke!
Tip: Make a conscious choice whether you need a BOM, and add it only when necessary.
Creeping characters and magic numbers
If when opening a "re-encoded" file the program throws a format error, maybe you didn't account for "invisible" characters at the start (like a BOM), or you guessed the encoding wrong (for example, read a file as ASCII while it contains Cyrillic in 1251). Always check which encodings your app/colleague/server uses.
5. Useful nuances
Working with large files: why line-by-line, not all at once?
You could write simply:
string allText = File.ReadAllText("input.txt", Encoding.GetEncoding(1251));
File.WriteAllText("output.txt", allText, Encoding.UTF8);
This approach works for small and medium files (say, up to 50 MB). But if the file is large, the whole text will be loaded into RAM — and if the file is a couple of GB, you'll be in trouble. So for robustness we use line-by-line reading/writing.
Re-encoding between less common encodings
Suppose you suddenly have a file in ISO-8859-1 (MySQL dumps love that), and you need it in Unicode.
var sourceEnc = Encoding.GetEncoding("iso-8859-1");
var destEnc = Encoding.UTF8;
using var reader = new StreamReader("data-latin.txt", sourceEnc);
using var writer = new StreamWriter("data-unicode.txt", false, destEnc);
string line;
while ((line = reader.ReadLine()) != null)
{
writer.WriteLine(line);
}
This approach works with any encoding supported by .NET.
Do NOT re-encode binary files!
The program above is intended exclusively for text files. If a file is binary (for example, an image, audio file, archive), trying to "read it as text and then write it" will definitely mangle bytes, produce lots of garbled output, and the file will become unusable. For binary files re-encoding doesn't apply — "encoding" makes no sense.
Mapping table: which encodings are useful for what
| Encoding | When to use |
|---|---|
| UTF-8 | General files, web, modern apps |
| UTF-8 with BOM | For compatibility with Notepad, Excel |
| Windows-1251 | For "old-school" local Russian apps |
| ASCII | For files that contain only English text |
| UTF-16 | For special cases, exotic apps |
Helper "tricks" and tips
How to find out a file's encoding?
- Special editors (Notepad++, Visual Studio Code) often detect and show the encoding.
- If there's no BOM and everything "looks" correct but letters are wrong — most likely the encoding doesn't match.
Can encoding detection be automated?
- .NET has no "magic wand" that 100% detects any file encoding. Usually the rules are: if there's a BOM it's clearly UTF with BOM; if not — you have to guess from content or try variants.
What if some lines read fine and others are "garbled"?
- Maybe the file is "composite" or corrupted. Check whether the file was written by different programs.
6. Typical mistakes and how to avoid them
For smooth understanding, let's look at the most common mistakes:
Wrong source encoding. If you think the file is UTF-8 but it's actually Windows-1251 — you'll get gibberish instead of Cyrillic. Verify the source encoding; if unsure open it in Notepad++ or similar editors that show the real encoding.
BOM in the wrong place. Sometimes adding a BOM breaks parsing, sometimes lack of BOM leads to wrong encoding detection on the other side.
Reading the whole file into memory. If the file is large, use line-by-line reading — otherwise the program will "eat" all RAM on a big text dataset.
Writing without specifying encoding. By default StreamWriter uses UTF-8 without BOM. If you want a different variant — explicitly set the encoding.
Wrong use of GetEncoding. If you typo the name, for example, "utf8" instead of "utf-8", you'll get an exception. Use correct names (or codes, e.g. 1251).
GO TO FULL VERSION