Skip to content
Published on

Unicode and UTF-8: The Text Encoding Minefield

Authors

Introduction — The Illusion of "One Character"

When you start programming, you learn that a string is "a sequence of characters." And most of the time, that model serves you fine. Back when English-speaking developers only dealt with ASCII, it was actually true. One character was one byte, string length equaled character count, and reversing a string was the same as reversing an array.

But the text we handle is not only English. The moment Korean, Japanese, emoji, combining accents, and right-to-left scripts enter the picture, the simple "one character" model collapses. And this collapse happens quietly. No compile error, and most tests still pass. Then a user enters a name with an emoji in it, or a Korean user searches on Linux for a filename created on macOS, and suddenly everything is misaligned.

This post maps out that minefield. We will walk through how bytes, code points, and grapheme clusters are three distinct layers, why "👨‍👩‍👧".length lies, and why the difference between UTF-8 and UTF-16 turns into real production bugs. For those of us who work with Korean and Japanese, this is not somebody else's problem.

Three Layers — Bytes, Code Points, Graphemes

The first thing to internalize is that text is not one layer but three. Almost every Unicode bug starts by conflating these three layers.

  • Byte: the unit of storage and transmission — what actually flows through files and the network. An encoding (like UTF-8) turns code points into bytes.
  • Code point: the number Unicode assigns to each character, written like U+AC00 (가) or U+1F600 (😀). Unicode is a giant dictionary of these numbers.
  • Grapheme cluster: the unit a human perceives as "one character." The trap is that this can be made of several code points.

Take the family emoji. To the human eye it is one character. But behind it sit several code points, and those code points are in turn encoded as several bytes.

What a human sees:  👨‍👩‍👧   (one grapheme = "one character")
Code points:        U+1F468 U+200D U+1F469 U+200D U+1F467  (5 of them)
UTF-8 bytes:        11 bytes

All three counts differ. One grapheme, five code points, eleven bytes. And in most languages, .length counts none of these three accurately.

Why "👨‍👩‍👧".length Lies

Run this in JavaScript.

"👨‍👩‍👧".length        // 8
[..."👨‍👩‍👧"].length   // 5

.length returns 8. There is one grapheme and five code points — so why 8? A JavaScript string's .length counts the number of UTF-16 code units. Not graphemes, not code points. Three of this emoji's five code points (the person emoji) are each represented in UTF-16 as two code units (a surrogate pair), and the two ZWJs are one each, so 3×2 + 2×1 = 8.

In other words, what .length counts is neither "the number of characters a human sees" nor "the number of Unicode characters," but an implementation detail of the encoding underneath. The spread operator [...str] and for...of iterate by code point, so they return 5. And to get the "1" a human expects, you need a grapheme cluster segmenter (like Intl.Segmenter).

const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment("👨‍👩‍👧")].length   // 1

The lesson is clear. The question "what is the length of this string?" has more than one answer. You must first decide whether you mean the number of bytes needed for storage, the number of code points, or the number of characters the user counts. If you miss this distinction in UI logic — a tweet character limit, an input-field max length, cursor movement — a bug is guaranteed.

UTF-8 vs UTF-16 vs UTF-32

Unicode is just a "numbering book"; how you turn those numbers into actual bytes is the encoding. Let's compare the three main ones.

EncodingCode unit sizeBytes per characterNotes
UTF-88-bit1–4 bytes (variable)ASCII-compatible, web standard, space-efficient
UTF-1616-bit2 or 4 bytesBMP is 2 bytes, beyond it a surrogate pair
UTF-3232-bitalways 4 bytesfixed width, simple but wasteful

UTF-8 is the de facto standard today. The ASCII range (0–127) is encoded as-is in a single byte, so English text is byte-for-byte identical to ASCII. Above that, Latin extensions, Korean, and emoji grow to 2, 3, and 4 bytes. A single Korean syllable is 3 bytes in UTF-8, and most Japanese characters are 3 bytes too. That is why Korean and Japanese text produces larger files than English.

UTF-16 is the internal string representation of Java (the JVM), JavaScript, Windows, and C#. Characters in the Basic Multilingual Plane (BMP, U+0000U+FFFF) are 2 bytes, but anything beyond it (most emoji) is represented as two 16-bit units — a surrogate pair. This is exactly the root of the .length lie we just saw.

UTF-32 represents every code point as 4 bytes, always. It has the advantage that indexing becomes a simple O(1) operation, but it wastes so much space for typical text that it is almost never used for storage or transmission.

Surrogate Pairs — UTF-16's Original Sin

Surrogate pairs deserve a closer look. Unicode originally assumed 16 bits would be enough for every character (65,536 of them). That soon proved insufficient, and the code point space was extended up to U+10FFFF. UTF-16, already locked into 16-bit units, had to represent these extended characters somehow.

The solution is the surrogate pair. The range U+D800U+DFFF is reserved as a special zone that is "not a character on its own — only meaningful in pairs," and characters beyond the BMP are expressed as a combination of two units from this zone.

😀  =  U+1F600
UTF-16:  0xD83D 0xDE00   (high surrogate + low surrogate)
"only together do these two make one 😀"

The problem is that these two units can be split by accident. In JavaScript, if you slice out the first "character" of an emoji with str.charAt(0) or str[0], you get half a surrogate — a broken character (usually ).

const s = "😀";
s[0];              // '\uD83D' — half a surrogate, broken character
s.substring(0, 1); // broken the same way
s.codePointAt(0);  // 128512 — the correct code point

This is why, when you slice or index a string, you must respect code point boundaries rather than code unit boundaries. It is a common accident for naive code that truncates a username to 20 "characters" to chop an emoji in half.

Normalization — The é That Looks the Same but Isn't

Now we step on a mine that matters especially to Korean and Japanese users: normalization.

Here is where the trouble starts. Unicode often provides more than one way to represent the same character. Consider the French é.

  • Composed (NFC): U+00E9 — a single code point that is "é" itself.
  • Decomposed (NFD): U+0065 U+0301 — an "e" (U+0065) followed by a combining accent (U+0301).

On screen they both look exactly like é. But at the byte level they are entirely different data. So this happens.

const a = "é";           // NFC: 1 code point
const b = "é";     // NFD: 2 code points
a === b;                 // false !
a.length;                // 1
b.length;                // 2
a.normalize() === b.normalize();  // true (both normalized to NFC)

To the eye it is the same character, yet === fails. This is a common culprit behind the ghost bug where you search a database for a username that "is clearly there but won't come up." If one side is stored as NFC and the other as NFD, the string match fails.

Unicode defines four normalization forms.

FormNameMethod
NFCCanonical Compositiondecompose, then recompose as much as possible (the most common storage form)
NFDCanonical Decompositiondecompose as much as possible
NFKCCompatibility Compositionnormalize compatibility characters too, then compose
NFKDCompatibility Decompositionnormalize compatibility characters too, then decompose

The practical rule is simple: normalize incoming strings to a single form (usually NFC) before storing them. Then comparison, search, and duplicate detection become stable.

The macOS vs Linux é War — Korean Is Especially at Risk

Normalization is not somebody else's problem because different operating systems prefer different forms. In particular, Apple's file system has historically stored filenames in a form close to NFD, while Linux and Windows generally use NFC.

This difference shows up dramatically in Korean. The syllable "각" can be represented two ways.

"각" (NFC):  U+AC01                     (1 code point, precomposed syllable)
"각" (NFD):  U+1100 U+1161 U+11A8        (ㄱ+ㅏ+ㄱ, decomposed into 3 jamo)

Create a file like "report_final.hwp" on macOS, zip it or commit it to git, then look for that file by name on a Linux server, and it may not show up — because the bytes differ. If you are a Korean developer, you have probably hit "I unzipped a macOS zip on a server and the Korean filenames were broken or unsearchable" at least once. The culprit is often not the encoding but a mismatch in normalization form.

Japanese has the same trap: kana with dakuten and handakuten (が, ぱ, and so on) can have decomposed and composed forms. So when handling filenames, user input, and search keys, it is wise to make it a team rule which normalization form you standardize on.

Emoji and ZWJ — How to Assemble One Character

Earlier we said the family emoji is five code points. The secret to that assembly is the ZWJ (Zero Width Joiner, U+200D). The ZWJ is an invisible "glue" character — a signal to render the emoji before and after it as one.

👨 (man) + ZWJ + 👩 (woman) + ZWJ + 👧 (girl)
  = 👨‍👩‍👧  (renders as one family)

If the font or platform understands this ZWJ sequence, it draws one combined emoji; if not, it shows three people side by side. This is why the same text can look different on different devices.

Add skin-tone modifiers (Fitzpatrick modifiers), flags (a combination of two regional indicator characters), and gender/profession combinations, and it is common for one grapheme to grow to five or six code points. All of this is "one emoji" to the user. If you fail to respect the grapheme cluster when handling strings, you get bugs that chop emoji wrong, count length wrong, or move the cursor only halfway.

String Reversal — The Most Famous Trap

"Reverse a string" is a coding-interview classic. The naive answer looks like this.

function reverse(s) {
  return s.split("").reverse().join("");
}
reverse("hello");   // "olleh"  — works

In ASCII it is perfect. But once Unicode enters, it falls apart.

reverse("😀");        // "\uDE00\uD83D" — surrogate pair reversed and broken → �
reverse("é");   // accent detaches from its base letter → "́e"

split("") splits by UTF-16 code unit, so it cuts surrogate pairs in half. For combining characters, the accent ends up attached to the wrong letter. Processing by code point fixes the surrogate problem, but combining characters and ZWJ emoji still break. To truly reverse correctly, you must split by grapheme cluster.

function reverseGraphemes(s, locale = "en") {
  const seg = new Intl.Segmenter(locale, { granularity: "grapheme" });
  return [...seg.segment(s)].map(x => x.segment).reverse().join("");
}

The lesson of this example is not that reversing strings matters. It is that it shows how often, and how quietly, the naive assumption of "processing character by character" breaks. Slicing, counting, cursor movement, even regex matching — the same trap hides everywhere.

A Practical Checklist

Compressing the minefield above into practical rules:

  • Standardize on UTF-8. If files, the DB, HTTP headers, and even source code are all UTF-8, half of your encoding confusion disappears.
  • Decide what "length" means first. Bytes, code points, or graphemes? A UI character limit should be grapheme-based to match user expectations.
  • Normalize input before storing. Usually to NFC. Match search keys and comparison targets to the same form.
  • Never slice strings by code unit. To avoid chopping emoji and combining characters in half, respect code point — ideally grapheme — boundaries.
  • Suspect the normalization form of filenames. Especially in pipelines that cross macOS and Linux, and with Korean and Japanese filenames.
  • Put emoji and combining characters in your tests. If you test with ASCII only, these bugs will never be caught.

Conclusion

Our first intuition that text is "a sequence of characters" was true only in the narrow world of English and ASCII. Real text has three layers — bytes, code points, and graphemes — and emoji, combining characters, and normalization all knock those layers out of alignment. .length lying, an é that looks the same but isn't, and string reversal breaking emoji all come from this misalignment.

For those of us who work with Korean and Japanese, this is especially pressing. We live in a world where the same character becomes different bytes between precomposed and decomposed forms, between macOS and Linux. Fortunately, the principles are simple. Distinguish the three layers, standardize on UTF-8, normalize input, and respect graphemes rather than code units. These four habits alone let you cross most of the text minefield safely.

References