UTF-8 Encoding

The encoding that lets you use emojis, Chinese characters, and plain ASCII in the same file.

5 min read

What is UTF-8?

UTF-8 is a character encoding that can represent every character in the Unicode standard—over 140,000 characters from virtually every writing system on Earth, plus emoji. It's backward compatible with ASCII and has become the dominant encoding on the web.

ASCII:     Hello
UTF-8:     Hello こんにちは مرحبا 🚀

The genius of UTF-8 is that ASCII characters (the first 128) use just one byte, while other characters use 2-4 bytes. This means English text stays compact while still supporting every language.

How It Works

UTF-8 uses a clever variable-length encoding:

BytesBit PatternCode Point RangeCharacters
10xxxxxxxU+0000 to U+007FASCII (a-z, 0-9, etc.)
2110xxxxx 10xxxxxxU+0080 to U+07FFLatin, Greek, Cyrillic
31110xxxx 10xxxxxx 10xxxxxxU+0800 to U+FFFFChinese, Japanese, Korean
411110xxx 10xxxxxx 10xxxxxx 10xxxxxxU+10000 to U+10FFFFEmoji, rare scripts

The leading bits tell you how many bytes the character uses. Continuation bytes always start with 10.

Encoding Examples

Character: A
Unicode:   U+0041
UTF-8:     01000001 (1 byte)

Character: é
Unicode:   U+00E9
UTF-8:     11000011 10101001 (2 bytes)

Character: 中
Unicode:   U+4E2D
UTF-8:     11100100 10111000 10101101 (3 bytes)

Character: 🎉
Unicode:   U+1F389
UTF-8:     11110000 10011111 10001110 10001001 (4 bytes)

Why UTF-8 Won

Before UTF-8, we had encoding chaos:

  • ASCII for English (127 characters)
  • ISO-8859-1 for Western European
  • Shift-JIS for Japanese
  • GB2312 for Chinese
  • ...and hundreds more

If you opened a file with the wrong encoding, you got garbage: é instead of é. UTF-8 solved this by supporting everything in one encoding.

EncodingProsCons
UTF-8Universal, ASCII-compatible, web standardVariable width complicates string operations
UTF-16Fixed width for most chars, Windows/Java defaultNot ASCII-compatible, byte-order issues
UTF-32Fixed 4 bytes per char, simple indexingWastes space, rarely used
ASCIISimple, 1 byte per charEnglish only

Where You'll See This

  • The Web - 98%+ of websites use UTF-8
  • Source code - Most editors default to UTF-8
  • JSON - UTF-8 by specification
  • APIs - REST APIs typically use UTF-8
  • Databases - utf8mb4 in MySQL, UTF8 in PostgreSQL
  • Terminal - Modern terminals are UTF-8 native

Common Gotchas

⚠️String Length vs Byte Length

"Hello".length is 5 characters and 5 bytes. But "🎉".length might return 2 in JavaScript (surrogate pairs) even though it's 4 bytes. Character counting is surprisingly hard.

ℹ️MySQL utf8 vs utf8mb4

MySQL's utf8 only supports 3-byte characters (no emoji!). Use utf8mb4 for full UTF-8 support. Yes, this has bitten everyone.

  • BOM (Byte Order Mark) - EF BB BF at the start of files. Usually unnecessary and can cause issues.
  • Mojibake - Garbled text from wrong encoding: é is é decoded as Latin-1 instead of UTF-8.
  • Overlong encodings - Security vulnerability where characters are encoded with more bytes than necessary.
  • Invalid sequences - Not all byte sequences are valid UTF-8. Always validate untrusted input.
  • Normalization - é can be one character (U+00E9) or two (e + combining accent). They look identical but aren't equal.

The BOM Problem

With BOM:    EF BB BF 48 65 6C 6C 6F  (H e l l o)
Without BOM: 48 65 6C 6C 6F           (H e l l o)

The BOM is a zero-width character at the start of a file that signals "this is UTF-8." It's:

  • Unnecessary - UTF-8 doesn't need byte order marking
  • Problematic - Can break shell scripts, PHP files, JSON
  • Recommended to omit - Unless required by specific tools

In Code

javascript
// Encode string to UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello 🌍");
// Uint8Array [72, 101, 108, 108, 111, 32, 240, 159, 140, 141]

// Decode UTF-8 bytes to string
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// "Hello 🌍"

// Handle emoji length correctly
const emoji = "🎉";
console.log(emoji.length);           // 2 (JS uses UTF-16)
console.log([...emoji].length);      // 1 (spread operator counts graphemes)
python
# Python 3 strings are Unicode by default
text = "Hello 🌍"

# Encode to UTF-8 bytes
utf8_bytes = text.encode('utf-8')
# b'Hello \xf0\x9f\x8c\x8d'

# Decode from UTF-8 bytes
text = utf8_bytes.decode('utf-8')
# "Hello 🌍"

# Character count
len("🎉")  # 1 (Python 3 counts code points correctly)

Detecting Encoding

python
# Check if bytes are valid UTF-8
def is_valid_utf8(data: bytes) -> bool:
    try:
        data.decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False

Quick Reference

TermMeaning
Code pointA number assigned to a character (U+0041 = 'A')
GraphemeWhat users perceive as a single character
Surrogate pairTwo UTF-16 units encoding one character (emoji)
BOMByte Order Mark (EF BB BF)
MojibakeGarbled text from wrong encoding
NormalizationConverting equivalent sequences to a standard form

Try It

Try UTF-8 Encoder

"UTF-8: Because everyone deserves to see their language on screen, even if JavaScript still can't count emoji properly."