What is UTF-8?

UTF-8 is a character encoding that can represent every character in the Unicode standard—over 140,000 characters from virtually every writing system on Earth, plus emoji. It's backward compatible with ASCII and has become the dominant encoding on the web.

ASCII:     Hello
UTF-8:     Hello こんにちは مرحبا 🚀

The genius of UTF-8 is that ASCII characters (the first 128) use just one byte, while other characters use 2-4 bytes. This means English text stays compact while still supporting every language.

How It Works

UTF-8 uses a clever variable-length encoding:

Bytes	Bit Pattern	Code Point Range	Characters
1	`0xxxxxxx`	U+0000 to U+007F	ASCII (a-z, 0-9, etc.)
2	`110xxxxx 10xxxxxx`	U+0080 to U+07FF	Latin, Greek, Cyrillic
3	`1110xxxx 10xxxxxx 10xxxxxx`	U+0800 to U+FFFF	Chinese, Japanese, Korean
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	U+10000 to U+10FFFF	Emoji, rare scripts

The leading bits tell you how many bytes the character uses. Continuation bytes always start with 10.

Encoding Examples

Character: A
Unicode:   U+0041
UTF-8:     01000001 (1 byte)

Character: é
Unicode:   U+00E9
UTF-8:     11000011 10101001 (2 bytes)

Character: 中
Unicode:   U+4E2D
UTF-8:     11100100 10111000 10101101 (3 bytes)

Character: 🎉
Unicode:   U+1F389
UTF-8:     11110000 10011111 10001110 10001001 (4 bytes)

Why UTF-8 Won

Before UTF-8, we had encoding chaos:

ASCII for English (127 characters)
ISO-8859-1 for Western European
Shift-JIS for Japanese
GB2312 for Chinese
...and hundreds more

If you opened a file with the wrong encoding, you got garbage: Ã© instead of é. UTF-8 solved this by supporting everything in one encoding.

Encoding	Pros	Cons
UTF-8	Universal, ASCII-compatible, web standard	Variable width complicates string operations
UTF-16	Fixed width for most chars, Windows/Java default	Not ASCII-compatible, byte-order issues
UTF-32	Fixed 4 bytes per char, simple indexing	Wastes space, rarely used
ASCII	Simple, 1 byte per char	English only

Where You'll See This

The Web - 98%+ of websites use UTF-8
Source code - Most editors default to UTF-8
JSON - UTF-8 by specification
APIs - REST APIs typically use UTF-8
Databases - utf8mb4 in MySQL, UTF8 in PostgreSQL
Terminal - Modern terminals are UTF-8 native

Common Gotchas

⚠️String Length vs Byte Length

"Hello".length is 5 characters and 5 bytes. But "🎉".length might return 2 in JavaScript (surrogate pairs) even though it's 4 bytes. Character counting is surprisingly hard.

ℹ️MySQL utf8 vs utf8mb4

MySQL's utf8 only supports 3-byte characters (no emoji!). Use utf8mb4 for full UTF-8 support. Yes, this has bitten everyone.

BOM (Byte Order Mark) - EF BB BF at the start of files. Usually unnecessary and can cause issues.
Mojibake - Garbled text from wrong encoding: Ã© is é decoded as Latin-1 instead of UTF-8.
Overlong encodings - Security vulnerability where characters are encoded with more bytes than necessary.
Invalid sequences - Not all byte sequences are valid UTF-8. Always validate untrusted input.
Normalization - é can be one character (U+00E9) or two (e + combining accent). They look identical but aren't equal.

The BOM Problem

With BOM:    EF BB BF 48 65 6C 6C 6F  (H e l l o)
Without BOM: 48 65 6C 6C 6F           (H e l l o)

The BOM is a zero-width character at the start of a file that signals "this is UTF-8." It's:

Unnecessary - UTF-8 doesn't need byte order marking
Problematic - Can break shell scripts, PHP files, JSON
Recommended to omit - Unless required by specific tools

In Code

javascript

// Encode string to UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello 🌍");
// Uint8Array [72, 101, 108, 108, 111, 32, 240, 159, 140, 141]

// Decode UTF-8 bytes to string
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// "Hello 🌍"

// Handle emoji length correctly
const emoji = "🎉";
console.log(emoji.length);           // 2 (JS uses UTF-16)
console.log([...emoji].length);      // 1 (spread operator counts graphemes)

python

# Python 3 strings are Unicode by default
text = "Hello 🌍"

# Encode to UTF-8 bytes
utf8_bytes = text.encode('utf-8')
# b'Hello \xf0\x9f\x8c\x8d'

# Decode from UTF-8 bytes
text = utf8_bytes.decode('utf-8')
# "Hello 🌍"

# Character count
len("🎉")  # 1 (Python 3 counts code points correctly)

Detecting Encoding

python

# Check if bytes are valid UTF-8
def is_valid_utf8(data: bytes) -> bool:
    try:
        data.decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False

Quick Reference

Term	Meaning
Code point	A number assigned to a character (U+0041 = 'A')
Grapheme	What users perceive as a single character
Surrogate pair	Two UTF-16 units encoding one character (emoji)
BOM	Byte Order Mark (EF BB BF)
Mojibake	Garbled text from wrong encoding
Normalization	Converting equivalent sequences to a standard form

Try It

Try UTF-8 Encoder

"UTF-8: Because everyone deserves to see their language on screen, even if JavaScript still can't count emoji properly."