What is UTF-8?
UTF-8 is a character encoding that can represent every character in the Unicode standard—over 140,000 characters from virtually every writing system on Earth, plus emoji. It's backward compatible with ASCII and has become the dominant encoding on the web.
ASCII: Hello
UTF-8: Hello こんにちは مرحبا 🚀
The genius of UTF-8 is that ASCII characters (the first 128) use just one byte, while other characters use 2-4 bytes. This means English text stays compact while still supporting every language.
How It Works
UTF-8 uses a clever variable-length encoding:
| Bytes | Bit Pattern | Code Point Range | Characters |
|---|---|---|---|
| 1 | 0xxxxxxx | U+0000 to U+007F | ASCII (a-z, 0-9, etc.) |
| 2 | 110xxxxx 10xxxxxx | U+0080 to U+07FF | Latin, Greek, Cyrillic |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx | U+0800 to U+FFFF | Chinese, Japanese, Korean |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | U+10000 to U+10FFFF | Emoji, rare scripts |
The leading bits tell you how many bytes the character uses. Continuation bytes always start with 10.
Encoding Examples
Character: A
Unicode: U+0041
UTF-8: 01000001 (1 byte)
Character: é
Unicode: U+00E9
UTF-8: 11000011 10101001 (2 bytes)
Character: 中
Unicode: U+4E2D
UTF-8: 11100100 10111000 10101101 (3 bytes)
Character: 🎉
Unicode: U+1F389
UTF-8: 11110000 10011111 10001110 10001001 (4 bytes)
Why UTF-8 Won
Before UTF-8, we had encoding chaos:
- ASCII for English (127 characters)
- ISO-8859-1 for Western European
- Shift-JIS for Japanese
- GB2312 for Chinese
- ...and hundreds more
If you opened a file with the wrong encoding, you got garbage: é instead of é. UTF-8 solved this by supporting everything in one encoding.
| Encoding | Pros | Cons |
|---|---|---|
| UTF-8 | Universal, ASCII-compatible, web standard | Variable width complicates string operations |
| UTF-16 | Fixed width for most chars, Windows/Java default | Not ASCII-compatible, byte-order issues |
| UTF-32 | Fixed 4 bytes per char, simple indexing | Wastes space, rarely used |
| ASCII | Simple, 1 byte per char | English only |
Where You'll See This
- The Web - 98%+ of websites use UTF-8
- Source code - Most editors default to UTF-8
- JSON - UTF-8 by specification
- APIs - REST APIs typically use UTF-8
- Databases -
utf8mb4in MySQL,UTF8in PostgreSQL - Terminal - Modern terminals are UTF-8 native
Common Gotchas
"Hello".length is 5 characters and 5 bytes. But "🎉".length might return 2 in JavaScript (surrogate pairs) even though it's 4 bytes. Character counting is surprisingly hard.
MySQL's utf8 only supports 3-byte characters (no emoji!). Use utf8mb4 for full UTF-8 support. Yes, this has bitten everyone.
- BOM (Byte Order Mark) -
EF BB BFat the start of files. Usually unnecessary and can cause issues. - Mojibake - Garbled text from wrong encoding:
éisédecoded as Latin-1 instead of UTF-8. - Overlong encodings - Security vulnerability where characters are encoded with more bytes than necessary.
- Invalid sequences - Not all byte sequences are valid UTF-8. Always validate untrusted input.
- Normalization -
écan be one character (U+00E9) or two (e + combining accent). They look identical but aren't equal.
The BOM Problem
With BOM: EF BB BF 48 65 6C 6C 6F (H e l l o)
Without BOM: 48 65 6C 6C 6F (H e l l o)
The BOM is a zero-width character at the start of a file that signals "this is UTF-8." It's:
- Unnecessary - UTF-8 doesn't need byte order marking
- Problematic - Can break shell scripts, PHP files, JSON
- Recommended to omit - Unless required by specific tools
In Code
// Encode string to UTF-8 bytes
const encoder = new TextEncoder();
const bytes = encoder.encode("Hello 🌍");
// Uint8Array [72, 101, 108, 108, 111, 32, 240, 159, 140, 141]
// Decode UTF-8 bytes to string
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);
// "Hello 🌍"
// Handle emoji length correctly
const emoji = "🎉";
console.log(emoji.length); // 2 (JS uses UTF-16)
console.log([...emoji].length); // 1 (spread operator counts graphemes)
# Python 3 strings are Unicode by default
text = "Hello 🌍"
# Encode to UTF-8 bytes
utf8_bytes = text.encode('utf-8')
# b'Hello \xf0\x9f\x8c\x8d'
# Decode from UTF-8 bytes
text = utf8_bytes.decode('utf-8')
# "Hello 🌍"
# Character count
len("🎉") # 1 (Python 3 counts code points correctly)
Detecting Encoding
# Check if bytes are valid UTF-8
def is_valid_utf8(data: bytes) -> bool:
try:
data.decode('utf-8')
return True
except UnicodeDecodeError:
return False
Quick Reference
| Term | Meaning |
|---|---|
| Code point | A number assigned to a character (U+0041 = 'A') |
| Grapheme | What users perceive as a single character |
| Surrogate pair | Two UTF-16 units encoding one character (emoji) |
| BOM | Byte Order Mark (EF BB BF) |
| Mojibake | Garbled text from wrong encoding |
| Normalization | Converting equivalent sequences to a standard form |
Try It
Try UTF-8 Encoder"UTF-8: Because everyone deserves to see their language on screen, even if JavaScript still can't count emoji properly."