Universal Standard
Unicode assigns a unique code point (U+0000 to U+10FFFF) to over 150,000 characters from all writing systems.
Code Points
Each character has a code point written as U+XXXX. ASCII characters keep their values: A = U+0041 (65).
Variable-Width UTF-8
UTF-8 uses 1-4 bytes per character. ASCII uses 1 byte, most languages 2-3 bytes, emojis use 4 bytes.
Backward Compatible
UTF-8's first 128 code points match ASCII exactly, making ASCII text valid UTF-8 automatically.
| Bytes | Code Point Range | Byte Pattern |
|---|---|---|
| 1 | U+0000 - U+007F | 0xxxxxxx |
| 2 | U+0080 - U+07FF | 110xxxxx 10xxxxxx |
| 3 | U+0800 - U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
| 4 | U+10000 - U+10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
How Variable-Width Works:
The leading bits indicate byte count: 0 = 1 byte, 110 = 2 bytes, 1110 = 3 bytes, 11110 = 4 bytes. Continuation bytes always start with 10.