Unicode & UTF-8 Encoding

The universal character set supporting every writing system on Earth

Universal Standard

Unicode assigns a unique code point (U+0000 to U+10FFFF) to over 150,000 characters from all writing systems.

Code Points

Each character has a code point written as U+XXXX. ASCII characters keep their values: A = U+0041 (65).

Variable-Width UTF-8

UTF-8 uses 1-4 bytes per character. ASCII uses 1 byte, most languages 2-3 bytes, emojis use 4 bytes.

Backward Compatible

UTF-8's first 128 code points match ASCII exactly, making ASCII text valid UTF-8 automatically.

UTF-8 Variable-Width Encoding
Bytes Code Point Range Byte Pattern
1 U+0000 - U+007F 0xxxxxxx
2 U+0080 - U+07FF 110xxxxx 10xxxxxx
3 U+0800 - U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 - U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How Variable-Width Works:

The leading bits indicate byte count: 0 = 1 byte, 110 = 2 bytes, 1110 = 3 bytes, 11110 = 4 bytes. Continuation bytes always start with 10.

Interactive UTF-8 Encoder
UTF-8 Encoded (showing code point and bytes):
Enter text above to see UTF-8 encoding