Unicode Character Encoding

Universal Standard

Unicode assigns a unique code point (U+0000 to U+10FFFF) to over 150,000 characters from all writing systems.

Each character has a code point written as U+XXXX. ASCII characters keep their values: A = U+0041 (65).

UTF-8 uses 1-4 bytes per character. ASCII uses 1 byte, most languages 2-3 bytes, emojis use 4 bytes.

UTF-8's first 128 code points match ASCII exactly, making ASCII text valid UTF-8 automatically.

UTF-8 Variable-Width Encoding

How Variable-Width Works:

The leading bits indicate byte count: 0 = 1 byte, 110 = 2 bytes, 1110 = 3 bytes, 11110 = 4 bytes. Continuation bytes always start with 10.

Interactive UTF-8 Encoder

Enter text (including emoji, accents, symbols):

UTF-8 Encoded (showing code point and bytes):

Enter text above to see UTF-8 encoding