@rkenmi - Misconceptions of ASCII and Unicode

<< OS 101

NumPy vs. Pandas, and other flavors (Dask, Modin, Ray) >>

Updated on July 21, 2022

Myth: ASCII characters take up one byte

ASCII represented characters using numbers between 32 and 127. This accounted for characters used in the English language (lowercase and uppercase) and the numbers between 1-32 were reserved control characters.

Therefore this myth is not true; ASCII characters can be fully represented with 7 bits. However, since most computers used 8-bit bytes, a lot of OEM vendors decided to use the last remaining 8th bit (codes 128-255) for their own creative purposes, well before standards such as Unicode came along.

Myth: UTF-8 takes up only 8 bits

In Unicode, each letter maps to a code point which is a theoretical concept, similar to a programming pointer. For example, the letter H is U+0048 in Unicode.

This is different from Unicode encoding, which is the process of converting the code points into bits and bytes. There are many Unicode encodings, such as UCS-2 which will store every code point into two bytes. Or UCS-4, which stores every code point into four bytes. There are more conservative encodings, such as UTF-7, which will try to map code points to 7 bits.

So the natural follow-up to this question is, does UTF-8 take up to 8 bits? This myth is partially true, and partially false. UTF-8 does indeed map code points from 0-127 into a single byte. This means ASCII characters (i.e. a-z, A-Z) are mapped to a single byte, guaranteed. Anything higher than code point 128 however can be stored using 2, 3, or even up to 6 bytes, depending on the character. This is due to the fact that Unicode is a theoretical concept that can allow for hundreds of thousands of representations of characters using their code points (ex: U+006C for o).

Myth: UTF-16 takes up only 16 bits

Similar to the above, UTF-16 will map 65536 characters to two bytes, including your traditional ASCII characters. For example, H can be represented by 00 48 (hexadecimal, big-endian) in UTF-16, or 48 00 in little-endian.

Myth: ANSI is different from ASCII

This is true; ANSI is a character set standard that builds on top of ASCII to fill out the characters represented by numbers 128-255, borrowing characters from other character sets (Windows-1252, ISO-8859-1).

ANSI was critical for unifying the divergent character sets above code point 128 and many OEM vendors used the ANSI standard to be consistent.

Myth: UTF-8 should always be chosen over UTF-16

UTF-8 can encode ASCII characters in 1 byte, while UTF-16 requires two bytes minimum for any characters.

Since ASCII characters that can traditionally be represented in under 1 byte will be upsized to two bytes in UTF-16, this can result in a large overhead in terms of memory space usage. For this reason, if most strings use ASCII characters, it will be much more space-efficient to use UTF-8.

On the flip side, many higher-order characters can take a whopping 3 bytes in UTF-8, while they fit in just 2 bytes in UTF-16. Therefore, the answer is: it depends. If you need international text support (i.e. Asian characters), UTF-16 is a good choice. If you can live with English text only, then UTF-8 is a good choice.

Article Tags:

utf-8utf-16ansistringsstringComputer Scienceunicodeascii

Misconceptions of ASCII and Unicode

Back to Top

<< OS 101

NumPy vs. Pandas, and other flavors (Dask, Modin, Ray) >>

Updated on July 21, 2022

Myth: ASCII characters take up one byte

Myth: UTF-8 takes up only 8 bits

Myth: UTF-16 takes up only 16 bits

Myth: ANSI is different from ASCII

Myth: UTF-8 should always be chosen over UTF-16

Article Tags:

<< OS 101

NumPy vs. Pandas, and other flavors (Dask, Modin, Ray) >>