Unicode official site
Computers operated solely with numbers, assigning numeric values to letters and other characters. Historically, various character encoding systems existed, each with limitations in language and symbol coverage, particularly for non-Latin scripts like Japanese. These early systems often conflicted, where identical numbers could represent different characters, leading to data corruption and compatibility issues.
Despite various text encoding standards developed, there are roughly 3 main stages:
Over the years, UTF-8 has been taking over the encoding standard globally.
The name “Unicode” itself represents its foundational goals:
Web browsers and text editors predominantly use UTF-8 by default. Web content such HTML, CSS, and JavaScript files are encoded in UTF-8, with character sets specified in HTTP headers (along with MIME type in the header Content-Type
) or HTML meta tags (<meta charset="utf-8"></meta>
) to ensure correct interpretation by browsers.
Despite the widespread use of UTF-8 for file encoding, JavaScript internally operates with UTF-16 due to historical reasons. This affects how JavaScript handles strings, particularly for characters represented by two UTF-16 units, like many emojis.
In JavaScript, characters can be represented as escapes in two ways
\uHHHH
and\u{H...H}
, where HHHH in\uHHHH
is a 4 digits long UTF-16 code, and H…H in\u{H...H}
is a 1 to 6 digits long Unicode code point. Character escape representation\u{H...H}
was introduced in ECMAScript 6.For example, for emoji 🤗, the character escapes can be
\ud83e\udd17
or\u{1f917}
.
string[index]
) and .charAt()
methods return a character at the index based on UTF-16 code units. For characters represented by two code units (surrogate pairs), these methods will return incomplete characters.const string = 'Hi👍' // Character escapes: '\u0048\u0069\ud83d\udc4d'
const h = string[0] // Retrieve the character 'H' (the first unit '\u0048'), because 'H' is encoded to one single unit of UTF-16
const i = string[0] // Similarly, retrieve the character 'i' (the second unit '\u0069')
const e = string[2] // The result is not we expected character 👍. Instead, we retrieved the third unit escape '\ud83d', which is the first half of a pair of escapes of 👍.
const emoji = string[2] + string[3] // Concatenating the two units in sequence, we retrieved the complete emoji 👍 ('\ud83d\udc4d').
// `.charAt()` behaves the same as direct indexing, it may not return the correct character, instead, the unit of the indexed position:
const e1 = string.charAt(3) // '\udc4d'
.charCodeAt()
behaves similarly to indexing and .charAt()
, however, it returns the UTF-16 code at the specified index:
const string = 'Hi👍' // Character escapes: '\u0048\u0069\ud83d\udc4d'
const e1 = string.charCodeAt(3) // Got UTF-16 code 56397
const e2 = e1.toString(16) // Converted to hexadecimal 'dc4d'
.codePointAt()
are used to returns the complete Unicode code points for a character that starts at the specified index, properly handling characters that are represented by surrogate pairs.Note that the index is still based on UTF-16 code units, not Unicode code points.
const string = 'Hi👍' // Character escapes: '\u0048\u0069\ud83d\udc4d'
const e1 = string.codePointAt(2) // Got number 128077, which is the Unicode code point of 👍, decoded from 0xd83ddc4d
// Below we got number 56397 (hexadecimal is 0xdc4d). That doesn't get the correct Unicode code point,
// due to being indexed at the second unit of 👍 ('\ud83d\udc4d')
const e5 = string.codePointAt(3)
.length
property of a string object counts UTF-16 units, which may not correspond to the actual number of characters, especially when the string contains two-unit characters (surrogate pairs).const string = 'Hi👍' // Character escapes: '\u0048\u0069\ud83d\udc4d'
const len = string.length // Got 4, because '\u0048\u0069\ud83d\udc4d' has 4 units
To accurately count characters, especially when including emojis or other complex characters, you can use [...string].length
or for...of
loop, both of which iterate a string by Unicode code points, regardless of whether they’re represented by one or two UTF-16 code units.
const string = 'Hi👍' // Character escapes: '\u0048\u0069\ud83d\udc4d'
// Fundamentally, `[...string]` split the string to array with its iterator which iterates based on Unicode code points
const len1 = [...string].length // Got 3
// Below loop prints the 3 single characters: 'H', 'i', and '👍'
for (const str of string) {
console.log(str)
}
Additionally, there’re two static methods String.fromCharCode()
and String.fromCodePoint()
introduced in ECMAScript 6, used to convert specified sequences of UTF-16 codes and Unicode code points to strings.
The development and widespread implementation of Unicode have been critical in supporting the diverse range of languages and symbols used around the world, thereby enabling truly global communication. As technology continues to evolve, particularly in web technology, the role of Unicode remains central in the ongoing efforts towards enhancing internationalization and localization in digital environments. This foundational standard not only facilitates compatibility across different systems and platforms but also ensures that digital communications, especially on the web, remain inclusive and accessible to all users, regardless of language. By standardizing character representation across the internet and software applications, Unicode plays a pivotal role in the seamless global exchange of information.