Count unicode characters in JavaScript

Count unicode characters in JavaScript

May 29, 2017
javascript

There are several ways to count strings in JS in Unicode Code Point units instead of 16-bit units, but in the end, I wonder which method works in most environments (including legacy browsers like IE11) as of May 2017. I’ve done some research, and I’ll summarize it for you.

Unicode Code Points #

In Unicode, all characters are assigned an ID (code point) (0 to 0x10FFFF). The code point is written as U+hexdecimal.

In UTF-16, code points from U+0000 to U+FFFF are represented by a single 16-bit code unit, and code points after U+10000 (e.g., the code point for 𩸽 (Arabesque Sea in Japanese)) are represented by surrogate pairs as described below because they cannot be represented by 16 bits alone.

Surrogate Pairs #

In UTF-16, some characters (corresponding to code points after U+10000) are represented by 32 bits (16 bits x 2) in order to represent code points that cannot be represented by 16 bits alone. These characters (or expressions?) are called surrogate pairs.

Surrogate Code Point #

The range of 16-bit values that make up a surrogate pair is from U+D800 to U+DFFF, and these are called surrogate code points (no characters are assigned to these surrogate code points).

In addition

  • U+D800 ~ U+DBFF are called upper surrogate.
  • U+DC00 ~ U+DFFF are called lower surrogates.

A surrogate pair is represented by the combination of the upper and lower surrogates.

.length property #

JavaScript’s str.length does not return the length of the string in Unicode code points, but in UTF-16 code units.

For example, “𩸽” is a surrogate pair, so the length will be 2.

"𩸽".length; // 2

We want to count surrogate pairs such as “𩸽” as 1.

How to check the length of each code point #

for of #

ES2015 added for ... of... and it can be used to repeat code point by code point.

for (let c of '𩸽定食') console.log(c)
// 𩸽
// 定
// 食

But currently (May 2017) unsupported by IE11 https://kangax.github.io/compat-table/es6/#test-for..of_loops

Spread operator #

The split assignment also seems to be code point aware. If you use the spread operator, you can quickly transform a string into an array by code points.

[...'𩸽定食']
// ['𩸽', '定', '食']

But currently (May 2017) unsupported by IE11 https://kangax.github.io/compat-table/es6/#test-spread(…)operator

RegExp unicode flag #

Starting with ES2015, the unicode flag has been introduced, which treats patterns as a sequence of code points.

'𩸽定食'.match(/./ug);
// ['𩸽', '定', '食']

Again, IE11 doesn’t support it https://kangax.github.io/compat-table/es6/#test-RegExp_y_and_u_flags

Work hard with for statements. #

Loop through the string in 16-bit units, counting surrogate pairs as 1.

function stringLength(str) {
  let count = 0;
  for (let i = 0; i < str.length; i++) {
    count++;
    // obtain the i-th 16-bit
    const code = str.charCodeAt(i);
    if (0xD800 <= code && code <= 0xDBFF) {
      // if the i-th 16bit is an upper surrogate
      // skip the next 16 bits (lower surrogate)
      i++;
    }
  }
  return count;
}

Obviously it works with IE11 🎉

Regular Expressions (Unicode Sequence) #

Idea is to convert surrogate pairs to non-surrogate pair characters first and then take length

function stringLength(str) {
  // Replace surrogate pair with _
  return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g, '_').length;
}

stringLength('𩸽定食'); // 3

If you want to convert it to an array

function stringToArray(str) {
  return str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[^\uD800-\uDFFF]/g) || [];
}

stringToArray('𩸽定食');
// ['𩸽', '定', '食']

Conclusion #

That’s all, maybe we should drop IE11 support.