Reversing String That Contains Emojis in Javascript and C++
Update 23/06/19: my solution assumed that the emoji contains only one rune, which is not true for emojis like country flags 🇻🇬… Sorry about missing this part…
This blog is inspired by Mathias Bynens’s JavaScript ♥ Unicode, his presentation is much more fun and informative than this short article :)
Reversing string that contains emojis is a nice toy problem to play with, the solution also applies to general problems like string matchings, modifications, etc.
Why?
In JavaScript, str.split("").reverse().join("")
Works well in basically all the cases, why it breaks when there are emojis?
1: JavaScript’s String store characters code in an array, each element is a UTF-16 unit, which is at most 0xFFFF
.
Quote from ECMAScript® 2015 Language Specification:
A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.
2. Lots of emojis have code value larger than 0xFFFF
.
(Take a look at https://unicode.org/emoji/charts/full-emoji-list.html in case you interested).
3. Javascript saves the emojis with large value to adjacent “store”s, "🎉1".length
gives 3 instead of 2 due to it indeed cost 3 16-bit unit.
The originally adjacent codes get reversed after str.split("").reverse().join("")
, we can only cross finger to hope the end result does not contain some weird characters.
You can play with the code below to check what’s happened.
Solve the problem with JavaScript’s String iterator
The characters(instead of UTF-16 units) in the text can be separated, e.g., we can parse “🎉”, “1” out of “🎉1”.
Luckily, we don’t have to do the parsing ourselves, if we use JavaScript’s String iterator to iterate through the value, the expected characters will be given. We can take this advantage from for...of
or Array.from
like the example below, you can read the String iterator’s specification here.
In C++
In some cases, we don’t have nice built-in tools like above, we have to parse the characters ourselves, see we are trying to parse the emojis out of UTF-8 encoded text(table below is the rules of UTF-8 encoding).
Since the first code point of every UTF-8 code indicated the “width” of the character, we can follow the steps below:
Giving that a string’s size is N
1. If the S’s size is 0, exit.
2. Parsing the width W from S[0].
3. Read the first W code points from S, store/use that character.
4. Remove first W code points from S, repeat step 1.
You can play with the code below: