UTF8 and UTF16 in JavaScript
9 Dec, 2011. Written by Tom Roggero
I was wondering what the fuck was the bitwise operators doing on UTF-16 decode of Byte Counter from Mathias Bynens. Because reading RFC documentation is so boring, I decided to ask Some directly (the guy who made that function). Everything started in StackOverflow.
The reason you don't find anything about it in RFC 3492 is because it doesn't say anything about javascript
A simple answer to your question: The code extracts two 10-bit values and and returns a 21-bit-value.
Before I describe in detail what the line does I will explain why it's there.
Internally javascript store characters as 16-bit-integers. The punycode implementation I made (based on the example in the rfc) needs an array of 32-bit-integers to be able to use the full range of unicode (but since javascript doesn't have integers it gets an array of double-precision 64-bit IEEE 754 floats)
To be able to store code points outside of the BMP in a javascript string, UTF-16 is used. It uses surrogate pairs to store a 20-bit-value in two words with 10 bits each when the code point is outside of the BMP.
Now to the code:
value = ((value & 0x3FF) << 10) + (extra & 0x3FF) + 0x10000;- Before this line is executed, the variable "value" is assigned the "high surrogate" with the most significant 10 bits (and has a value between D800-DBFF).
- Variable "extra" is assigned the "low surrogate" with the least significant 10 bits and has a value between DC00-DFFF.Starting with this piece:
(value & 0x3FF)- Only the 10 lower bits is wanted, so everything else is masked away with a bitwise AND.The result is a 10-bit-value between 0x000-0x3ff.
<< 10This rotates the value to the left by 10 bits to restore the value of the 10 bits. That's another way to say "multiply with 1024"
.(extra & 0x3FF)Again, only the lower 10 bits is wanted, and everything else is masked away.
+ 0x10000This is the final conversion step.Since everything in BMP is accessed without surrogate pairs, there is no need to be able to access it with surrogates
By the way, thanks you so much Some.