Wednesday, 19 November 2014

Javascript – Get the size of the Unicode characters

Recently I have to submit the input fields containing the 2 byte or 3 byte characters like Chinese, Japanese or Korean which was causing database error as double byte characters were increasing the length of the string.

So, it was required to check the size of string in JavaScript before submitting the form.
Use below function to get the character count by providing input string.

function countUtf8Bytes(str)
{
  var count = 0,decimal_code_point;
  for(i=0;i<str.length;i++)
  {
    decimal_code_point = str.charCodeAt(i);
    count= count + (1)+(decimal_code_point>127)+(decimal_code_point>2047);
  }
  return count;
}

You can test your code here.

Calculate Length

How it works?

Before trying anything, I think, we should know how it works. So, that we can fix any exceptional case found in the future.

If you wanted to know how this function work then please go ahead and read the below explanations.
We should know what Unicode characters are first are.

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet.

UTF-8 (U from Universal Character Set + Transformation Format) is a character encoding capable of encoding all possible characters which is called code points  in Unicode.
Every character in unicode is assigned a magic number by the Unicode consortium which is written like this: U+0639. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. U+0041 is called as code point.

UTF-8 is system for storing string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Take a look at following table to understand how the code points will be stored in memory.

Table 1


From the above table, if we convert binary code hex code point to decimal, we can find decimal code points range as follows–

Bits of code point
First Code point
Last code point
Bytes in Sequence
First Decimal code point
Last Decimal code point
7
U+0000
U+007F
1
0
127
11
U+0080
U+07FF
2
128
2047
16
U+0800
U+FFFF
3
2047
65535
21
U+10000
U+1FFFFF
4
65536
2097115

So, From above table we conclude that if decimal code point any character is in range of -

  •   0 - 127 then it is 1 byte character.
  •  128 - 2047 then it is 2 byte character.
  •  2048 - 65535 then it is 3 byte character.
  •  65536 - 2097115 then it is 4 byte character.
Table 2


As given in above table, Unicode code point for character $ is U+0024 which has binary code point is 0100100 and decimal code point will be 36 (Just convert hex 0024 to decimal).So, it is 1 byte character.

Similar way, in fourth row there is Chinese character having code point as U+24B62 and decimal code point will be 150370.As it is coming in the range of 65536 – 2097115, it is 4 byte character.

Now, you might have question in your mind that why we are calculating decimal code point and checking its range.Check the javascript function that alrady written, we have used charCodeAt() javascript function to get code points. Actually this function charCodeAt() is returning decimal code points of the character.

So, In below line of our javascript function we are checking the range and adding a byte if a character is exceeding a range.

count= count + (1)+(decimal_code_point>127)+(decimal_code_point>2047);

This code will consider only till 3 byte characters.

Calculating size of the 4 byte and 5 byte characters is not possible by this method because of working of charCodeAt() function.

The charCodeAt() method returns the numeric Unicode value of the character at the given index (except for unicode codepoints > 0x10000).Unicode code points range from 0 to 1114111 (0x10FFFF).

Note that charCodeAt() will always return a value that is less than 65536. This is because the higher code points are represented by a pair of (lower valued) "surrogate" pseudo-characters which are used to comprise the real character. Because of this, in order to examine or reproduce the full character for individual characters of value 65536 and above, for such characters, it is necessary to retrieve not only charCodeAt(i), but also charCodeAt(i+1) (as if examining/reproducing a string with two letters.
  
So, if we will check the character size of 4 and 5 byte characters by using charCodeAt() then it is wrong because for characters greater than code point 65536, this function will return two point codes  and addition of their size depending on code points returned by charCodeAt()  will be wrong.

For example –
For Simplified Chinese characters, 𤭢 (U+24B62) charCodeAt() will return two code points 55378 and 57186. Each of these two characters is of size 3 byte and if we add the size then total will be 6 which will be wrong as we know by the table 1 above that 𤭢 (U+24B62) is 4 byte character.
So, here I am leaving it up to you to modified above code to consider 4 and 5 byte characters in calculation.

Best of Luck !!!

Following are the sites I have referred to come to this conclusion.



No comments:

Post a Comment