Recently I have to submit the input fields containing the 2 byte or 3
byte characters like Chinese, Japanese or Korean which was causing database
error as double byte characters were increasing the length of the string.
So, it was required to check
the size of string in JavaScript before submitting the form.
Use below function to get the character count by providing input
string.
function
countUtf8Bytes(str)
{
var count = 0,decimal_code_point;
for(i=0;i<str.length;i++)
{
decimal_code_point = str.charCodeAt(i);
count= count + (1)+(decimal_code_point>127)+(decimal_code_point>2047);
}
return count;
}
You can test your code here.
How it works?
Before trying anything, I think, we should know how it works. So,
that we can fix any exceptional case found in the future.
If you wanted to know how this function work then please go ahead
and read the below explanations.
We should know what Unicode characters are first are.
Unicode was a brave effort
to create a single character set that included every reasonable writing system
on the planet.
UTF-8 (U from Universal
Character Set + Transformation Format) is a character encoding capable
of encoding all possible characters which is called code points in Unicode.
Every character in unicode
is assigned a magic number by the Unicode consortium which is written like
this: U+0639. U+0639 is the Arabic letter Ain. The English letter A would be
U+0041. U+0041 is called as code point.
UTF-8 is system for storing
string of Unicode code points, those magic U+ numbers, in memory using 8 bit
bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only
code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
Take a look at following
table to understand how the code points will be stored in memory.
Table 1
From the above table, if we convert binary code hex code point to
decimal, we can find decimal code points range as follows–
Bits of code
point
|
First Code
point
|
Last code
point
|
Bytes in
Sequence
|
First Decimal
code point
|
Last Decimal
code point
|
7
|
U+0000
|
U+007F
|
1
|
0
|
127
|
11
|
U+0080
|
U+07FF
|
2
|
128
|
2047
|
16
|
U+0800
|
U+FFFF
|
3
|
2047
|
65535
|
21
|
U+10000
|
U+1FFFFF
|
4
|
65536
|
2097115
|
So, From above table we conclude that if decimal code point any
character is in range of -
- 0 - 127 then it is 1 byte character.
- 128 - 2047 then it is 2 byte character.
- 2048 - 65535 then it is 3 byte character.
- 65536 - 2097115 then it is 4 byte character.
Table 2
As given in above table, Unicode code point for character $ is U+0024 which has binary code point
is 0100100 and decimal code point will be 36 (Just convert hex 0024 to decimal).So,
it is 1 byte character.
Similar way, in fourth row there is Chinese character having code
point as U+24B62 and decimal code point will be 150370.As it is coming in the
range of 65536 – 2097115, it is 4 byte character.
Now, you might have question in your mind that why we are
calculating decimal code point and checking its range.Check the javascript
function that alrady written, we have used charCodeAt() javascript function to
get code points. Actually this function charCodeAt() is returning decimal code
points of the character.
So, In below line of our javascript function we are checking the
range and adding a byte if a character is exceeding a range.
count= count +
(1)+(decimal_code_point>127)+(decimal_code_point>2047);
This code will consider only till 3 byte characters.
Calculating size of the 4 byte and 5 byte characters is not possible by
this method because of working of charCodeAt() function.
The
charCodeAt()
method returns the numeric Unicode value of
the character at the given index (except for unicode codepoints > 0x10000).Unicode
code points range from 0 to 1114111 (0x10FFFF).
Note
that
charCodeAt()
will
always return a value that is less than 65536. This is because the higher code
points are represented by a pair of (lower valued) "surrogate"
pseudo-characters which are used to comprise the real character. Because of
this, in order to examine or reproduce the full character for individual
characters of value 65536 and above, for such characters, it is necessary to
retrieve not only charCodeAt(i)
, but also charCodeAt(i+1)
(as if examining/reproducing a string with
two letters.
So, if we will check the character size of 4 and 5 byte
characters by using charCodeAt() then it is wrong because for characters greater
than code point 65536, this function will return two point codes and addition of their size depending on code
points returned by charCodeAt() will be
wrong.
For example –
For Simplified Chinese characters, 𤭢 (U+24B62) charCodeAt() will return two code points 55378 and
57186. Each of these two characters is of size 3 byte and if we add the size then
total will be 6 which will be wrong as we know by the table 1 above that 𤭢 (U+24B62) is 4 byte character.
So, here I am leaving it up to you to
modified above code to consider 4 and 5 byte characters in calculation.
Best of Luck !!!
Following are the sites I have referred to come to this conclusion.
No comments:
Post a Comment