Friday 24 December 2010

JavaScript: validating UTF-8 string lengths in the browser

Let's take a JavaScript string: "€100". This is going to be sent from a browser input box and stored in a web server's database. The database is using the UTF-8 encoding and the constraint on the column is CHAR(4). Spot the problem?

From the ECMA specification:

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.

As a UTF-16 string, this data will have length of 4 (the 16-bit-based code unit sequence is 20AC 0031 0030 0030). When encoded as UTF-8, this string will have a length of 6. A code unit has a length of 8 (one byte) in UTF-8 and the encoded form is E2 82 AC 31 30 30 (the first three bytes are the euro symbol).

It would be best if this problem was caught before the data was sent to the server (you still need to validate input on the server, of course). Many languages and platforms have rich encoding libraries. By comparison, the standard JavaScript library is very lean. The ECMA standard does mandate some encoding functionality for URIs from which we might be able to hack a solution. However, it isn't rocket science to calculate the width of the string in UTF-8 as we can demonstrate:

Input text:

UTF-16 length: 4

UTF-8 length: 6

The difficulty here is in communicating this information to the end user. Few people are going to understand why "$100" is valid while "€100" is not.

One work-around could be to triple the column size in the database (or cut the allowed input by a third). This will allow a degree of user-interface consistency. Such an approach may not always be practical. Note: I'm ignoring cases like combining character sequences.

The Code

The utf8ByteCount function below returns the length of a string when encoded as UTF-8.

/**
 * codePoint - an integer containing a Unicode code point
 * return - the number of bytes required to store the code point in UTF-8
 */
function utf8Len(codePoint) {
  if(codePoint >= 0xD800 && codePoint <= 0xDFFF)
    throw new Error("Illegal argument: "+codePoint);
  if(codePoint < 0) throw new Error("Illegal argument: "+codePoint);
  if(codePoint <= 0x7F) return 1;
  if(codePoint <= 0x7FF) return 2;
  if(codePoint <= 0xFFFF) return 3;
  if(codePoint <= 0x1FFFFF) return 4;
  if(codePoint <= 0x3FFFFFF) return 5;
  if(codePoint <= 0x7FFFFFFF) return 6;
  throw new Error("Illegal argument: "+codePoint);
}

function isHighSurrogate(codeUnit) {
  return codeUnit >= 0xD800 && codeUnit <= 0xDBFF;
}

function isLowSurrogate(codeUnit) {
  return codeUnit >= 0xDC00 && codeUnit <= 0xDFFF;
}

/**
 * Transforms UTF-16 surrogate pairs to a code point.
 * See RFC2781
 */
function toCodepoint(highCodeUnit, lowCodeUnit) {
  if(!isHighSurrogate(highCodeUnit)) throw new Error("Illegal argument: "+highCodeUnit);
  if(!isLowSurrogate(lowCodeUnit)) throw new Error("Illegal argument: "+lowCodeUnit);
  highCodeUnit = (0x3FF & highCodeUnit) << 10;
  var u = highCodeUnit | (0x3FF & lowCodeUnit);
  return u + 0x10000;
}

/**
 * Counts the length in bytes of a string when encoded as UTF-8.
 * str - a string
 * return - the length as an integer
 */
function utf8ByteCount(str) {
  var count = 0;
  for(var i=0; i<str.length; i++) {
    var ch = str.charCodeAt(i);
    if(isHighSurrogate(ch)) {
      var high = ch;
      var low = str.charCodeAt(++i);
      count += utf8Len(toCodepoint(high, low));
    } else {
      count += utf8Len(ch);
    }
  }
  return count;
}

Links

No comments:

Post a Comment

All comments are moderated