-
Notifications
You must be signed in to change notification settings - Fork 17
Counting Characters
Counting Characters - It's harder than you think! This page discusses issues when counting the number of characters in user input. It is most relevant to WTextArea.
Modern browsers allow developers to restrict the maximum number of characters the user may enter into a textarea by setting the maxlength attribute.
For example here is a simple HTML snippet which creates a form with a single textarea that allows up to 200 characters:
<form action="" method="post">
<textarea maxlength="200"></textarea>
<input type="submit"/>
</form>
The image below shows how this simple page looks when loaded in a web browser.
A common scenario is for a user to copy text from another application into a textarea. For example assume the user wants to copy from Microsoft Word and paste into the textarea example above (with a limit of 200 characters).
The image below shows the MS Word document we are copying from.
Note that, according to MS Word, the selected text is exactly 200 characters long.
When we paste the above 200 characters into our simple HTML form and press the submit button here is what we see:
According to the browser we have entered 203 characters, how can this be? MS Word wouldn't lie to us right? The browser wouldn't lie either right?
Let's ask a simple text editor how many characters there are, the image below shows the same text pasted into Atom:
What the?!? The text editor says we have 206 characters!
In summary we have copied text from MS Word and pasted into a web browser and a text editor. Each of these reports the following character counts:
- MS Word: 200
- Web Browser: 203
- Text Editor: 206
To find who is telling the truth we must examine the raw bytes that make up this text using a hex editor, shown below:
We can clearly see twenty rows of ten characters followed by one row of six = 206 characters.
-
MS Word: 200<- LIE -
Web Browser: 203<- LIE - Text Editor: 206 <- TRUTH
In the scenario above MS Word and the Web Browser are counting newline characters differently. Only the text editor is counting them for what they are.
When you hit Enter
in MS Word it adds two characters to create the newline, these are shown in the hex editor as 0D 0A
. In English 0D
is called "Carriage Return" (CR
) and 0A
is called "Line Feed" (LF
), we can abbreviate this combination as CRLF
.
When counting characters MS Word is completely ignoring these character sequences while the web browser is counting them as if they were only one character. As we have proven above, both are wrong.
Unix, Linux and Mac OS X do not use CRLF
to represent newline but instead simply use LF
(long ago MacOS used CR
). Does this solve the problem? Unfortunately NO!
Even if the user input contains LF
for newlines the server will receive CRLF
combinations. This is explained in the HTML5 Spec:
Finally, there is the value, as used in form submission and other processing models in this specification. It is normalised so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs ...
When the user submits a page with a textarea WComponents (server side) counts actual characters, in our example above this means we would count 206 characters.
In some cases this can result in confusion when the user enters a value that does not exceed the maximum length as reported in the browser but is rejected as too long by WComponents on the server. As discussed above, the browser is lying to the user, WComponents is telling the truth.