-
Notifications
You must be signed in to change notification settings - Fork 17
Counting Characters
Modern browsers allow developers to restrict the maximum number of characters the user may enter into a [textarea] (http://www.w3.org/TR/html5/forms.html#attr-textarea-maxlength) by setting the maxlength
attribute.
For example here is a simple HTML snippet which creates a form with a single textarea that only allows 200 characters:
<form action="" method="post">
<textarea maxlength="200"></textarea>
<input type="submit"/>
</form>
The image below shows how this simple page looks when loaded in a web browser.
A common scenario is for a user to copy text from another application into a textarea. In this example we'll assume the user wants to copy from Microsoft Word and paste into the textarea example above (with a limit of 200 characters).
The image below shows the MS Word document we are copying from.
Note that, according to MS Word, the selected text is exactly 200 characters long.
When we paste the above 200 characters into our simple HTML form and press the submit button here is what we see:
According to the browser we have entered 203 characters, how can this be? MS Word wouldn't lie to us right? The browser wouldn't lie either right? But someone is clearly not telling the truth.
Let's ask a simple text editor how many charaters there are, the image below shows the same text pasted into ATOM:
What the?!? The text editor says we have 206 characters long!
In summary we have copied text from word and pasted into a web browser and a text editor. Each of these reports the following character counts:
- MS Word: 200
- Web Browser: 203
- Text Editor: 206
To find who is telling the truth we must examine the raw bytes that make up this text using a hex editor, shown below:
We can clearly see twenty rows of ten characters followed by one row of six = 206 characters.
-
MS Word: 200<- LIE -
Web Browser: 203<- LIE - Text Editor: 206 <- TRUTH
In the scenario above MS Word and the Web Browser are counting newline characters differently.
When you hit Enter
in MS Word it adds two characters to create the newline, these are shown in the hex editor as 0D 0A
. In English 0D
is called carriage return
and 0A
is called line feed
.
When counting characters MS Word is completely ignoring these character sequences while the web browser is counting them as if they were only one character. As we have proven above, both are wrong.