Skip to content

Counting Characters

marksreeves edited this page Nov 15, 2016 · 22 revisions

Counting Characters - It's harder than you think! This page discusses issues when counting the number of characters in user input. It is most relevant to WTextArea.

Contents

TextArea and MaxLength

Modern browsers allow developers to restrict the maximum number of characters the user may enter into a [textarea] (https://html.spec.whatwg.org/multipage/forms.html#attr-textarea-maxlength) by setting the maxlength attribute.

For example here is a simple HTML snippet which creates a form with a single textarea that allows up to 200 characters:

<form action="" method="post">
	<textarea maxlength="200"></textarea>
	<input type="submit"/>
</form>

The image below shows how this simple page looks when loaded in a web browser.

HTML textarea and submit button

Client Side Issues

A common scenario is for a user to copy text from another application into a textarea. For example assume the user wants to copy from Microsoft Word and paste into the textarea example above (with a limit of 200 characters).

The count according to MS Word

The image below shows the MS Word document we are copying from.

Selected text in MS Word showing 200 characters are selected

Note that, according to MS Word, the selected text is exactly 200 characters long.

The count according to Web Browsers

When we paste the above 200 characters into our simple HTML form and press the submit button here is what we see:

HTML validation error reporting we have entered 203 characters

According to the browser we have entered 203 characters, how can this be? MS Word wouldn't lie to us right? The browser wouldn't lie either right?

The count according to Text Editors

Let's ask a simple text editor how many characters there are, the image below shows the same text pasted into Atom:

Text editor reporting the sample text is actually 206 characters

What the?!? The text editor says we have 206 characters!

The true character count

In summary we have copied text from MS Word and pasted into a web browser and a text editor. Each of these reports the following character counts:

  • MS Word: 200
  • Web Browser: 203
  • Text Editor: 206

To find who is telling the truth we must examine the raw bytes that make up this text using a hex editor, shown below:

Hex editor proving the sample text is 206 characters

We can clearly see twenty rows of ten characters followed by one row of six = 206 characters.

  • MS Word: 200 <- LIE
  • Web Browser: 203 <- LIE
  • Text Editor: 206 <- TRUTH

What's Going On?

In the scenario above MS Word and the Web Browser are counting newline characters differently. Only the text editor is counting them for what they are.

When you hit Enter in MS Word it adds two characters to create the newline, these are shown in the hex editor as 0D 0A. In English 0D is called "Carriage Return" (CR) and 0A is called "Line Feed" (LF), we can abbreviate this combination as CRLF.

When counting characters MS Word is completely ignoring these character sequences while the web browser is counting them as if they were only one character. As we have proven above, both are wrong.

What about Unix Platforms?

Unix, Linux and Mac OS X do not use CRLF to represent newline but instead simply use LF (long ago MacOS used CR). Does this solve the problem? Unfortunately NO!

Even if the user input contains LF for newlines the server will receive CRLF combinations. This is explained in the HTML5 Spec:

Finally, there is the value, as used in form submission and other processing models in this specification. It is normalised so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs ...

WComponents Character Count

When the user submits a page with a textarea WComponents (server side) counts actual characters, in our example above this means we would count 206 characters.

In some cases this can result in confusion when the user enters a value that does not exceed the maximum length as reported in the browser but is rejected as too long by WComponents on the server. As discussed above, the browser is lying to the user, WComponents is telling the truth.

Related information

Clone this wiki locally