-
Notifications
You must be signed in to change notification settings - Fork 245
Custom Syntax Highlighting
If you want to do syntax highlighting for a language not already included with Scintilla you can implement your own syntax highlighting logic.
Before going further I should point out that it's not generally possible to augment an existing, built-in lexer with your own custom styling logic. They are not designed to support that. For the most part you get what you get with the built-in lexers. If you want to deviate from them you'll need to write your own lexer as described below.
The two key APIs for doing syntax highlighting in Scintilla are the StartStyling(int position)
and SetStyling(int length, int style)
methods. StartStyling
is called for the position
where you would like to start styling. Then, for each subsequent range of characters a call to SetStyling
will apply the style
for the length
of characters specified. Each call to SetStyling
advances the position so it's not necessary to call StartStyling
for each style range.
For example, the following snippet will color the first 5 characters of the document in style 1
(red) and the subsequent 2 characters in style 2
(blue):
scintilla.Lexer = Lexer.Null;
scintilla.Styles[1].ForeColor = Color.Red;
scintilla.Styles[2].ForeColor = Color.Blue;
scintilla.StartStyling(0);
scintilla.SetStyling(5, 1);
scintilla.SetStyling(2, 2);
You'll notice that we also set the current lexer to Lexer.Null
to let Scintilla know that we will be doing our own syntax highlighting and not to use a built-in lexer which would interfere with our work.
Following a process similar to that above you could begin to construct logic in your application for doing custom syntax highlighting. You'll quickly find, however, that when the text changes, there is not an automatic process to update your styles. Styled text will continue to remain styled, but new or changed text will have no style. Knowing this, Scintilla provides an easier way to track changes and know when to do styling by handling the StyleNeeded
event.
The StyleNeeded
event can be enabled by setting the current lexer to Lexer.Container
(think "containing application" instead of built-in). Once enabled, the StyleNeeded
event will be raise each time the document needs to be re-styled.
To ensure optimal performance, Scintilla suggests that you only re-style affected areas of the document. Text extending beyond the bottom of the window or text at the top of the document which is already correctly styled doesn't need to be styled again. Scintilla provides APIs for knowing where to begin styling--because of a change in the text or scrolling, and where to end styling--just out of view. This is determined by using the GetEndStyled
method in conjunction with the StyleNeededEventArgs.Position
property like this:
private void scintilla_StyleNeeded(object sender, StyleNeededEventArgs e)
{
var startPos = scintilla.GetEndStyled();
var endPos = e.Position;
// TODO style this range
}
Much better. Now instead of wondering what to style and when to do it, we just handle the StyleNeeded
event, determine the range that needs to be styled, and use that information with our custom logic. There is nothing to prevent you from re-styling text outside this range, but it will likely be unnecessary and waste cycles.
We're now in a position to write some custom lexer / syntax highlighting logic.
This is an advanced topic.
Writing a complete lexer is WAY, WAY beyond the scope of this document. There are entire volumes written on the subject and there is just no way to cover it completely in a few short paragraphs. That being said, I'll see if I can at least point you in the right direction.
Lexer logic usually consists of a big loop which iterates through each character in the document and classifies it. These classifications form the basis of syntax highlighting styles.
Often, characters are significant when they are grouped with other characters to form keywords or strings. These are called lexemes (or tokens). So in addition to looping through each character we're going to need to track some state about whether this character belongs to a group and forms a lexeme. In essence, we need to build a basic state machine.
For the purposes of this example we're going to build a VERY basic C# lexer. Our lexer will identify keywords, numbers, and strings. This should be sufficient to teach the basics without going overboard.
The skeleton of our loop looks like this:
public const int StyleString = 4;
private const int STATE_UNKNOWN = 0;
private const int STATE_STRING = 3;
public void Style(Scintilla scintilla, int startPos, int endPos)
{
var state = STATE_UNKNOWN;
// Start styling
scintilla.StartStyling(startPos);
while (startPos < endPos)
{
var c = (char)scintilla.GetCharAt(startPos);
REPROCESS:
switch (state)
{
case STATE_UNKNOWN:
if (c == '"')
{
// Start of "string"
scintilla.SetStyling(1, StyleString);
state = STATE_STRING;
}
break;
case STATE_STRING:
break;
// etc...
}
startPos++;
}
}
Our method is designed to be called from the StyleNeeded
event as previously discussed. I prefer to write my lexers so that they don't store any variables outside of my Style
method because that allows me to handle multiple StyleNeeded
events from multiple Scintilla windows using a single lexer instance.
Our constants define the possible states our lexer loop can be in. STATE_UNKOWN
means the next character could be anything--a string, a number, whatever. Once we encounter a known character, for example a double-quote (") we style it using SetStyling
and then change the loop state to STATE_STRING
so that the next iteration of the loop can begin looking for the closing double-quote (") character. Once found, the range will get styled and the state set back to STATE_UNKOWN
to begin looking for the next language element.
On each loop iteration we advance the starPos
and get the next character by calling GetCharAt
. For performance or practical reasons you may prefer to get an entire line of text a time. It's entirely up to you how you go about writing your lexer.
Now that you're an expert we'll look at the complete working sample and then break it down:
private CSharpLexer cSharpLexer = new CSharpLexer("class const int namespace partial public static string using void");
private void form_Load(object sender, EventArgs e)
{
scintilla.StyleResetDefault();
scintilla.Styles[Style.Default].Font = "Consolas";
scintilla.Styles[Style.Default].Size = 10;
scintilla.StyleClearAll();
scintilla.Styles[CSharpLexer.StyleDefault].ForeColor = Color.Black;
scintilla.Styles[CSharpLexer.StyleKeyword].ForeColor = Color.Blue;
scintilla.Styles[CSharpLexer.StyleIdentifier].ForeColor = Color.Teal;
scintilla.Styles[CSharpLexer.StyleNumber].ForeColor = Color.Purple;
scintilla.Styles[CSharpLexer.StyleString].ForeColor = Color.Red;
scintilla.Lexer = Lexer.Container;
}
private void scintilla_StyleNeeded(object sender, StyleNeededEventArgs e)
{
var startPos = scintilla.GetEndStyled();
var endPos = e.Position;
cSharpLexer.Style(scintilla, startPos, endPos);
}
public class CSharpLexer
{
public const int StyleDefault = 0;
public const int StyleKeyword = 1;
public const int StyleIdentifier = 2;
public const int StyleNumber = 3;
public const int StyleString = 4;
private const int STATE_UNKNOWN = 0;
private const int STATE_IDENTIFIER = 1;
private const int STATE_NUMBER = 2;
private const int STATE_STRING = 3;
private HashSet<string> keywords;
public void Style(Scintilla scintilla, int startPos, int endPos)
{
// Back up to the line start
var line = scintilla.LineFromPosition(startPos);
startPos = scintilla.Lines[line].Position;
var length = 0;
var state = STATE_UNKNOWN;
// Start styling
scintilla.StartStyling(startPos);
while (startPos < endPos)
{
var c = (char)scintilla.GetCharAt(startPos);
REPROCESS:
switch (state)
{
case STATE_UNKNOWN:
if (c == '"')
{
// Start of "string"
scintilla.SetStyling(1, StyleString);
state = STATE_STRING;
}
else if (Char.IsDigit(c))
{
state = STATE_NUMBER;
goto REPROCESS;
}
else if (Char.IsLetter(c))
{
state = STATE_IDENTIFIER;
goto REPROCESS;
}
else
{
// Everything else
scintilla.SetStyling(1, StyleDefault);
}
break;
case STATE_STRING:
if (c == '"')
{
length++;
scintilla.SetStyling(length, StyleString);
length = 0;
state = STATE_UNKNOWN;
}
else
{
length++;
}
break;
case STATE_NUMBER:
if (Char.IsDigit(c) || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F') || c == 'x')
{
length++;
}
else
{
scintilla.SetStyling(length, StyleNumber);
length = 0;
state = STATE_UNKNOWN;
goto REPROCESS;
}
break;
case STATE_IDENTIFIER:
if (Char.IsLetterOrDigit(c))
{
length++;
}
else
{
var style = StyleIdentifier;
var identifier = scintilla.GetTextRange(startPos - length, length);
if (keywords.Contains(identifier))
style = StyleKeyword;
scintilla.SetStyling(length, style);
length = 0;
state = STATE_UNKNOWN;
goto REPROCESS;
}
break;
}
startPos++;
}
}
public CSharpLexer(string keywords)
{
// Put keywords in a HashSet
var list = Regex.Split(keywords ?? string.Empty, @"\s+").Where(l => !string.IsNullOrEmpty(l));
this.keywords = new HashSet<string>(list);
}
}
If we did everything correct you should get rudimentary syntax highlighting of C# strings, numbers, and keywords.
The first code block is typical boilerplate for configuring styles. We new-up a CSharpLexer
instance with some keywords and pass control to its Style
method in the StyleNeeded
event. As a matter of practice it's usually a good idea to allow keywords to be configured separately from your lexer logic so you can easily add keywords when the language evolves without rewriting your lexer logic. This is how lexers built-in to Scintilla work.
Once the CSharpLexer.Style
method has control we execute our loop. Scintilla expects us to always style entire lines at a time so we first make sure our startPos
is at the beginning of a line.
We iterate through each character and start classifying; double-quotes are strings, digits are numbers, letters are keywords or identifiers, and set the state and styling appropriately.
To track a range of characters we increment the length
variable. This is best illustrated by how we capture keywords in the STATE_IDENTIFIER
state. When we've detected the end of an identifier or keyword sequence we use the length
we've been tracking to get the entire word and check it against our list of possible keywords. The length
then gets reset along with the state
and the whole sequence starts over.
You may notice the REPROCESS
label and goto
(yes, goto) statements. This is a little trick I picked-up from studying the http-parser used in the Node.js project. By jumping back to the top of the switch statement we can reprocess a character without advancing to the next character. This allows us to avoid having to 'peek' at the next character in the document. For example, if we wanted to know where a number sequence ends we could either peek ahead of the current character to see if the next character is non-digit (i.e. whitespace) or we could just wait until we've actually hit the non-digit character, complete the number sequence we're tracking, and then reprocess the current unknown character all within a single loop iteration.