-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with regions #93
Comments
@lgolding status on this? |
@michaelcfanning I have not worked on this issue at all. |
@kupsch @michaelcfanning These comments are incomplete. Pasting in what I have so far... Thanks for looking at this so carefully! Here's some feedback to what you wrote:
|
Phone call with @kupsch and @michaelcfanning on 5/11: we settled on this proposal:
|
I broke the LS discussion into a separate CSD.1 issue (which we might punt): #169, "Decide how to handle uncommon line break characters". |
The default value for endLine is not fully specified. If endLine is absent and startLine and endColumn are present, then the default endLinevalue is startLine. The text does not say what it should be endColumn is absent. It is implied later that it should be startLine. I think that it should be (startLine + 1). This makes it the whole line (or to the end of the line if startColumn is present) and is what most people would expect. The text should also mention that if both startLine and charLength are present, then the invariants of any derived endLine and endColumn position must match the position computer by charLength. To make the case of both endLine and endColumn being absent meaningful, the default endColumn value should be (startColumn + charLength) as it is not possible to fulfill the invariant any other way. As an alternative, the default values for endLine and endColumn could be undefined is charLength is present. We should discuss the use of half-open (inclusive of the start, exclusive of the end position) ranges probably on Wednesday. Although the half-open representation of the region has some nice properties, there is an impedance mismatch between this and how tools representing ranges as closed regions. I have not seen any tools use the half-open representation (all used closed ranges). This may be misinterpreted by someone looking at SARIF. |
@kupsch You are right that we need to specify We went with half-open so we could use BTW, these changes in semantics are going to make writing the v1 => v2 converter very delicate. |
FYI @michaelcfanning @kupsch @lukecartey Here is the outcome of our discussion on Friday 5/18. I include a couple of points that we didn't state explicitly, but that I think are obvious (like #3).
|
line definition
The definition of line (sec 1.2 of wd2) should state the line is 1-based to be consistent with the definition of column.
regions
The description of a text region's (sec 3.20.2 in wd02) could be improved as concepts such as how to map line and column number to positions are not described, complicated by overloading properties that are based on other properties (length), defaults do not match what I expect (specifying just startLine, should be the whole line, not an empty region at column 0), not all regions are representable (such as a single character).
The text below I think fixes these problems, unifies text and binary regions, and allows the start and end locations to simultaneously be described by multiple means:
A file is a sequence of bytes. Locations in a file can be specified by a byte offset that is 0 based. A text file is a file that encodes a sequence of characters. Depending on the encoding, each character may be encoded by a fixed or variable or fixed number of bytes. Locations in a text file are 1 based and can logically be described by a character offset, or a line and column number.
To locate line and column positions, the text SHALL first be decoded using the file's encoding, so that multibyte encoded sequences are decoded to characters code points, and these values are then used for all further computations. Any initial metacharacter (or bytes) SHALL be discarded from further line and column computation such as Unicode BOM sequences. Further alterations SHALL NOT be made to the character sequence such as normalization; each character (code point) is a separate character for determining column and line positions. This has the property that if there is a lossless conversion from one encoding to another, the line and column positions do not change (unless normalization occurs during the conversion). Note: due to combining characters and other formatting characters, the number of glyphs displayed may not match the number of character code points.
To calculate line and column positions for each character, the property lineEndSeqs is an array of character sequences that seperate lines. There MAY exist a lineEndSeqs property present in the following objects (the first one found in the order given is used, otherwise the default value): File Object, Run Object, or Sarif Object. The following document gives guidance on values: http://unicode.org/standard/reports/tr13/tr13-5.html. The default lineEndSeqs SHALL be: CRLF (U+000D U+000A), CR (U+000D), LF (U+000A), and NEL (U+0085). The default lineEndSeqs does not include LS (U+2028) and PS (U+2029).
A file is a sequence of lines where each is a sequence of characters seperated by the longest matching line end sequence. The first character in a file has the line and column position 1. A line is a sequence of characters that is terminated by an end of line sequence (optionally for the last line) and each character is assigned a sequential column number starting at 1. The first character of a line is defined to have column position 1. Each subsequent character has a column position of 1 plus the column position of the previous character and proceed until the last character of an end of line sequence or the end of the file is reached. The end of line sequence character(s) are assigned column numbers. The character after an end of line sequence (if present) begins the next row, increments the line position by 1, and resets the column position to 1.
The line length of a line is the maximal column position within the line that is not part of a end of line sequence. The complete line length of a line is maximal column assigned including any end of line sequence characters.
An alternative to a line and column based location is startCharOffset (number of character from the beginning of the file with the first character being in position 1 after discarding encoding metadata) or startByteOffset (number of character from the beginning of the file with the first character being in position 0 and includes all bytes of the file). A relative position to the start can be computed from the properties charLength (include this many decoded characters starting at the initial location), or byteLength (include this many bytes starting at the initial location).
A non-empty region SHALL consist of 1 or more characters or bytes. An empty text region consists of 0 characters or bytes, and is represented by a length (charLength or byteLength) of 0. The location occurs immediately before the start position.
If the startLine property is present, then the startColumn property SHALL be assigned a default values if not present. If the startLine property is present and the region is not-empty, then the endLine and endColumn properties SHALL be assigned a default values if not present. The default values are as follows:
A text region is a contiguous region of a file that is specified by a start and end position in the file. The start position of the text region is specified by (startLine, startColumn), (startCharOffset) or (startByteOffset). The end position of the text region is specified by (endLine, endColumn), (charLength), or (byteLength). The start or end position MAY be specified by multiple means, and if so they SHALL all resolve to the same location in the file. If the file is a text file the (startLine, startColumn, endLine, endColumn) SHOULD be specified unless the default values are correct.
The startLine and endLine properties SHALL have the range of 1 to the number of lines in the file, and startLine <= endLine for non-empty text regions.
The startColumn and end Column property SHALL have the range of 1 to the complete line length of line startLine inclusive for non-empty text regions.
The startCharOffset property SHALL have the range of 1 to the number of characters in the file for a non-empty region, and the value 0 for a empty region.
The startByteOffset property SHALL have the range of 1 to the number of bytes in the file for a non-empty region, and the value 0 for a empty region.
If the startLine and endLine property values are equal, then the addition inequality startColumn <= endColumn SHALL be true.
For an empty region, the range of the startLine, startColumn, startCharOffset, and startByteOffset SHALL be allowed have the range for a non-empty text region with the maximal value being extended by 1 to represent a position after the last element of the file or line.
The text was updated successfully, but these errors were encountered: