Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with regions #93

Closed
kupsch opened this issue Feb 13, 2018 · 9 comments
Closed

Problems with regions #93

kupsch opened this issue Feb 13, 2018 · 9 comments

Comments

@kupsch
Copy link

kupsch commented Feb 13, 2018

line definition

The definition of line (sec 1.2 of wd2) should state the line is 1-based to be consistent with the definition of column.

regions

The description of a text region's (sec 3.20.2 in wd02) could be improved as concepts such as how to map line and column number to positions are not described, complicated by overloading properties that are based on other properties (length), defaults do not match what I expect (specifying just startLine, should be the whole line, not an empty region at column 0), not all regions are representable (such as a single character).

The text below I think fixes these problems, unifies text and binary regions, and allows the start and end locations to simultaneously be described by multiple means:


A file is a sequence of bytes. Locations in a file can be specified by a byte offset that is 0 based. A text file is a file that encodes a sequence of characters. Depending on the encoding, each character may be encoded by a fixed or variable or fixed number of bytes. Locations in a text file are 1 based and can logically be described by a character offset, or a line and column number.

To locate line and column positions, the text SHALL first be decoded using the file's encoding, so that multibyte encoded sequences are decoded to characters code points, and these values are then used for all further computations. Any initial metacharacter (or bytes) SHALL be discarded from further line and column computation such as Unicode BOM sequences. Further alterations SHALL NOT be made to the character sequence such as normalization; each character (code point) is a separate character for determining column and line positions. This has the property that if there is a lossless conversion from one encoding to another, the line and column positions do not change (unless normalization occurs during the conversion). Note: due to combining characters and other formatting characters, the number of glyphs displayed may not match the number of character code points.

To calculate line and column positions for each character, the property lineEndSeqs is an array of character sequences that seperate lines. There MAY exist a lineEndSeqs property present in the following objects (the first one found in the order given is used, otherwise the default value): File Object, Run Object, or Sarif Object. The following document gives guidance on values: http://unicode.org/standard/reports/tr13/tr13-5.html. The default lineEndSeqs SHALL be: CRLF (U+000D U+000A), CR (U+000D), LF (U+000A), and NEL (U+0085). The default lineEndSeqs does not include LS (U+2028) and PS (U+2029).

A file is a sequence of lines where each is a sequence of characters seperated by the longest matching line end sequence. The first character in a file has the line and column position 1. A line is a sequence of characters that is terminated by an end of line sequence (optionally for the last line) and each character is assigned a sequential column number starting at 1. The first character of a line is defined to have column position 1. Each subsequent character has a column position of 1 plus the column position of the previous character and proceed until the last character of an end of line sequence or the end of the file is reached. The end of line sequence character(s) are assigned column numbers. The character after an end of line sequence (if present) begins the next row, increments the line position by 1, and resets the column position to 1.

The line length of a line is the maximal column position within the line that is not part of a end of line sequence. The complete line length of a line is maximal column assigned including any end of line sequence characters.

An alternative to a line and column based location is startCharOffset (number of character from the beginning of the file with the first character being in position 1 after discarding encoding metadata) or startByteOffset (number of character from the beginning of the file with the first character being in position 0 and includes all bytes of the file). A relative position to the start can be computed from the properties charLength (include this many decoded characters starting at the initial location), or byteLength (include this many bytes starting at the initial location).

A non-empty region SHALL consist of 1 or more characters or bytes. An empty text region consists of 0 characters or bytes, and is represented by a length (charLength or byteLength) of 0. The location occurs immediately before the start position.

If the startLine property is present, then the startColumn property SHALL be assigned a default values if not present. If the startLine property is present and the region is not-empty, then the endLine and endColumn properties SHALL be assigned a default values if not present. The default values are as follows:

  • startColumn: 1
  • endLine: the value of startLine
  • endColumn: the line length of the line endLine

A text region is a contiguous region of a file that is specified by a start and end position in the file. The start position of the text region is specified by (startLine, startColumn), (startCharOffset) or (startByteOffset). The end position of the text region is specified by (endLine, endColumn), (charLength), or (byteLength). The start or end position MAY be specified by multiple means, and if so they SHALL all resolve to the same location in the file. If the file is a text file the (startLine, startColumn, endLine, endColumn) SHOULD be specified unless the default values are correct.

The startLine and endLine properties SHALL have the range of 1 to the number of lines in the file, and startLine <= endLine for non-empty text regions.

The startColumn and end Column property SHALL have the range of 1 to the complete line length of line startLine inclusive for non-empty text regions.

The startCharOffset property SHALL have the range of 1 to the number of characters in the file for a non-empty region, and the value 0 for a empty region.

The startByteOffset property SHALL have the range of 1 to the number of bytes in the file for a non-empty region, and the value 0 for a empty region.

If the startLine and endLine property values are equal, then the addition inequality startColumn <= endColumn SHALL be true.

For an empty region, the range of the startLine, startColumn, startCharOffset, and startByteOffset SHALL be allowed have the range for a non-empty text region with the maximal value being extended by 1 to represent a position after the last element of the file or line.

@michaelcfanning
Copy link
Contributor

@lgolding and @kupsch to follow up offline on this and update here.

@michaelcfanning
Copy link
Contributor

@lgolding status on this?

@ghost
Copy link

ghost commented Mar 28, 2018

@michaelcfanning I have not worked on this issue at all.

@ghost
Copy link

ghost commented Apr 3, 2018

@kupsch @michaelcfanning These comments are incomplete. Pasting in what I have so far...

Thanks for looking at this so carefully! Here's some feedback to what you wrote:

  • "The definition of line (sec 1.2 of wd2) should state the line is 1-based to be consistent with the definition of column."

    The Terminology section defines column is an index ("1-based index of a character within a line"); whereas it defines line as a sequence of characters ("contiguous sequence of characters, starting either at the beginning of a file or immediately after a newline sequence...").

    I could change the term column to column number and I could add a corresponding definition of line number ("1-based index of a line within a file"). There would still be an asymmetry (there would be no definition for just plain column), but I think that's ok because I don't think "column" (as a set of characters) is a useful concept in SARIF.

  • "Not all regions are representable, such as a single character":

    That's not correct. The region { "startLine": 12, "startColumn": 13, "endColumn": 14 } represents the character at line 12 column 13.

  • "Locations in a text file are 1-based"

    I agree for line and column numbers. I don't agree that a character offset should be 1-based.

    UPDATE: See my comment below for more thoughts on this.

  • "There MAY exist a lineEndSeqs property present in the following objects (the first one found in the order given is used, otherwise the default value): File Object, Run Object, or Sarif Object"

    I don't want to put anything on the sarifLog object. It's just a version number and an array of runs. The only reason it exists at all (as opposed to having run be the top-level object) is to make it convenient to ship a collection of runs over the wire as a unit. Sometimes I wonder if it was all a dreadful mistake and run should have been at the top.

  • "A file is a sequence of lines..."

    I think the existing definition for text file is fine: "[a] file considered as a sequence of characters organized into lines and columns"

    We might start this paragraph as follows, making use of that definition of line from the Terminology section:

    Adjacent lines in a text file are delimited by the longest newline matching newline sequence."

  • "An alternative to a line and column based location is startCharOffset..."

    I actually like the idea of completely separating the properties used to describe a text region from those used to describe a binary region. We might even define textRegion and binaryRegion objects and have a region object contain one of each.

  • "The start{Char,Byte}Offset property SHALL have the range of 1 to the number of {characters,bytes}..."

    I strongly disagree with the idea of a 1-based offset. "How far are you from the start of the file?" "1 character" No -- you're right at the beginning of the file, and the offset should be 0.

    In fact, this contradicts what you wrote at the very top: "A file is a sequence of bytes. Locations in a file can be specified by a byte offset that is 0 based." -- which I agree with.

    UPDATE: I now understand that the reason you want a 1-based "character offset" is that in the same way we say "this character is line 1, column 1", you want to be able to say "this is character 1". Still, I think "offset" is a misleading name for that concept. Maybe call it characterPosition instead.

    Having said all that: do we really need "character position" and "character length"? @michaelcfanning, are there tools that designate character ranges this way? If so, do they use a 1-based "character position" or a 0-based "character offset"?

  • "... and the value 0 for a empty region."

    Again, I don't agree with this design choice. We don't need a "sentinel value" of 0 for the offset properties to denote an empty region. We already have the length property for that.

  • "The default lineEndSeqs does not include LS (U+2028) and PS (U+2029)."

    Why does it not include LS (U+2028)? According to http://unicode.org/standard/reports/tr13/tr13-5.html, that's the unambiguous line separator character. Is your concern that it is a line separator rather than a line terminator, and as such would not be considered part of a line?

@ghost ghost changed the title problems with regions Problems with regions Apr 20, 2018
@michaelcfanning michaelcfanning added the CSD.1 Will be fixed in CSD.1. label Apr 27, 2018
@ghost
Copy link

ghost commented May 11, 2018

Phone call with @kupsch and @michaelcfanning on 5/11: we settled on this proposal:

  • Replace offset and length properties with charOffset, charLength, byteOffset, and byteLength.
  • Add a statement that character-based properties are independent of the presence or absence of a BOM.
  • Change the statement that surrogate pairs consist of two characters. They are one character, but some editors get it wrong.
  • We have not decided how to handle LS (U+2028). I'll open a separate issue for that.

@ghost
Copy link

ghost commented May 11, 2018

@michaelcfanning @kupsch

I broke the LS discussion into a separate CSD.1 issue (which we might punt): #169, "Decide how to handle uncommon line break characters".

ghost pushed a commit that referenced this issue May 12, 2018
ghost pushed a commit that referenced this issue May 12, 2018
ghost pushed a commit that referenced this issue May 14, 2018
@kupsch
Copy link
Author

kupsch commented May 15, 2018

The default value for endLine is not fully specified. If endLine is absent and startLine and endColumn are present, then the default endLinevalue is startLine. The text does not say what it should be endColumn is absent. It is implied later that it should be startLine. I think that it should be (startLine + 1). This makes it the whole line (or to the end of the line if startColumn is present) and is what most people would expect.

The text should also mention that if both startLine and charLength are present, then the invariants of any derived endLine and endColumn position must match the position computer by charLength. To make the case of both endLine and endColumn being absent meaningful, the default endColumn value should be (startColumn + charLength) as it is not possible to fulfill the invariant any other way. As an alternative, the default values for endLine and endColumn could be undefined is charLength is present.

We should discuss the use of half-open (inclusive of the start, exclusive of the end position) ranges probably on Wednesday. Although the half-open representation of the region has some nice properties, there is an impedance mismatch between this and how tools representing ranges as closed regions. I have not seen any tools use the half-open representation (all used closed ranges). This may be misinterpreted by someone looking at SARIF.

@ghost
Copy link

ghost commented May 15, 2018

@kupsch You are right that we need to specify endLine when endColumn is absent. Making it startLine + 1 would make the set of lines half-open, which you objected to in the case of columns.

We went with half-open so we could use startColumn == endColumn to represent an insertion point (zero-length region). @michaelcfanning, would using endColumn == startColumn - 1 to represent an insertion point be acceptable? That would require endColumn == -1 for an insertion point at the start of the file -- the only time any of these properties could be negative.

BTW, these changes in semantics are going to make writing the v1 => v2 converter very delicate.

@ghost ghost closed this as completed May 15, 2018
@ghost ghost reopened this May 15, 2018
@ghost
Copy link

ghost commented May 21, 2018

FYI @michaelcfanning @kupsch @lukecartey

Here is the outcome of our discussion on Friday 5/18. I include a couple of points that we didn't state explicitly, but that I think are obvious (like #3).

  1. Column ranges are half-open: they exclude the column specified by endColumn.

  2. If endLine is missing, it defaults to startLine.

  3. There is no default value for startLine.

    EXAMPLE: The region startLine = 12, startColumn = 3, endColumn = 6 consists of the three characters in columns 3 through 5 of line 12.

  4. If startColumn is missing, it defaults to 1.

    EXAMPLE: The region startLine = 12, endColumn = 6 consists of the five characters in columns 1 through 5 of line 12. The implied value of startColumn is 1.

  5. If endColumn is missing, it defaults to one more than the number of characters on the last line of the region (excluding the newline sequence).

    EXAMPLE: Suppose line 12 consists of the characters "abcd\r\n" (the four characters "abcd" followed by the two-character newline sequence "\r\n"). Then the region startLine = 12, startColumn = 2 consists of the three characters "bcd". The implied value of endColumn is 5.

  6. As a result of #4 and #5, a region specified solely by startLine consists of the entire line, excluding the newline sequence.

    EXAMPLE: Suppose line 12 consists of the characters "abcd\r\n". Then the region startLine = 12 consists of the four characters "abcd". The implied value of startColumn is 1 and the implied value of endColumn is 5.

  7. To specify an insertion point, explicitly specify both startColumn and endColumn, on the same line, with equal values.

    EXAMPLE: The region startLine = 12, startColumn = 5, endColumn = 5 specifies an insertion point before column 5 on line 12.

    EXAMPLE: The region startLine = 12, startColumn = 1, endColumn = 1 specifies an insertion point at the beginning of line 12.

    NOTE: Omitting endColumn does not specify an insertion point (endColumn does not default to startColumn, contrary to the current spec).

  8. Line ranges are closed: they include the line specified by endLine.

  9. If a region contains more than one line, it includes the newline sequence for all but the last line.

    EXAMPLE: The region startLine = 12, startColumn = 3, endLine = 14 includes columns 3 through the end of line 12 (including the newline sequence), all of line 13 (including the newline sequence), and all of line 14 except for the newline sequence.

    EXAMPLE: The region startLine = 12, endLine = 14 includes all of lines 12 through 14 except for the newline sequence on line 14.

    EXAMPLE: The region startLine = 12, endLine = 13, endColumn = 1 consists of all of line 12 (including the newline sequence). The "cursor position" is before column 1 on line 13.

ghost pushed a commit that referenced this issue May 23, 2018
ghost pushed a commit that referenced this issue May 24, 2018
ghost pushed a commit that referenced this issue May 25, 2018
ghost pushed a commit that referenced this issue May 31, 2018
@ghost ghost closed this as completed May 31, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants