-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide how to handle uncommon line break characters #169
Comments
We already partially ducked the issue (in the sense of that we avoided telling consumers what line breaking convention a file uses) by adding this text to §3.2.2, "fileContent.text property":
We could complete the ducking process by adding this text to §3.22.2, Text regions:
Who would consume the proposed property |
I am thinking we are getting a bit out of our charter here. Why are we specifying the producers behavior in regards to text files? The run.defaultNewlineSequences is a mechanism that provides a clue to how the producer behaved. Most editors don't provide configuration around this. But the property could be used by a post-processor to recompute line breaks (when knowing a specific editor that the file was headed to). |
@michaelcfanning No, I'm not out of charter. The behavior I specified above simply required both the SARIF producer and the SARIF consumers to be aware of the file's line break conventions:
I'm sure we both agree that the producer needs to know what line number it's on. The only real question is how the consumer figures it out. My proposal above just ducked the issue by telling the consumer that it needed to "conform to" the file's link breaking convention. Your proposal (as I now understand it) is for the producer to give the consumer a hint by populating Note that you're asking more of the producer than I was. You're asking the producer to:
I was just asking for #1. |
@michaelcfanning In addition to |
@michaelcfanning That would be consistent with the pair of properties |
To be clear, my previous two comments are moot if we decide against introducing |
I would add the defaultNewLineSequence and newLineSequence properties (or both could be called newLineSequence). The fundamental problem here is that a viewer (program or human) needs to know the assessment tool's interpretation of line ending, so it can compute the line number in the same was as a tool. This is same reason that a viewer needs to know the tools interpretation of the file's encoding. These don't have to be "correct," but they do need to be consistent between the producer and the viewer. Other than heuristics or backchannel (tightly coupled tool and viewer), I think that it would be useful to have the producer clearly disclose its interpretation. In issue #93, I proposed adding a lineEndSeqs property on the file, run and sarif object with a default being CRLF, LF, CR, and NEL. To remove ambiguity, I would propose that there at least be a new property on the run object that is an array of possible line ending sequences, and that the default be [CRLF, LF]. If not specified this should produce reasonable results for most uses if the tool uses the standard Linux/UNIX or Windows conventions (the only anomaly would be a windows file with lone LFs which would be unusual). If a producer considers other characters such at NEL, CR, VT, FF, LS, or PS to end a line then they SHALL specify the property. Producers SHOULD populate the property if they compute line endings as only CRLF or only LF. Having a property on the file object would generalize and allow different files to have different line ending interpretation. It would be unusual for a file to be processed multiple times with different sequences, so no need to support, but it could happen as this isn't really a property of the file, but of the run of the tool and a produced result. The spec should specify that to determine if the character(s) at an offset is a line ending sequence the processor should proceed as if it tested each element of the array in order for a match and stopped when a match is found; and that the search for the next new line seuquences proceeds from the character following the match. As an example the text "one\r\ntwo" finds one new line sequence (CRLF) with the array [CRLF, CR, LF], and two (CR and LF) with the array [CR, LF, CRLF]. |
A consumer that performs fixes would be use this property. |
This property is approved at the run level. The default is CRLF, LF. |
Confirmed with @michaelcfanning. then run-level property is named |
We put the property on |
Unicode defines line-breaking characters that are not commonly used, such as NEL (U+0085), LS (U+2028), and PS (U+2029). This can cause problems if a SARIF producer recognizes this characters as line breaks (and so counts them in region.startLine and region.endLine), but the SARIF consumer does not (and so highlights the region incorrectly). Or vice versa, a producer might not recognize them but an editor might, and again, it will highlight the wrong region.
Is there anything SARIF can do to help? Like defining a property
run.defaultNewlineSequences:string[]
that says what the producer considers to be a newline sequence?@michaelcfanning @kupsch
The text was updated successfully, but these errors were encountered: