-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(#13734): Properly escape special characters in CSV output #13735
Conversation
As an aside, the |
Codecov Report
@@ Coverage Diff @@
## master #13735 +/- ##
==========================================
- Coverage 77.43% 70.53% -6.90%
==========================================
Files 933 850 -83
Lines 47093 42158 -4935
Branches 5818 4431 -1387
==========================================
- Hits 36465 29737 -6728
- Misses 10485 12421 +1936
+ Partials 143 0 -143
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I agree with the change, please make this change here if you don't mind! |
@@ -479,6 +479,7 @@ class CsvResponse(Response): # pylint: disable=too-many-ancestors | |||
""" | |||
|
|||
charset = conf["CSV_EXPORT"].get("encoding", "utf-8") | |||
default_mimetype = "text/csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
tests/utils/csv_tests.py
Outdated
|
||
|
||
def test_df_to_escaped_csv(): | ||
csv_str = "col_a,=col_b\n-10,=cmd|' /C calc'!A0\na,=b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add another row to assert that both the column headers and the data are escaped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is currently testing both the header and body rows. Do you mean break the assert up so that it's easier to read? Happy to do that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be easier to parse visually, but not a blocker.
@robdiciuccio thanks for taking a look! I think you may have commented on an outdated version as it looks like I made those changes already. Can you take a look when you have a chance and let me know if those concerns were addressed? |
@benjreinhart a few minor questions still pending (PR seems to be current), but overall looks good. I'll create a test env for this PR. |
/testenv up |
@robdiciuccio Ephemeral environment spinning up at http://35.162.70.159:8080. Credentials are |
assert result == "'-value" | ||
|
||
result = csv.escape_value("=value") | ||
assert result == "'=value" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing some testing in the ephemeral env, I'm curious why just the single quote at the beginning rather than wrapping the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defusedcsv package that you reference above states:
Excel will not display the ' character to the user
but this is not the case in my tests, where Excel displays the following: '=10+20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: it was what I saw recommended the most when googling around.
Longer: I saw a few different recommendations but they all seem to be intrusive in one form or another. The approach I saw most often recommended was the single preceding quote. There is also a tab approach, but that looks invasive to some degree as well.
However, now that I just went back to look this up again, I found a post that shows how even this is not sufficient:
One product in my organization fixed a similar issue with 2 layers of defense. For any CSV cell value that starts with +, -, @, = (as suggested in http://georgemauer.net/2017/10/07/csv-injection.html or OWASP) the fix added (1) a preceding TAB char, (2) single quotes around the cell value. But later the we found that adding a single / pair of double quote(s) before the attacker's payload can simply evade the filter to check for the chars +, -, @, =. e.g. if the payload attacker injects is =2+5+cmd|' /C calc'!A0 filter will catch it and mitigate the risks. But if attacker puts the payload ""=2+5+cmd|' /C calc'!A0 filter won't be able to catch it as it was checking for only values starting with +, -, @, =. The end result will be same because MS excel, while rendering the value, simply skips the leading double quotes in CSV values.
Their recommendation is:
prepend a SPACE or TAB or SINGLE QUOTE to ALL CSV values before exporting them to file. This mitigation leaves the CSV file human readable but not executable. DO NOT check for leading +, -, @, =, " and prepend to only those values.
Does that sound like a valid approach or do you have a preferred alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatives that I see are:
- Alter every column with a preceding space. Intrusive, but not as noticeable as a single quote and will, according to that post, solve the issue in all cases
- Try to be more clever with a regex to account for the quote issue he mentions and hope this captures everything
- Keep current solution (maybe change to preceding space) and ignore the quote issue. Less invasive, but still problematic.
We can also wrap the whole column with single quotes, but then the UX is worse in some spreadsheet software that hides the preceding quote from user visually.
Mitigating this is going to be intrusive no matter which solution we choose. I'd recommend 1 above if we want to be covered in all cases, but also happy to go with 2 if we feel confident that the the quote issue is the only other case to account for.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think option 1 is too intrusive. Option 2 or shipping as-is are viable options. Curious what others think here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like 2 is doable. I agree with @robdiciuccio that 1 is intrusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the implementation to be a bit more clever after some more testing in google sheets and excel. Best I can tell, this gets the job done without being overly intrusive.
Some interesting things I found while testing:
- Many resources say that a leading space will work the same as a leading single quote, but I found that Google sheets will ignore a leading space and evaluate the command
- Google Sheets seems to hide leading single quote while Excel did not
- Pandas seems to remove the leading double quotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One Q, otherwise LGTM
2d64572
to
62dd99f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, Ben! That linked article is super interesting.
# This regex will match if the string starts with: | ||
# | ||
# 1. one of -, @, +, |, =, % | ||
# 2. two double quotes immediately followed by one of -, @, +, |, =, % | ||
# 3. one or more spaces immediately followed by one of -, @, +, |, =, % | ||
# | ||
problematic_chars_re = re.compile(r'^(?:"{2}|\s{1,})(?=[\-@+|=%])|^[\-@+|=%]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a FYI, in Python you can write verbose regular expressions with inline comments: https://docs.python.org/3/library/re.html#re.VERBOSE
You can also use +
instead of {1,}
, but no need to change this.
Ephemeral environment shutdown and build artifacts deleted. |
…pache#13735) * fix: Escape csv content during downloads * Reuse CsvResponse object * Use correct mimetype for csv responses * Ensure that headers are also escaped * Update escaping logic
* master: (56 commits) test: Adds tests and storybook to CertifiedIcon component (#13457) chore: Moves CheckboxIcons to Checkbox folder (#13459) chore: Removes Popover duplication (#13462) build(deps): bump elliptic from 6.5.3 to 6.5.4 in /docs (#13527) fix: allow spaces in DB names (#13800) chore: Update PR template for SIP-59 DB migrations process (#13855) Add CODEOWNERS (#13759) feat(alerts & reports): Easier to read execution logs (#13752) fix: Disallows negative options remaining (#13749) Fix broken link (#13861) fix(native-filters): add global async query support to native filters (#13837) Displays row limit warning with Alert component (#13854) fix(errors): Downgrade error on stop query to a warning (#13826) fix(alerts and reports): Unify timestamp format on execution log view (#13718) fix(sqllab): warning message when rows limited (#13841) chore: add success log whenever a connection is working (#13811) fix(native-filters): improve loading styles for filter component (#13794) chore: update change log with cherry-picks for release 1.1 (#13824) feat: added support to configure the default explorer viz (#13610) fix(#13734): Properly escape special characters in CSV output (#13735) ...
…pache#13735) * fix: Escape csv content during downloads * Reuse CsvResponse object * Use correct mimetype for csv responses * Ensure that headers are also escaped * Update escaping logic
This fixes #13734 by escaping CSV values when downloading data as a CSV.
SUMMARY
Escape leading special characters in CSV output.
TEST PLAN
Some unit tests and manual QA
ADDITIONAL INFORMATION