Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). #326

Adil-Iqbal · 2022-08-07T14:08:38Z

Summary

We'd like to add a feature to CsvGenerator that will enable sanitizing string values against CSV Injection.

CsvGenerator.Feature.SANITIZE_STRING_VALUES

The current method of protecting against this vulnerability is not reliable. Specifically, we request that for all string values, all matches that match the regular expression ^[-\=\+@]+ should be replaced by an empty string before the string value is written.

Background

https://www.veracode.com/blog/secure-development/data-extraction-command-execution-csv-injection

Rationale

The existing feature that is most-often used to protect against CSV injection is CsvGenerator.Feature.ALWAYS_QUOTE_STRINGS. There are two problems with this approach.

It unnecessarily augments all string values, regardless of whether those values are malicious.
End users may still remove double quotes programmatically in their spreadsheet application, not realizing that they are exposing themselves to a critical security issue.

Many developers are attempting to address the issue by using reflection to manually "sanitize" the declared string fields on their POJOs prior to serializing them as a CSV row. The risk is that they are doing it improperly, allowing exposure to still exist in some form. For example, one user on stack overflow decided to prepend malicious string with an apostrophe ('), instead of just removing the offending characters entirely.

Clearly, a more targeted feature is needed to protect the end user.

Mitigating the Risk

The risk is mitigated by ensuring that the following characters do not appear at the start of the string:

equals sign (=)
plus sign (+)
minus sign (-)
"At" symbol (@)

Acceptance criteria

Strings should be transformed:

After the string values are trimmed.
Before ALWAYS_QUOTE_STRINGS is applied.
Before string values are written.

The following transformations can be used as a guide for feature.

=foo --> foo
+foo --> foo
-foo --> foo
@foo --> foo
==foo --> foo
++foo --> foo
--foo --> foo
@@foo --> foo
@+foo --> foo
=-foo --> foo
foo --> foo
foo= --> foo=
fo=o --> fo=o
+-=@fo=o= --> fo=o=

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2022-08-07T15:21:11Z

First of all, thank you for reporting the issue. I will have to read this with thought.

But my first thought is that this seems related to some specific use cases; and that the suggestion cannot be a general rule for all usage. It sounds like this would specifically be against injection when loading documents in Excel -- a common use case to be sure, but not at all the only use case. And as such it cannot dictate general processing: it would not make any sense to -- for example -- remove leading minus sign from negative numbers.
Similarly there are performance concerns in various content cleaning aspects to consider (for cases where replacement/removal is not needed).

It would likely make sense to add new CsvWriteFeature to cover use case(s) (SANITIZE_FOR_EXCEL?).
There are probably also ways to add quoting/escaping, although that is another orthogonal feature.

Adil-Iqbal · 2022-08-08T04:18:46Z

You're right. It's certainly application specific. As is the vulnerability. Though, I do feel that "export to csv" like functionality is within the domain of this project.

That said, it seems I didn't consider all the constraints. The negative number like problem seems hard to solve without inspecting the entire string.

On the performance front, perhaps we can qualify which strings we apply the sanitize operation to by checking if the first character is in the set of offending characters. If not, we can omit the sanitize operation.

As for the operation itself, I'll try and think about edge cases and ways to save on time.

On the naming front, SANITIZE_FOR_EXCEL is better than what I originally suggested, but may be too narrow. Formulae are used by every major spreadsheet app, including Google Sheets, OfficeLibre, Zoho Sheets, Apple Numbers on MacOS, etc. All of them use the four characters mentioned above. A better name for the feature might be SANITIZE_FOR_CONSUMER.

cowtowncoder · 2022-08-08T18:21:42Z

Yes, I agree that improvement(s) to allow safe(r) operation are important and useful, I hope it did not sound I suggest otherwise.

While I agree that SANITIZE_FOR_EXCEL or similar is sort of too specific, I think there is sometimes practical consideration of what a useful name is -- I am definitely open for alternatives. But one problem there may be is if different tools have different requirements (which is probably true): is it possible to find one common setting, or is there need for separate ones.
But I could go with alternative names; perhaps something more along lines of "SANITIZE_FORMULAS" or "SANITIZE_FORMULA_CHARACTERS".

Now, one practical question: it seemed to me that:

For characters mentioned, the check only needs to be applied for the first character of a cell value (or possible first non-whitespace character)
Problem only occurs if cell value is NOT quoted

is this true? (2) is esp. relevant since ideally I'd prefer not mutating actual values -- but obviously if it won't make any difference then... something needs to be done.

And come to think of that, it may even be necessary to offer some sort of handlers for deciding how to replace possibly problematic characteres (drop/change).
This is not an easy problem to solve cleanly it seems. :)

pjfanning · 2022-08-29T23:56:43Z

I can't see the security issue here - the security risk is on the consumer side (the Excel user) as opposed to the producer side. The Excel user shouldn't be loading untrusted files.

I'm not aware of any CSV libs that have a feature like this. If you can provide some links to equivalent support in other CSV tools, I'd be interested in seeing them.

If you want to export data using so that Excel users can load it up - why not produce an xlsx file (using Apache POI or something like that)? Then you get to control whether a cell is a text cell or a formula cell.

cowtowncoder added the csv label Aug 7, 2022

cowtowncoder changed the title ~~Critical Security Issue: Add feature to sanitize strings against CSV injection (aka Formula Injection).~~ Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). Aug 30, 2022

oliviertassinari mentioned this issue Jan 16, 2024

[data grid] Sanitize cells with formulas for CSV export used in Excel mui/mui-x#11702

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). #326

Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). #326

Adil-Iqbal commented Aug 7, 2022 •

edited

Loading

cowtowncoder commented Aug 7, 2022

Adil-Iqbal commented Aug 8, 2022

cowtowncoder commented Aug 8, 2022

pjfanning commented Aug 29, 2022 •

edited

Loading

Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). #326

Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). #326

Comments

Adil-Iqbal commented Aug 7, 2022 • edited Loading

Summary

Background

Rationale

Mitigating the Risk

Acceptance criteria

cowtowncoder commented Aug 7, 2022

Adil-Iqbal commented Aug 8, 2022

cowtowncoder commented Aug 8, 2022

pjfanning commented Aug 29, 2022 • edited Loading

Adil-Iqbal commented Aug 7, 2022 •

edited

Loading

pjfanning commented Aug 29, 2022 •

edited

Loading