Skip to content

Commit

Permalink
Support swapping to disk for large nested archives (#98)
Browse files Browse the repository at this point in the history
  • Loading branch information
glynternet authored Feb 23, 2022
1 parent 24f2aa0 commit 7f12500
Show file tree
Hide file tree
Showing 14 changed files with 610 additions and 201 deletions.
78 changes: 45 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,39 +249,44 @@ Usage:
log4j-sniffer crawl <root> [flags]
Flags:
--archive-open-mode string Supported values:
standard - standard file opening will be used. This may cause the filesystem cache to be populated with reads from the archive opens.
directio - direct I/O will be used when opening archives that require sequential reading of their content without being able to skip to file tables at known locations within the file.
For example, "directio" can have an effect on the way that tar-based archives are read but will have no effect on zip-based archives.
Using "directio" will cause the filesystem cache to be skipped where possible. "directio" is not supported on tmpfs filesystems and will cause tmpfs archive files to report an error. (default "standard")
--archives-per-second-rate-limit int The maximum number of archives to scan per second. 0 for unlimited.
--directories-per-second-rate-limit int The maximum number of directories to crawl per second. 0 for unlimited.
--disable-cve-2021-44832-detection Disable detection of CVE-2021-44832 in versions up to 2.17.0
--disable-cve-2021-45105-detection Disable detection of CVE-2021-45105 in versions up to 2.16.0
--disable-detailed-findings Do not print out detailed finding information when not outputting in JSON.
--disable-flagging-jndi-lookup Do not report results that only match on the presence of a JndiLookup class.
Even when disabled results which match other criteria will still report the presence of JndiLookup if relevant.
--disable-unknown-versions Only output issues if the version of log4j can be determined (note that this will cause certain detection mechanisms to be skipped)
--enable-obfuscation-detection Enable applying partial bytecode matching to Jars that appear to be obfuscated. (default true)
--enable-partial-matching-on-all-classes Enable partial bytecode matching to all class files found.
--enable-trace-logging Enables trace logging whilst crawling. disable-detailed-findings must be set to false (the default value) for this flag to have an effect
--file-path-only If true, output will consist of only paths to the files in which CVEs are detected
-h, --help help for crawl
--ignore-dir strings Specify directory pattern to ignore. Use multiple times to supply multiple patterns.
Patterns should be relative to the provided root.
e.g. ignore "^/proc" to ignore "/proc" when using a crawl root of "/"
--json If true, output will be in JSON format
--maximum-average-obfuscated-class-name-length uint32 The maximum average class name length for classes within a Jar to be considered obfuscated. (default 3)
--maximum-average-obfuscated-package-name-length uint32 The maximum average package name length for packages within a Jar to be considered obfuscated. (default 3)
--nested-archive-max-depth uint The maximum depth to recurse into nested archives.
A max depth of 0 will open up an archive on the filesystem but not any nested archives.
--nested-archive-max-size uint The maximum compressed size in bytes of any nested archive that will be unarchived for inspection.
This limit is made a per-depth level.
The overall limit to nested archive size unarchived should be controlled
by both the nested-archive-max-size and nested-archive-max-depth. (default 5242880)
--per-archive-timeout duration If this duration is exceeded when inspecting an archive,
an error will be logged and the crawler will move onto the next file. (default 15m0s)
--summary If true, outputs a summary of all operations once program completes (default true)
--archive-open-mode string Supported values:
standard - standard file opening will be used. This may cause the filesystem cache to be populated with reads from the archive opens.
directio - direct I/O will be used when opening archives that require sequential reading of their content without being able to skip to file tables at known locations within the file.
For example, "directio" can have an effect on the way that tar-based archives are read but will have no effect on zip-based archives.
Using "directio" will cause the filesystem cache to be skipped where possible. "directio" is not supported on tmpfs filesystems and will cause tmpfs archive files to report an error. (default "standard")
--archives-per-second-rate-limit int The maximum number of archives to scan per second. 0 for unlimited.
--directories-per-second-rate-limit int The maximum number of directories to crawl per second. 0 for unlimited.
--disable-cve-2021-44832-detection Disable detection of CVE-2021-44832 in versions up to 2.17.0
--disable-cve-2021-45105-detection Disable detection of CVE-2021-45105 in versions up to 2.16.0
--disable-detailed-findings Do not print out detailed finding information when not outputting in JSON.
--disable-flagging-jndi-lookup Do not report results that only match on the presence of a JndiLookup class.
Even when disabled results which match other criteria will still report the presence of JndiLookup if relevant.
--disable-unknown-versions Only output issues if the version of log4j can be determined (note that this will cause certain detection mechanisms to be skipped)
--enable-obfuscation-detection Enable applying partial bytecode matching to Jars that appear to be obfuscated. (default true)
--enable-partial-matching-on-all-classes Enable partial bytecode matching to all class files found.
--enable-trace-logging Enables trace logging whilst crawling. disable-detailed-findings must be set to false (the default value) for this flag to have an effect.
--file-path-only If true, output will consist of only paths to the files in which CVEs are detected
-h, --help help for crawl
--ignore-dir strings Specify directory pattern to ignore. Use multiple times to supply multiple patterns.
Patterns should be relative to the provided root.
e.g. ignore "^/proc" to ignore "/proc" when using a crawl root of "/"
--json If true, output will be in JSON format
--maximum-average-obfuscated-class-name-length int The maximum class name length for a class to be considered obfuscated. (default 3)
--maximum-average-obfuscated-package-name-length int The maximum average package name length a class to be considered obfuscated. (default 3)
--nested-archive-disk-swap-dir string When nested-archive-disk-swap-max-size is non-zero, this is the directory in which temporary files will be created for writing temporary large nested archives to disk. (default "/tmp")
--nested-archive-disk-swap-max-size uint The maximum size in bytes of disk space allowed to use for inspecting nest archives that are over the nested-archive-max-size.
By default no disk swap is to be allowed, nested archives will only be inspected if they fit into the configured nested-archive-max-size.
When an archive is encountered that is over the nested-archive-max-size, an the archive may be written out to a temporary file so that it can be inspected without a large memory penalty.
If large archives are nested within each other, an archive will be opened only if the accumulated space used for archives on disk would not exceed the configured nested-archive-disk-swap-max-size.
--nested-archive-max-depth uint The maximum depth to recurse into nested archives.
A max depth of 0 will open up an archive on the filesystem but not any nested archives.
--nested-archive-max-size uint The maximum compressed size in bytes of any nested archive that will be unarchived for inspection.
This limit is made a per-depth level.
The overall limit to nested archive size unarchived should be controlled
by both the nested-archive-max-size and nested-archive-max-depth. (default 5242880)
--per-archive-timeout duration If this duration is exceeded when inspecting an archive,
an error will be logged and the crawler will move onto the next file. (default 15m0s)
--summary If true, outputs a summary of all operations once program completes (default true)
```

#### Archives
Expand Down Expand Up @@ -356,6 +361,8 @@ Flags:
Using "directio" will cause the filesystem cache to be skipped where possible. "directio" is not supported on tmpfs filesystems and will cause tmpfs archive files to report an error. (default "standard")
--archives-per-second-rate-limit int The maximum number of archives to scan per second. 0 for unlimited.
--directories-per-second-rate-limit int The maximum number of directories to crawl per second. 0 for unlimited.
--disable-cve-2021-44832-detection Disable detection of CVE-2021-44832 in versions up to 2.17.0
--disable-cve-2021-45105-detection Disable detection of CVE-2021-45105 in versions up to 2.16.0
--dry-run When true, a line with be output instead of deleting a file. Use --dry-run=false to enable deletion. (default true)
--enable-obfuscation-detection Enable applying partial bytecode matching to Jars that appear to be obfuscated. (default true)
--enable-partial-matching-on-all-classes Enable partial bytecode matching to all class files found.
Expand Down Expand Up @@ -405,6 +412,11 @@ Flags:
e.g. ignore "^/proc" to ignore "/proc" when using a crawl root of "/"
--maximum-average-obfuscated-class-name-length int The maximum class name length for a class to be considered obfuscated. (default 3)
--maximum-average-obfuscated-package-name-length int The maximum average package name length a class to be considered obfuscated. (default 3)
--nested-archive-disk-swap-dir string When nested-archive-disk-swap-max-size is non-zero, this is the directory in which temporary files will be created for writing temporary large nested archives to disk. (default "/tmp")
--nested-archive-disk-swap-max-size uint The maximum size in bytes of disk space allowed to use for inspecting nest archives that are over the nested-archive-max-size.
By default no disk swap is to be allowed, nested archives will only be inspected if they fit into the configured nested-archive-max-size.
When an archive is encountered that is over the nested-archive-max-size, an the archive may be written out to a temporary file so that it can be inspected without a large memory penalty.
If large archives are nested within each other, an archive will be opened only if the accumulated space used for archives on disk would not exceed the configured nested-archive-disk-swap-max-size.
--nested-archive-max-depth uint The maximum depth to recurse into nested archives.
A max depth of 0 will open up an archive on the filesystem but not any nested archives.
--nested-archive-max-size uint The maximum compressed size in bytes of any nested archive that will be unarchived for inspection.
Expand Down
8 changes: 8 additions & 0 deletions changelog/@unreleased/pr-98.v2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
type: feature
feature:
description: |-
Inspecting large nested archives without a large memory impact can now be enabled by setting `--nested-archive-disk-swap-max-size` to a positive non-zero value.
When a nested zip file is encountered that is above the `--nested-archive-max-size`, space will be used on disk to write out the archive temporarily to be able to inspected. The location that temporary files are written to can be configured using `--nested-archive-disk-swap-dir`, which is set to `/tmp` by default.
links:
- https://github.com/palantir/log4j-sniffer/pull/98
9 changes: 9 additions & 0 deletions cmd/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ type crawlFlags struct {
perArchiveTimeout time.Duration
nestedArchiveMaxDepth uint
nestedArchiveMaxSize uint
nestedArchiveDiskSwapMaxSize uint
nestedArchiveDiskSwapDir string
directoriesCrawledPerSecond int
archivesCrawledPerSecond int
enableObfuscationDetection bool
Expand All @@ -58,6 +60,11 @@ an error will be logged and the crawler will move onto the next file.`)
This limit is made a per-depth level.
The overall limit to nested archive size unarchived should be controlled
by both the nested-archive-max-size and nested-archive-max-depth.`)
cmd.Flags().UintVar(&flags.nestedArchiveDiskSwapMaxSize, "nested-archive-disk-swap-max-size", 0, `The maximum size in bytes of disk space allowed to use for inspecting nest archives that are over the nested-archive-max-size.
By default no disk swap is to be allowed, nested archives will only be inspected if they fit into the configured nested-archive-max-size.
When an archive is encountered that is over the nested-archive-max-size, an the archive may be written out to a temporary file so that it can be inspected without a large memory penalty.
If large archives are nested within each other, an archive will be opened only if the accumulated space used for archives on disk would not exceed the configured nested-archive-disk-swap-max-size.`)
cmd.Flags().StringVar(&flags.nestedArchiveDiskSwapDir, "nested-archive-disk-swap-dir", "/tmp", `When nested-archive-disk-swap-max-size is non-zero, this is the directory in which temporary files will be created for writing temporary large nested archives to disk.`)
cmd.Flags().UintVar(&flags.nestedArchiveMaxDepth, "nested-archive-max-depth", 0, `The maximum depth to recurse into nested archives.
A max depth of 0 will open up an archive on the filesystem but not any nested archives.`)
cmd.Flags().IntVar(&flags.directoriesCrawledPerSecond, "directories-per-second-rate-limit", 0, `The maximum number of directories to crawl per second. 0 for unlimited.`)
Expand Down Expand Up @@ -95,6 +102,8 @@ func createCrawlConfig(root string, flags crawlFlags) (crawler.Config, error) {
ArchiveListTimeout: flags.perArchiveTimeout,
ArchiveMaxDepth: flags.nestedArchiveMaxDepth,
ArchiveMaxSize: flags.nestedArchiveMaxSize,
ArchiveDiskSwapMaxSize: flags.nestedArchiveDiskSwapMaxSize,
ArchiveDiskSwapMaxDir: flags.nestedArchiveDiskSwapDir,
DirectoriesCrawledPerSecond: flags.directoriesCrawledPerSecond,
ArchivesCrawledPerSecond: flags.archivesCrawledPerSecond,
ObfuscatedClassNameAverageLength: obfuscatedClassNameAverageLength,
Expand Down
15 changes: 15 additions & 0 deletions integration_test/crawl_integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -197,3 +197,18 @@ func TestTraceLoggingFlag(t *testing.T) {
require.NoError(t, err)
assert.Contains(t, string(output), "[TRACE]")
}

func TestDiskSwapping(t *testing.T) {
cli, err := products.Bin("log4j-sniffer")
require.NoError(t, err)

cmd := exec.Command(cli, "crawl", "../examples/nested_very_deep", "--nested-archive-max-depth", "3", "--nested-archive-max-size", "0", "--nested-archive-disk-swap-max-size", "1")
output, err := cmd.CombinedOutput()
require.NoError(t, err)
assert.Contains(t, string(output), "over remaining space allowed for disk swap")

cmd = exec.Command(cli, "crawl", "../examples/nested_very_deep", "--nested-archive-max-depth", "3", "--nested-archive-max-size", "0", "--nested-archive-disk-swap-max-size", "99999999")
output, err = cmd.CombinedOutput()
require.NoError(t, err)
assert.Contains(t, string(output), "Files affected by CVE-2021-44228 or CVE-2021-45046 or CVE-2021-45105 or CVE-2021-44832 detected: 1 file")
}
Loading

0 comments on commit 7f12500

Please sign in to comment.