Universal-ish parser and detector #103

M-Gonzalo · 2019-12-30T18:09:26Z

Part of the nature of precomp calls for the frequent use of parsers and file-type recognition code.

This is a tedious task, as every parser needs to be manually written and tuned.
It is also very prone to errors and mismatches because of the different implementations of each standard and the fact that the streams are often only a part of the whole file so the program is flying blind.

I believe that having a robust and accurate universal type detection code is not only posible but probably easier to implement than the current system using the method described in here.

the proposed solution is correctly assigning file types based only on file fragments of size 1024 with an accuracy of 98.3%.

The parsers currently used by precomp are very good in their own right, yet there are a number of future applications where precomp will need more and more detection code, including some open issues:

#6 #20 #26 #44 and #86

There are other applications for a quick and correct type detection, as the use of dictionary preprocessors and/or custom compressors for text, exe preprocessing, mm preprocessing, and maybe even fast detection of header-less deflate streams, currently done on "brute mode".
There's also the proposed extract switch and the streams grouping to improve compression.

So it would probably make sense to tackle this before addressing any of the other issues...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal-ish parser and detector #103

Universal-ish parser and detector #103

M-Gonzalo commented Dec 30, 2019 •

edited

Loading

Universal-ish parser and detector #103

Universal-ish parser and detector #103

Comments

M-Gonzalo commented Dec 30, 2019 • edited Loading

M-Gonzalo commented Dec 30, 2019 •

edited

Loading