Re-consider completely file-based testing? #2152
Replies: 2 comments 1 reply
-
@stefan6419846 / @MartinThoma |
Beta Was this translation helpful? Give feedback.
-
My opinion The only alternative I see is to pay a company that in anycase will ask for data inputs, but up to now I've seen only one person accepting to contribute for support, altough I'm pretty sure this library is used in business projects. Finally just for general information, pypdf includes remove_text and remove_images that should wipe out most information. |
Beta Was this translation helpful? Give feedback.
-
At the moment, testing in pypdf is de-facto file-based only. While this is completely fine and basically tests on the integration layer, standalone unit tests might make sense as well to ensure some method is working correctly.
Let's take #2147 or #2110 as an example: I have some PDF files which show issues, but I cannot really provide them due privacy reasons. There are all sorts of PDF generators available, some of them having bugs in some versions, some of them not minding to actually generate completely valid PDF files at all; PDF/A might solve this, but you usually will not find many of them in everyday life. Generating crafted files which show such issues might be doable, but requires a deep understanding of the inner structure of PDF files. Using mocking, one might provide such problematic data in an easier way, which would allow for non-PDF-based tests of single methods (although this might impose some overhead when having to adapt mocks during enhancements/refactoring).
In the past, I sometimes used PyMuPDF to mostly anonymize PDF files, but this only works to some extent as well: The error in #2147 will silently be solved, PDFs of scanned images generally cannot really be anonymized, some images to extract might be scans or other private resources (like signatures), ...
One approach for me would be to stop reporting such issues and maintaining own patches (which would avoid these conflicts), but I generally appreciate the work which goes into such a freely available library and want to support development as far as I am able to.
If someone has realistic alternative approaches which avoids the aforementioned issues, I am open for them as well as for further discussions on this topic to enhance the general contribution experience of pypdf.
Beta Was this translation helpful? Give feedback.
All reactions