Re-consider completely file-based testing? #2152

stefan6419846 · 2023-09-05T09:30:42Z

stefan6419846
Sep 5, 2023
Maintainer

At the moment, testing in pypdf is de-facto file-based only. While this is completely fine and basically tests on the integration layer, standalone unit tests might make sense as well to ensure some method is working correctly.

Let's take #2147 or #2110 as an example: I have some PDF files which show issues, but I cannot really provide them due privacy reasons. There are all sorts of PDF generators available, some of them having bugs in some versions, some of them not minding to actually generate completely valid PDF files at all; PDF/A might solve this, but you usually will not find many of them in everyday life. Generating crafted files which show such issues might be doable, but requires a deep understanding of the inner structure of PDF files. Using mocking, one might provide such problematic data in an easier way, which would allow for non-PDF-based tests of single methods (although this might impose some overhead when having to adapt mocks during enhancements/refactoring).

In the past, I sometimes used PyMuPDF to mostly anonymize PDF files, but this only works to some extent as well: The error in #2147 will silently be solved, PDFs of scanned images generally cannot really be anonymized, some images to extract might be scans or other private resources (like signatures), ...

One approach for me would be to stop reporting such issues and maintaining own patches (which would avoid these conflicts), but I generally appreciate the work which goes into such a freely available library and want to support development as far as I am able to.

If someone has realistic alternative approaches which avoids the aforementioned issues, I am open for them as well as for further discussions on this topic to enhance the general contribution experience of pypdf.

pubpub-zz · 2023-09-05T17:51:54Z

pubpub-zz
Sep 5, 2023
Maintainer

@stefan6419846 / @MartinThoma
I propose to convert this into a discussion.

0 replies

pubpub-zz · 2023-09-05T18:03:52Z

pubpub-zz
Sep 5, 2023
Maintainer

My opinion
Without input data, investigation is impossible in all most all cases. have test files through email seems a good option and we are doing our best to keep privacy and this seems accepted by many people.

The only alternative I see is to pay a company that in anycase will ask for data inputs, but up to now I've seen only one person accepting to contribute for support, altough I'm pretty sure this library is used in business projects.
Also when receiving private data, once the issue is fixed the test we are building uses some manually generated issued to ensure the issue is not coming back.

Finally just for general information, pypdf includes remove_text and remove_images that should wipe out most information.

1 reply

stefan6419846 Sep 6, 2023
Maintainer Author

have test files through email seems a good option and we are doing our best to keep privacy and this seems accepted by many people.

I did similar stuff in the past, but nevertheless tried to anonymize the corresponding PDF files as much as possible anyway. Privacy just has a high priority in my case, especially when most of the PDF files are not generated by me.

The only alternative I see is to pay a company that in anycase will ask for data inputs, but up to now I've seen only one person accepting to contribute for support, altough I'm pretty sure this library is used in business projects.

My company is using pypdf in such a manner as well. Nevertheless, it is much easier to provide human resources (as far as there is knowledge and nothing more important has to be done) to support development than actual financial support.

Finally just for general information, pypdf includes remove_text and remove_images that should wipe out most information.

I tried to use it in the past, but PyMuPDF proved to be much more stable, especially considering that these PDFs somehow violate the PDF standard already. Additionally, some stuff like problematic images with personal data cannot be covered by either approach. Doing a quick test for #2147, running the methods on the file will generate another error (#2157).

By the way: Tests like the current one in #2150 show that some of these tests can indeed work without any external PDF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-consider completely file-based testing? #2152

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Re-consider completely file-based testing? #2152

stefan6419846 Sep 5, 2023 Maintainer

Replies: 2 comments · 1 reply

pubpub-zz Sep 5, 2023 Maintainer

pubpub-zz Sep 5, 2023 Maintainer

stefan6419846 Sep 6, 2023 Maintainer Author

stefan6419846
Sep 5, 2023
Maintainer

Replies: 2 comments 1 reply

pubpub-zz
Sep 5, 2023
Maintainer

pubpub-zz
Sep 5, 2023
Maintainer

stefan6419846 Sep 6, 2023
Maintainer Author