Solidify API through single PDF interface with supporting functions. #1604

lababidi · 2023-02-03T18:53:42Z

lababidi
Feb 3, 2023

Explanation

Currently there are multiple classes (PdfReader, PdfWriter, PdfMerger) that are action oriented. Typically Classes represent objects, not actions. Functions would represent the actions needed a bit more succinctly and effectively (pypdf.read() pypdf.PDF.write() pypdf.merge() ). In the same token, the resulting object that will be read or written is simply a PDF which could be a class PDF(). This would allow for a clearer API and allow for PDFs to be read in and written out easily with modifications easily applied to the PDF object. Additionally, this would allow MetaData to persist along with Fonts etc.

This would be somewhat similar to the Pandas interface (pd.read_csv, pd.DataFrame). I think that model would work well for PDFs.

I wanted to present this and discuss it before I began prototyping something out. Would love to know your thoughts @MartinThoma et al!

Code Example

How would your feature be used? (Remove this if it is not applicable.)

import pypdf
from pypdf import PDF

pdf: PDF = pypdf.read("path_to/old.pdf")

page = pypdf.Page("new text")
pdf.add_page(page)

pdf.write("path_to/new.pdf")
#metadata in new pdf is same as old pdf


merged_pdf: PDF = pypdf.merge(pdf1, pdf2, ..., pdfN)
merged_pdf.write("path_to/merged.pdf")

...  # your new feature in action!

MartinThoma · 2023-02-03T21:14:15Z

MartinThoma
Feb 3, 2023
Maintainer

I want to get rid of PdfMerger at some point - @pubpub-zz extended the PdfWriter a lot so that this might be possible.

I would not add a read(path) function which essentially does the same as PdfReader(path). I simply don't see the benefit of it.

Any breaking change to pypdf has to have a clear benefit. A lot of people use it and I want to avoid breaking changes. We had a lot of breaking changes in 2022 in order to make the interface more consistent / pythonic. Although I think that I communicated those changes pretty well, I still see a lot of people using the old interface/struggling with the switch. I update code in lots of places (stackoverflow, other git repositories, writing people who wrote articles/tutorials). I simply don't want to put that much time in another change like this, except if there is a big benefit.

I'm still open to be convinced that there is such a big benefit (or that the change is less complex than I currently think) :-)

2 replies

pubpub-zz Feb 4, 2023
Maintainer

The internal implementation of PdfReader and PdfWriter are completely different : for example, PdfReader can handle multiple generation of object but not PdfWriter : the work to merge those will be too important

lababidi Feb 7, 2023
Author

Thanks for the response!

So while it seems like breaking changes, I'm not immediately suggesting to remove any of the current classes or methods. I believe that adding a PDF class could help unite the interface. The PDF could be a new underlying class that's returned that has all the functions of the Writer as well as the Reader.

One of the concerns I have is that writing a PDF after it has been read results in many changes in the PDF especially when nothing has been changed. This inconsistency is coupled with the fact that the two class (Reader/Writer) are so different when they really are the same "object" conceptually, a PDF. I also believe this would simplify the code because it would unify the underlying architecture.

the work to merge those will be too important
I'm not exactly sure what you mean with this? Do you mean too difficult?

alexwgee · 2023-03-19T22:16:09Z

alexwgee
Mar 19, 2023

Hello @lababidi and everyone. Thank you to everyone, especially @MartinThoma , who has really breathed new life into this project. I really like that this library is being brought up to date. I appreciate all of your work and effort.

I am outsider. I think this is my first comment. So, I don't expect my opinion to count for much. However, I just wanted to say that I think that @lababidi's idea regarding having a PDF class that retains the state of a PDF object make a lot of sense to me. In my humble opinion, the concern that @lababidi raises about how writing a PDF after it has been read results in many changes to the PDF even if nothing has really changed... is a valid point theoretically.

On the one hand, I can see where this may be desireable. The user documentation for pypdf / PyPDF2 even says that you can reduce the PDF size by doing this (i.e., by reading and then writing the file).

However, I wonder if there might be use cases where you might want to preserve the PDF state in all but the ways explicitly operated upon by the user.

That's just my two cents. Thank you to everyone who contributes and works on this project, especially @MartinThoma who I can see has put a lot of effort into it.

3 replies

MartinThoma Mar 19, 2023
Maintainer

Don't forget about @pubpub-zz, he is doing way more of the actual development work than I do 💖

MartinThoma Mar 19, 2023
Maintainer

I think the problem with a single pdf class instead is the reader / writer class is that it's just not that easy to implement.

alexwgee Mar 19, 2023

Don't forget about @pubpub-zz, he is doing way more of the actual development work than I do 💖

Thank you @pubpub-zz !

I think the problem with a single pdf class instead is the reader / writer class is that it's just not that easy to implement.

That is understandable. Any new feature/functionality has to of course be practical to implement, test, and maintain.

lababidi · 2023-12-18T03:42:56Z

lababidi
Dec 18, 2023
Author

I would like to resurrect this discussion because, I really like pypdf, and want to make it even more awesome :)

I'm happy to help design and use already implemented functions. I think that having a PDF class isn't as daunting as one would think if you dig into it. I concur with everything @alexwgee said, and I think something could be done. Again, I'm happy to help implement and design, but I would love to connect first to align.

2 replies

MartinThoma Dec 18, 2023
Maintainer

I think having a Pdf class that would combine PdfReader and PdfWriter in one clean interface would be amazing. I wouldn't immediately get rid of PdfReader / PdfWriter, but have all three available for a while.

So I'm very open to PRs / drafts for this idea!

MartinThoma Dec 18, 2023
Maintainer

As a side-note: Another type of integration that would be awesome is fpdf2. I think the capabilities of fpdf2 within PdfWriter would be amazing, but that's a different topic :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solidify API through single PDF interface with supporting functions. #1604

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Solidify API through single PDF interface with supporting functions. #1604

lababidi Feb 3, 2023

Explanation

Code Example

Replies: 3 comments · 7 replies

MartinThoma Feb 3, 2023 Maintainer

pubpub-zz Feb 4, 2023 Maintainer

lababidi Feb 7, 2023 Author

alexwgee Mar 19, 2023

MartinThoma Mar 19, 2023 Maintainer

MartinThoma Mar 19, 2023 Maintainer

alexwgee Mar 19, 2023

lababidi Dec 18, 2023 Author

MartinThoma Dec 18, 2023 Maintainer

MartinThoma Dec 18, 2023 Maintainer

lababidi
Feb 3, 2023

Replies: 3 comments 7 replies

MartinThoma
Feb 3, 2023
Maintainer

pubpub-zz Feb 4, 2023
Maintainer

lababidi Feb 7, 2023
Author

alexwgee
Mar 19, 2023

MartinThoma Mar 19, 2023
Maintainer

MartinThoma Mar 19, 2023
Maintainer

lababidi
Dec 18, 2023
Author

MartinThoma Dec 18, 2023
Maintainer

MartinThoma Dec 18, 2023
Maintainer