Replies: 4 comments 2 replies
-
Why do you want to do that? |
Beta Was this translation helpful? Give feedback.
-
I just want to speed up the process. I can handle multiple files in parallel, but with a large file with lots of images, I don't see a way to parallelise the workload within this file. |
Beta Was this translation helpful? Give feedback.
-
I replace the images with compressed versions of the images to shrink the size of the PDF file. example_094.pdf I would like to process pages or at least images in parallel to speed up processing time. |
Beta Was this translation helpful? Give feedback.
-
Here is a minimal non-working example for the parallelization attempt: from pypdf import PdfReader, PdfWriter
import os
from time import time
from pathlib import Path
import concurrent.futures
def process_image(img_obj):
img_obj.replace(img_obj.image, quality=30)
def process_page(page):
# candidate for multiprocessing/multi-threading?
for img_obj in page.images:
process_image(img_obj)
def process_pdf(input_pdf):
reader = PdfReader(input_pdf, strict=False)
writer = PdfWriter()
writer.clone_document_from_reader(reader)
# candidate for multiprocessing/multi-threading?
for page in writer.pages:
process_page(page)
filename = Path(input_pdf)
output_pdf = Path(f'./'
f'{filename.stem}'
'_processed'
f'{filename.suffix}')
with open(output_pdf, 'wb') as f:
writer.write(f)
writer.close()
return f'{input_pdf} -> {output_pdf}'
def main():
start_time = time()
home = Path(os.path.expanduser("~"))
input_path = home / 'PATH/TO/PDF/FILES'
file_list = [entry.path for entry in os.scandir(input_path)
if entry.is_file()]
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = [executor.submit(process_pdf,
file) for file in file_list]
for future in concurrent.futures.as_completed(futures):
print(f'{future.result()}')
total_time = time() - start_time
print(f'Elapsed time: {total_time:.2f}s')
if __name__ == '__main__':
main() |
Beta Was this translation helpful? Give feedback.
-
Is there a way to process pages of a PDF file in parallel (multiprocessing or -threading)? As far as I know, the page object is not picklable. Is there any other solution to utilize multiple CPU cores when processing the file?
Beta Was this translation helpful? Give feedback.
All reactions