PdfReader.outline how to get the page content by title #2042

Ethan-Chen-plus · 2023-07-26T14:46:39Z

Ethan-Chen-plus
Jul 26, 2023

from PyPDF2 import PdfReader as pdf_read

#每个书签的索引格式
#{'/Title': '书签名', '/Page': '指向的目标页数', '/Type': '类型'}

directory_str = ''
def bookmark_listhandler(list):
    global directory_str
    for message in list:
        if isinstance(message, dict):
            directory_str += message['/Title'] + '\n'
            # print(message['/Title'])
        else:
            bookmark_listhandler(message)
a = []
with open('10.1007@s10058-019-00220-4.pdf', 'rb') as f:
    pdf = pdf_read(f)
    #检索文档中存在的文本大纲,返回的对象是一个嵌套的列表
    text_outline_list = pdf.outline
    bookmark_listhandler(text_outline_list)
    global a
    a = text_outline_list
    # print(text_outline_list)

with open('context.txt', 'w', encoding='utf-8') as f:
    f.write(directory_str)
directory_str

I can get the directory of the pdf. but how can I get the context of each title?

We can see the title. But I can't see the content of each title.

pubpub-zz · 2023-07-26T15:11:10Z

pubpub-zz
Jul 26, 2023
Maintainer

PyPDF2 is obsolete. please upgrade to pypdf

0 replies

Ethan-Chen-plus · 2023-07-27T02:17:55Z

Ethan-Chen-plus
Jul 27, 2023
Author

@pubpub-zz OK, thanks.After upgrading, it looks like below:

from pypdf import PdfReader as pdf_read

# 每个书签的索引格式
# {'/Title': '书签名', '/Page': '指向的目标页数', '/Type': '类型'}

directory_str = ""


def bookmark_listhandler(list):
    global directory_str
    for message in list:
        if isinstance(message, dict):
            directory_str += message["/Title"] + "\n"
            # print(message['/Title'])
        else:
            bookmark_listhandler(message)


a = []
with open("./document/Elsver/ikeda2019.pdf", "rb") as f:
    pdf = pdf_read(f)
    # 检索文档中存在的文本大纲,返回的对象是一个嵌套的列表
    text_outline_list = pdf.outline
    bookmark_listhandler(text_outline_list)
    global a
    a = text_outline_list
    # print(text_outline_list)

with open("context.txt", "w", encoding="utf-8") as f:
    f.write(directory_str)
directory_str

But still I can't get the content of each title.
If I want to get a dict of each title and its content, what should I do?

0 replies

pubpub-zz · 2023-07-27T07:59:28Z

pubpub-zz
Jul 27, 2023
Maintainer

if you want to access the raw object, the indirect_reference is present to access it:
text_outline_list[0].indirect_reference

add a call to .get_object() to get access to the object.

0 replies

Ethan-Chen-plus · 2023-07-30T04:30:35Z

Ethan-Chen-plus
Jul 30, 2023
Author

@pubpub-zz Thank you.But I mean that I want to get a dict like this:

text_outline_list[0].indirect_reference.get_object()can only get this:

0 replies

pubpub-zz · 2023-07-30T07:16:09Z

pubpub-zz
Jul 30, 2023
Maintainer

you are looking for the text extraction :
text_outline_list["/Title"] : text_outline_list["/Page"].extract_text()
You should put this in a try/except : some outlines may not reference a page

4 replies

Ethan-Chen-plus Jul 30, 2023
Author

from pypdf import PdfReader as pdf_read


directory_str = ''
def bookmark_listhandler(list):
    global directory_str
    for message in list:
        if isinstance(message, dict):
            try:
                print(message['/Title'])
                # print(message['/Page'])
                print(message['/Page'].extract_text())
                print('*'*20)
            except Exception as e:
                print(e)
        else:
            bookmark_listhandler(message)

f = open('10.1007@s10058-019-00220-4.pdf', 'rb')
pdf = pdf_read(f)

text_outline_list = pdf.outline
bookmark_listhandler(text_outline_list)

'DictionaryObject' object has no attribute 'extract_text'

pubpub-zz Jul 30, 2023
Maintainer

for easiness, you will have also to pass the pdf file as parameter:
then try:
pdf.pages[pdf.get_page_number(text_outline_list["/Page"])].extract_text()

Ethan-Chen-plus Jul 31, 2023
Author

pdf.get_page_number(message['/Page']) can only get the start pages of each title, which also includes other content. I only want to get the things between two titles
this is my test.pdf

It only have two pages.
If I use pdf.pages[pdf.get_page_number(text_outline_list["/Page"])].extract_text(),each time it will get the whole page of page 1 or page two.However, for example, I just want to get the content which the title include, for "1 Introduction", I only want to get "This is the content of Introduction", not the whole page 1.

pubpub-zz Jul 31, 2023
Maintainer

You are expecting too much from pypdf but also from pdf format: Outlines provide positions.
So you will have to analyse the full destination to get start position (be carefull about position to the whole page), then you will have to do the same for the next outline then using visitor functions only extracts text within the good range.

All this should be part of your own code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfReader.outline how to get the page content by title #2042

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PdfReader.outline how to get the page content by title #2042

Ethan-Chen-plus Jul 26, 2023

Replies: 5 comments · 4 replies

pubpub-zz Jul 26, 2023 Maintainer

Ethan-Chen-plus Jul 27, 2023 Author

pubpub-zz Jul 27, 2023 Maintainer

Ethan-Chen-plus Jul 30, 2023 Author

pubpub-zz Jul 30, 2023 Maintainer

Ethan-Chen-plus Jul 30, 2023 Author

pubpub-zz Jul 30, 2023 Maintainer

Ethan-Chen-plus Jul 31, 2023 Author

pubpub-zz Jul 31, 2023 Maintainer

Ethan-Chen-plus
Jul 26, 2023

Replies: 5 comments 4 replies

pubpub-zz
Jul 26, 2023
Maintainer

Ethan-Chen-plus
Jul 27, 2023
Author

pubpub-zz
Jul 27, 2023
Maintainer

Ethan-Chen-plus
Jul 30, 2023
Author

pubpub-zz
Jul 30, 2023
Maintainer

Ethan-Chen-plus Jul 30, 2023
Author

pubpub-zz Jul 30, 2023
Maintainer

Ethan-Chen-plus Jul 31, 2023
Author

pubpub-zz Jul 31, 2023
Maintainer