PdfReader.outline how to get the page content by title #2042
Replies: 5 comments 4 replies
-
PyPDF2 is obsolete. please upgrade to pypdf |
Beta Was this translation helpful? Give feedback.
-
@pubpub-zz OK, thanks.After upgrading, it looks like below: from pypdf import PdfReader as pdf_read
# 每个书签的索引格式
# {'/Title': '书签名', '/Page': '指向的目标页数', '/Type': '类型'}
directory_str = ""
def bookmark_listhandler(list):
global directory_str
for message in list:
if isinstance(message, dict):
directory_str += message["/Title"] + "\n"
# print(message['/Title'])
else:
bookmark_listhandler(message)
a = []
with open("./document/Elsver/ikeda2019.pdf", "rb") as f:
pdf = pdf_read(f)
# 检索文档中存在的文本大纲,返回的对象是一个嵌套的列表
text_outline_list = pdf.outline
bookmark_listhandler(text_outline_list)
global a
a = text_outline_list
# print(text_outline_list)
with open("context.txt", "w", encoding="utf-8") as f:
f.write(directory_str)
directory_str
|
Beta Was this translation helpful? Give feedback.
-
if you want to access the raw object, the add a call to |
Beta Was this translation helpful? Give feedback.
-
@pubpub-zz Thank you.But I mean that I want to get a dict like this:
|
Beta Was this translation helpful? Give feedback.
-
you are looking for the text extraction : |
Beta Was this translation helpful? Give feedback.
-
I can get the directory of the pdf. but how can I get the context of each title?
We can see the title. But I can't see the content of each title.
Beta Was this translation helpful? Give feedback.
All reactions