How to extract internal links using PyPDF #2911

swathiJayav · 2024-10-18T22:47:21Z

swathiJayav
Oct 18, 2024

I am trying to extract internal links. This is what I have.

    import pypdf

    internal_links = []
    with open(<pdf_path>, 'rb') as pdf_file:
        reader = pypdf.PdfReader(pdf_file)
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            if "/Annots" in page:
                for annot in page["/Annots"]:
                    annot_obj = annot.get_object()
                    if annot_obj["/Subtype"] == "/Link":
                        if "/Dest" in annot_obj:
                            dest = annot_obj["/Dest"]
                            print(f"page: {page_num} dest: {dest}")

I'm unable to get the page number from dest. I looked in the pypdf codebase and also in the returned object, but can't find a way to correlate it with a page number.

Answered by stefan6419846

Oct 24, 2024

The following should work:

dest = annot_obj["/Dest"]
pg_ref = dest[0]
pg_num = [p.indirect_reference for p in reader.flattened_pages].index(pg_ref)

For the corresponding syntax, see table 149 of the PDF 2.0 specification.

Full code (slightly modified):

import pypdf

internal_links = []
with pypdf.PdfReader('Boeing.pdf') as reader:
    for page_num, page in enumerate(reader.pages):
        page = reader.pages[page_num]
        if "/Annots" in page:
            for annot in page["/Annots"]:
                annot_obj = annot.get_object()
                if annot_obj["/Subtype"] == "/Link":
                    if "/Dest" in annot_obj:
                        dest = annot_obj["/Dest"][0]
     …

View full answer

pubpub-zz · 2024-10-19T06:40:28Z

pubpub-zz
Oct 19, 2024
Maintainer

I convert this thread into a discussion as this is not an issue

0 replies

pubpub-zz · 2024-10-19T07:21:18Z

pubpub-zz
Oct 19, 2024
Maintainer

this should provide with the page number in your case:

pg_num = [ p.indirect_reference for p in reader.flattened_pages].index(dest)

be aware that all links are not using directly ["/Dest"] but some could also use ["/A"] with /Goto actions, and also some documents may store the destinations in the reader.root_object["/Names"]["/Dests"]

9 replies

swathiJayav Oct 24, 2024
Author

We are on 5.0.1 (latest), and still not able to access reader.root_object["/Names"]["/Dests"]

stefan6419846 Oct 24, 2024
Maintainer

reader.root_object should always exist: https://pypdf.readthedocs.io/en/latest/modules/PdfReader.html#pypdf.PdfReader.root_object If it does not for your PDF file or there are errors, please provide it for further analysis and possibly a stacktrace. Otherwise there is not much to help you with.

swathiJayav Oct 24, 2024
Author

Actually, I am able to access it, but don't see anything with this key: reader.root_object["/Names"]. I'm guessing this only works if pg_num = [ p.indirect_reference for p in reader.flattened_pages].index(dest) didn't work.

Is the recommended way to find the linked page number to try all three of these approaches until one works? Does this cover all the cases we need to check for?

pg_num = [ p.indirect_reference for p in reader.flattened_pages].index(dest)
use ["/A"] with /Goto actions
check reader.root_object["/Names"]["/Dests"]

stefan6419846 Oct 24, 2024
Maintainer

The following should work:

dest = annot_obj["/Dest"]
pg_ref = dest[0]
pg_num = [p.indirect_reference for p in reader.flattened_pages].index(pg_ref)

For the corresponding syntax, see table 149 of the PDF 2.0 specification.

Full code (slightly modified):

import pypdf

internal_links = []
with pypdf.PdfReader('Boeing.pdf') as reader:
    for page_num, page in enumerate(reader.pages):
        page = reader.pages[page_num]
        if "/Annots" in page:
            for annot in page["/Annots"]:
                annot_obj = annot.get_object()
                if annot_obj["/Subtype"] == "/Link":
                    if "/Dest" in annot_obj:
                        dest = annot_obj["/Dest"][0]
                        pg_num = [ p.indirect_reference for p in reader.flattened_pages].index(dest)
                        print(f"page: {page_num} pg_num: {pg_num}")

And reader.root_object works fine here:

>>> from pypdf import PdfReader
>>> reader = PdfReader('Boeing.pdf')
>>> reader.root_object
{'/Type': '/Catalog', '/Pages': IndirectObject(3, 0, 134954891051504)}
>>>

Answer selected by swathiJayav

swathiJayav Oct 24, 2024
Author

I'm able to get it working when /Dest is available, thank you.

But for cases where it's a named link (specified with /GoTo), it's not clear how to correlate this with a page number. In the image below, annot_obj['/A']['/D'] does not match with anything in reader.root_object["/Names"]["/Dests"] . Attached is the document as well (taken from pypdf tests).

pdflatex-outline.pdf

stefan6419846 Oct 25, 2024
Maintainer

Retrieve the mapping of names to pages:

dests = reader.root_object["/Names"]["/Dests"]
kids = dests["/Kids"]
reader.pages[0]  # Populate `reader.flattened_pages`
_pages = [p.indirect_reference for p in reader.flattened_pages]
name_to_page_number = {}
for kid in kids:
    names = kid["/Names"]
    name_count = len(names)
    for i in range(0, name_count, 2):
        page_reference = names[i + 1]["/D"][0]
        name_to_page_number[names[i]] = _pages.index(page_reference)

Handle the annotation:

if "/A" in annot_obj:
    dest = annot_obj["/A"]["/D"]
    pg_num = name_to_page_number[dest]
    print(dest, pg_num)

swathiJayav Oct 26, 2024
Author

This works, thank you so much!

swathiJayav · 2024-10-26T04:56:55Z

swathiJayav
Oct 26, 2024
Author

Final answer:

For non-named links: #2911 (reply in thread)
`
For named links: #2911 (reply in thread)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract internal links using PyPDF #2911

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to extract internal links using PyPDF #2911

swathiJayav Oct 18, 2024

Replies: 3 comments · 9 replies

pubpub-zz Oct 19, 2024 Maintainer

pubpub-zz Oct 19, 2024 Maintainer

swathiJayav Oct 24, 2024 Author

stefan6419846 Oct 24, 2024 Maintainer

swathiJayav Oct 24, 2024 Author

stefan6419846 Oct 24, 2024 Maintainer

swathiJayav Oct 24, 2024 Author

stefan6419846 Oct 25, 2024 Maintainer

swathiJayav Oct 26, 2024 Author

swathiJayav Oct 26, 2024 Author

swathiJayav
Oct 18, 2024

Replies: 3 comments 9 replies

pubpub-zz
Oct 19, 2024
Maintainer

pubpub-zz
Oct 19, 2024
Maintainer

swathiJayav Oct 24, 2024
Author

stefan6419846 Oct 24, 2024
Maintainer

swathiJayav Oct 24, 2024
Author

stefan6419846 Oct 24, 2024
Maintainer

swathiJayav Oct 24, 2024
Author

stefan6419846 Oct 25, 2024
Maintainer

swathiJayav Oct 26, 2024
Author

swathiJayav
Oct 26, 2024
Author