Get bookmarks from PDF using PyPDF2

shushu
1 min readOct 22, 2017

--

Environment: python 2.7

Packages: PyPDF2

Goal: obtain bookmarks of a PDF and preserve their relationships

Create a DestinationNode class in order to preserve the structure of the bookmarks

class DestinationNode(object):
def __init__(self, heading, parent=None)
self.heading = heading
self.parent = parent
self.children = []
def add_child(self, child):
self.children.append(child)

The bookmarks of a PDF file are stored by PyPDF2 in the read-only property outlines.

from PyPDF2 import PdfFileReader
from PyPDF2.pdf import Destination
# Read the pdf file
reader = PdfFileReader(pdf_file)
outlines = reader.outlines

The outlines contains a nested list of Destination. Each destination has a title property which is a bookmark.

If a heading has child headings, it will be followed by a list of destinations of its sub headings. If it has no sub headings, it will be followed by another heading on the same level.

heading1
[sub-heading11, [...], sub-heading12, sub-heading13, ...]
heading2
heading3
[sub-heading31, sub-heading32, [...], sub-heading33, ...]

Construct the outline tree

def construct_outline_tree(self):
# Find child destination
def find_destination(nested_dest):
destinations = []
for obj in nested_dest:
if isinstance(obj, Destination):
destinations.append(obj)
return destinations
# Generate the outline tree and return its root node
def generate_tree(parent, nested_dest):
destinations = find_destination(nested_dest)
sub_nested_dict = {}
for dest in destinations:
new_node = DestinationNode(dest.title, parent=parent)
parent.add_child(new_node)
next_idx = nested.index(dest)+ 1
if next_idx < len(nested_dest) - 1
and isinstance(nested_dest[next_idx, list]):
sub_nested_dict[new_node] = nested_dest[next_idx]
for subroot, sub_nested in sub_nested_dict.iteritems():
generate_tree(reader, subroot, sub_nested)
# Create a fake root as an entrance
root = DestinationNode('Root')
generate_tree(root, outlines)
return root

Now the bookmarks can be accessed through root returned by the construct_outline_tree function

--

--