Environment: python 2.7
Packages: PyPDF2
Goal: obtain bookmarks of a PDF and preserve their relationships
Create a DestinationNode class in order to preserve the structure of the bookmarks
class DestinationNode(object):
def __init__(self, heading, parent=None)
self.heading = heading
self.parent = parent
self.children = []def add_child(self, child):
self.children.append(child)
The bookmarks of a PDF file are stored by PyPDF2 in the read-only property outlines.
from PyPDF2 import PdfFileReader
from PyPDF2.pdf import Destination# Read the pdf file
reader = PdfFileReader(pdf_file)outlines = reader.outlines
The outlines contains a nested list of Destination. Each destination has a title property which is a bookmark.
If a heading has child headings, it will be followed by a list of destinations of its sub headings. If it has no sub headings, it will be followed by another heading on the same level.
heading1
[sub-heading11, [...], sub-heading12, sub-heading13, ...]
heading2
heading3
[sub-heading31, sub-heading32, [...], sub-heading33, ...]
Construct the outline tree
def construct_outline_tree(self):
# Find child destination
def find_destination(nested_dest):
destinations = []
for obj in nested_dest:
if isinstance(obj, Destination):
destinations.append(obj)
return destinations # Generate the outline tree and return its root node
def generate_tree(parent, nested_dest):
destinations = find_destination(nested_dest)
sub_nested_dict = {}
for dest in destinations:
new_node = DestinationNode(dest.title, parent=parent)
parent.add_child(new_node)
next_idx = nested.index(dest)+ 1
if next_idx < len(nested_dest) - 1
and isinstance(nested_dest[next_idx, list]):
sub_nested_dict[new_node] = nested_dest[next_idx]
for subroot, sub_nested in sub_nested_dict.iteritems():
generate_tree(reader, subroot, sub_nested) # Create a fake root as an entrance
root = DestinationNode('Root')
generate_tree(root, outlines) return root
Now the bookmarks can be accessed through root
returned by the construct_outline_tree
function