Moving the Table of Contents in PDF Files
Moving the Table of Contents in PDF Files
Sometimes I need to move the table of contents generated by Groff to the front of a document. In this post I share a simple approach I take to solving this.
As I’m sure is the case for many of you reading this, I have spent quite a lot of time writing some fairly long and complicated documents. The majority of these documents are technical descriptions of projects I’ve worked on, such as graphics processors, garbage collectors, network processors, and a sadly aborted ISA. One thing that all these documents have in common is that their production looks a lot like how we produce software.
My typesetter of choice is typically troff (pronounced tee-roff), which was part of Bell Lab’s DWB for Unix. Whilst perhaps somewhat anachronistic, troff lives on in the form of GNU’s groff which still sees quite a bit of activity from the developers.
Quite often the production of these documents can be somewhat involved. It is not uncommon that
these productions have two pipelines: one that is a series of pre-processors that output the
language parsed by troff
, and a series of post-processors that typically work with the
PostScript output by troff
. Many of the pre-processors are those that users of troff will be
used to: pic
, tbl
, grap
, and so on. Post processors are usually fewer in number, and often
relate to processing graphics and imagery. The PostScript output is these days typically rendered to
PDF for distribution, and this rendering is typically performed by Ghostscript.
Typically, generating things like indices and tables of contents happens towards the end of
processing the input sources, usually by a series of incantations at the end of the last input. This
is often necessary, as the entire input needs to be parsed for us to gather up all the terms and
headings. In normal invocations of programs like pdfroff
, you can ask a macro package to move the
table of contents to another position in the document, however this is not always the case,
especially when involving more complicated build processes that are broken into stages. Because of
these staged builds, sometimes it is not possible for macro packages to relocate a generated table
of contents to a different location in the output. As an example of the discussion on this, see the
section on positioning the TOC in the documentation for the mom macro package.
Whilst it is possible to relocate the table of contents in the generated PostScript, it is somewhat harder to rework the embedded PDF instructions that are used to produce the final output. So, I typically leave the indices at the end of the output and then perform a final post-processing step on the output PDF files to move the table of contents to just after the cover page.
To move the table of contents I typically use a short shell script with an accompanying AWK file. This script has largely remained unchanged for quite some time now, and primarily makes use of two tools:
pdfgrep
is used to find the start of the table of contents.pdftk
that is used to rearrange the pages and edit the bookmarks in the PDF.
The script and the sources for a demo PDF file can be found on GitHub.
The repository contains a Makefile
that uses groff
to create the demonstration PDF that is used
in the remainder of this post. Running make
in a clone of this repository will create a demo.pdf
that demonstrates the table of contents being moved just after the cover page.
The script starts with the usual shebang to invoke bash
and then reads the standard input into a
temporary file.
#!/bin/bash
# Read the PDF content from the standard input into a temporary file.
input=$(mktemp)
cat > "$input"
To move the table of contents from the end of a PDF document to just after the first page, the
script uses pdftk
, which has a rather useful cat
command that lets you reorganize the pages of a
PDF file using single pages and ranges of pages. For example, if we wanted to output the first page,
followed by pages ten through fifteen, then pages two through nine, we could use the following
command:
pdftk "$input" cat 1 10-15 2-9 output "$output"
To move the table of contents, the script is therefore going to need to know:
- On what page does the table of contents start,
- On what page does the table of contents end,
We’re also going to need to modify the bookmarks, but we’ll get to that later. To begin with, the
script extracts various data from the PDF file using the dump_data
command to the pdftk
tool.
It then splits the data into two files: one that contains lines that start with the text Bookmark
,
and another which contains all the other lines.
# Extract all the data from the input PDF
pdftk "$input" dump_data > "$input".data
# Extract just the bookmarks from the data into a separate file
grep "^Bookmark" "$input".data > "$input".bookmarks
grep -v "^Bookmark" "$input".data > "$input".not-bookmarks
Let’s take a look at the data that is not the bookmarks data.
InfoBegin
InfoKey: Creator
InfoValue: groff version 1.23.0
InfoBegin
InfoKey: Title
InfoValue: Demo PDF: Demonstrating TOC Relocation
...
NumberOfPages: 5
...
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595 842
PageMediaDimensions: 595 842
...
The data starts with some information keys that give the title of the PDF, the authors name, and so
on. There’s also some information about the page media dimensions. We can also see a NumberOfPages
key that gives us the total number of pages in our PDF. The script is going to need to know the
number of pages in the PDF to build the page ranges to the cat
command of the pdftk
tool. The
script uses grep
to extract the NumberOfPages
field from the data and then uses AWK to get just
the number.
$ cat "$input".not-bookmarks | grep "NumberOfPages:" | awk '{print $2}'
5
To find out where the table of contents starts, the script uses pdfgrep
to search through the
document and find the page on which the title "Table of Contents"
is written. To do so, the script
calls pdfgrep
with the following options:
$ pdfgrep -m 1 -n "Table of Contents" "$input"
5: Table of Contents
The pdfgrep
tool outputs the page on which it finds a match to the input, followed by a colon,
followed by the matched text. We just want the page number, so we can use cut
to extract the text
before the colon:
$ pdfgrep -m 1 -n "Table of Contents" "$input" | cut -d: -f 1
5
pdfgrep
.The script can arrange these commands into a series of three variable assignments so that the
values can be used later in our instructions to pdftk
.
# Find out how many pages there are in the PDF.
pages=$(grep "NumberOfPages:" "$input".not-bookmarks | awk '{print $2}')
# Find out on which page the "Table of Contents" starts
toc_page=$(pdfgrep -m 1 -n "Table of Contents" "$input" | cut -d: -f 1)
toc_count=$((pages - toc_page + 1))
The script calculates the total number of pages in the table of contents by subtracting the page on
which the table of contents heading was found from the number of pages in the PDF then adding one.
The number of pages is also stored in the toc_count
variable, which the script will use later.
Next the script needs to rearrange the bookmarks that are stored in the PDF. If the script just use
cat
command to pdftk
we will run into a problem where the bookmarks (what is shown in PDF
viewers), will often be removed. That is fine in this case as we want the script to change the
bookmarks in a couple of ways anyway.
The shell script has already extracted the data from the PDF using the dump_data
command to
pdftk
and stored any lines that start with the text Bookmark
into a separate file. Let’s take a
look at the contents of that file.
BookmarkBegin
BookmarkTitle: 1. Introduction
BookmarkLevel: 1
BookmarkPageNumber: 2
BookmarkBegin
BookmarkTitle: 1.1. Random Thoughts
BookmarkLevel: 2
BookmarkPageNumber: 2
BookmarkBegin
BookmarkTitle: 2. Building the Document
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 2.1. The Real Truth
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 2.2. Options to Groff
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: Table of Contents
BookmarkLevel: 1
BookmarkPageNumber: 5
We can see that each bookmark starts with a line: BookmarkBegin
. Each bookmark then contains a
title (BookmarkTitle
), the outline level (BookmarkLevel
), and the page number
(BookmarkPageNumber
). The script needs to update these bookmarks so that the page numbers in the
BookmarkPageNumber
fields are all incremented to point to the new page locations of the
corresponding bookmarks. This is because the script will be inserting $toc_count
pages at the
start of the document, so the bookmarks will need to be adjusted as well.
Something else we can see is that the table of contents has received it’s own bookmark. The script will get rid of that bookmark, as we’re moved the table to the start of the document.
To make these adjustments, the script runs the bookmarks through AWK using a small script to updates the bookmarks. The AWK script is going to see each line of the bookmarks data. The AWK script gathers up all the lines in each bookmark and, if the bookmark is to be retained, outputs them all to the standard output.
To invoke the AWK script, the shell script passes in the offset to add to each BookmarkPageNumber
,
which is the value held in $toc_count
variable. The script first feeds this value to AWK, and then
the contents of the bookmarks it split from the PDF data. The AWK script processes this input and
the updated bookmark data is written to a new file.
# Renumber all the bookmarks in the PDF, and remove the Table of Contents entry.
(echo "Offset: $toc_count"; cat "$input".bookmarks) | gawk -f toc-data.awk > "$input".bookmarks.modified
Let’s look now at this AWK script. To start with, the AWK script initialises some variables. The
first variable okay
will be non-zero if the bookmark can be retained. The second variable offset
contains the value that the script wants to add to each BookmarkPageNumber
.
# Initialisation of some variables.
BEGIN {
# The `okay` flag will be non-zero if we're okay to output this bookmark.
okay = 1
# The `offset` variable contains the increment we need to add to the page numbers.
offset = 0
}
As AWK runs through the lines of bookmarks it collects up the lines for each bookmark into an array.
This script calls this array bookmark
. As each line comes along the script adds it to this array.
When AWK arrives at the next bookmark (or the end of the input) it writes out the lines it stored in
the bookmark
array, so long this is not a bookmark that it wants to skip (i.e. the table of
contents bookmark). To do this writing, the AWK script has an output()
function:
# A function that prints out all the bookmark lines (so long as it's okay).
function output() {
if (okay) {
for (i = 1; i <= len; i++) {
print bookmark[i]
}
}
}
Now lets look at the lines that the script wants to match against for the different behaviours it
needs to exhibit. Firstly, the script needs to get the offset to add to each bookmark’s page number.
This was passed to AWK by the shell script in a line with an Offset:
prefix. The AWK script can
match against that and then skip the line so it does not make it in to the bookmark
array.
# Parse the page offset from the input (use `next` to skip this line).
/Offset: [0-9]+/ {
offset = $2
next
}
The AWK script needs to know when a new bookmark starts, so it matches against the BookmarkBegin
line. If such a line is found then the input is starting a new bookmark, which means AWK needs to
write out the lines held in the bookmark
array (using our output
function), and then reset the
length of the bookmark. The script also sets okay
back to 1
. The AWK script doesn’t use next
to skip these lines, so that they get added to the bookmark
array.
# Parse the beginning of a bookmark, outputting any previous bookmark.
/BookmarkBegin/ {
output()
okay = 1
len = 0
}
Next the script wants to match against the BookmarkPageNumber
lines. These lines need to be added
to the bookmark
array, but they need to modified by adding the value held in the offset
variable
to the page number. As AWK is going to modify the line, it stores the modified line in the
bookmark
array itself, and then uses next
to skip to the next line.
# Match the bookmark page number record, but add the offset to the page number.
/BookmarkPageNumber: [0-9]+/ {
# Add the line to the 'bookmark' array, but increment the page number by 'offset'
bookmark[++len] = $1 " " ($2 + offset)
next
}
The next line that the script wants to match against is the bookmark for the table of contents. AWK
needs to elide this bookmark from the output, so when it finds a BookmarkTitle
with the name
Table of Contents
it sets the okay
variable to 0
so the output()
function won’t write out
the bookmark.
# Do not echo a bookmark for the TOC itself.
/BookmarkTitle: Table of Contents/ {
okay = 0
}
Finally AWK can match against any line that either doesn’t match all the rules above, or matches a
rule that has not used next
to skip the line. In this rule the script is just going to store the
entire line in the bookmark
array.
{
# For all other lines, just add them to the bookmark
bookmark[++len] = $0
}
And right at the end, the script needs to match against the end of the input and make sure that it
calls the output()
function so AWK always outputs the last bookmark.
# At the end of the input, output any remaining lines.
END {
output()
}
Running this AWK file over the collection of bookmarks from earlier will add one to all the page numbers and remove the table of contents bookmark:
BookmarkBegin
BookmarkTitle: 1. Introduction
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 1.1. Random Thoughts
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 2. Building the Document
BookmarkLevel: 1
BookmarkPageNumber: 4
BookmarkBegin
BookmarkTitle: 2.1. The Real Truth
BookmarkLevel: 2
BookmarkPageNumber: 4
BookmarkBegin
BookmarkTitle: 2.2. Options to Groff
BookmarkLevel: 2
BookmarkPageNumber: 4
Now that we’ve taken a look at the AWK file that modifies the bookmarks we can go back to the shell script. The shell script takes the output of AWK and the non-bookmark data taken from the PDF and concatenates them together into a single file:
# Combine the modified bookmarks with the rest of the extracted PDF data.
cat "$input".not-bookmarks "$input".bookmarks.modified > "$input".data.modified
At this point the script can use pdftk
with the cat
command to rearrange the pages in the PDF.
The script runs pdftk
as follows:
# Rearrange the pages in the input PDF to move the TOC to immediately after the first page.
pdftk "$input" cat 1 $toc_page-$pages 2-$((toc_page - 1)) output "$input".rearranged
The arguments to cat
are as follows:
- The number
1
, tellingpdftk
to copy over the cover page unchanged, - The page range
$toc_page-$pages
, which includes the first page of the table of contents and the last page of the document. Recall that we’re assuming the table of contents extends until the end of the PDF file. - The rest of the PDF file, starting on page
2
and extending until the page before the table of contents.
Next the script need to take the modified PDF data, which includes the renumbered bookmarks, and
update the information in the PDF. The script can use the update_info
command to pdftk
to do
this.
# Now we need to update the bookmarks in the rearranged PDF.
pdftk "$input".rearranged update_info "$input".data.modified output "$input".output
With all the work completed, the script is now able to clean up after itself and output the rearranged PDF file to the standard output.
# Clean up the temporary files.
rm "$input"
rm "$input".data
rm "$input".data.modified
rm "$input".bookmarks
rm "$input".bookmarks.modified
rm "$input".not-bookmarks
rm "$input".rearranged
# Output the rearranged PDF
cat "$input".output
You can find these scripts and an example groff source file on GitHub.