Moving the Table of Contents in PDF Files

Sometimes I need to move the table of contents generated by Groff to the front of a document. In this post I share a simple approach I take to solving this.

4 Jul 2024

•

14 min read

Pdf Groff Awk Bash

As I’m sure is the case for many of you reading this, I have spent quite a lot of time writing some fairly long and complicated documents. The majority of these documents are technical descriptions of projects I’ve worked on, such as graphics processors, garbage collectors, network processors, and a sadly aborted ISA. One thing that all these documents have in common is that their production looks a lot like how we produce software.

My typesetter of choice is typically troff (pronounced tee-roff), which was part of Bell Lab’s DWB for Unix. Whilst perhaps somewhat anachronistic, troff lives on in the form of GNU’s groff which still sees quite a bit of activity from the developers.

Quite often the production of these documents can be somewhat involved. It is not uncommon that these productions have two pipelines: one that is a series of pre-processors that output the language parsed by troff, and a series of post-processors that typically work with the PostScript output by troff. Many of the pre-processors are those that users of troff will be used to: pic, tbl, grap, and so on. Post processors are usually fewer in number, and often relate to processing graphics and imagery. The PostScript output is these days typically rendered to PDF for distribution, and this rendering is typically performed by Ghostscript.

Processing input sources into PDFs via PostScript

Typically, generating things like indices and tables of contents happens towards the end of processing the input sources, usually by a series of incantations at the end of the last input. This is often necessary, as the entire input needs to be parsed for us to gather up all the terms and headings. In normal invocations of programs like pdfroff, you can ask a macro package to move the table of contents to another position in the document, however this is not always the case, especially when involving more complicated build processes that are broken into stages. Because of these staged builds, sometimes it is not possible for macro packages to relocate a generated table of contents to a different location in the output. As an example of the discussion on this, see the section on positioning the TOC in the documentation for the mom macro package.

Whilst it is possible to relocate the table of contents in the generated PostScript, it is somewhat harder to rework the embedded PDF instructions that are used to produce the final output. So, I typically leave the indices at the end of the output and then perform a final post-processing step on the output PDF files to move the table of contents to just after the cover page.

To move the table of contents I typically use a short shell script with an accompanying AWK file. This script has largely remained unchanged for quite some time now, and primarily makes use of two tools:

pdfgrep is used to find the start of the table of contents.
pdftk that is used to rearrange the pages and edit the bookmarks in the PDF.

The script and the sources for a demo PDF file can be found on GitHub. The repository contains a Makefile that uses groff to create the demonstration PDF that is used in the remainder of this post. Running make in a clone of this repository will create a demo.pdf that demonstrates the table of contents being moved just after the cover page.

toc-rel: A script that moves a table of contents in a PDF

The script starts with the usual shebang to invoke bash and then reads the standard input into a temporary file.

#!/bin/bash

# Read the PDF content from the standard input into a temporary file.
input=$(mktemp)
cat > "$input"

To move the table of contents from the end of a PDF document to just after the first page, the script uses pdftk, which has a rather useful cat command that lets you reorganize the pages of a PDF file using single pages and ranges of pages. For example, if we wanted to output the first page, followed by pages ten through fifteen, then pages two through nine, we could use the following command:

pdftk "$input" cat 1 10-15 2-9 output "$output"

To move the table of contents, the script is therefore going to need to know:

On what page does the table of contents start,
On what page does the table of contents end,

We’re also going to need to modify the bookmarks, but we’ll get to that later. To begin with, the script extracts various data from the PDF file using the dump_data command to the pdftk tool. It then splits the data into two files: one that contains lines that start with the text Bookmark, and another which contains all the other lines.

# Extract all the data from the input PDF
pdftk "$input" dump_data > "$input".data

# Extract just the bookmarks from the data into a separate file
grep "^Bookmark" "$input".data > "$input".bookmarks
grep -v "^Bookmark" "$input".data > "$input".not-bookmarks

Let’s take a look at the data that is not the bookmarks data.

InfoBegin
InfoKey: Creator
InfoValue: groff version 1.23.0
InfoBegin
InfoKey: Title
InfoValue: Demo PDF: Demonstrating TOC Relocation
...
NumberOfPages: 5
...
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595 842
PageMediaDimensions: 595 842
...

The data starts with some information keys that give the title of the PDF, the authors name, and so on. There’s also some information about the page media dimensions. We can also see a NumberOfPages key that gives us the total number of pages in our PDF. The script is going to need to know the number of pages in the PDF to build the page ranges to the cat command of the pdftk tool. The script uses grep to extract the NumberOfPages field from the data and then uses AWK to get just the number.

$ cat "$input".not-bookmarks | grep "NumberOfPages:" | awk '{print $2}'
5

To find out where the table of contents starts, the script uses pdfgrep to search through the document and find the page on which the title "Table of Contents" is written. To do so, the script calls pdfgrep with the following options:

$ pdfgrep -m 1 -n "Table of Contents" "$input"
5: Table of Contents

The pdfgrep tool outputs the page on which it finds a match to the input, followed by a colon, followed by the matched text. We just want the page number, so we can use cut to extract the text before the colon:

$ pdfgrep -m 1 -n "Table of Contents" "$input" | cut -d: -f 1
5

Note

For simplicity, we’re going to assume that the table of contents is going to be on the last few pages of the document. Of course, if you needed to use some other indicator, you could use a similar invocation of pdfgrep.

The script can arrange these commands into a series of three variable assignments so that the values can be used later in our instructions to pdftk.

# Find out how many pages there are in the PDF.
pages=$(grep "NumberOfPages:" "$input".not-bookmarks | awk '{print $2}')

# Find out on which page the "Table of Contents" starts
toc_page=$(pdfgrep -m 1 -n "Table of Contents" "$input" | cut -d: -f 1)
toc_count=$((pages - toc_page + 1))

The script calculates the total number of pages in the table of contents by subtracting the page on which the table of contents heading was found from the number of pages in the PDF then adding one. The number of pages is also stored in the toc_count variable, which the script will use later.

Next the script needs to rearrange the bookmarks that are stored in the PDF. If the script just use cat command to pdftk we will run into a problem where the bookmarks (what is shown in PDF viewers), will often be removed. That is fine in this case as we want the script to change the bookmarks in a couple of ways anyway.

The shell script has already extracted the data from the PDF using the dump_data command to pdftk and stored any lines that start with the text Bookmark into a separate file. Let’s take a look at the contents of that file.

BookmarkBegin
BookmarkTitle: 1. Introduction
BookmarkLevel: 1
BookmarkPageNumber: 2
BookmarkBegin
BookmarkTitle: 1.1. Random Thoughts
BookmarkLevel: 2
BookmarkPageNumber: 2
BookmarkBegin
BookmarkTitle: 2. Building the Document
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 2.1. The Real Truth
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 2.2. Options to Groff
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: Table of Contents
BookmarkLevel: 1
BookmarkPageNumber: 5

We can see that each bookmark starts with a line: BookmarkBegin. Each bookmark then contains a title (BookmarkTitle), the outline level (BookmarkLevel), and the page number (BookmarkPageNumber). The script needs to update these bookmarks so that the page numbers in the BookmarkPageNumber fields are all incremented to point to the new page locations of the corresponding bookmarks. This is because the script will be inserting $toc_count pages at the start of the document, so the bookmarks will need to be adjusted as well.

Something else we can see is that the table of contents has received it’s own bookmark. The script will get rid of that bookmark, as we’re moved the table to the start of the document.

To make these adjustments, the script runs the bookmarks through AWK using a small script to updates the bookmarks. The AWK script is going to see each line of the bookmarks data. The AWK script gathers up all the lines in each bookmark and, if the bookmark is to be retained, outputs them all to the standard output.

To invoke the AWK script, the shell script passes in the offset to add to each BookmarkPageNumber, which is the value held in $toc_count variable. The script first feeds this value to AWK, and then the contents of the bookmarks it split from the PDF data. The AWK script processes this input and the updated bookmark data is written to a new file.

# Renumber all the bookmarks in the PDF, and remove the Table of Contents entry.
(echo "Offset: $toc_count"; cat "$input".bookmarks) | gawk -f toc-data.awk > "$input".bookmarks.modified

Let’s look now at this AWK script. To start with, the AWK script initialises some variables. The first variable okay will be non-zero if the bookmark can be retained. The second variable offset contains the value that the script wants to add to each BookmarkPageNumber.

# Initialisation of some variables.
BEGIN {
  # The `okay` flag will be non-zero if we're okay to output this bookmark.
  okay = 1
  # The `offset` variable contains the increment we need to add to the page numbers.
  offset = 0
}

As AWK runs through the lines of bookmarks it collects up the lines for each bookmark into an array. This script calls this array bookmark. As each line comes along the script adds it to this array. When AWK arrives at the next bookmark (or the end of the input) it writes out the lines it stored in the bookmark array, so long this is not a bookmark that it wants to skip (i.e. the table of contents bookmark). To do this writing, the AWK script has an output() function:

# A function that prints out all the bookmark lines (so long as it's okay).
function output() {
  if (okay) {
    for (i = 1; i <= len; i++) {
      print bookmark[i]
    }
  }
}

Now lets look at the lines that the script wants to match against for the different behaviours it needs to exhibit. Firstly, the script needs to get the offset to add to each bookmark’s page number. This was passed to AWK by the shell script in a line with an Offset: prefix. The AWK script can match against that and then skip the line so it does not make it in to the bookmark array.

# Parse the page offset from the input (use `next` to skip this line).
/Offset: [0-9]+/ {
  offset = $2
  next
}

The AWK script needs to know when a new bookmark starts, so it matches against the BookmarkBegin line. If such a line is found then the input is starting a new bookmark, which means AWK needs to write out the lines held in the bookmark array (using our output function), and then reset the length of the bookmark. The script also sets okay back to 1. The AWK script doesn’t use next to skip these lines, so that they get added to the bookmark array.

# Parse the beginning of a bookmark, outputting any previous bookmark.
/BookmarkBegin/ {
  output()
  okay = 1
  len = 0
}

Next the script wants to match against the BookmarkPageNumber lines. These lines need to be added to the bookmark array, but they need to modified by adding the value held in the offset variable to the page number. As AWK is going to modify the line, it stores the modified line in the bookmark array itself, and then uses next to skip to the next line.

# Match the bookmark page number record, but add the offset to the page number.
/BookmarkPageNumber: [0-9]+/ {
  # Add the line to the 'bookmark' array, but increment the page number by 'offset'
  bookmark[++len] = $1 " " ($2 + offset)
  next
}

The next line that the script wants to match against is the bookmark for the table of contents. AWK needs to elide this bookmark from the output, so when it finds a BookmarkTitle with the name Table of Contents it sets the okay variable to 0 so the output() function won’t write out the bookmark.

# Do not echo a bookmark for the TOC itself.
/BookmarkTitle: Table of Contents/ {
  okay = 0
}

Finally AWK can match against any line that either doesn’t match all the rules above, or matches a rule that has not used next to skip the line. In this rule the script is just going to store the entire line in the bookmark array.

{
  # For all other lines, just add them to the bookmark
  bookmark[++len] = $0
}

And right at the end, the script needs to match against the end of the input and make sure that it calls the output() function so AWK always outputs the last bookmark.

# At the end of the input, output any remaining lines.
END {
    output()
}

Running this AWK file over the collection of bookmarks from earlier will add one to all the page numbers and remove the table of contents bookmark:

BookmarkBegin
BookmarkTitle: 1. Introduction
BookmarkLevel: 1
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 1.1. Random Thoughts
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: 2. Building the Document
BookmarkLevel: 1
BookmarkPageNumber: 4
BookmarkBegin
BookmarkTitle: 2.1. The Real Truth
BookmarkLevel: 2
BookmarkPageNumber: 4
BookmarkBegin
BookmarkTitle: 2.2. Options to Groff
BookmarkLevel: 2
BookmarkPageNumber: 4

Now that we’ve taken a look at the AWK file that modifies the bookmarks we can go back to the shell script. The shell script takes the output of AWK and the non-bookmark data taken from the PDF and concatenates them together into a single file:

# Combine the modified bookmarks with the rest of the extracted PDF data.
cat "$input".not-bookmarks "$input".bookmarks.modified > "$input".data.modified

At this point the script can use pdftk with the cat command to rearrange the pages in the PDF. The script runs pdftk as follows:

# Rearrange the pages in the input PDF to move the TOC to immediately after the first page.
pdftk "$input" cat 1 $toc_page-$pages 2-$((toc_page - 1)) output "$input".rearranged

The arguments to cat are as follows:

The number 1, telling pdftk to copy over the cover page unchanged,
The page range $toc_page-$pages, which includes the first page of the table of contents and the last page of the document. Recall that we’re assuming the table of contents extends until the end of the PDF file.
The rest of the PDF file, starting on page 2 and extending until the page before the table of contents.

Next the script need to take the modified PDF data, which includes the renumbered bookmarks, and update the information in the PDF. The script can use the update_info command to pdftk to do this.

# Now we need to update the bookmarks in the rearranged PDF.
pdftk "$input".rearranged update_info "$input".data.modified output "$input".output

With all the work completed, the script is now able to clean up after itself and output the rearranged PDF file to the standard output.

# Clean up the temporary files.
rm "$input"
rm "$input".data
rm "$input".data.modified
rm "$input".bookmarks
rm "$input".bookmarks.modified
rm "$input".not-bookmarks
rm "$input".rearranged

# Output the rearranged PDF
cat "$input".output

You can find these scripts and an example groff source file on GitHub.