Optimizing Asciidoctor-Generated PDFs

Das Oel: An image of people compressing seeds to extract the oil.

After about 8 months of writing, the first drafts of the book are nearing completion — and boy are they massive.

The full 600+ page PDF straight from asciidoctor-pdf is ~611MB in size and I wanted to see what I could do about this.

I suspect that the files include native-sized images, rather than images scaled down to screen (96dpi) or print (300dpi) resolutions.

A side note: I’m not sure if there’s a command out there that could list every object in a PDF file as a tree structure, including its size in bytes, but that would be incredibly helpful. Even more so, if for detected images it could show the effective dpi of the image at the dimensions it’s being displayed at within the file.

In any case, there are a few options for optimizing PDFs generated by the Prawn PDF library for Ruby, which is used by Asciidoctor as its underlying PDF library.

Methods

Here are some numbers I saw when trying a few different methods.

Method #1 uses Ghostscript to optimize the PDF, which rescales image content to 300 dpi resolution.

Method #2 uses hexapdf to optimize the PDF, which does not rescale image content, but garbage collects the data structures that make up the file to remove unreachable and unused information. Optionally, it can compress the page contents, text, etc.

Commands

These commands are suggested by the asciidoctor-pdf project, which I’m using to render my book:

optimize-pdf.sh A.pdf

(The optimize-pdf.sh script is located in the asciidoctor-pdf Github repository.)

It expands to:

"$GS" -q -dNOPAUSE -dBATCH -dSAFER -dNOOUTERSAVE \
  -sDEVICE=pdfwrite \
  -dPDFSETTINGS=/prepress \
  -dPrinted=false \
  -dCannotEmbedFontPolicy=/Warning \
  -dDownsampleColorImages=$DOWNSAMPLE_IMAGES \
  -dColorImageResolution=$IMAGE_DPI \
  -dDownsampleGrayImages=$DOWNSAMPLE_IMAGES \
  -dGrayImageResolution=$IMAGE_DPI \
  -dDownsampleMonoImages=$DOWNSAMPLE_IMAGES \
  -dMonoImageResolution=$IMAGE_DPI \
  -sOutputFile="$FILE_OPTIMIZED" \
  "$FILE" $FILE_PDFMARK

The other commands are:

hexapdf optimize --force A.pdf A-no-compress.pdf

And:

hexapdf optimize --force --compress A.pdf A-compress.pdf

(You have to install hexapdf first by running gem install hexapdf.)

Results

I ran the above commands on two sample PDF files, let’s call them A.pdf (a ~20 page excerpt) and B.pdf (the full book) — and here’s what happened:

A.pdfB.pdf
Original15078643 bytes611086363 bytes
Ghostscript1801272
1.4s
236357225
1m36.8s
hexapdf (no compress)14243384
13.2s
552383221
4.4s
hexapdf (compress)14241612
13.7s
552264201
28.2s

There’s some weird stuff in these results.

Ghostscript performs far and away the best with both files, but the amount of time it takes to do so varies substantially. With the smaller excerpt, it completes its optimization 10x faster than hexapdf, and produces a file about 7x smaller.

But when processing the much larger, book-length PDF, Ghostscript takes a lot longer to crunch things down. This makes sense, the book has a ton of high-resolution images in it.

What makes no sense, though, is how the no-compress version of hexapdf takes less time to optimize the much larger file than it did for the excerpt. I’m not really sure what it’s doing there.

There’s almost no difference in size between the no-compress and compress versions of the hexapdf output, so for the amount of processing time it adds, it’s not worth it. This is also mentioned in the hexapdf help optimize command line help text.

Conclusion

Sticking to Ghostscript makes the most sense if you’re expecting the PDF to be read on any device up to about 300dpi, which includes pretty much every tablet ever made.