After about 8 months of writing, the first drafts of the book are nearing completion — and boy are they massive.
The full 600+ page PDF straight from
asciidoctor-pdf is ~611MB in size and I wanted to see what I could do about this.
I suspect that the files include native-sized images, rather than images scaled down to screen (96dpi) or print (300dpi) resolutions.
A side note: I’m not sure if there’s a command out there that could list every object in a PDF file as a tree structure, including its size in bytes, but that would be incredibly helpful. Even more so, if for detected images it could show the effective dpi of the image at the dimensions it’s being displayed at within the file.
In any case, there are a few options for optimizing PDFs generated by the Prawn PDF library for Ruby, which is used by Asciidoctor as its underlying PDF library.
Here are some numbers I saw when trying a few different methods.
Method #1 uses Ghostscript to optimize the PDF, which rescales image content to 300 dpi resolution.
Method #2 uses hexapdf to optimize the PDF, which does not rescale image content, but garbage collects the data structures that make up the file to remove unreachable and unused information. Optionally, it can compress the page contents, text, etc.
These commands are suggested by the asciidoctor-pdf project, which I’m using to render my book:
(The optimize-pdf.sh script is located in the asciidoctor-pdf Github repository.)
It expands to:
"$GS" -q -dNOPAUSE -dBATCH -dSAFER -dNOOUTERSAVE \ -sDEVICE=pdfwrite \ -dPDFSETTINGS=/prepress \ -dPrinted=false \ -dCannotEmbedFontPolicy=/Warning \ -dDownsampleColorImages=$DOWNSAMPLE_IMAGES \ -dColorImageResolution=$IMAGE_DPI \ -dDownsampleGrayImages=$DOWNSAMPLE_IMAGES \ -dGrayImageResolution=$IMAGE_DPI \ -dDownsampleMonoImages=$DOWNSAMPLE_IMAGES \ -dMonoImageResolution=$IMAGE_DPI \ -sOutputFile="$FILE_OPTIMIZED" \ "$FILE" $FILE_PDFMARK
The other commands are:
hexapdf optimize --force A.pdf A-no-compress.pdf
hexapdf optimize --force --compress A.pdf A-compress.pdf
(You have to install
hexapdf first by running
gem install hexapdf.)
I ran the above commands on two sample PDF files, let’s call them A.pdf (a ~20 page excerpt) and B.pdf (the full book) — and here’s what happened:
There’s some weird stuff in these results.
Ghostscript performs far and away the best with both files, but the amount of time it takes to do so varies substantially. With the smaller excerpt, it completes its optimization 10x faster than
hexapdf, and produces a file about 7x smaller.
But when processing the much larger, book-length PDF, Ghostscript takes a lot longer to crunch things down. This makes sense, the book has a ton of high-resolution images in it.
What makes no sense, though, is how the no-compress version of
hexapdf takes less time to optimize the much larger file than it did for the excerpt. I’m not really sure what it’s doing there.
There’s almost no difference in size between the no-compress and compress versions of the
hexapdf output, so for the amount of processing time it adds, it’s not worth it. This is also mentioned in the
hexapdf help optimize command line help text.
Sticking to Ghostscript makes the most sense if you’re expecting the PDF to be read on any device up to about 300dpi, which includes pretty much every tablet ever made.
Featured image: “Das Öl”, Örtels Lesebuch, 32 Bilder, 2te Abtg. (Source: Library of Congress)