pdfs should be reproducible
Some workflows involve creating PDFs from arbitrary source data. It can be useful to create a one-to-one relationship between the source and the result, for change tracking and reproducibility. In other words: if the source data doesn't change then the built PDF doesn't change and is byte-for-byte identical.
It's also a good idea to remove metadata for privacy reasons, but that's a separate topic.
The first step is a general metadata-removal pass, which removes the fields that constantly change like
ModDate, with an added bonus of nuking
Author, and even
PTEX.Fullbanner (added by
exiftool -overwrite_original -all= file.pdf
Unfortunately this is not enough. The warning from
exiftool is worth paying attention to.
ExifTool PDF edits are reversible. Deleted tags may be recovered!
Linearization is a PDF optimisation. It enables the PDF to be streamed one page at a time by reorganising its internals so that each page is self-contained. (Non-linearized PDFs store information associated with each page across the entire file.)
This has the desired side effect of fully removing metadata as when the optimisation process encounters a metadata tag and then an instruction to hide the tag it knows it can omit it entirely.
qpdf --linearize --replace-input file.pdf
Unfortunately, once again, this is not enough. There is one tag remaining that is not considered metadata but is still regenerated each time the PDF is built:
#Remove the ID tag
ID tag is an optional tag if not using encryption, but the recommendation is that it remain for maximum compatibility. If it can't be removed, then at least it can be set to a static value.
Either use something arbitrary like
0s, or generate an ID, for example by MD5 hashing a phrase. Bear in mind that this will create a relationship between all generated PDFs if the same one is used everywhere.
For the last transformation nothing more is needed than a simple line of
ID=00000000000000000000000000000000 sed -r -i "s|/ID \[<[0-9a-f]+><[0-9a-f]+>]|/ID [<$ID><$ID>]|" file.pdf
It may look like there is a chance that this replacement matches something in the text contents, but it won't: linearization will already have removed all plain text representations.
#Put it all together
Combine the commands above into a post-processing script and run it each time your PDF is generated, and there should be no change to the result if the source data didn't change.