pdfs should be reproducible
Some workflows involve creating PDFs from arbitrary source data. It can be useful to create a one-to-one relationship between the source and the result, for change tracking and reproducibility. In other words: if the source data doesn't change then the built PDF doesn't change and is byte-for-byte identical.
It's also a good idea to remove metadata for privacy reasons, but that's a separate topic.
#Remove metadata
The first step is a general metadata-removal pass, which removes the fields that constantly change like CreationDate
and ModDate
, with an added bonus of nuking Producer
, Author
, and even PTEX.Fullbanner
(added by pdfTeX
).
exiftool -overwrite_original -all= file.pdf
Unfortunately this is not enough. The warning from exiftool
is worth paying attention to.
ExifTool PDF edits are reversible. Deleted tags may be recovered!
In an old topic on the ExifTool forums the author outlines that removal of the data is harder than expected and recommends to linearize the PDF after "removing" the metadata with ExifTool.
#Linearize
Linearization is a PDF optimisation. It enables the PDF to be streamed one page at a time by reorganising its internals so that each page is self-contained. (Non-linearized PDFs store information associated with each page across the entire file.)
This has the desired side effect of fully removing metadata as when the optimisation process encounters a metadata tag and then an instruction to hide the tag it knows it can omit it entirely.
qpdf --linearize --replace-input file.pdf
Unfortunately, once again, this is not enough. There is one tag remaining that is not considered metadata but is still regenerated each time the PDF is built: ID
.
#Remove the ID tag
The ID
tag is an optional tag if not using encryption, but the recommendation is that it remain for maximum compatibility. If it can't be removed, then at least it can be set to a static value.
Either use something arbitrary like 0
s, or generate an ID, for example by MD5 hashing a phrase. Bear in mind that this will create a relationship between all generated PDFs if the same one is used everywhere.
For the last transformation nothing more is needed than a simple line of sed
.
ID=00000000000000000000000000000000
sed -r -i "s|/ID \[<[0-9a-f]+><[0-9a-f]+>]|/ID [<$ID><$ID>]|" file.pdf
It may look like there is a chance that this replacement matches something in the text contents, but it won't: linearization will already have removed all plain text representations.
#Put it all together
Combine the commands above into a post-processing script and run it each time your PDF is generated, and there should be no change to the result if the source data didn't change.