Compare PDF Content With Ruby -
i in process of writing ruby script/app helps me compiling latex (at least) pdf. 1 feature want have should run pdflatex
iteratively until pdf converges (as should, guess).
the idea compare pdf generated in 1 iteration against 1 former iteration using fingerprints. in particular, use digest::md5.file(.)
.
the problem never converges. (the, hopefully) culprit pdf's timestamp set seconds @ least pdflatex
. since runs of pdflatex
take typically longer 1 second, result keeps changing. is, expect pdf's equal timestamp(s) after point. assumption might wrong; hints appreciated.
what can this? basic ideas far:
- use library capable of doing job
- strip meta data away , hash pdf content
- overwrite timestamps fixed value before comparing
do have more ideas or solutions? solutions should use free software runs on linux. such use ruby preferred, using external software acceptable.
by way, not know how pdf encoded suspect merely comparing contained text won't work me since graphics or links might change in later iterations.
possibly related:
- how compare 2 pdf files? (messy, text-based or proprietary solutions)
- functional pdf testing (uses java library; not clear wether job)
this not bullet-proof solution, works me:
cat file.pdf | grep -a -v "/creationdate\|/moddate\|/id" | md5sum
or ruby
`cat file.pdf | grep -a -v "/creationdate\\|/moddate\\|/id" | md5sum`.strip
this computes pdf's hash after dropping lines cause supposedly identical pdfs differ.
ymmw, depending on pdf creator. use diff -a file1.pdf file2.pdf
find out other lines need drop.
Comments
Post a Comment