Compare PDF Content With Ruby -


i in process of writing ruby script/app helps me compiling latex (at least) pdf. 1 feature want have should run pdflatex iteratively until pdf converges (as should, guess).

the idea compare pdf generated in 1 iteration against 1 former iteration using fingerprints. in particular, use digest::md5.file(.).

the problem never converges. (the, hopefully) culprit pdf's timestamp set seconds @ least pdflatex. since runs of pdflatex take typically longer 1 second, result keeps changing. is, expect pdf's equal timestamp(s) after point. assumption might wrong; hints appreciated.

what can this? basic ideas far:

  • use library capable of doing job
  • strip meta data away , hash pdf content
  • overwrite timestamps fixed value before comparing

do have more ideas or solutions? solutions should use free software runs on linux. such use ruby preferred, using external software acceptable.

by way, not know how pdf encoded suspect merely comparing contained text won't work me since graphics or links might change in later iterations.

possibly related:

this not bullet-proof solution, works me:

cat file.pdf | grep -a -v "/creationdate\|/moddate\|/id" | md5sum 

or ruby

`cat file.pdf | grep -a -v "/creationdate\\|/moddate\\|/id" | md5sum`.strip 

this computes pdf's hash after dropping lines cause supposedly identical pdfs differ.

ymmw, depending on pdf creator. use diff -a file1.pdf file2.pdf find out other lines need drop.


Comments

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

sql server - python to mssql encoding problem -

windows - Python Service Installation - "Could not find PythonClass entry" -