Compare PDF Content With Ruby -

- January 15, 2014

i in process of writing ruby script/app helps me compiling latex (at least) pdf. 1 feature want have should run pdflatex iteratively until pdf converges (as should, guess).

the idea compare pdf generated in 1 iteration against 1 former iteration using fingerprints. in particular, use digest::md5.file(.).

the problem never converges. (the, hopefully) culprit pdf's timestamp set seconds @ least pdflatex. since runs of pdflatex take typically longer 1 second, result keeps changing. is, expect pdf's equal timestamp(s) after point. assumption might wrong; hints appreciated.

what can this? basic ideas far:

use library capable of doing job
strip meta data away , hash pdf content
overwrite timestamps fixed value before comparing

do have more ideas or solutions? solutions should use free software runs on linux. such use ruby preferred, using external software acceptable.

by way, not know how pdf encoded suspect merely comparing contained text won't work me since graphics or links might change in later iterations.

possibly related:

how compare 2 pdf files? (messy, text-based or proprietary solutions)
functional pdf testing (uses java library; not clear wether job)

this not bullet-proof solution, works me:

cat file.pdf | grep -a -v "/creationdate\|/moddate\|/id" | md5sum

or ruby

`cat file.pdf | grep -a -v "/creationdate\\|/moddate\\|/id" | md5sum`.strip

this computes pdf's hash after dropping lines cause supposedly identical pdfs differ.

ymmw, depending on pdf creator. use diff -a file1.pdf file2.pdf find out other lines need drop.

Search This Blog

Sohocode

Compare PDF Content With Ruby -

Comments

Post a Comment

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

sql server - python to mssql encoding problem -

windows - Python Service Installation - "Could not find PythonClass entry" -