[JHOVE] Canadiana.org Introduction, and some questions about PDF files.

Russell McOrmond Russell.McOrmond at canadiana.ca
Mon Apr 3 19:18:34 BST 2017


  I'm the lead systems person at Canadiana.org, a Canadian charity that our
members created to digitise, preserve, and provide access to Canada's
documentary heritage. http://www.canadiana.ca/about

  In our Trustworthy Digital Repository http://www.canadiana.ca/trustworthy-
digital-repository we have tens of millions of files, with some of the
earliest scanned and OCR'd in the late 1990's.  As part of the CRL
certificationhttp://www.canadiana.ca/tdr-certification we added a few
checks to files being added to our TDR (ImageMagick's identify for our
JPEG, JPEG2000 and TIFF images, as well as our PDF files).  While we wanted
to adopt something more robust such as JHOVE, we weren't able to allocate
the resources at the time to implement.

  On March 24 I downloaded the latest JHOVE (XML files generated
indicate release="1.16.5") and set it to generate an XML file for all of
our files.

  I expected to find problems with our earliest files, but was surprised to
find issues reported from our most recent.  We would like to have JHOVE
file identification/validation as part of our ingest process, only adding
new files which are recognised.  Before we can do this we need to work out
compatibility issues.


* Issues with ABBYY Recognition Server generated PDF's.

  For one of our projects we had ABBYY generate multi-page PDF files from
the JPEG images we scanned.  I posted a comment to an issue on Github with
some of the details.  https://github.com/openpreserve/jhove/issues/115#
issuecomment-290946128

  When we were doing the OCR we had a few problems.  The ABBYY we have is
only 32 bit, and thus doesn't have access to the RAM we have on the machine.


  I thought one way to ensure we had less problems was to have ABBYY
generate single-page PDF files, and then use
https://poppler.freedesktop.org/ or PDFtk to create the multi-page PDF
file.  Unfortunately this hasn't worked well, and I don't know where the
source of the problem is.

  I took a sample directory with 1533 JPEG images with the 1533 PDF files
which ABBYY generated.  If I run JHOVE on that directory, it indicates all
files are "Well-Formed and valid" and of the correct type.

  If I try to join the PDF files into a multi-page PDF I get validation
errors:

cihm at quark:/opt/wip/Temp/rwm$ (find MississaugaNews_2 -name '*.pdf' | sort
; echo MississaugaNews_2-pdfunite.pdf) | xargs pdfunite
cihm at quark:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2-pdfunite.pdf
Tagged:         no
Form:           none
Pages:          1533
Encrypted:      no
Page size:      733.45 x 1486.1 pts
Page rot:       0
File size:      4143807494 bytes
Optimized:      no
PDF version:    1.4
cihm at quark:/opt/wip/Temp/rwm$ identify MississaugaNews_2-pdfunite.pdf
MississaugaNews_2-pdfunite.pdf[0] PDF 733x1486 733x1486+0+0 16-bit Bilevel
DirectClass 137KB 3.810u 0:03.820
MississaugaNews_2-pdfunite.pdf[1] PDF 783x1429 783x1429+0+0 16-bit Bilevel
DirectClass 137KB 3.840u 0:03.850
MississaugaNews_2-pdfunite.pdf[2] PDF 733x1367 733x1367+0+0 16-bit Bilevel
DirectClass 137KB 3.830u 0:03.839
... (cut as showing each page not very interesting...)
MississaugaNews_2-pdfunite.pdf[1531] PDF 733x1356 733x1356+0+0 16-bit
Bilevel DirectClass 137KB 0.060u 0:00.070
MississaugaNews_2-pdfunite.pdf[1532] PDF 733x1359 733x1359+0+0 16-bit
Bilevel DirectClass 137KB 0.060u 0:00.070
cihm at quark:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h xml
MississaugaNews_2-pdfunite.pdf
java.lang.ArrayIndexOutOfBoundsException: 79961
at edu.harvard.hul.ois.jhove.module.PdfModule.getObject(PdfModule.java:2398)
at
edu.harvard.hul.ois.jhove.module.PdfModule.resolveIndirectObject(PdfModule.java:2377)
at
edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(PdfModule.java:1344)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(PdfModule.java:521)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(JhoveBase.java:803)
at edu.harvard.hul.ois.jhove.JhoveBase.process(JhoveBase.java:588)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(JhoveBase.java:455)
at Jhove.main(Jhove.java:292)
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="
http://hul.harvard.edu/ois/xml/ns/jhove" xsi:schemaLocation="
http://hul.harvard.edu/ois/xml/ns/jhove
http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove"
release="1.16.5" date="2017-03-20">
 <date>2017-04-03T12:08:45-04:00</date>
 <repInfo uri="MississaugaNews_2-pdfunite.pdf">
  <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
  <lastModified>2017-04-03T11:31:47-04:00</lastModified>
  <size>4143807494</size>
  <format>PDF</format>
  <status>Not well-formed</status>
  <sigMatch>
  <module>PDF-hul</module>
  </sigMatch>
  <messages>
   <message offset="4143550391" severity="error">66166</message>
   <message offset="0" severity="error">No document catalog
dictionary</message>
  </messages>
  <mimeType>application/pdf</mimeType>
 </repInfo>
</jhove>


cihm at quark:/opt/wip/Temp/rwm$ (find MississaugaNews_2 -name '*.pdf' | sort
; echo cat ; echo output ; echo MississaugaNews_2-pdftk.pdf) | xargs pdftk
cihm at quark:/opt/wip/Temp/rwm$ pdfinfo MississaugaNews_2-pdftk.pdf
Creator:        pdftk 2.01 - www.pdftk.com
Producer:       itext-paulo-155 (itextpdf.sf.net-lowagie.com)
CreationDate:   Mon Apr  3 11:38:20 2017
ModDate:        Mon Apr  3 11:38:20 2017
Tagged:         no
Form:           none
Pages:          1533
Encrypted:      no
Page size:      733.45 x 1486.1 pts
Page rot:       0
File size:      4143726198 bytes
Optimized:      no
PDF version:    1.4
cihm at quark:/opt/wip/Temp/rwm$ /opt/jhove/jhove -m pdf-hul -h xml
MississaugaNews_2-pdftk.pdf
<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="
http://hul.harvard.edu/ois/xml/ns/jhove" xsi:schemaLocation="
http://hul.harvard.edu/ois/xml/ns/jhove
http://hul.harvard.edu/ois/xml/xsd/jhove/1.6/jhove.xsd" name="Jhove"
release="1.16.5" date="2017-03-20">
 <date>2017-04-03T12:08:49-04:00</date>
 <repInfo uri="MississaugaNews_2-pdftk.pdf">
  <reportingModule release="1.8" date="2017-03-14">PDF-hul</reportingModule>
  <lastModified>2017-04-03T11:38:57-04:00</lastModified>
  <size>4143726198</size>
  <format>PDF</format>
  <status>Not well-formed</status>
  <sigMatch>
  <module>PDF-hul</module>
  </sigMatch>
  <messages>
   <message offset="4143726197" severity="error">Missing startxref keyword
or value</message>
  </messages>
  <mimeType>application/pdf</mimeType>
 </repInfo>
</jhove>
cihm at quark:/opt/wip/Temp/rwm$


  If it is helpful, the pdfunite generated PDF file can be downloaded from
http://pub.canadiana.ca/view/omcn.MississaugaNews_2

  If this is a bug in pdfunite and/or pdftk I'd like to know more details
so I can file a bug report, but I was surprised to find that JHOVE reported
errors with both.

-- 
System Administration and software developer,
Canadiana.org   http://www.canadiana.ca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openpreservation.org/pipermail/jhove/attachments/20170403/d8db9d4f/attachment.html>


More information about the JHOVE mailing list