PDF-1.5 Image insertion - Pandoc should use correct /MediaBox dimensions/ratio when exporting to .docx, or output warning

#10902

Issue Details

about 1 month ago
sladensladen
bug
sladensladen
opened about 1 month ago
Author

For PDF-1.5 files with compressed trailer and compressed object streams (/ObjStm); all other PDF software including qpdf and pdfinfo use the correct page dimensions read from /MediaBox:

$ pdfinfo 'x.pdf' | grep pts Page size: 596 x 842 pts (A4)

However, when using Pandoc and outputting to .docx:

$ echo '![Some PDF Image](x.pdf)' > x.md && pandoc x.md -o x.docx

the image dimensions are incorrectly detected, and the image is inserted into the .docx with incorrect aspect ratio and size.

Manually walking through the PDF objects using qpdf shows a compressed Cross-reference (xref) and compressed object streams (\ObjStm) in use. A compressed object stream has multiple objects in a single /Flate compression stream:

$ for object in trailer 18 40 1 2 ; do qpdf --show-object=$object x.pdf -- ; done << /Filter /FlateDecode /Info 39 0 R /Length 155 /Root 40 0 R /Size 43 /Type /XRef /W [ 1 2 2 ] >> % Object is stream. Dictionary: << /Filter /FlateDecode /First 17 /Length 41 0 R /N 3 /Type /ObjStm >> << /Pages 1 0 R /Type /Catalog >> << /Count 2 /Kids [ 2 0 R 12 0 R ] /Type /Pages >> << /Contents 4 0 R /Group << /CS /DeviceRGB /I true /S /Transparency /Type /Group >> /MediaBox [ 0 0 596 842 ] /Parent 1 0 R /Resources 3 0 R /Type /Page >>

The /Page object with its /MediaBox[left bottom right top] is there, but only obtainable by (fully) walking/parsing/unpacking the PDF file.

These PDF-1.5 files in question are not particularly exotic, they are PDF-1.5 files printed from Firefox, via Cairo:

obj 39 0 << /CreationDate (D:20240611224431+02'00) /Creator (Mozilla Firefox 126.0) /Producer (cairo 1.17.4 \(https://cairographics.org\)) >>

(See previous issue "Wrong image size for pdf images in docx" #4322, newly opened here per request of @jgm )