PDF-1.5 Image insertion - Pandoc should use correct /MediaBox dimensions/ratio when exporting to .docx, or output warning
For PDF-1.5 files with compressed trailer
and compressed object streams (/ObjStm
); all other PDF software including qpdf
and pdfinfo
use the correct page dimensions read from /MediaBox
:
$ pdfinfo 'x.pdf' | grep pts Page size: 596 x 842 pts (A4)
However, when using Pandoc and outputting to .docx
:
$ echo '' > x.md && pandoc x.md -o x.docx
the image dimensions are incorrectly detected, and the image is inserted into the .docx
with incorrect aspect ratio and size.
Manually walking through the PDF objects using qpdf
shows a compressed Cross-reference (xref
) and compressed object streams (\ObjStm
) in use. A compressed object stream has multiple objects in a single /Flate
compression stream:
$ for object in trailer 18 40 1 2 ; do qpdf --show-object=$object x.pdf -- ; done << /Filter /FlateDecode /Info 39 0 R /Length 155 /Root 40 0 R /Size 43 /Type /XRef /W [ 1 2 2 ] >> % Object is stream. Dictionary: << /Filter /FlateDecode /First 17 /Length 41 0 R /N 3 /Type /ObjStm >> << /Pages 1 0 R /Type /Catalog >> << /Count 2 /Kids [ 2 0 R 12 0 R ] /Type /Pages >> << /Contents 4 0 R /Group << /CS /DeviceRGB /I true /S /Transparency /Type /Group >> /MediaBox [ 0 0 596 842 ] /Parent 1 0 R /Resources 3 0 R /Type /Page >>
The /Page
object with its /MediaBox[left bottom right top]
is there, but only obtainable by (fully) walking/parsing/unpacking the PDF file.
These PDF-1.5 files in question are not particularly exotic, they are PDF-1.5 files printed from Firefox, via Cairo:
obj 39 0 << /CreationDate (D:20240611224431+02'00) /Creator (Mozilla Firefox 126.0) /Producer (cairo 1.17.4 \(https://cairographics.org\)) >>
(See previous issue "Wrong image size for pdf images in docx" #4322, newly opened here per request of @jgm )
PDF-1.5 Image insertion - Pandoc should use correct /MediaBox dimensions/ratio when exporting to .docx, or output warning
For PDF-1.5 files with compressed trailer
and compressed object streams (/ObjStm
); all other PDF software including qpdf
and pdfinfo
use the correct page dimensions read from /MediaBox
:
$ pdfinfo 'x.pdf' | grep pts Page size: 596 x 842 pts (A4)
However, when using Pandoc and outputting to .docx
:
$ echo '' > x.md && pandoc x.md -o x.docx
the image dimensions are incorrectly detected, and the image is inserted into the .docx
with incorrect aspect ratio and size.
Manually walking through the PDF objects using qpdf
shows a compressed Cross-reference (xref
) and compressed object streams (\ObjStm
) in use. A compressed object stream has multiple objects in a single /Flate
compression stream:
$ for object in trailer 18 40 1 2 ; do qpdf --show-object=$object x.pdf -- ; done << /Filter /FlateDecode /Info 39 0 R /Length 155 /Root 40 0 R /Size 43 /Type /XRef /W [ 1 2 2 ] >> % Object is stream. Dictionary: << /Filter /FlateDecode /First 17 /Length 41 0 R /N 3 /Type /ObjStm >> << /Pages 1 0 R /Type /Catalog >> << /Count 2 /Kids [ 2 0 R 12 0 R ] /Type /Pages >> << /Contents 4 0 R /Group << /CS /DeviceRGB /I true /S /Transparency /Type /Group >> /MediaBox [ 0 0 596 842 ] /Parent 1 0 R /Resources 3 0 R /Type /Page >>
The /Page
object with its /MediaBox[left bottom right top]
is there, but only obtainable by (fully) walking/parsing/unpacking the PDF file.
These PDF-1.5 files in question are not particularly exotic, they are PDF-1.5 files printed from Firefox, via Cairo:
obj 39 0 << /CreationDate (D:20240611224431+02'00) /Creator (Mozilla Firefox 126.0) /Producer (cairo 1.17.4 \(https://cairographics.org\)) >>
(See previous issue "Wrong image size for pdf images in docx" #4322, newly opened here per request of @jgm )