CVE-2022-44109: GitHub - ldenoue/pdftojson: using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.
pdftojson commit 94204bb was discovered to contain a stack overflow via the component Stream::makeFilter(char, Stream, Object*, int).
using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.
On MacOS, you might need to specify libpng and libfreetype locations, e.g.
./configure --with-libpng-library=/usr/local/Cellar/libpng/1.6.16/lib/ --with-libpng-includes=/usr/local/Cellar/libpng/1.6.16/include/ --with-freetype2-library=/usr/local/lib/ --with-freetype2-includes=/usr/local/include/freetype2/
You will find pdftojson inside the directory xpdf/pdftojson
pdftojson <input.pdf> <output.json>
File format
The JSON produced looks like: [ { "pages":14, "number":1, "width":612, "height":792, "text":[ [115,162,41,14,0,"What "], … ] }, { "pages":14, "number":2, "width":612, "height":792, "text":[ [115,162,41,14,0,"Here "], … ] }, … ];
For each page, the text array contains: [top,left,width,height,0,text]