The trick is using System.Reflection to expose hidden (private) properties of the PDFbox Page object. Program creates 1 image for each page of a PDF, computes word locations (if PDF is OCR'ed) then ...
Are there any open source software tools for creating PDF files from Word documents? A program called GhostWord provides the ability to click an icon in a Word toolbar and generate PDF files.
To recap, when a Tesseract PDF (3.0x or 4.x) is run through Ghostscript the OCR layer will be mangled. Ghostscript's pdfwrite (gs -sDEVICE=pdfwrite -o out.pdf in.pdf) will display spaces between every ...