Calculating Text Regions of Individual Words from Existing PDF
Heres what I am trying to do; I would appreciate to know if this is possible in iText.
I'm not interested in constructing pdfs only deconstructing existing pdf's for analysis of content and positions of words on the page.
Rather than boundary of all text on the page I want the boundary info for each word in order to generate some xml for another program I wrote.
Something like this...
<word id="0" x="0" y="0" width="8" height="4">The</word>
<word id="1" x="12" y="0" width="7" height="4">fox</word>
<word id="2" x="22" y="0" width="7" height="4">was</word>
I know I can do it for a region of text; as shown in the IText in Action book in Chapter 15; but I really do want it for each individual word so I can generate invisible yet clickable hotspots over what will end up being just be a plain image.
Is this possible to do with iText; how would I accomplish something like this? I might note that some PDF's aren't normal text but an image with some acrobat pro ocr'd text. I'm not sure the difference between normal pdf text and acrobat ocr'd pdf text.
Thanks, great book!