About us | News | Contact
iText ® Licenses Support   
You are here: Home > Forums > iText and iTextSharp Support > Advanced features > Text extraction
User login
  • Request new password

Calculating Text Regions of Individual Words from Existing PDF

Submitted by kalani96746 on Sun, 07/15/2012 - 20:08

Hi ;

Heres what I am trying to do; I would appreciate to know if this is possible in iText.
I'm not interested in constructing pdfs only deconstructing existing pdf's for analysis of content and positions of words on the page.

Rather than boundary of all text on the page I want the boundary info for each word in order to generate some xml for another program I wrote.

Something like this...
<word id="0" x="0" y="0" width="8" height="4">The</word>
<word id="1" x="12" y="0" width="7" height="4">fox</word>
<word id="2" x="22" y="0" width="7" height="4">was</word>

I know I can do it for a region of text; as shown in the IText in Action book in Chapter 15; but I really do want it for each individual word so I can generate invisible yet clickable hotspots over what will end up being just be a plain image.

Is this possible to do with iText; how would I accomplish something like this? I might note that some PDF's aren't normal text but an image with some acrobat pro ocr'd text. I'm not sure the difference between normal pdf text and acrobat ocr'd pdf text.

Thanks, great book!

kb

Extracting Image Position ›
  • Login to post comments

The concept of a word doesn't exist in PDF

Submitted by Bruno Lowagie on Mon, 07/16/2012 - 12:44.

See http://article.gmane.org/gmane.comp.java.lib.itext.general/62518

  • Login to post comments
Content © 2010 1T3XT BVBA