About us | News | Contact
iText ® Licenses Support   
You are here: Home > Forums > iText and iTextSharp Support > Advanced features > Text extraction
User login
  • Request new password

Can I extract/parse information from pdf

Submitted by Anonymous Newbie on Wed, 02/09/2011 - 10:29

Can I use iText to extract/parse information from pdf document like text, tables as well as form field values?

‹ Extracting tabular data
  • Login to post comments

Maybe...

Submitted by Bruno Lowagie on Wed, 02/09/2011 - 10:44.

Yes, you can use iText to parse a PDF for information.

iText allows you to examine every object in the Carousel Object System (see chapter 13 of iText in Action). This means that you can search for values such as the reason why a PDF was signed, the content of an annotation, the titles of the bookmarks, etc... You can also get different content streams, but: the content stream consists of PDF syntax (Adobe Imaging Model), so you'll need further processing to extract the actual text.

Yes, you can use iText to process these content streams.

You can extract text from a content stream, but for ordinary PDFs, the result will be plain text (without any structure). If there's a table on the page, that table won't be recognized as such. You'll get the content and some white space, but that's not a tabular structure! Only if you have a tagged PDF, you can obtain an XML-file(see chapter 15 of iText in Action). If the PDF contains tags that are recognized as table tags, this will be reflected in the PDF.

Yes, you can use iText to get form field values.

If the PDF is using AcroForm technology, you need the getField() method; for XFA, you can also use this method if there's an AcroForm counter-part, but you can also extract the data XML (which is your only option for dynamic XFA forms). If the PDF looks like a form (to the human eye), but doesn't contain any AcroForm or XFA technology (for instance because the form has been flattened), you can't extract form fields; you can get the plain text, but not the field names (because from a technical point of view there are no fields).

No, you can't get any information from a PDF file that isn't there.

As described above: you can't get fields from a PDF that looks like a form, if the PDF isn't a form from a technical point of view; you can't get a table from a PDF that looks like a table, if the tabular structure (using tags) is missing inside the PDF.

  • Login to post comments
Content © 2010 1T3XT BVBA