I have an application, that extracts headings out of pdf files. The documents that the application is supposed to work with, all have more or less coherent structure and formatting. In fact, telling if a text chunk is bold or not, is very important.
I am looking for a method to extract the text as well as anchor information using iText. For example: the PDF content is "You can visit our website, XYZ , and do something" where XYZ is a clickable link. The output when extracting this content should be: "You can visit our website, XYZ (www.google.com) and do something".
We explored many API's like Tika, PdfBox and iText to extract page numbers from a PDF file, but we weren't able to meet this requirement. In iText we tried PdfPageLabels.getPageLabels(reader) but the behavior of this method is not uniform.
While extracting font name from PDF, I get some junk characters followed by plus sign and then the font name with font style. I want to remove the junk characters. I get those junk characters only for a few PDF file, for example: MMLPEO+RemingtonNoiseless