Extract text and images from PDF pages

Extracting text and images from PDF pages for additional processing is a common requirement for many software projects. XFINIUM.PDF library can extract text, images and vector graphics from PDF files at various levels, from low level PDF operators to high level visual objects.

The main class for extracting text, images and vector graphics from a PDF page is PdfContentExtractor class. The page from which the content is extracted is provided as parameter to the PdfContentExtractor constructor.
The following methods for extracting content are available:

ExtractText

Extracts the text from a PDF page as a string object.
The context parameter has effect on the performance when extracting text from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

The code below extracts the text from a PDF file:
C#:

VB.NET:

ExtractTextFragments

Extracts the text from a PDF page as a collection of text fragment objects. A text fragment is a piece of text painted by a single ‘showtext’ operator. The text can be a letter, a word or an entire phrase, it depends on the application that generated the PDF file.
A text fragment object includes several information such as: the text being shown, the name of the font used to display the text, the font size, the positions of the fragment’s 4 corners (the fragment can be rotated and skewed so it cannot be represented as a rectangle), the pen and brush used to style the text and a collection of glyphs describing each glyph that composes the text.
The context parameter has effect on the performance when extracting text fragments from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

The code below shows how to extract the text fragments from a page and highlight them:

C#

 

VB.NET:

ExtractWords

Extracts the text from a PDF page as a collection of word objects. Each word object consists of the text representing the word and a collection of text fragments that are combined together to create the word.
The context parameter has effect on the performance when extracting words from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

ExtractImages

Extracts the images from a PDF page. The method parses the page content and returns a collection of visual images where each visual image represents a drawing instance of an image object. For example if a page contains a single image object in its resources but that image is drawn 5 times on the page the method will return a collection of 5 visual image objects. Each visual image object specifies the position of the image’s 4 corners (the image can be rotated and skewed so it cannot be represented as a rectangle), its vertical and horizontal resolution, the image size in pixels, the image colorspace and bits per component.
The includeImageData parameter specifies how the image data should be handled. If true, the images will be decoded and the actual image data will be included in the image object but the method will take longer to complete. If false the images will not be decoded and the method will execute faster.
If you need to save the images to external storage then set this parameter to true. If you need only information about the image, such as position on the page, size, resolution then set this parameter to false.

The code below shows how to extract information about the images displayed on a page:

C#:

VB.NET:

ExtractVisualObjects

Extracts the content of the page as a collection of visual objects. A visual object can be a path, a text fragment, an image, a shading or a form XObjects.
The method parameters let you control the result:
– includeImageData – if true, the images will be decoded and the actual image data will be included in the image object but the method will take longer to complete. If false the images will not be decoded and the method will execute faster. If you need to save the images to external storage then set this parameter to true. If you need only information about the image, such as position on the page, size, resolution then set this parameter to false.
– keepGraphicContainers – the page content can be extracted as a flat list of visual object or as a grouped list where the grouping item is a form XObject. If true the form XObjects will be extracted as standalone objects and their content will appear in a separate collection as a child of the form XObject. If false the form XObjects will not appear in the result collection and their content will be included directly in the page content.
– context – a PdfContentExtractionContext object. This parameter has effect on the performance when extracting content from multiple pages of the same document. It acts as a cache for shared objects between pages thus speeding the extraction process.

ExtractOptionalContentGroup

Extracts the content of the specified optional content group as a reusable drawing object. The returned PdfPageOptionalContent object can be later drawn on a page using the Graphics’ DrawFormXObject method.

ExtractContentStreamOperators

Extracts the content of a PDF page as a collection of content stream operators. Each operator in the page content stream is represented by an operator object. The collection of operators and their operands can be used for a low level custom analysis of page content.

Leave a Reply