x

Login to purchase

Username
Password

Greatstone Interantional Greatstone Australia
Pricing & ordering
Overview
Features
Product
Description
Price
W-30-09-02-WALF

Windows Server 2000/2003/2008 

Apple Mac OS X Server PPC/Intel 

Linux x86/IA-64/x86_64/EM64T 

FreeBSD on x86 


£ 600.00
W-30-09-02-SIH

Sun Solaris 7-10 on x86/sparc 

IBM AIX 4/5L 

HP-UX 10.20/ni on PA-RISC/IA-64 

£ 1,340.00
W-30-09-02-DS

Windows 2000/XP

Apple Mac OS X PPC/Intel 

 

£ 120.00
to login and build your order.

Greatstone offers the option of purchasing online, using one of the credit cards shown below, or submitting a Purchase Order on credit terms.

Delivery of all orders will be by electronic download from our secure website, and will be configured to your requirements. Your software will normally be available for you to use within 12 hours of submitting your order.

By choosing the Purchase Order option, we will require a valid order number which we can use on our invoice and the name of the authorising manager should it be necessary to progress payment. For your first order, we require payment to be made by bank transfer or cheque within 7 days of our invoice. All subsequent orders will be invoiced on 30 day terms.

What is PDFlib TET?

The PDFlib Text Extraction Toolkit (TET) is a developer product for reliably extracting text and raster images from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Raster images are extracted in common raster formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information.

TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.

With PDFlib TET you can:

Implement the PDF indexer for a search engine

Repurpose the text and images in PDFs

Convert the contents of PDFs to other formats

Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)



Supported PDF input

PDFlib TET supports all PDF versions up to Acrobat 9 (including RC4 and AES encryption). TET can extract Chinese, Japanese, and Korean text. All CJK encodings are recognized; horizontal and vertical writing modes are supported.

Protected documents can be indexed while at the same time respecting access permissions and permission controls.



Unicode

Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:

TET converts all text contents to Unicode. In C and other non-Unicode aware languages the text is returned in the UTF-8 or UTF-16 formats, and as native strings in Unicode-capable programming languages.

Ligatures and other multi-character glyphs are decomposed into a sequence of the corresponding Unicode characters.

Vendor-specific Unicode assignments (PUA characters) are identified, and mapped to characters in the common Unicode area if possible.

Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.

TET implements various workarounds for problems with specific document creation packages, such as InDesign and TeX documents or PDFs generated on mainframe systems.



Content analysis and word identification

TET includes advanced content analysis algorithms:

Patented algorithm for determining word boundaries which is required to retrieve proper words

Recombine the parts of hyphenated words

Remove duplicate instances of text, e.g. shadow and artificially bolded text

Recombine paragraphs in reading order

Reorder text which is scattered over the page



Page Layout and Table Detection

The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified.



Text Geometry

TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.



Image Extract

Images on PDF pages can be extracted as TIFF, JPEG, or JPEG 2000 files. Precise geometric information (position, size, and angles) are reported for each image. Fragmented images will be combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color space conversion occurs. This ensures the highest possible image quality.



PDF Analysis

The TET library includes the pCOS interface for querying details about a PDF document, such as document info and XMP metadata, font lists, page size, and many more (see separate datasheet for the pCOS product).



Repair Mode

Various kinds of damaged PDF documents are detected and automatically repaired if possible.



Configuration Options for problematic PDF

TET contains special handling and workarounds for various kinds of PDF where the text cannot be extracted correctly with other products. In addition, it includes various configuration features to improve processing of problem documents:

Unicode mapping can be customized via user-supplied tables for mapping character codes or glyph names to Unicode.

PDFlib FontReporter is an auxiliary tool for analyzing fonts, encodings, and glyphs in PDF. It works as a plugin for Adobe Acrobat. This plugin is freely available for Mac and Windows.

Embedded fonts are analyzed to find additional hints which are useful for Unicode mapping. External font files or system fonts are used to improve text extraction results if a font is not embedded.



Document Domains

PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:

page contents

predefined and custom document info entries

XMP metadata on document and image level

bookmarks

file attachments and PDF collections/portfolios can be processed recursively

form fields

comments (annotations)

general PDF properties can be queried, such as page count, conformance to standards like PDF/A or PDF/X, etc.



XMP Metadata

TET supports XMP metadata in several ways:

Using the integrated pCOS interface, XMP metadata for the document, individual pages, images, or other parts of the document can be extracted programmatically.

TETML output contains XMP document and image metadata if present in the PDF.

Images extracted in the TIFF or JPEG formats contain image metadata if present in the PDF.



TETML represents PDF Contents as XML

TET optionally represents the PDF contents in an XML flavor called TETML which contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata.

TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.



TET Connectors

TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments:

The TET Plugin for Adobe Acrobat is a free utility for extracting text and images from PDF. It offers better functionality than Acrobat’s built-in tools, and can be used to evaluate TET interactively.

TET connector for the Lucene Search Engine

TET connector for the Solr Search Server

TET connector for Oracle Text

 

TET PDF IFilter for Microsoft products is available as a separate product. It extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows (see separate datasheet for details).

TET connector for MediaWiki

 


TET Cookbook

The TET Cookbook is a collection of programming examples which demonstrate the use of TET for various text and image extraction tasks. Several Cookbook samples show how to combine the TET and PDFlib+PDI products in order to enhance PDF documents, e.g. add bookmarks or links based on the text on the page.

 

 

 
Product features
implement a search engine for processing PDF
extract text from PDF, e.g. to store it in a database
convert the text content of PDF pages to XML for processing with other tools
process PDFs based on their contents
Copyright © 2012 Greatstone - All rights reserved
Solution by Accura-tech