PDFlib TET, 2021-03 PDFlib GmbH2 www.pdflib.com
Geometry
TET provides precise metrics for the text, such as the position on
the page, glyph widths, and text direction. Specific areas on the
page can be excluded or included in the text extraction, e.g. to
ignore headers and footers or margins.
Text Color
TET analyzes color information in the PDF page description and
returns precise color information for each glyph. This can be used,
for example, to identify headings or other highlighted text. Option-
ally the advanced color spaces Separation and DeviceN can be
extracted in a simpler alternate color space.
Image Extraction
Images on PDF pages can be extracted as TIFF, JPEG, JBIG2 or JPEG
2000 files. Precise geometric information (position, size, and
angles) is reported for each image. Fragmented images are com-
bined to larger images to facilitate repurposing. Image fidelity is
guaranteed since no downsampling or color conversion occurs. This
ensures the highest possible image quality.
Ignore Artifacts in Tagged PDF
In Tagged PDF, especially PDF/UA, irrelevant content may be tagged
as Artifact, e.g. headers and footers. TET optionally ignores Artifact
text and images.
PDF Analysis with the pCOS Interface
The TET library includes the pCOS interface for querying details
about a PDF document, such as document info and XMP metadata,
font lists, page size, and many more (see separate pCOS datasheet).
Unicode Postprocessing
TET supports various Unicode postprocessing steps which can be
used to improve the extracted text:
> Foldings preserve, remove or replace characters, e.g. remove
punctuation or characters from irrelevant scripts.
> Decompositions replace a character with an equivalent sequence
of one or more other characters, e.g. replace narrow, wide or
vertical Japanese characters or Latin superscript (e.g.
a) variants
with their respective standard counterparts.
> Text can be converted to all Unicode normalization forms, e.g.
emit NFC form to meet the requirements for Web text or a data-
base.
Document Domains
PDF documents may contain text in other places than the page
contents. While most applications deal with the page contents
only, in many situations other document domains may be relevant
as well. TET extracts the text from all document domains:
> page contents
> predefined and custom document info entries
> XMP metadata on document and image level
> bookmarks
> file attachments and PDF portfolios are processed recursively
> form fields
> comments (annotations)
> general PDF properties can be queried, such as page count, con-
formance to standards like PDF/A or PDF/X, etc.
XMP Metadata
TET supports XMP metadata in several ways:
> Using the integrated pCOS interface, XMP metadata for the
document, individual pages, images, or other parts of the docu-
ment can be extracted programmatically.
> TETML output contains XMP document and image metadata.
> Images extracted in the TIFF or JPEG formats contain XMP image
metadata.
TETML represents PDF Contents as XML
TET optionally represents the PDF contents in an XML flavor called
TETML. It contains a variety of PDF information in a form which
can be processed with common XML tools. TETML contains the
text plus optionally font and position information, resource details
(fonts, images, colorspaces), and metadata.
TETML also includes interactive elements such as form fields, an-
notations, bookmarks etc. It can even be used to analyze JavaScript
or color space details, ICC profiles or output intents.
TETML can be processed with XSLT stylesheets, e.g. to apply filters
or to convert TETML to other formats. Sample XSLT stylesheets for
processing TETML are included in the TET distribution.
The following fragment shows TETML output with glyph details:
<Word>
<Text>PDFlib</Text>
<Box llx="111.48" lly="636.33" urx="161.14" ury="654.33">
<Glyph font="F1" size="18" x="111.48" y="636.33" width="9.65">P</Glyph>
<Glyph font="F1" size="18" x="121.12" y="636.33" width="11.88">D</Glyph>
<Glyph font="F1" size="18" x="133.00" y="636.33" width="8.33">F</Glyph>
<Glyph font="F1" size="18" x="141.33" y="636.33" width="4.88">l</Glyph>
<Glyph font="F1" size="18" x="146.21" y="636.33" width="4.88">i</Glyph>
<Glyph font="F1" size="18" x="151.08" y="636.33" width="10.06">b</Glyph>
</Box>
</Word>
TETML can include information about word and paragraph group-
ing as well as about tables and lists, image placement and annota-
tions along with geometric information for these elements.
TET Connectors
TET connectors interface TET with other software. They make PDF
text extraction available for various software environments:
> TET connector for the Lucene Search Engine
> TET connector for the Solr Search Server
> TET connector for the Apache TIKA toolkit
> TET connector for Oracle Text
> TET connector for MediaWiki
> TET PDF IFilter for Microsoft products is available as a separate
product. It extracts text and metadata from PDF documents and
makes it available to search and retrieval software on Windows
(see separate datasheet for details).