● Slight movements of text position.
● Slight changes to boilerplate text.
● Addition of new boilerplate text or new key-value pairs.
Strictly speaking, such a collection is NOT of a single template, but of a number of close
templates (e.g. yearly versions of the same form). If the differences are small, the tool can
overcome these differences, but if the differences are large (especially changes to text or
order of key-value pairs), then results might be disappointing. If your documents have large
differences, you might want to partition them into similar collections, and “Extract
structured data” from each of these subset collections separately, and then combine your
downloaded results.
When in doubt...
Just give it a try. Sometimes, if documents aren’t really of the same template, it will give
garbage. But sometimes you may be surprised to get some good results. This is especially
true if there’s a segment of the documents (e.g the first page) that adheres to these rules
while the rest doesn’t. You may be able to get good extraction results from the “good”
segment.
Examples (from across the web)
● Country factbook dataset: Toy dataset built for demos. Includes key-value pairs,
tables and repeated-section. Great for learning how to use the tool.
● Medical Examination Report Form: Despite its complexity, if the form is fixed then it’s
a valid template. Some sections (e.g. “Testing”) may have a column layout, but as
long as it is fixed across documents, that’s still valid. Note that support for extracting
checkboxes is not yet available in
th.com/pinpoint-extract/resources/country_factbook_sample_collection.zipe Alpha
release.
● Hearing Aid Compatibility Status Certifications (FCC Form 855): Demonstrates how
key-value pairs may be subtle, how optional sections are allowed, and how the
document flows across pages.
● Charlotte-Mecklenburg Police Department Incident Report: A great example of a
non-fixed form template. Note the many key-value pairs, and the repeated sections
(“Property”).
● Bill Receipt: Somewhat problematic, since the top key-value pairs are split into two
distinct columns. If they are always consistent (e.g. “Phone” on the left is always on
the same row as “Date” on the right) then it’s ok. Otherwise, attempting to extract
from the headers might be flaky.