A Practical Guide to Big Data: Opportunities, Challenges & Tools© 2012 Dassault Systèmes
25
Definition
We define a “search platform” as a complete search engine
system that can serve as a multi-purpose, information aggrega-
tion, access and analysis platform in addition to meeting classic
enterprise or Web search needs. Such a search platform, also
referred to as a “unified information access” (UIA) platform,
encompasses all core data management functions, though with
an NLP/indexing twist. These functions include:
• Data capture (crawlers, connectors & APIs)
• Data storage (cached copies of source content and the
index itself)
• Data processing (NLP and index construction and mainte-
nance)
• Data access (human and machine IR, faceted navigation
and dashboard analytics)
A search system is therefore a DMS like its NoSQL and NewSQL
counterparts, and it achieves massive scalability in much the
same way, i.e., through distributed architectures, parallel
processing, column-oriented data models, etc. However, it is
the semantic capabilities and high usability of search-based
DMS that make them ideal complements to (and in some cases,
alternatives to) NoSQL and NewSQL systems.
First, a search DMS enables full-text search of any NoSQL,
NewSQL, or large volume “Old”SQL system (a highly valuable
contribution in and of itself). Second, it brings industrial auto-
mation to the task of meaningfully structuring data (a must-
have for extracting value from Big Data) either for direct use or
as a source for another system. A search platform can:
• Effectively structure large volume unstructured content
• Enrich data of any kind with meanings and relationships
not reflected in source systems
• Aggregate heterogeneous, multi-source content (unstruc-
tured and/or structured) into a meaningful whole
To structure unstructured data, a search platform runs content
through NLP processors that consecutively break it down,
analyze it, and then enrich it with structural and semantic at-
tributes and values. Take the processing of an HTML page, for
example. First, in text-centric processing (see the section on
Crawlers in Data Capture & Preprocessing), a crawler captures
basic structural information about the page, like page size, file
type, and URL, and transmits it along with the page text to an
indexer.
The indexer complements this baseline information with the
results of semantic analysis to create a holistic “document”
to be indexed. At a minimum, this analysis includes a deter-
mination of what language the text is written in, followed by
parsing the content for indexable keywords (and ultimately
phrases), determining along the way the grammatical form of
each keyword, and possible grammatical and semantic variants
for it. More sophisticated indexers may then analyze the text
to identify synonyms and related terms, to flag known people,
places or things (using standard or custom lists), to determine
the general subject matter treated, to decide whether the over-
all tone is positive or negative, etc. Business rules may be used
to guide the analysis and to perform various types of ETL-style
data transformations. This may include extracting only a select
number of attributes in order to distill Big Data down into a
pertinent and manipulable subset.
Once this structured version of a previously unstructured docu-
ment has been created, semantic technologies can be used
to identify links between it and other documents, whether
the other documents are derived from structured sources like
databases, semi-structured sources like Web logs, or other
unstructured sources like file servers. In this way, you can build
a unified, meaningfully organized Big Data collection from any
number or type of source systems, and you can further search,
explore and analyze this information along any axis of interest
(products, people, events, etc.).
When your target user is a business user and not an expert pro-
grammer or statistician, the search foundation provides a sin-
gular advantage: no other technology is as effective as search at
making Big Data meaningful and accessible to ordinary human
users.
Tools like natural language search, faceted navigation and data
visualization provide users of all skill levels with an instantly
familiar way of exploring and analyzing Big Data.
That is to say, they allow a user to launch any search or analyti-
cal task the same way they launch a search on the Web: by
entering a phrase or a few keywords in a text box. They also
enable a user to conduct iterative exploratory analytics simply
by clicking on (traversing) dynamic data clusters (represented
as text menus or in visual forms like charts or graphs).
This ease of use plus the sheer responsiveness of search
platforms encourages iterative exploration: if users get instant
answers to questions they ask in their own way, they are
enticed to query and explore further. If questions are difficult
to formulate and/or answers are sluggish in coming, users will
look elsewhere, or give up their quest altogether.
Search platforms are responsive because they are optimized
for fast query processing against large volumes of data
(read operations), and because most of the calculations they
use to produce dashboard analytics and ad hoc drilling are
automatically executed as part of routine indexing processes:
the results are there waiting to be exploited with no processing
overhead (CloudView extends analytic possibilities with high-