In the idealized
version of the semantic web, humans don’t deal with metadata, it’s all done under
the hood to process information that’s delivered fully baked to us, the knowledge primates.
How far are we from that vision? What does “fully baked” mean? Is it even
possible?
If you recall, this issue
ties back into the fundamental questions I raised in one of my previous posts.
More specifically, it ties into the second question: “Who is primarily going to
process the labels?” (with labels referring to metadata).
The potential answers
were straightforward:
- Machines (then we need rigorous
standards)
- Humans (then we don’t need standards as
rigorous - tagging can do)
- Both – most likely (then we have two sets
of metadata standards… e.g. RDF/OWL and del.icio.us-type tags… and machine
progressively work their way up to turn their machine labels into labels that
are more useful for us humans)
Most players out there
in the semantic web and knowledge management fields are focused on trying to
get machines to process more of the metadata than they used to. The key
enablers are on the one hand, information standards, and on the other, smart
(and less smart) algorithms. Most players tend to place more efforts on one enabler
or the other as a way to tackle the problem of automating metadata processing.
On the one hand, we
have what I’ll identify as the data homogenizers:
those players are driven by standardization trends in data and metadata.
Because it’s much easier to process well-defined and recognizable data types, they
tend to focus on information that has limited conceptual content and little need
for disambiguation and context, such as times, places, people, and organizations.
RDF together with OWL standards constitute the technological foundation of
choice for this school of thought. The goal is to create labels that can enable
machines to process all that information directly, without human input. So those
labels are mostly leveraged by machines using simple agents to make
associations at the data level within and across the online world,
organizations, and private individual domains. Building on that, the hope is to
tackle data of increasingly higher complexity through standardization, and the vision
is to cover a larger fraction of the information building blocks and then find a
way up towards higher levels of the semantic pyramid (see previous post).
Data homogenizers are getting
the most press in the semantic web world currently, due to the readiness of the
enabling technologies, mainly RDF, OWL, and SPARQL. As mentioned above, this
stream is highly dependent on the capacity to standardize data formats. That is
a huge limitation. Initiatives to extend this standardization to new formats, beyond
the most basic ones we listed such as locations, names etc, must be constrained
to either very basic data or very specific universes (e.g. a format is used by
libraries but not beyond).
The data homogenizer
ecosystem will keep expanding at an interesting pace, and will bring great
developments and “mash-ups”, but likely not the wow feeling needed to kick off
a gold rush on the user side. Freebase boasts
that it can let you search for Jennifer Connelly films with actors who have
appeared in a Steven Spielberg movie. Not the type of search I run more
than half a dozen times a year. Twine can recognize people, organizations, and dates,
and use that to twine data together. It’s a great step forward, but still falls
short of solving a critical market problem. And for the part which holds the highest user
value, offering content and connection recommendations by creating metadata for
complex information sequences, it still to this date relies greatly on human
tagging. Which brings only incremental value to existing processes such as exploring
expert bookmarks through tags in del.icio.us.
Problem is, those
much-heralded data homogenizers are the least semantic part of the so-called semantic web. They detect things
like person’s names and locations on a page, but do not surface what a
document, a paragraph or a sentence are about. As such, their value resides
mainly in complex search queries and pooling things together. Imagine a database
that would match exact entries across documents to connect data together
automatically. It would do well if you’re looking for Jennifer Connelly films
with actors who have appeared in a Steven Spielberg movie. Not if you’d like to
find what dances were in vogue across Latin America in the 50s and compile a summary of those dance techniques, a query that requires a broader range of metadata and analytical processing than currently enabled by RDF, OWL and SPARQL.
Discovering some relevant
new information is a big problem and the data homogenizers help with that. But making
sense of the knowledge out there, and helping with things like cutting through information
overload by delivering just the insights I really care about, are problems of much
larger proportions. Organizing my emails automatically, figuring out what part
of a webpage is really about Google Earth API capabilities, aggregating several
documents into one intelligently, that’s what I’d expect a semantically-enabled
application to do.
So let’s look at the second
group, and let’s call those the meaning
makers. One example of such players would be Endeca, which offers automated content
navigation. A number of emerging applications tackling the problem of
autotagging also qualify. Those attempt to extract the meaning from group of
documents, documents, and structured fragments of documents (the 3 upper levels
of the semantic pyramid).
They don’t go down to the building-block level to try and identify an author, a
location, or a date, because that’s of limited help in
extracting the broader meaning. The technologies they use tend to be much more
mathematically complex and often mimics the process human brains go through
when tagging information. This explains why they have a history of not working very well in
practice.
Till today, the
problem of these players has been to either rely too much on humans for the conceptual
interpretation task, which creates an adoption barrier, or to provide
relatively poor results when relying too heavily on algorithms. The semantic
agents indeed tend to be weak compared to human reasoning. That’s a key reason why
people-powered applications like technorati or del.icio.us continue to exist,
frankly. A second problem lies in the output on the other side of the pipe: for
now, because of the complexity and lack of standardization of the generated metadata,
a lot of that metadata still appears as tags for human users to process. In
other words, there is little that machines can do with the extracted metadata
apart from feeding users content carrying similar tags. For instance, if a
paragraph refers to Hinduism in a document on India, my algorithm will tag it
accordingly, and then I may be able to bring up that paragraph as part of a
search on Hinduism. Generally, the machine value-add process pretty much stops
here. Beyond content retrieval, it won’t use the Hinduism tag much. But the
point is, it could, and soon it likely will.
The capabilities these
applications develop are that of automatically adding intelligent metadata to
relatively unstructured pools of information. Everywhere information lives, such unstructured
data tend to abound and prevail, and so it is much more preferable and scalable to explore smart
solutions that will make sense of it, rather than to ask that this data be structured
using standards that by nature can only be local, transient, and prejudiced. Reasoning
capabilities give these applications a long-term edge in covering all levels the
metadata pyramid in great depth. Once you’re capable of recognizing, extracting
and playing with the concepts in a scientific white paper, little remains in
the way of recognizing dates, locations, and people.
In practice, as
highlighted previously, this is a much harder endeavor than standardizing data
and then adding simple inference agents to process relationships. From a purely
pragmatic angle, data homogenizers deliver,
whereas the meaning makers tend to get
bogged down in execution complexity, which clever players reduce by
constraining themselves to B2B verticals. As a result, the market is directing
increasing resources to the homogenizers, as a look at the different rounds of
Radar Networks, Metaweb or Hakia suffice to demonstrate.
Looking ahead, it’s
going to be interesting to see both sides wrestle to define and control the heart of the semantic web market. The data homogenizers are ready and their power
is growing fast. They have the advantage of a pragmatic, systematic approach
that’s bearing immediate fruits. The meaning makers have years of development behind them and the potential to
deliver much more value to users, but they have only a meagre trail of commercially
successful innovations to show for it, as their idealistic penchant led them to tried too much
too soon, too often.
From a pure market
slicing-and-dicing perspective, we could decouple both types of applications since the
benefits they provide are quite different; but already, as metadata is deployed
and progressively creeps across all levels and types of information and, more importantly,
as delivering real users benefits requires both cleaner data and smarter agents,
those two schools of thoughts are starting to converge.
Until algorithms are
able to overlay and process relevant metadata at all levels of the semantic pyramid, it
means that humans will continue to be huge direct contributors and direct processors
of metadata, across most Semantic Web plays. For a player in that ecosystem, striking the right balance between
automation and the need for human input, and leveraging any resulting metadata
to maximize information processing and enrichment right out of the gate, will
be critical.
Somehow, I was really looking
forward to being spoon-fed enhanced information by the machines... Ultimately, I expect it to take us to our next evolutionary step. Yet it seems now that we’ll have to content
ourselves with remaining knowledge primates for yet another little while.