« Marketing RDF: theory and practice | Main | Top 10 semantic web players »

December 04, 2007

Creating metadata: a task for humans or machines?

Today I’d like to tackle the first “fundamental” question I raised in my November 23rd post: who will label all this data?

A big driver for “who” will label the data is “what” is to be labeled. There is not just one kind of data out there, and for the purpose of metadata creation I’d distinguish between at least 4 types of data: basic building blocks (e.g. sentences in a text document), structured fragments of documents (e.g. a paragraph), self-standing documents (e.g. a speech), and groups of documents (e.g. set of conference speeches). I have synthesized this in the slide below.

Greg_boutin_metadata_types_by_dat_3


Each one of these data types calls for a different type of metadata. Metadata for documents and groups of documents is mostly going to be used as is to organize these documents and return search results to a human user. This metadata needs to be provided in a synthesized format, usually in the form of a few keywords or expressions. Standardization of this metadata can remain relatively limited, as machine only need to match these text strings in a mostly straightforward manner.

On the other hand, metadata for what I dubbed ‘building blocks’, the most basic structured unit in a document, will be highly standardized in order to be processed by algorithms, which will weave blocks together by relying on metadata and, if all goes well, turn all this into ‘intelligent’ answers. This metadata therefore is purely designed for machine use.

Metadata for ‘structured fragments’ lies in between that for documents and for 'building blocks', as it can be leveraged for direct human use or for machine processing, depending on the need. Generally, however, I’d see it more aimed at human use, due to the lack of standardization of the underlying data (the computer will likely need to go down one level and still process the metadata for building blocks to make sense of it all.)

So are machines better equipped than humans to create that lower-level metadata for machine use? Just looking at the cheer volume of metadata to be created, one would hope so. Indeed, the volume of metadata to be generated is inversely proportional to the level of the data it relates to. This is evident: labeling each sentence in a document will generate much larger volumes of metadata than tagging the overall document. See the slide below for illustration.

Greg_boutin_metadata_volumes_by_d_2



Unfortunately, one problem remains with algorithms: accuracy. How accurate are the metadata-weaving algorithms today? Overall, not very. To be accurate, algorithms need to focus on a very small part of the problem. For instance, recognizing addresses, or people, or events, in a document, and generating RDF metadata for them.

But algorithms are fast improving. So I expect machines to progressively climb up the metadata food chain. It is possible that they may not even do this in the anticipated order. Algorithms may emerge that may tag document accurately, before they even overlay metadata on things like sentences accurately.

How fast will all of that metadata automation happen?

Here is where I part way with many out there…

A lot of folks in the space seem to ask themselves optimistically how to best automate the task of building metadata, and not really how much the task can be automated within their relevant timeframe. They work on replacing users input as much as possible through mathematical models, and anticipate them to be ready in six months or a year, when most likely they will require another 5 or 10 years of efforts to get to anywhere practical - if they do get there. By focusing instead on building systems that best stimulate, aggregate and synthesize user inputs (ironically, meta-systems!), they could
within a year or so deliver a working solution, and then build on that potential success to gradually increase the level of automation in their application.

In sum, I suggest here that solutions that (
intelligently) incorporate human input further will perform better over time. We need a healthier balance between human input and automatic metadata production. Given the poor performance of current metadata applications, focusing on algorithms that enhance the collection of user input and learn from it rather than autistically extract metadata from the data itself is a better investment of one’s time.


Will the differential between human performance and machine performance likely remain wide enough to justify the investments in collecting human input for years to come? A multibillion-dollar question, but I’d bet that it will. Because it’s likely that metadata will become increasingly user-driven, dynamic and volatile, in line with the ever- and faster-changing user needs and mental frameworks. As long as the ultimate consumer of all that metadata remains human, algorithms will need our inputs. So building capabilities in the “wisdom of crowds” area today can only help position you better in the space tomorrow.

Of course, it can be said almost with certainty that at some point, user input collection will be fully automated and transparent, and machines will create metadata with higher accuracy and speed levels than would be possible through human processing. As of today though, no algorithm I came across has proved capable of coating metadata accurately and comprehensively without extensive human input. We seem to be years away from  intelligent systems that will "get it". And guess what? Getting to those systems will require the same thing as trying to do without them: focus on better ways to stimulate, aggregate and synthesize user inputs!

In a future post I’ll attempt to look into (1) which "users" will provide those inputs: programmers, experts, mainstream users? (2) how user input can be collected and integrated into metadata-generating solutions.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e54f8bc158883300e54f95a8608833

Listed below are links to weblogs that reference Creating metadata: a task for humans or machines?:

Comments

Twitter Updates

    follow me on Twitter
    My Photo

    About me

    • I am Greg Boutin, founder of Growthroute Ventures. Acting as an outsourced executive, I help tech ventures develop solutions, go to market, sell, scale and raise their investor appeal and valuation. Managing information is a top interest for me, I am featured monthly on the semantic web gang podcasts, speak at events like the web 3.0 conference, write articles, and always work on a start-up concept or two.
    Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported

    Blogroll

    Enter your email address:

    Delivered by FeedBurner