Saturday 3 October 2009

New data search initiatives to cut out Google’s noise

When Google launched in 1998 it immediately caught the attention of serious Web users and has never looked back since. But there’s a lot of garbage online. Do a search and you must wade through dozens of crappy links to find the valuable nuggets you need. This means that Google’s algorithms are working too well but not well enough. Refine your search and you might get more focused results, but this all takes time.

GoogleNews helps, especially with its by-date sorter, which lets you find articles published in a certain month of a certain year. But this only shows published news items. Often the information you want is not even published online because it never made it into the final article.

Last year, two initiatives were launched that have now joined forces to help provide journalists – and anyone – with better search. The websites do not say this, but at the Online News Association conference in San Francisco, which ends today, Aron Pilhofer, editor of Interactive News Technologies at The New York Times, described the initiative as a “scalpel” compared to Google’s “blunt instrument”.

“Google is blunt tool for data searching for docs usable for journalists. DocumentCloud is scalpel,” tweeted Adam Glenn, who describes himself as a digital media consultant, journalist, and educator.

DocumentCloud is a searchable repository of journalistic “original source documents”, according to the website. You know what I mean. These are the documents that journalists use to make their news stories. The raw material. The bulk that they process. “Think of it as a card catalog for primary source documents,” the website advises.

Documents do not have to be handed over for inclusion. Those who want to have their documents included in the index can still host them on their own sites. Who will be able to contribute? According to Pilhofer:

The repository will be open for anyone to read from, but not to contribute to. It will be limited to news organizations, bloggers and watchdog groups whose mission includes publishing source documents as a means of better informing the public about issues of the day.

By restricting the number of people and organisations that can submit documents for inclusion, DocumentCloud erects a filter between the viewer and the Web. This is where the ‘scalpel’ comes in.

DocumentCloud was set up by people from ProPublica, “an independent, non-profit newsroom that produces investigative journalism in the public interest”, and The New York Times. Last month it announced a list of 24 partners who would beta-test the software. All are American entities.

"Readers will be able to search documents on DocumentCloud and then will be pointed to the documents themselves on contributing organisations' websites," says Pilhofer.

The initiative received a grant of $US719,000 from the John S. and James L. Knight Foundation in June.

Last month, it was announced that DocumentCloud would partner with a Thomson-Reuters technology venture called OpenCalais, which adds meta-tags to documents. This collaboration serves to further sharpen the scalpel.

The second technology is a tagging service that adds value, it would seem, by identifying what it calls ‘entities’ in text when it is submitted for processing. An automated Web service is also available so that documents will be processed without troublesome clicks and points.

“A great tool from Reuters Calais that automatically retrieves entities and relationships from a text,” tweeted a representative of thisislike, a web-based service that creates visual associations between things.

Ari Tenhunen, a graphic designer and web solution developer who also attended the conference, called it a “semantic engine”. Tom Tegue, VP of platform strategy for OpenCalais, talks about “natural language processing”. A useful page on InformationWeek’s website contains a video run-through. Tegue says:

It’s about taking unstructured content, like news stories, and adding structure to them, [and] extracting meaning from those stories. [E]ssentially it is turning them into data so that you can start to do the things we know how to do with data, like put them in rows and columns or merge stories from multiple sources. And do that without human intervention.

So the combination of DocumentCloud and OpenCalais results in a high-powered search engine that is solely based on qualified material – not the unmediated protoplasm that currently exists on the Web and which we must contend with daily – by spending time scanning and sifting – with ‘traditional’ search engines such as Google.

OpenCalais saves time and money because it automates the process of tagging. Instead of getting knowledge workers – whose time costs a lot of money – to tag articles prior to consumption, OpenCalais runs semantic tagging over documents and spits out a set of results.

But questions remain.

“Is it possible somebody finally got semantic tagging right?” asked Christopher Groskopf, a professional archivist and tweeting interested observer who was not at the conference.

InformationWeek was also forced by curiosity to ask how the information would be exposed. How would a person ‘see’ the information generated by the tagging engine? OpenCalais has developed an interface for Wordpress blogs, for example, that allows a blogger to automatically generate a set of tags that can be included in a tag cloud.

This saves some time, but it’s not earth-shattering. “How to process mountains of blog text and convert it to data & tags,” tweeted Steve Brown, CEO of 3banana, a mobile semantic data capture startup.

Tegue says they have 15,000 “registered users”, ie entities that have asked to use the tagging service. That’s a lot of source material, and it’s not just news organisations that have signed up to tag their data. Essentially this service – if it is freely available and if a lot of organisations choose to have their documents published on it – will rival such content aggregators as GoogleNews and Factiva.

Combine the tagging engine with the massive volume of documents that will be assembled by DocumentCloud and journalists will have an incredibly useful, reliable, intelligently-tagged set of information to use when researching articles.

The reaction at ONA? “It's wickedly cool.”

No comments: