Skip to main content

Development

Google Big Dataset: Wikilinks Corpus

A few days ago when I was browsing some information categorized in data mining and machine learning, I heard that Google had released a large dataset called Wikilinks Corpus which contains 40 million mentions over 3 million entities. What does mention and entity mean here?

Apple is also rumored to be working on a new, less expensive version of the iPhone, which retains the 4 inch retina display of the iPhone 5, but which also is housed in a different casing and effects.

Here’s no doubt that Apple is at the center of technology’s largest revolution ever and that longtime shareholders have been handsomely rewarded, with more than 1,000% gains.

“You can make a lovely smoothie with baobab powder, apple juice, some natural yoghurt and a handful of blueberries.”

The word Apple is mentioned in 3 times and understood as 2 entities – one for a giant IT company and another for a fruit. If you read this 3 news from website or newspaper you can easily know the specific meaning of each term, but it’s difficult for machine to distinguish them.

The cross-document reference resolution, a kind of technology to simulate manual distinguishing process, is the task of grouping the entity mentions in a collection of documents into sets that each represents a distinct entity.  This resolution classifies entity mentions in documents according to the entities to which they refer. In above example, it should classify mention “Apple” into different documents.

In fact, to generate large, organic labeled datasets for training and testing cross document reference model has been challenging. Google wikilinks is to solve this kind of problem. The method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without manual effort, it includes many styles of text and different types of entity with same mention word.

untitled

The example showing as above figure – there are two mentions of “Banksy” from different web pages, since both links point to the same Wikipedia entity, it is likely that the two mentions refer to the same entity. With Wikilinks Corpus dataset, you can write code to download the webpages listed in the above dataset, to find the relevant links from these webpages, and to extract the context around the links. The dataset file can be downloaded from: http://code.google.com/p/wiki-links/downloads/list. To learn more about un-structured big data set, perhaps we can start playing with this one.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us