The Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.
More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides four different data set releases extracted from the Common Crawl 2014, 2013, 2012 and 2010. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.