Wednesday 5 June 2013

Web Scraping Evolved: APIs for Turning Webpage Content into Valuable Data

This guest post comes from Marc Mezzacca, founder of NextGen Shopping, a company dedicated to creating innovative shopping mashups.    Marc’s latest venture is a social-media based coupon code website called CouponFollow that utilizes the Twitter API.

While the rates in adoption of semantic standards are increasing, the majority of the web is still home to mostly unstructured data.  Search engines, like Google, remain focused on looking at the HTML markup for clues to create richer results for their users.  The creation of schema.org and similar movements has aided in the progression of the ability draw valuable content from webpages.

But even with semantic standards in place, structured data requires a parser to extract information and convert it into a data-interchange format, namely JSON or XML.  Many libraries exist for this, and in several popular coding languages.  But be warned, most of these parser libraries are far from polished.  Most are community-produced, and thus may not be complete or up to date, as standards are ever changing.  On the flip side, website owners whom don’t fully follow semantic rules, can break the parser.  And of course there are sites which contain no structured data formatting at all.   This inconsistency causes problems for intelligent data harvesting, and can be a roadblock for new business ideas and startups.

Several companies are offering an API service to make sense of this unstructured data, helping to remove these roadblocks.  For example, AlchemyAPI offers a suite of data extraction APIs including a Structured Content Scraping API, which enables structured data to be extracted based on both visual and structural traits.  Another company, DiffBot, is also taking care of the “dirty work” in the cloud, allowing entrepreneurs and developers to focus on their business instead of the semantics involved in parsing.  DiffBot stands out because of their unique approach.  Instead of looking at the data as a computer, they are looking visually, like a human would.  They first classify what type of webpage (eg. article, blog post, product, etc.) and then proceed to extract what visually appears to be relevant data for that page type (article title, most relevant image, etc).

Currently their website lists APIs for Page Classification (check out their infographic), as well as parsing Article type webpages.  Much of the web, including discussion boards, events, e-commerce data, etc. remains as potential future API offerings and it will be interesting to see which they go after next.

You can test drive the Artcle API on their website and see the extraction results instantly, as shown below of this article:


Source: http://blog.programmableweb.com/2012/09/13/web-scraping-evolved-apis-for-turning-webpage-content-into-valuable-data/

No comments:

Post a Comment