Web scraping – the Jekyll and Hyde of Web 2.0 (part 1)?

Two media reports this week have highlighted the way in which social media and web 2.0 applications can use “web scraping” in very different ways and for very different outcomes.

Web scraping is a specialised software-based process used to extract data from websites, where it is commonly displayed using HTML or similar mark-up languages. When displayed in this way, it is difficult for conventional software trying to “read” the data to make a distinction between relevant information and the surrounding “noise” of the formatting, a situation complicated by the fact that websites may display similar information using very different layouts.

After the data has been “scraped” from the relevant websites it is collated in a database or some other systematic framework and put to new purposes, possibly not ones that were envisaged by the original creators of the information. The most common examples are the websites which provide on-demand price information for a product selected by the user, based on real-time comparisons between the web pages of various online shops.

Web scraping can be viewed as a specific form of data scraping, or the extraction of data from the human-readable output of any computer, but for the purposes of these posts I will regard the two terms as synonymous.

First, a positive example of web scraping at work. The not-for-profit organisation OpenAustralia, which has already made a name for itself in making information about the federal parliament, parliamentary debates and individual MPs more easily accessible, has just released a new app for iPhones and Android smartphones. The app gives users the ability to locate neighbouring development proposals that may affect a property just by pointing thier mobile phones at the property.

Planningalerts phone app.

Planningalerts phone app.

The app is an extension of OpenAustralia’s already-successful web-based planning alerts service. This allows users log the address of a property and be informed by email of development proposals within either a 200 metre, 800 metre or 2km radius, assuming the property is located within the boundaries of a council that provides this information online in a format that has been scraped by the software.

This is not without its complications. In an interview with the Sydney Morning Herald, OpenAustralia founder Matthew Landauer describes the process as “very painful and error prone”:

“The program clicks on links, fills out forms to do searches, and then when the program finds the web page with the development application it has to extract the unstructured information on the page and turn it into structured information.”

Only 85 of the over 650 councils in Australia are covered by the software, but these include many of the larger ones and others are being added to the system by “crowdsourcing”: members of the community adding the details of the websites of councils which start to place their DA information online.

So far, so good. There would be widespread agreement that this is an appropriate “repurpose” of information which is publicly available anyway. Some councils may be nervous about the way in which the OpenAustralia phone and online applications raise the bar in terms of who gets notified about DAs, but hopefully most will take a positive attitude and cooperate with this initiative by making information available in a more standardised, software-readable format. In turn, this and other eGov initiatives may lead to councils and other levels of government making a wider range of data available in computer-accessible formats.

However, web scraping can be used in different and less benign ways. In my next post I will look at a new application which raises major questions about the use of web scraping and related techniques in relation to social media websites.

This entry was posted in Local Government, Social Media, Web 2.0 and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s