Is web scraping part of data mining?

Investigative journalism

Note: The options presented here represent the status of March 2015.
For this reason this site is updated regularly.

*****

"Scraping" means something like "scrape" or "scratch" in English. The umbrella term summarizes techniques with which data is extracted from websites or documents - or, to stick with the literal translation, "scraped out" for further processing.

The Canadian data journalist Glen McGregor describes scraping as a powerful tool for accessing electronic data for stories that would otherwise not be realizable (see http://j-source.ca/article/web-scraping-how-journalists-get-their- own-data).

Scraping can be done in different ways: The most powerful option is to create a program yourself that is able to systematically search web pages for the information you want. However, mastering a programming language such as “Python”, “Ruby” or “Perl” is essential for this.

Scraping with special tools

If you lack this ability, you can of course work with a programmer - or you can use a range of tools. There are numerous scraping extension programs for various Internet browsers, as well as browser-based online applications and completely independent programs. In order to be able to use these effectively, however, one should have at least a basic knowledge of the structure of websites and the willingness to familiarize oneself with the programs as much as necessary.

In addition, you should be aware that full data control can only be guaranteed with your own scripts. Most of the programs available save the data they have accessed on the (rented) servers of your manufacturing company.

Since information is usually bundled that is already available online, this may not always be a bad thing. But - especially with sensitive data sets - this fact should be considered.

Example: Extracting tables from PDFs with scraperwiki

The English-language platform "ScraperWiki" (scraperwiki.com) provides a number of online tools for obtaining, processing and downloading data. In order to be able to use these tools, one must register and log in on the website. A basic account is free, and the free package has also been expanded for journalists: https.//wordpress.scraperwiki.com/solutions/data-journalism/.

A very useful scraper wiki tool enables tables to be extracted from PDF files.
For the application example, the report published by the world doping agency WADA for 2013 with the test results for the individual sports is used. The report is available on the agency homepage at the following URL:

https://wada-main-prod.s3.amazonaws.com/resources/files/WADA-2013-Anti-Doping-Testing-Figures-SPORT-REPORT.pdf

A first leaf through shows: The report contains numerous facts, explanations and tables on 27 pages. However, these are “trapped” in the PDF file and cannot be easily used for further analysis and visualization. Because: The PDF format is originally only intended for printing and displaying documents. The structure of a PDF is not intended to extract data.

For further processing, however, we need the tables with the test results for Olympic sports in a “more data-friendly” format, such as xls or csv.

The following steps are necessary for this:
After you have registered and logged in at scraperwiki.com, you first select the option "Create a new dataset". In the selection menu that then opens, click on "Extract from PDFs".

First, a PDF file must be selected from which the tables are to be extracted. This can be available online, but it can also be on your own computer. In the example, the link to the WADA report is copied into the field. Then click on "Extract Tables" and wait a moment

A table view of the document is now visible on the screen under the “View in a table” tab. The required tables are on pages 3, 4, and 5.
With "Download as a Spreadsheet", the required pages can be downloaded as an Excel file and further processed and merged in a spreadsheet program.

Alternative: "Tabula": http://tabula.technology

is free software for extracting tables and PDF files. However, it is still in the experimental status!

Another software tip: scrape websites with import.io

Import.io is a freely usable program with which data records can be extracted from websites. Import.io is operated via a completely graphical user interface. Small programs, so-called "crawlers", are created in the software, which looks like a web browser. These crawlers can be "trained" on the structure of a website in order to filter out the desired data records and export them as a table. Crawlers are stored and can be called up again and again to get an updated version of the required data.

The software can be downloaded for free here: www.import.io. To use it, you need to register or log in with an existing social media account.

The following link provides a series of introductory training videos (webinars) that explain how to use the software using various case studies: https://import.io/webinars (in English).

Other alternatives:

 

"Data Mining" (from the English: "Data" = "Data"; Mining = "Mining") is used when, from large amounts of data (so to speak, "mountains" of data that are essential for a story or graphic information) be digged out ". Data mining makes it possible to recognize patterns in the data or to uncover connections. These can then be used for further research (cf. ALISHANI, A / van der KAA, HAJ 2013: Journalistic Data Mining. [Online, cited June 27, 2013] - URL: http://mystudy.uvt.nl/it10 .vakzicht? taal = n & pfac = FGW & vakcode = 822033).

In connection with the keyword big data, data mining could become more important in the future. A definition can be found at the Schleswig-Holstein data protection center:

"Big Data" stands for particularly large amounts of data that are collected, made available and evaluated via the Internet or otherwise. Once extracted from the original survey context, the data can be used for any purpose, e.g. B. to recognize statistical trends, to gain scientific knowledge, to make political, economic or other decisions, possibly also decisions related to individual people. This results in completely new opportunities for social, economic and scientific knowledge that can help improve living conditions in our complex world. ”(Data Protection Center 2013)

Software: SPSS, R (statistics); Tableau (analysis and visualization) in the "Public" version can be used free of charge

Data mining with forensic software

In order to sort, analyze and evaluate a particularly large amount of files on a data carrier or web server, some investigative data journalists rely on forensic programs, which are also used for forensic investigations.

A current example of such successful data mining is the journalism project “Offshore Leaks” from the example above, in which the tax evasion of companies and private individuals in so-called tax havens was revealed. According to the German journalist Sebastian Mondial involved, the research was based on a hard drive with over 260 gigabytes of data, distributed in 2.5 million documents - such as e-mails, tables or PDF files. This had been leaked to the International Consortium for Investigative Journalists (ICIJ). Using the forensic software "Nuix" (www.nuix.com), this vast mountain of data was filtered and evaluated by an international team of journalists.

 

Some websites offer developers special programming interfaces, known as API (Application Programming Interface) for short. Data journalists can make use of this, for example, to answer specific questions about certain structures or actors in social networks: For example, the API ofTwitter find out whichTwitter-Users are involved in certain topic complexes, how actively they participate in discussions and where they are geographically located.

A common data format for API queries is JSON. Google (https://developers.google.com) and Facebook (https://developers.facebook.com/docs/graph-api) also offer APIs for developers.

In the United States, government institutions also provide certain APIs for developers. A list of the various portals is available on a central portal: www.data.gov.

In Germany, a beta version of an official data portal has been online for some time, operated by the Federal Ministry of the Interior. According to the information on the site, the meta catalog can also be accessed via API: www.govdata.de/hilfe

******

Editor's note (JL): Patrick RÖSING works as a data journalist and visualizer at stern online. He dealt intensively with this subject in his master's thesis from 2013. The master's thesis "Data journalism in regional and local media: nine recommendations for action" can be downloaded for free.