Data collection, organization and analysis: a brief overview

Photo:Kalyeena Makortoff

Gianluca De Martino and Andrea Nelson Mauro – journalists for Dataninja – shared some practical information on the main tools and methods available to collect, organize, enrich and analyze data during one of the first panels of the 10th edition of the International Journalism Festival on Wednesday 6 April, 2016. They provided practical examples to enable participants to acquire introductory notions in order to enhance their skills in data management.

Dataninja is a data driven team, started in 2012 and it’s “a sort of a agency for content production, publish articles and etc. But everything starts with data – data analysis, visualization with data and also enhancing the data,” said De Martino. “All the years our team basically has adapted to the requests of the market. We’re trying also to spread a better awareness as far as data culture is concerned and try to do this in many different ways,” also said De Martino and gave an example with the project Confiscati Bene – a project based on the cooperation between state official and citizens aimed at the “re-use of of buildings and other assets seized from the mafia,” published last December. He also discussed another team project called Dataninja School, an e-learning platform offering users the opportunity to share data skills.

During the session De Martino and Nelson Mauro showed attendees examples of tools for collecting, cleaning and analysing data –  three essential elements to manage data according to De Martino. Nelson Mauro also commented on the difficulties in data journalism – data availability for example. “The difficulty lies in the fact that the data sets are not always completely organized so we might have to extract the tables from other types of files,” said the journalist, and he presented some examples on how to scrape worksheets. Nelson Mauro also presented two tools for scraping a PDF file – Tabula and ScraperWiki. “I prefer Tabula because it’s also possible to extract just parts of tables. ScraperWiki is quite similar but doesn’t allow to select one specific table to be scraped,” Mauro noted. He also presented Data Miner – another tool for extracting HTML tables and to convert them into CSV or into Excell file.

Additionally, the two Dataninja journalists examined and discussed the applications of one more tool available to data users: Open Refine gives them the possibility to work on huge amounts of data to then build a preliminary analysis of it, and ultimately clean it.