Data driven journalism
Encyclopedia
Data-driven journalism is a journalistic process based on analyzing and filtering large data sets for the purpose of creating a new story. Data-driven journalism deals with open data
Open science data
Open science data is a type of Open data focussed on publishing observations and results of scientific activities available for anyone to analyze and reuse...

 that is freely available online and analyzed with open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 tools. Data-driven journalism strives to reach new levels of service for the public, helping consumers, managers, politicians to understand patterns and make decisions based on the findings. As such, data driven journalism might help to put journalists into a role relevant for society in a new way.

Definitions

According to information architect and multimedia journalist Mirko Lorenz, data-driven journalism is a workflow that consists of the following elements: digging deep into data by scraping, cleansing and structuring it, filtering by mining for specific information, visualizing and making a story. This process can be extended to provide information results that cater to individual interests and the broader public.

Data journalism trainer and writer Paul Bradshaw
Paul Bradshaw (journalist)
Paul Bradshaw is an online journalist and blogger, a Reader in Online Journalism at Birmingham City University and a Visiting Professor at City University's School of Journalism in London. He manages his own blog, the Online Journalism Blog , and is the co-founder of Help Me Investigate, an...

 describes the process of data-driven journalism in a similar manner: data must be found, which may require specialized skills like MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

 or Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

, then interrogated, for which understanding of jargon and statistics is necessary, and finally visualized and mashed with the aid of open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 tools.

Reporting based on data

Telling stories based on the data is the primary goal. The findings from data can be transformed into any form of journalistic writing. Visualizations can be used to create a clear understanding of a complex situation. Furthermore, elements of storytelling can be used to illustrate what the findings actually mean, from the perspective of someone who is affected by a development. This connection between data and story can be viewed as a "new arc" trying to span the gap between developments that are relevant, but poorly understood, to a story that is verifiable, trustworthy, relevant and easy to remember.

Data-driven journalism and the value of trust

Based on the perspective of looking deeper into facts and drivers of events, there is a suggested change in media strategies: In this view the idea is to move "from attention to trust". The creation of attention, which has been a pillar of media business models has lost its relevance because reports of new events are often faster distributed via new platforms such as Twitter than through traditional media channels. On the other hand, trust can be understood as a scarce resource. While distributing information is much easier and faster via the web, the abundance of offerings creates costs to verify and check the content of any story create an opportunity. The view to transform media companies into trusted data hubs has been described in an article cross-published in February 2011 on Owni.eu and Nieman Lab.

Process of data-driven journalism

The process to transform raw data into stories is aking to a refinement and transformation. The main goal is to extract information recipients can act upon. The task of a data journalist is to extract what is hidden. This approach can be applied to almost any context, such as finances, health, environment or other areas of public interest.

Inverted pyramid of data journalism

In 2011, Paul Bradshaw introduced a model, he called "The Inverted Pyramid of Data Journalism".

Steps of the process

In order to achieve this, the process should be split up into several steps. While the steps leading to results can differ, a basic distinction can be made by looking at six phases:
  1. Find: Searching for data on the web
  2. Clean: Process to filter and transform data, preparation for visualization
  3. Visualize: Displaying the pattern, either as a static or animated visual
  4. Publish: Integrating the visuals, attaching data to stories
  5. Distribute: Enabling access on a variety of devices, such as the web, tablets and mobile
  6. Measure: Tracking usage of data stories over time and across the spectrum of uses.

Find data

Data can be obtained directly from governmental databases such as data.gov
Data.gov
Data.gov is a U.S. government website launched in late May 2009 by the then Federal Chief Information Officer of the United States, Vivek Kundra....

, data.gov.uk
Data.gov.uk
data.gov.uk is a UK Government project to open up almost all non-personal data acquired for official purposes for free re-use. Sir Tim Berners-Lee and Professor Nigel Shadbolt are the two key figures behind the project.- Beta version and launch :...

 and World Bank Data API but also by placing Freedom of Information
Freedom of information
Freedom of information refers to the protection of the right to freedom of expression with regards to the Internet and information technology . Freedom of information may also concern censorship in an information technology context, i.e...

 requests to government agencies; some requests are made and aggregated on websites like the UK's What Do They Know. While there is a worldwide trend towards opening data, there are national differences as to what extend that information is freely available in usable formats. If the data is in a webpage, scrapers are used to generate a spreadsheet. Examples of scrapers are: ScraperWiki
ScraperWiki
ScraperWiki is a website for collaboratively building programs to extract and analyze public data, in a wiki-like fashion. "Scraper" refers to screen scrapers, programs that extract data from websites. "Wiki" means that any user with programming experience can create or edit such programs for...

, Firefox plugin OutWit Hub or Needlebase. In other cases OCR-Software can be used to get data from PDFs.

How to analyze unfamiliar data: circle, dive, and riff

Clean data

Usually data is not in a format that is easy to visualize. Examples being that there are too many data points or that the rows and columns need to be sorted differently. Another issue is that once investigated many datasets need to be cleaned, structured and transformed. Various open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 tools like Google Refine, Data Wrangler
Data wrangler
A data wrangler is very non-technical term that specifies duties typically handled by a storage administrator for working with large amounts of data. This can occur in areas like major research projects and the making of films with a large amount of complex computer-generated imagery...

 and Google Spreadsheets allow uploading, extracting or formatting data.

Visualize data

To visualize data in the form of graphs and charts, applications such as Many Eyes or Tableau Public are available. Yahoo! Pipes
Yahoo! Pipes
Yahoo! Pipes is a web application from Yahoo! that provides a graphical user interface for building data mashups that aggregate web feeds, web pages, and other services, creating Web-based apps from various sources, and publishing those apps...

 and Open Heat Map are examples of tools that enable the creation of maps based on data spreadsheets. The number of options and platforms is expanding. Some new offerings provide options to search, display and embed data, an example being Timetric.

To create meaningful and relevant visualizations, journalists use a growing number of tools. There are by now, several descriptions what to look for and how to do it. Most notable published articles are:


As of 2011, the use of HTML 5 libraries using the canvas
Canvas
Canvas is an extremely heavy-duty plain-woven fabric used for making sails, tents, marquees, backpacks, and other items for which sturdiness is required. It is also popularly used by artists as a painting surface, typically stretched across a wooden frame...

 tag is gaining in popularity. There are numerous libraries enabling to graph data in a growing variety of forms. One example here would be RGraph. As of 2011 there is a growing list of JavaScript libraries allowing to visualize data.

Publish data story

There are different options to publish data and visualizations. A basic approach is to attach the data to single stories, similar to embedding web videos. More advanced concepts allow to create single dossiers, e.g. to display a number of visualizations, articles and links to the data on one page. Often such specials have to be coded individually, as many Content Management Systems are designed to display single posts based on the date of publication.

Distribute data

Providing access to existing data is another phase, which is gaining importance. Think of the sites as "marketplaces" (commercial or not), where datasets can be found easily by others.
Especially of the insights for an article where gained from Open Data, journalists should provide a link to the data they used for others to investigate (potentially starting another cycle of interogation, leading to new insights).

Providing access to data and enabling groups to discuss what information could be extracted is the main idea behind Buzzdata, a site using the concepts of social media such as sharing and following to create a community for data investigations.

Other platforms (which can be used both to gather or to distribute data):

Measuring the impact of data stories

A final step of the process is to measure how often a dataset or visualization is viewed.

In the context of data-driven journalism the extend of such tracking, such as collecting user data or any other information that could be used for marketing reasons or other uses beyond the control of the user, should be viewed as problematic. One newer, non-intrusive option to measure usage, is a lightweight tracker called PixelPing. The tracker is the result of a project by ProPublica
ProPublica
ProPublica is a non-profit corporation based in New York City. It describes itself as an independent non-profit newsroom that produces investigative journalism in the public interest. In 2010 it became the first online news source to win a Pulitzer Prize, for a piece written by one of its...

 and DocumentCloud
DocumentCloud
DocumentCloud is a web-based software platform created for journalists to allow the searching, analyzing, annotation and publication of primary source documents used in reporting. It is the only two-time Knight News Challenge grantee...

. There is a corresponding backend solution to collect the data. The software is open source and can be downloaded via GitHub.

Examples

There is a growing list of examples how data-driven journalism can be applied:


Other prominent uses of data driven journalism is related to the release by whistle-blower organization WikiLeaks
Wikileaks
WikiLeaks is an international self-described not-for-profit organisation that publishes submissions of private, secret, and classified media from anonymous news sources, news leaks, and whistleblowers. Its website, launched in 2006 under The Sunshine Press organisation, claimed a database of more...

 of the Afghan War Diary
2010 Afghan War documents leak
The Afghan War documents leak is the disclosure of a collection of internal U.S. military logs of the War in Afghanistan, also called the Afghan War Diary, which were published by Wikileaks on 2010. The logs consist of 91,731 documents, covering the period between January 2004 and December 2009. ...

, a compendium of 91,000 secret military reports covering the war in Afghanistan from 2004 to 2010. Three global broadsheets, namely The Guardian
The Guardian
The Guardian, formerly known as The Manchester Guardian , is a British national daily newspaper in the Berliner format...

, The New York Times
The New York Times
The New York Times is an American daily newspaper founded and continuously published in New York City since 1851. The New York Times has won 106 Pulitzer Prizes, the most of any news organization...

and Der Spiegel
Der Spiegel
Der Spiegel is a German weekly news magazine published in Hamburg. It is one of Europe's largest publications of its kind, with a weekly circulation of more than one million.-Overview:...

, dedicated extensive sections to the documents; The Guardian
The Guardian
The Guardian, formerly known as The Manchester Guardian , is a British national daily newspaper in the Berliner format...

's reporting included an interactive map pointing out the type, location and casualties caused by 16,000 IED
Improvised explosive device
An improvised explosive device , also known as a roadside bomb, is a homemade bomb constructed and deployed in ways other than in conventional military action...

 attacks, The New York Times
The New York Times
The New York Times is an American daily newspaper founded and continuously published in New York City since 1851. The New York Times has won 106 Pulitzer Prizes, the most of any news organization...

 published a selection of reports that permits rolling over underlined text to reveal explanations of military terms, while Der Spiegel
Der Spiegel
Der Spiegel is a German weekly news magazine published in Hamburg. It is one of Europe's largest publications of its kind, with a weekly circulation of more than one million.-Overview:...

 provided hybrid visualizations (containing both graphs and maps) on topics like the number deaths related to insurgent bomb attacks.

Tutorials & Tools: How to get started with data-driven journalism

There is an open and helpful community of data-journalists, willing to share how raw data can be turned into good stories. Below is a selected list of posts worth reading.

Tutorials and How To's

  • Simon Rogers: How to: get to grips with data journalism (2011)
  • Simon Rogers: Wikileaks' Afghanistan war logs: how our datajournalism operation worked (2010)
  • Paul Bradshaw: How to be a data journalist (2010)

Tools

For each phase there are now different tools. Each tool might have certain features that are helpful to solve a specific problem working with data. Beginners interested in data-driven journalism should explore the options in order to have a stack of tools ready on their computers to work with data. From these resources journalists should step by step gain experience and create their own "tool stack" to become productive.

See also

  • Database journalism
    Database journalism
    Database journalism or structured journalism is a principle in information management whereby news content is organized around structured pieces of data, as opposed to news stories....

  • Computational journalism
    Computational journalism
    Computational journalism can be defined as the application of computation to the activities of journalism such as information gathering, organization, sensemaking, communication and dissemination of news information, while upholding values of journalism such as accuracy and verifiability...

  • Open data
    Open science data
    Open science data is a type of Open data focussed on publishing observations and results of scientific activities available for anyone to analyze and reuse...

  • Open source
    Open source
    The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

  • Open knowledge
    Open Knowledge
    Open Knowledge is a term used to denote a set of principles and methodologies related to the production and distribution of knowledge works in an open manner...

  • Freedom of information legislation
    Freedom of information legislation
    Freedom of information legislation comprises laws that guarantee access to data held by the state. They establish a "right-to-know" legal process by which requests may be made for government-held information, to be received freely or at minimal cost, barring standard exceptions...

  • Information visualization
    Information visualization
    Information visualization is the interdisciplinary study of "the visual representation of large-scale collections of non-numerical information, such as files and lines of code in software systems, library and bibliographic databases, networks of relations on the internet, and so forth".- Overview...


List of newspaper data desks

  • Data Desk, The Los Angeles Times
  • Datablog, The Guardian
    The Guardian
    The Guardian, formerly known as The Manchester Guardian , is a British national daily newspaper in the Berliner format...

  • http://www.texastribune.org/library/data/, The Texas Tribune
    The Texas Tribune
    The Texas Tribune is a nonprofit news organization headquartered in Downtown Austin, Texas, devoted to state government and public policy. It aims to promote civic engagement through original, explanatory journalism and public events...


List of organizations interested in data driven journalism

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK