This image shows blurred computer code. (Courtesy of qubodup)

My position at inewsource as an “investigative data reporter” is new, but the newsroom’s work with data has been a source of pride for years. inewsource even built data journalism into its mission statement.

But what do we mean by “data journalism?” 

First, what is data? The word can bring to mind all varieties of bar, line and pie chart. If you’re a real nerd, you might think of scatterplots or Sankey diagrams. But all those are really just representations of data. 

Data is information. That’s it. And data comes in many forms. It can be written into a spreadsheet that opens in Microsoft Excel, or it can be a stack of documents. Even the works of William Shakespeare can be viewed as data — you can count the prevalence of the words in his plays and plot them on a chart.

A graph showing text analysis of words from Shakespeare plays. (Courtesy of Michael Clark)

Journalists use data constantly — even reporters who don’t have “data” in their titles — and in every stage of the story process. 

Data can be the foundation of a story — a reason to start making phone calls or knocking on doors. Did Candidate A raise more money than Candidate B this quarter? Are homicide rates rising or falling in a specific geographic area? A tipster says something shocking — “People are getting trapped in elevators” — but is that really true? 

If we think data can help answer a question, we go after it. Local, state and national governments gather and maintain huge amounts of data. Journalists can download data sets through publicly available sources, such as San Diego’s open data portal or the Federal Election Commission. We can also request data under federal and state open records laws, a process that can take weeks or months — even years. (I have records requests pending with the Indiana Department of Correction dating back to August 2020. Governments don’t always hand over public records without a fight.)

Journalists can also build their own data sets if waiting on the government will take too long, or if the data doesn’t exist in a form that’s useful. We can use a technique called “scraping,” which involves programming a computer to search webpages and automatically enter information into a spreadsheet. We can compile documents from courts or other government agencies and read through them, either manually or with a computer, to pull out the important parts. Or we can create surveys and send them out to a targeted group. 

Once we have the data we need, we analyze it in hopes of answering our questions, keeping in mind that data is only as reliable as the people who create it. (It’s often riddled with typos, inconsistencies and duplicates, too.) The reporting process can include several calls just to understand how information is gathered and stored, and to determine how reliable our findings might be. 

In 2020, inewsource analyzed mortality data and found that pandemic-related deaths weren’t always captured in the county’s official totals and were concentrated in certain groups: “More than a third of the people excluded from the county’s death total are Hispanic, and deaths are increasing faster at home than in hospitals.” 

That finding was just a starting point, and the journalists reached out to officials, experts and advocates to add context and perspective. They also created infographics, or data visualizations, to display their findings. 

Data can also come in later to add context to something we’re already reporting. After months of digging into unethical human research at a San Diego Veterans Affairs health care office, inewsource journalists realized that whistleblower complaints in the case were basically ignored. They wanted to know how often that happened, so they wrote a story based on a dataset they built themselves: 

Stories can be written using what we gather and analyze, but simply allowing the audience to view and interact with data can be extremely valuable. The New York Times’s COVID data tracking efforts have been hugely important during the pandemic and helped the outlet win a Pulitzer Prize. The investigative nonprofit ProPublica has created tools that allow you to search for specific information, such as doctors who take payments from drug and device companies. 

And sometimes, data work lurks behind the scenes. When the Indiana government declined to release information on prisoner deaths early in the pandemic, I relied on people inside the facilities to feed me information. I soon realized the corrections agency was updating its website, marking prisoners as “Deceased,” so I started scraping that site to identify new deaths. From there, I tracked down family members and other prisoners to discuss the medical care those men received – or didn’t.

At inewsource, we try to be as transparent as possible about our methods to avoid mistakes and ensure that our audience views our work as credible. Sometimes, we even post a separate story explaining how we completed a project. We want our work to be reproducible: Someone else should be able to start with the same data and reach the same conclusions. 

With that said, this is my first post at inewsource, and I’m looking for stories and data to dig into. What questions do you have about Southern California? Email me at jakeharper@inewsource.org.

Type of Content

News: Based on facts, either observed and verified directly by the reporter, or reported and verified from knowledgeable sources.