Newspaper Databases: All That’s Not Fit for Research

newspapersThey say newspapers are the first draft of history. They capture and disseminate noteworthy events as they unfold, and they are used by succeeding generations to make sense of a nation’s history and identity. Although this chronicling is increasingly occurring with dizzying speed on the web, newspapers, especially the paper editions captured in databases, will remain a fundamental resource for scholarly and other types of research for the foreseeable future.

Limitations of Full-Image Databases

There are, however, a number of problems with using newspaper databases for research. In his article “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010,” Prof. Ian Milligan identifies one such issue: the shortcomings of optical character recognition (OCR) in databases of scanned microfilm. In a related post, he notes that keyword searches in databases that contain digitized, full-image versions of newspapers often result in incomplete retrieval of articles, due to the nature of the scanned material, the speed with which these databases were created, and the technological limitations. Consequently, research results can be problematic:

[H]yphenations are not covered (problematic in smaller columns, where Woodwork might be hyphenated as Wood-work across two lines), if microfilm streaks obscure a letter, if it was slightly tilted, or if the OCR just plain misses a character.* 

Prof. Milligan likens using these databases uncritically for historical research to “using a volume of the Canadian Historical Review with 10% or so of the pages ripped out.” While he recognizes that these databases are indispensable tools, he urges researchers to be aware of their limitations and to identify how they dealt with them.

Not Just a Database Issue

As a former researcher at Canada’s “newspaper of record,” I have additional concerns about relying on newspapers and newspaper databases for research. Despite the best efforts of reporters, editors, researchers, and archivists, news articles have long been replete with inaccuracies and omissions. The reasons are numerous and have to do with both structural and human shortcomings: the fast pace of news production, the lack of access to sources and resources, the lack of space, human error, editorial bias, editorial decision-making regarding which corrections are worth appending, etc. Once news articles make it into databases, other problems arise: graphics are not rendered in text-based electronic databases, databases have search and display technical shortcomings, etc.

Add to these the continuing economic pressures facing news organizations, which have necessitated deep cuts at many newspapers, and using newspapers as research sources has become increasingly problematic. In the seemingly endless rounds of layoffs since the start of the Great Recession, copyeditors, researchers, and enhancers/archivists — the guardians of accuracy, clarity, and order — have been the worst hit, while reporters and editors are being expected to do more with less. Errors, omissions, bias, and inconsequential content are now baked into the newspaper product, and this will have deep consequences for future scholarship and research.

All this to say that cautionary notes like Prof. Milligan’s are welcome and necessary, and researchers should always, always cross-reference research results with multiple and varied sources.

*It is my understanding that an upgraded database for The Globe and Mail’s Canada’s Heritage from 1844 is in the works that will address some of these shortcomings. For example, it will use higher quality OCR and will search and identify articles as a whole, even across pages.

Photo source: Jon S, Flickr