Newspaper Databases: All That’s Not Fit for Research

newspapersThey say newspapers are the first draft of history. They capture and disseminate noteworthy events as they unfold, and they are used by succeeding generations to make sense of a nation’s history and identity. Although this chronicling is increasingly occurring with dizzying speed on the web, newspapers, especially the paper editions captured in databases, will remain a fundamental resource for scholarly and other types of research for the foreseeable future.

Limitations of Full-Image Databases

There are, however, a number of problems with using newspaper databases for research. In his article “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010,” Prof. Ian Milligan identifies one such issue: the shortcomings of optical character recognition (OCR) in databases of scanned microfilm. In a related post, he notes that keyword searches in databases that contain digitized, full-image versions of newspapers often result in incomplete retrieval of articles, due to the nature of the scanned material, the speed with which these databases were created, and the technological limitations. Consequently, research results can be problematic:

[H]yphenations are not covered (problematic in smaller columns, where Woodwork might be hyphenated as Wood-work across two lines), if microfilm streaks obscure a letter, if it was slightly tilted, or if the OCR just plain misses a character.* 

Prof. Milligan likens using these databases uncritically for historical research to “using a volume of the Canadian Historical Review with 10% or so of the pages ripped out.” While he recognizes that these databases are indispensable tools, he urges researchers to be aware of their limitations and to identify how they dealt with them.

Not Just a Database Issue

As a former researcher at Canada’s “newspaper of record,” I have additional concerns about relying on newspapers and newspaper databases for research. Despite the best efforts of reporters, editors, researchers, and archivists, news articles have long been replete with inaccuracies and omissions. The reasons are numerous and have to do with both structural and human shortcomings: the fast pace of news production, the lack of access to sources and resources, the lack of space, human error, editorial bias, editorial decision-making regarding which corrections are worth appending, etc. Once news articles make it into databases, other problems arise: graphics are not rendered in text-based electronic databases, databases have search and display technical shortcomings, etc.

Add to these the continuing economic pressures facing news organizations, which have necessitated deep cuts at many newspapers, and using newspapers as research sources has become increasingly problematic. In the seemingly endless rounds of layoffs since the start of the Great Recession, copyeditors, researchers, and enhancers/archivists — the guardians of accuracy, clarity, and order — have been the worst hit, while reporters and editors are being expected to do more with less. Errors, omissions, bias, and inconsequential content are now baked into the newspaper product, and this will have deep consequences for future scholarship and research.

All this to say that cautionary notes like Prof. Milligan’s are welcome and necessary, and researchers should always, always cross-reference research results with multiple and varied sources.

*It is my understanding that an upgraded database for The Globe and Mail’s Canada’s Heritage from 1844 is in the works that will address some of these shortcomings. For example, it will use higher quality OCR and will search and identify articles as a whole, even across pages.

Photo source: Jon S, Flickr

The Ethics of Social Media Cyber-Sleuthing

social media

Without a doubt, social media and social networking sites like Facebook, Twitter, LinkedIn, and countless others have become indispensable tools in conducting background investigations, due diligence, employment pre-screening, and other types of investigations. Pursuit Magazine recently had a good two-part series that covered not just pointers to some lesser-known social media sites, but also discussed the importance of adequately capturing and presenting the information found on these sites.

The articles also highlighted some ethical and legal issues around gathering such information, advising, for example, against using shady techniques like pretexting and password cracking to gain access to protected material. Additionally, in Canada, a number of laws – notably human rights and privacy laws – govern the types of information that may be gathered on social media and elsewhere, the methods used for gathering the information, and the decisions made based on the information.

To stay on the side of the law, it is crucial for organizations and investigators to exercise caution when researching, collecting, and disclosing personal information about individuals. The Information and Privacy Commissioner of British Columbia has released some guidelines for social media background checks (PDF), identifying some pitfalls and issues to keep in mind:

  • Accuracy of information (Is it the right profile? Was the profile created by the individual himself or herself? Is the information current?)
  • Collecting irrelevant or too much information
  • Over-reliance on consent

Exercising good judgment when trawling social media sites isn’t just a matter of law and ethics; it can also save the organization from embarrassment, a lesson that the Toronto Star learned the hard way when it published false allegations against an Ontario MPP based on an old Facebook photo. The newspaper issued a rare front-page apology, citing an “egregious lapse” of standards.

Photo source: Jason Howie, Flickr

Are You a Skilled Googler?

Most of us think we’re great Googlers. And it’s a testament to Google’s strength as a mostly reliable search engine that we do usually find what we’re looking for with a few simple keywords. But beyond the quick factual search, things can get tricky, and as a number of studies have shown, most of us miss good information on the open web due to our limited search skills (and here it’s worth noting that less than 10% of online information is actually available on the open web via search engines; the other 90% resides on the deep or invisible web).

There are a number of ways to improve your search skills. While Google appears simple and intuitive on the surface, its power can best be harnessed with some training, and Google provides a number of online training guides to help improve the search skills of its users. Two self-paced courses have been developed for power searching and advanced power searching, and this course, geared to students and their teachers, provides lesson plans and trivia challenges. Also available are webinars that guide the user through a variety of tools and techniques to find higher quality sources more easily.

But no matter how advanced a Googler you become, you’ll be missing a lot of good information if you rely solely on Google. Other search engines such as Bing and DuckDuckGo index the web differently and have different ways of prioritizing results. (See this slide deck from Karen Blakeman of RBA Information Services for some alternatives to Google.) And as mentioned before, only a small fraction of online information is indexed through search engines; countless specialized databases and indexes provide high-quality material that won’t appear in search engine results.

By the way, Google has come up with a fun way to put your Google search skills to the test. A Google a Day is a daily puzzle that can be solved by using clever search skills on Google.