Reviewing academic literature has become much more efficient over the past 20 years. When I was in grade school, libraries still used card catalogs. Card catalogs are drawers of note cards that provide the library’s contents by author, title, or subject. By the time I finished high school, most libraries had converted their physical card catalogs to electronic ones. And when I was finishing my undergraduate degree in 2004, electronic databases of many academic journals were established and becoming more and more user-friendly. Now, as I work on my PhD, I can easily search massive online databases of academic journals with a few keywords and get full-text PDFs of articles in just a few seconds. It’s as easy as a Google search, and, in fact, one of the online databases that I use is Google Scholar. I occasionally go to the library to pick up a book, but I don’t think that I’ve looked at a hard copy of an academic journal since starting the PhD. And I even try to avoid hard copies of books because Kindle versions can be annotated electronically and then easily integrated into my academic workflow.
Over the past couple decades, searching for literature has gotten much faster, and many sources are now available in digital form, but the literature review process has basically remained the same: select a topic, search for sources based on keywords, review sources, and repeat.
In the next few decades, advances in computer science may allow researchers to interact with literature in new ways. The two possibilities I outline below are close to being available now, but they’re not easy to implement, and there’s plenty of room for them to improve.
First, researchers should be able to identify key literature articles on a given topic without having to do the leg work of sorting through current articles and tracing their citations back to the most influential pieces. An application could crawl academic literature databases for certain terms, pull down articles that match, parse the citation lists, and construct a network map of sources on the topic. The network map could be used to identify particularly influential pieces of literature. By taking each source’s publication date into account, the application could identify sources that are turning points in a given discourse.
This is possible now, but no one has bundled existing programs into a well-engineered, user-friendly package. Crawling is nothing new in computer science. Google has been using web crawlers to create an index of web pages since its inception. Here‘s an early paper by Google’s founders on their plans for Google in 1998. (Here are the conference proceedings.) To create a network map of sources, the program should be able to get the list of citations for each source and the list of sources that cite each source. There are many ways to get an article’s list of citations automatically. Some academic literature databases, like Google Scholar, provide sources’ citation lists. Alternatively, the program could retrieve the full-text of the article as a PDF, perform optical character recognition (OCR) on the PDF, and then parse the citations from the resulting plain text of the article. Google has a free OCR engine called Tesseract OCR that can be used with perl modules like PDF::OCR2 to automatically pull plain text out of PDFs. It would be possible to parse the plain text citation lists into individual citations (e.g., with regex statements) and feed these citations back into the web crawler to repeat the process again. Some databases of academic literature already provide the list of sources that cite a given source. Web of Knowledge provides a feature called cited reference search that allows researchers to see which articles have cited a given source. And there’s already at least one perl module (SNA::Network) that can do network analysis. Most, if not all of the pieces, for such an application already exist.
Imagine identifying a theory of interest and, within minutes, having the foundational article for that theory along with the major branches of discourse stemming from it. Perhaps, disciplinary researchers wouldn’t find much value in this because researchers in the field know which articles are in the field’s canon. As an interdisciplinary researcher, I think a tool like this one would be incredibly exciting. And even disciplinary researchers might find out that particular articles are more influential than they previously thought.
Second, researchers should be able to automatically code articles to identify patterns in academic literature. If researchers can specify the themes they’re looking for, a computer program can do the work of finding articles and coding them with researchers’ coding schemes. Such programs have already been used in biology to extract information on biodiversity from scientific articles. There’s a nice summary article here, and I recommend to look at the summary figure. Information on biodiversity is well-suited to this approach because biological classification has a clear taxonomy (e.g., species, genus, and so forth). But researchers are interested in many themes, which could be similarly specified. For example, I’m currently researching participatory approaches in international development projects, and I am coding literature according to a taxonomy of participatory approaches that I have developed. Sophisticated programs like those used in biology could automatically find and code the literature according to my own taxonomy.
There are several ways to “teach” the program how to code text. The researcher could define rules for interpreting text (called hand-crafted rules). The researcher could also code a subset of articles with the scheme and then feed the hand-coded articles into the program which then “learns” the coding system on the fly (called machine learning). Currently, developing this kind of system is very labor-intensive. But imagine what will be possible in 10 or 20 years when both computational power and computers’ ability to process natural language have increased dramatically. Perhaps a researcher could code 10 articles on his or her own, feed them into a program, and then let the program search for additional articles, code them automatically, and deliver a summary of results.
The programs I’ve described here are more suitable for certain research questions, and they may be totally worthless for some research questions. It’s easy to imagine that people in the humanities would get much less use out of the programs described above. I don’t think it will ever be possible (or desirable) to remove people from the research process, but it may be possible to make certain parts of the research process much more efficient.