Monday, February 23, 2015

Fixing Sente's Export to BibTeX

I recommended Sente as a reference manager and tool for reviewing literature in a previous post. And Sente is great for both of those tasks (assuming you have a Mac). However, Sente does a poor job when exporting a library to BibTeX format—something you would want to do if you were writing an article using LaTeX. For example, Sente leaves the title field blank for web pages. Fortunately, I have found that I can solve these kinds of problems with a bit of coding.

Outline of the Solution

Sente accurately exports data to its own SenteXML format, so my solution uses the following steps:

  1. From Sente, export the reference library to SenteXML format.
  2. From Sente, export the reference library to BibTeX format.
  3. Using a script,
    1. Read in the SenteXML file.
    2. Read in the BibTeX file.
    3. Loop through each entry in the BibTeX file and
      1. Check if the entry is a web page;
      2. If it is a web page, retrieve the title from the same entry in the SenteXML file and save the new title for the BibTeX entry;
      3. Save the BibTeX entry.

First, note that you must start with a library of references in Sente and then export your library to the two formats listed in steps 1 and 2. Second, note that this procedure could be used to check and modify any field, but in this blog post, I address missing titles for web pages.

Solution Details

Getting Perl

I implemented the solution using a perl script. If you don’t have perl, I recommend installing it using perlbrew. I strongly recommend using perlbrew if you’re using a Mac or a Linux machine because you avoid modifying the version of perl that’s installed by default on your operating system. Some system utilities and other applications might rely on the default perl installation, so if you modify it, these utilities and applications might break. See perlbrew’s web page for installation instructions, but you can likely install it with the following command:

\curl -L http://install.perlbrew.pl | bash

Then, to install the latest stable version of perl, enter the following command:

perlbrew install perl-5.16.0

Setting Up Perl

The script uses a few external perl modules, which need to be installed. Assuming that you’re using perlbrew with version 5.16.0 of perl, enter each of the following lines on the terminal (and wait after each finishes before entering the next):

perlbrew exec --with perl-5.16.0 cpanm XML::Simple
perlbrew exec --with perl-5.16.0 cpanm Text::BibTeX
perlbrew exec --with perl-5.16.0 cpanm Getopt::Long

Using the Script

You can retrieve the script and some example files here:

The GitHub repository is here, and you can download the entire repository as a ZIP file here.

To use the script, put all three files in the same folder. If you’re using your own files (instead of the example files above), make sure to move your files into the same folder as the script. You need to either rename your own files to the default file names used in the script or use command-line arguments to specify the file names that you are using (see below). The default file names are the ones used in the example files:

  • SenteXML file: references.xml
  • Original BibTeX file: references.bib
  • Updated BibTeX file: references_new.bib

Next, open a terminal and run the script with the following command (again, assuming that you have used perlbrew to install version 5.16.0 of perl):

perl5.16.0 repair.pl

The script will generate a file called references_new.bib which contains all the same references as the original file and correct titles for web page entries.

The script also accepts the following command line arguments:


--sente-file  FILENAME     Name of the SenteXML file

--bib-infile  FILENAME     Name of the original BibTeX file
                           
--bib-outfile FILENAME     Name of the updated BibTeX file
                           

For example, if you want to save the updated file as bibliography.bib instead of references.bib, enter the following command in the terminal:

perl5.16.0 repair.pl --bib-outfile bibliography.bib

You could also directly overwrite the old BibTeX with the updated one:

perl5.16.0 repair.pl --bib-outfile references.bib

But be careful because the old BibTeX file will be gone forever.

Next Steps

The script is a straightforward procedural script that loops through entries in a BibTeX file and makes changes according to some simple instructions. It’s not sophisticated, and for a simple task such as this, it doesn’t need to be. But there are many other ways in which an author might want to fix up a BibTeX file.

When I wrote my dissertation, I used a more elaborate version of this script to do all of the following:

  • convert titles to headline-style capitalization according to the Chicago Manual of Style,
  • correctly indicate the translator(s) of sources,
  • correctly alphabetize institutional authors with hyphens in their names (e.g., “UN-HABITAT”),
  • clean up the edition field, which inconsistently contained ordinal numbers, “ed.”, and “edition” in my reference library,
  • clean up US state names (e.g., by removing internal periods), and
  • insert missing titles for laws and statutes.

With these tasks, it might make sense to create some more general methods—for example, a general method that retrieves the contents of a field (such as the title field), sends it to a regex, and then updates the field with the result.

Advantages of Scripting

In general, I found that using perl to fix up the bibliographic database was very efficient. I did not want to manually update the BibTeX database because it was very large (over 2,000 entries) and because I was using Sente as the actual reference manager. If I had updated the BibTeX file manually, I would probably have to re-export to BibTeX any entries from the Sente library that I modify, overwrite the old versions in the BibTeX file, and then manually update those entries in the BibTeX file as needed (e.g., entering titles for websites). That is way too much manual work because I was constantly adding new entries to my Sente reference library and occasionally updating older entries.

This approach is also more flexible. If I were to manually update my entire database so that (for example) all titles conform to Chicago-style headline capitalization, I would have to edit everything again to use those same entries in a document that needs to conform to a different style guide. By implementing these changes with a script, I left myself a relatively easy way to change the formatting for different documents.