Monday, October 20, 2014

Academic Workflow for the Ages (Part 2)

Over two years ago, I wrote a post about an academic workflow, mostly for literature review. I did not focus on writing, although I did recommend that people to use Scrivener to draft their documents (dissertation chapters, journal articles, and so forth). In this post, I discuss an alternate way to draft documents, which I think is much better.

Pandoc

Natural Writing

I recommend that people write their documents in pandoc markdown. Pandoc’s author describes it as “your swiss-army knife” for documents because it can convert between many document formats (e.g., HTML to MS Word).

Markdown is a syntax for plain-text documents that aims to be readable in plain text but also have enough structure that it can be parsed and translated into other formats. (Markdown’s original authors aimed to generate HTML documents from markdown.) For example, the following is a list in a markdown document:

Here is a list written in markdown: 

* Here is an item in the list.
* Here is another item in the list. 
* Here is the final item in the list. 

Even in a plain-text file, it’s clear that this is a list, so it’s easy to read this document in plain text and understand the intended formatting. Additionally, because the syntax is very simple, it’s easy to just write your thoughts, arguments, and whatever else you need to without pausing to deal with formatting. If I wanted to make a list in an e-mail, I would write it exactly as written above. At least for lists, there’s nothing new to learn, and in general, writing in markdown is very natural. Markdown achieves a tremendous separation between content and formatting. Pandoc converts the list above into the HTML shown below.

<p>Here is a list written in markdown:</p>
<ul>
<li>Here is an item in the list.</li>
<li>Here is another item in the list.</li>
<li>Here is the final item in the list.</li>
</ul> 

When writing in pandoc markdown, you can also use HTML comments to make notes to yourself and keep them right next to the material they refer to. For example, you could write an outline of a journal article you’re drafting and enclose it in an HTML comment, so it’s excluded from the output (for most formats) but still at the top of your own markdown document.

<!-- This is an HTML comment. --> 

<!-- 
Comments can also 
span 
several lines. 
--> 

Here is a list written in markdown: 

* Here is an item in the list.
* Here is another item in the list. 
* Here is the final item in the list. 

And because pandoc markdown is stored in plain-text files, you can use a variety of text editors to write them. Personally, I prefer a minimalist writing environment because there are less distractions. MS Word, for example, has so many formatting tools immediately available that it’s tempting to write something and then immediately fix up how it looks. With a plain-text editor, there are virtually no distractions. My favorite at the moment is Text Wrangler on Mac with the font set to display 24pt Helvetica. It’s free and simple, and it does what I need it to.

More Features

Pandoc expands traditional markdown with new features, such as different kinds of lists, different kinds of tables, in-text citations and reference lists, footnotes, and metadata. For example, you can make a table with the following pandoc markdown:

-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.

  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------

Table: Here's the caption. It, too, may span
multiple lines.

If you want to cite a source, you can write the following:

Blah blah [@smith04; @doe99].

smith04 and doe99 are BibTeX keys in a BibTeX database.1 Pandoc will replace them with proper in-text citations and generate a reference list at the end of the document. You can even use Zotero’s citation styles, so there are a lot of options for automatic formatting, including Chicago, APA, MLA, and formats for many academic journals.2

For all the features and how to use them, spend some time on pandoc’s readme page, and try them out for yourself. Note: The table and citation examples in this section were copied verbatim from pandoc’s readme.

More Formats

Pandoc can convert markdown into formats other than HTML, including LaTeX. Pandoc converts the markdown list above into the following LaTeX:

Here is a list written in markdown:

\begin{itemize}
\itemsep1pt\parskip0pt\parsep0pt
\item
  Here is an item in the list.
\item
  Here is another item in the list.
\item
  Here is the final item in the list.
\end{itemize}

And as I mentioned above, pandoc can also convert markdown to an MS Word document (*.docx file). Do you see where this is going? If you write your documents in pandoc markdown, you can easily convert them to other formats as needed. Lists, tables, and citations will all appear correctly in a variety of formats. If you’re a PhD student and your advisors want to review your work in MS Word (usually because of its handy track changes feature and comments feature), you can just take the current version of your markdown file, convert it to MS Word, and send it out. And when it’s time to make a final version, you can still convert your markdown to LaTeX and apply whatever template you want to (your school’s template, a journal’s article template, or your own custom template). You don’t have to change your source file at all; just apply a new template to get the formatting you need.

A Simple Example

Returning to the list example above, if I have a markdown file named temp.md with the following contents:

Here is a list written in markdown: 

* Here is an item in the list.
* Here is another item in the list. 
* Here is the final item in the list. 

I can open a terminal and convert it to HTML with the following command:

pandoc -r markdown -w html -o temp.html temp.md

Pandoc will create an HTML file called temp.html with the following contents:

<p>Here is a list written in markdown:</p>
<ul>
<li>Here is an item in the list.</li>
<li>Here is another item in the list.</li>
<li>Here is the final item in the list.</li>
</ul>

Reproducible Statistics

But wait. There’s more!

If your work involves statistics and you’re familiar with R, you can write something called R Markdown. R Markdown files can contain both markdown and R code (and R code that writes markdown). So you can write an R Markdown file with your entire statistical analysis that writes out fresh statistics, tables, and figures every time you process the file. If you find an error in your work or you want to update the way a figure looks, just change the code in the R Markdown file and reprocess it. Everything else will be the same plus your changes will have been included.

I won’t get into the details of setting up an R Markdown file here, but basically, you create a file with the *.rmd extension and write in your markdown and R code. Then, you “make” the file by running R on it. There are different ways to do this. You could call R from the command line with Rscript, for example. In any case, the result will be a pandoc markdown file (with a *.md) extension, which you can then convert to other formats (e.g., MS Word, LaTeX, etc.) as you would any other pandoc markdown file. In addition, you can easily share your R code with anyone who wants to check your work.

Disadvantages

A Command-Line Tool

For some people, the fact that pandoc is a command-line tool will be a disadvantage. Some graphical user interfaces (GUIs) for pandoc are listed here. I haven’t tried them, so I can’t make any recommendations.

The LaTeX Writer

One of the footnotes in this post discusses my complaints about the way pandoc handles citations. Another problem is the way pandoc writes out tables in LaTeX. Pandoc automatically writes out tables as longtables in LaTeX and sometimes inserts minipages in the middle of tables. For example, the following table is written in pandoc markdown and saved in a file called temp.md:

-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.

  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------

Table: Here's the caption. It, too, may span
multiple lines.

Pandoc will convert the document to LaTeX with the following terminal command:

pandoc -r markdown -w latex -o temp.tex temp.md

The above command creates a file called temp.tex with the following contents (although I added some line breaks for formatting):

\begin{longtable}[c]{@{}clrl@{}}
\toprule\addlinespace
\begin{minipage}[b]{0.15\columnwidth}\centering
Centered Header
\end{minipage} & %
\begin{minipage}[b]{0.10\columnwidth}\raggedright
Default Aligned
\end{minipage} & %
\begin{minipage}[b]{0.20\columnwidth}\raggedleft
Right Aligned
\end{minipage} & %
\begin{minipage}[b]{0.31\columnwidth}\raggedright
Left Aligned
\end{minipage}
\\\addlinespace
\midrule\endhead
\begin{minipage}[t]{0.15\columnwidth}\centering
First
\end{minipage} & %
\begin{minipage}[t]{0.10\columnwidth}\raggedright
row
\end{minipage} & %
\begin{minipage}[t]{0.20\columnwidth}\raggedleft
12.0
\end{minipage} & %
\begin{minipage}[t]{0.31\columnwidth}\raggedright
Example of a row that spans multiple lines.
\end{minipage}
\\\addlinespace
\begin{minipage}[t]{0.15\columnwidth}\centering
Second
\end{minipage} & %
\begin{minipage}[t]{0.10\columnwidth}\raggedright
row
\end{minipage} & %
\begin{minipage}[t]{0.20\columnwidth}\raggedleft
5.0
\end{minipage} & %
\begin{minipage}[t]{0.31\columnwidth}\raggedright
Here's another one. Note the blank line between rows.
\end{minipage}
\\\addlinespace
\bottomrule
\addlinespace
\caption{Here's the caption. It, too, may span multiple lines.}
\end{longtable}

Notice that it’s a longtable containing minipages. This looks to me like a bit of madness. My understanding is that longtables are used by default because if they’re not, tables that are longer than a single page will not appear correctly. And I assume that authors complained about this problem in the past leading to the current setup. However, longtables could at least be implemented more elegantly with the tabu package. In general, it seems that more LaTeX formatting should be left to the LaTeX template (which can be changed easily by authors) than the pandoc writer (which cannot be changed easily by authors).

Git

Another benefit of writing documents in markdown—or really any plain-text format—is the ease of using version control. Version control systems, such as git, track changes to files, and with websites like GitHub, it’s possible to keep a remote backup of your files and their changes. Version control is especially useful on long projects, such as books or dissertations, where you may want to keep separate chapters in separate files. And if you’re collaborating with tech-savvy people, you can all work on the same set of files and track changes collectively.

I won’t give a tutorial on git here, and unfortunately, there’s a bit of a learning curve. But here are some useful resources:

The Setup

In summary, the setup I recommend is writing documents in either pandoc markdown or R markdown (depending on whether they contain statistics) and using git to track changes to documents. This setup works very well with long works, such as books and dissertations, where it makes sense to separate chapters into individual files.


Notes:


  1. You need to tell pandoc the name of the BibTeX file when you execute the pandoc command on the command line, and the file needs to be in a folder where pandoc can find it (i.e., in the same folder as the markdown file or in the folder indicated by the --data-dir flag). Again, see the readme for details.

  2. I actually have some gripes with the syntax for pandoc’s in-text citations. It’s convenient for quick documents, but for a formal academic work, pandoc’s citation commands are somewhat lacking, and BibLaTeX’s in-text citation commands are much more powerful. This is a tricky situation because while you can write BibLaTeX macros directly in pandoc markdown, pandoc won’t use them unless it’s converting that document to LaTeX. So if you want to create an MS Word document instead of a PDF, you’ll lose your citations. Personally, my solution was to write a parser in perl that could find pandoc citations and replace them with BibLaTeX citations, so I could still output LaTeX documents containing BibLaTeX formatting. But that might not be a solution for everyone. If there’s enough interest, I will polish and publish my perl code for others to use.

No comments:

Post a Comment