Monday, October 20, 2014

Academic Workflow for the Ages (Part 2)

Over two years ago, I wrote a post about an academic workflow, mostly for literature review. I did not focus on writing, although I did recommend that people to use Scrivener to draft their documents (dissertation chapters, journal articles, and so forth). In this post, I discuss an alternate way to draft documents, which I think is much better.

Pandoc

Natural Writing

I recommend that people write their documents in pandoc markdown. Pandoc’s author describes it as “your swiss-army knife” for documents because it can convert between many document formats (e.g., HTML to MS Word).

Markdown is a syntax for plain-text documents that aims to be readable in plain text but also have enough structure that it can be parsed and translated into other formats. (Markdown’s original authors aimed to generate HTML documents from markdown.) For example, the following is a list in a markdown document:

Here is a list written in markdown: 

* Here is an item in the list.
* Here is another item in the list. 
* Here is the final item in the list. 

Even in a plain-text file, it’s clear that this is a list, so it’s easy to read this document in plain text and understand the intended formatting. Additionally, because the syntax is very simple, it’s easy to just write your thoughts, arguments, and whatever else you need to without pausing to deal with formatting. If I wanted to make a list in an e-mail, I would write it exactly as written above. At least for lists, there’s nothing new to learn, and in general, writing in markdown is very natural. Markdown achieves a tremendous separation between content and formatting. Pandoc converts the list above into the HTML shown below.

<p>Here is a list written in markdown:</p>
<ul>
<li>Here is an item in the list.</li>
<li>Here is another item in the list.</li>
<li>Here is the final item in the list.</li>
</ul> 

When writing in pandoc markdown, you can also use HTML comments to make notes to yourself and keep them right next to the material they refer to. For example, you could write an outline of a journal article you’re drafting and enclose it in an HTML comment, so it’s excluded from the output (for most formats) but still at the top of your own markdown document.

<!-- This is an HTML comment. --> 

<!-- 
Comments can also 
span 
several lines. 
--> 

Here is a list written in markdown: 

* Here is an item in the list.
* Here is another item in the list. 
* Here is the final item in the list. 

And because pandoc markdown is stored in plain-text files, you can use a variety of text editors to write them. Personally, I prefer a minimalist writing environment because there are less distractions. MS Word, for example, has so many formatting tools immediately available that it’s tempting to write something and then immediately fix up how it looks. With a plain-text editor, there are virtually no distractions. My favorite at the moment is Text Wrangler on Mac with the font set to display 24pt Helvetica. It’s free and simple, and it does what I need it to.

More Features

Pandoc expands traditional markdown with new features, such as different kinds of lists, different kinds of tables, in-text citations and reference lists, footnotes, and metadata. For example, you can make a table with the following pandoc markdown:

-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.

  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------

Table: Here's the caption. It, too, may span
multiple lines.

If you want to cite a source, you can write the following:

Blah blah [@smith04; @doe99].

smith04 and doe99 are BibTeX keys in a BibTeX database.1 Pandoc will replace them with proper in-text citations and generate a reference list at the end of the document. You can even use Zotero’s citation styles, so there are a lot of options for automatic formatting, including Chicago, APA, MLA, and formats for many academic journals.2

For all the features and how to use them, spend some time on pandoc’s readme page, and try them out for yourself. Note: The table and citation examples in this section were copied verbatim from pandoc’s readme.

More Formats

Pandoc can convert markdown into formats other than HTML, including LaTeX. Pandoc converts the markdown list above into the following LaTeX:

Here is a list written in markdown:

\begin{itemize}
\itemsep1pt\parskip0pt\parsep0pt
\item
  Here is an item in the list.
\item
  Here is another item in the list.
\item
  Here is the final item in the list.
\end{itemize}

And as I mentioned above, pandoc can also convert markdown to an MS Word document (*.docx file). Do you see where this is going? If you write your documents in pandoc markdown, you can easily convert them to other formats as needed. Lists, tables, and citations will all appear correctly in a variety of formats. If you’re a PhD student and your advisors want to review your work in MS Word (usually because of its handy track changes feature and comments feature), you can just take the current version of your markdown file, convert it to MS Word, and send it out. And when it’s time to make a final version, you can still convert your markdown to LaTeX and apply whatever template you want to (your school’s template, a journal’s article template, or your own custom template). You don’t have to change your source file at all; just apply a new template to get the formatting you need.

A Simple Example

Returning to the list example above, if I have a markdown file named temp.md with the following contents:

Here is a list written in markdown: 

* Here is an item in the list.
* Here is another item in the list. 
* Here is the final item in the list. 

I can open a terminal and convert it to HTML with the following command:

pandoc -r markdown -w html -o temp.html temp.md

Pandoc will create an HTML file called temp.html with the following contents:

<p>Here is a list written in markdown:</p>
<ul>
<li>Here is an item in the list.</li>
<li>Here is another item in the list.</li>
<li>Here is the final item in the list.</li>
</ul>

Reproducible Statistics

But wait. There’s more!

If your work involves statistics and you’re familiar with R, you can write something called R Markdown. R Markdown files can contain both markdown and R code (and R code that writes markdown). So you can write an R Markdown file with your entire statistical analysis that writes out fresh statistics, tables, and figures every time you process the file. If you find an error in your work or you want to update the way a figure looks, just change the code in the R Markdown file and reprocess it. Everything else will be the same plus your changes will have been included.

I won’t get into the details of setting up an R Markdown file here, but basically, you create a file with the *.rmd extension and write in your markdown and R code. Then, you “make” the file by running R on it. There are different ways to do this. You could call R from the command line with Rscript, for example. In any case, the result will be a pandoc markdown file (with a *.md) extension, which you can then convert to other formats (e.g., MS Word, LaTeX, etc.) as you would any other pandoc markdown file. In addition, you can easily share your R code with anyone who wants to check your work.

Disadvantages

A Command-Line Tool

For some people, the fact that pandoc is a command-line tool will be a disadvantage. Some graphical user interfaces (GUIs) for pandoc are listed here. I haven’t tried them, so I can’t make any recommendations.

The LaTeX Writer

One of the footnotes in this post discusses my complaints about the way pandoc handles citations. Another problem is the way pandoc writes out tables in LaTeX. Pandoc automatically writes out tables as longtables in LaTeX and sometimes inserts minipages in the middle of tables. For example, the following table is written in pandoc markdown and saved in a file called temp.md:

-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.

  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------

Table: Here's the caption. It, too, may span
multiple lines.

Pandoc will convert the document to LaTeX with the following terminal command:

pandoc -r markdown -w latex -o temp.tex temp.md

The above command creates a file called temp.tex with the following contents (although I added some line breaks for formatting):

\begin{longtable}[c]{@{}clrl@{}}
\toprule\addlinespace
\begin{minipage}[b]{0.15\columnwidth}\centering
Centered Header
\end{minipage} & %
\begin{minipage}[b]{0.10\columnwidth}\raggedright
Default Aligned
\end{minipage} & %
\begin{minipage}[b]{0.20\columnwidth}\raggedleft
Right Aligned
\end{minipage} & %
\begin{minipage}[b]{0.31\columnwidth}\raggedright
Left Aligned
\end{minipage}
\\\addlinespace
\midrule\endhead
\begin{minipage}[t]{0.15\columnwidth}\centering
First
\end{minipage} & %
\begin{minipage}[t]{0.10\columnwidth}\raggedright
row
\end{minipage} & %
\begin{minipage}[t]{0.20\columnwidth}\raggedleft
12.0
\end{minipage} & %
\begin{minipage}[t]{0.31\columnwidth}\raggedright
Example of a row that spans multiple lines.
\end{minipage}
\\\addlinespace
\begin{minipage}[t]{0.15\columnwidth}\centering
Second
\end{minipage} & %
\begin{minipage}[t]{0.10\columnwidth}\raggedright
row
\end{minipage} & %
\begin{minipage}[t]{0.20\columnwidth}\raggedleft
5.0
\end{minipage} & %
\begin{minipage}[t]{0.31\columnwidth}\raggedright
Here's another one. Note the blank line between rows.
\end{minipage}
\\\addlinespace
\bottomrule
\addlinespace
\caption{Here's the caption. It, too, may span multiple lines.}
\end{longtable}

Notice that it’s a longtable containing minipages. This looks to me like a bit of madness. My understanding is that longtables are used by default because if they’re not, tables that are longer than a single page will not appear correctly. And I assume that authors complained about this problem in the past leading to the current setup. However, longtables could at least be implemented more elegantly with the tabu package. In general, it seems that more LaTeX formatting should be left to the LaTeX template (which can be changed easily by authors) than the pandoc writer (which cannot be changed easily by authors).

Git

Another benefit of writing documents in markdown—or really any plain-text format—is the ease of using version control. Version control systems, such as git, track changes to files, and with websites like GitHub, it’s possible to keep a remote backup of your files and their changes. Version control is especially useful on long projects, such as books or dissertations, where you may want to keep separate chapters in separate files. And if you’re collaborating with tech-savvy people, you can all work on the same set of files and track changes collectively.

I won’t give a tutorial on git here, and unfortunately, there’s a bit of a learning curve. But here are some useful resources:

The Setup

In summary, the setup I recommend is writing documents in either pandoc markdown or R markdown (depending on whether they contain statistics) and using git to track changes to documents. This setup works very well with long works, such as books and dissertations, where it makes sense to separate chapters into individual files.


Notes:


  1. You need to tell pandoc the name of the BibTeX file when you execute the pandoc command on the command line, and the file needs to be in a folder where pandoc can find it (i.e., in the same folder as the markdown file or in the folder indicated by the --data-dir flag). Again, see the readme for details.

  2. I actually have some gripes with the syntax for pandoc’s in-text citations. It’s convenient for quick documents, but for a formal academic work, pandoc’s citation commands are somewhat lacking, and BibLaTeX’s in-text citation commands are much more powerful. This is a tricky situation because while you can write BibLaTeX macros directly in pandoc markdown, pandoc won’t use them unless it’s converting that document to LaTeX. So if you want to create an MS Word document instead of a PDF, you’ll lose your citations. Personally, my solution was to write a parser in perl that could find pandoc citations and replace them with BibLaTeX citations, so I could still output LaTeX documents containing BibLaTeX formatting. But that might not be a solution for everyone. If there’s enough interest, I will polish and publish my perl code for others to use.

Monday, October 13, 2014

Arizona State University Dissertation/Thesis Template in LaTeX

Arizona State University (ASU) is one of the largest universities in the US. It must have tens of thousands of graduate students in attendance at any given time, many of whom need to write and submit a thesis or dissertation. All theses and dissertations need to following the formatting guidelines of ASU’s Graduate College (latest revision [July 2013] available here).

So I was surprised to see that ASU’s current LaTeX template is fairly basic. By writing this, I do not mean to criticize its author, who (as I understand the situation) was a graduate student who created a template that worked reasonably well and decided to make it available to others. ASU then adopted this student’s work as the template it would officially distribute to students. But as far as I know, the student who created the template did not invest a great deal of time or effort into it, and as a result, the template is rather shallow. It has the correct margins, and the table of contents will come out more or less correct, but what if you want to include appendices, for example? Or use biblatex for citations instead of natbib?

I did the formatting for my PhD dissertation on my own and created a new LaTeX template, which is available on GitHub here. (If you’re not familiar with git, you can grab everything simply by clicking the “Download ZIP” button to get the template and supporting files in a ZIP archive. Or just click here.)

Sample title page of dissertation template; click for image full sample PDF

Sample title page of dissertation template; click for image full sample PDF

The biggest (and, in my opinion, the most beneficial) difference between the official template and this new one is that the new template uses the memoir document class. The memoir document class is designed for formatting book-length works. For example, it has commands for indicating divisions between front matter, main matter, and back matter and adjusts formatting accordingly. So it’s a natural choice for formatting theses and dissertations which are book-length works. memoir is also a very large document class that natively supports many features without having to load other packages. It can natively format footnotes and endnotes, for example, and the table of contents can be highly customized using only memoir commands. The template I created definitely loads other packages, but I would guess that memoir is probably the most complete document class out there. And finally, memoir has excellent documentation, which is currently over 600 pages long. If there’s some confusing code in the template I created or if someone wants to add a new feature to their own thesis/dissertation, there’s a better chance that the documentation for memoir will provide the answer than for other document classes.

Some of the other improvements over the official template include the following:

  • Includes all required and optional sections, including a copyright page, dedication, acknowledgements, preface, endnotes, and biographical sketch.
  • Correct formatting for main matter (chapters) and back matter (appendices), which makes it easy to organize your entire document.
  • For the typesetting engine, works with either pdftex or xetex. (xetex makes it easy to use any of the approved fonts.)
  • For references, works with natbib and biblatex. (biblatex makes it easy to use Chicago, MLA, and APA style references.)
  • Better separation of content and formatting. For example, write your table captions however you want and they will appear correctly in the list of tables. This arrangement makes it much easier to produce another (much better-looking version) of your dissertation/thesis in case you want to share a better-looking version with colleagues.
  • Internal document references work. For example, clicking on an in-text citation jumps down to that citation in the references list.
  • Bookmarks work, so there is a navigation side menu in the PDF that contains the major document elements (e.g., table of contents and each chapter heading), so the PDF is easier to navigate.
  • Writes PDF metadata (including the title, name, and keywords) automatically.
  • Uses the memoir document class, so it is easier to change formatting and create a book-length work in general.

There were challenges to getting all these features working together. Strangely enough, one of the most difficult challenges was getting chapter-level and part-level headings to appear uppercased in the table of contents. It turns out that the typical commands for uppercasing text in the table of contents conflict with the hyperref package. (I’ve written a separate post on my solution here.) But overall, I think I’ve found reasonably elegant solutions for implementing the formatting requirements in ASU’s style guide.

I have intentionally not created a style file, yet. In my experience, troubleshooting a document with a custom style file leads to headaches because it requires hunting through the style file and the preamble to figure out where problems are. I think it’s better to have all the potentially problematic code in one lengthy preamble. If there is enough interest in either a style file or packaging everything in a class, I will create them, but at least initially, I am just making a simple template file available to everyone.

Formatting a dissertation or thesis is often one of the less pleasant parts of the graduate student experience. It’s the last thing students need to do before they’re finally done with an often long and difficult graduate school experience, and formatting is usually tedious and time-consuming. Hopefully, this template can take some of the pain out of that experience for ASU graduate students.

Again, the template is available on GitHub here. You can grab everything simply by clicking the “Download ZIP” button to get the template and supporting files in a ZIP archive. I did the formatting for my PhD dissertation on my own and created a new LaTeX template, which is available on GitHub here. If you’re not familiar with git, you can grab everything simply by clicking the “Download ZIP” button to get the template and supporting files in a ZIP archive, or just click here.

Monday, October 6, 2014

Uppercasing in a "memoir" Table of Contents with "hyperref"

memoir is one of my favorite document classes in LaTeX, but uppercased headings in a memoir table of contents conflict with the hyperref package. Uppercasing headings the table of contents in a memoir document can be achieved in at least two basic ways. The first uses the \cftKfont font commands to send formatting instructions to the table of contents. The K in \cftKfont stands for the heading level,1 so to modify the chapter headings, for example, you would use the \cftchapterfont macro as shown below:

% Uppercase chapter headings in TOC
\renewcommand*{\cftchapterfont}%     
  {\normalfont\MakeUppercase}

Because \MakeUppercase is sort of brutal in the way that it uppercases, it can cause errors. The memoir documentation recommends using another macro \MakeTextUppercase, which would be used in the following way to uppercase chapter headings in the table of contents:

% Uppercase chapter headings in TOC
\renewcommand*{\cftchapterfont}%     
  {\normalfont\MakeTextUppercase}

These methods are simple, but—as noted above—they both conflict with the hyperref package.

This conflict is aggravating because hyperref makes PDFs better in many ways. For example, hyperref will automatically make internal document links active. So if you have citations throughout your work, those in-text citations will turn into links that jump down to their entries in the list of citations at the end of the document. Similarly, when the document refers to a table or figure (e.g., “see table 2 for more information”), the number becomes an internal document link that will take the reader to the table or figure. And entries in the table of contents also become internal document links. So there are a lot of benefits to simply using the following line in a preamble:

\usepackage{hyperref}

With specific commands, hyperref can do a lot more, such as write out PDF metadata and use another package, called bookmarks, to create a PDF navigation bar.

The conflict between these two uppercasing commands and hyperref are a known problem, so memoir provides another method, \settocpreprocessor, for sending uppercase formatting to headings in the table of contents (see p. 158 of the memoir documentation). The code below shows how to use this macro to uppercase part and chapter headings in the table of contents:

\makeatletter
\settocpreprocessor{part}{%
    \let\tempf@rtoc\f@rtoc%
    \def\f@rtoc{%
      \texorpdfstring{\MakeTextUppercase{%
        \tempf@rtoc}%
      }{\tempf@rtoc}%
    }% 
}
\settocpreprocessor{chapter}{%
    \let\tempf@rtoc\f@rtoc% 
    \def\f@rtoc{%
      \texorpdfstring{\MakeTextUppercase{%
        \tempf@rtoc}%
      }{\tempf@rtoc}%
    }% 
}
\makeatother

This code will uppercase part and chapter heading entries in the table of contents, but it does not uppercase part-level or chapter-level headings in the table of contents. The difference is important because entries, such as the “List of tables,” won’t be uppercased by these commands. Doesn’t it make more sense for entries with the same level to have the same formatting? Users can get around this problem by manually uppercasing table-of-contents entries, such as for the name of the list of tables:

% Manually uppercase name of the list of tables
\renewcommand{\listtablename}
  {LIST OF TABLES}

But users would need to go back and manually change this and other commands if they want to produce a new version of their document without uppercased part-level and chapter-level headings. This approach violates the supposed strength of LaTeX—separating content from formatting. Instead, users need a way to uppercase part-level headings and chapter-level headings in memoir documents.

Fortunately, I have discovered a way to produce uppercase part-level and chapter-level headings by patching certain macros in the memoir document class. In the preamble, enter the following:

% Load 'etoolbox' to get the '\patchcmd' macro
\usepackage{etoolbox}

% Patch the command that writes part-level entries
%   to the table of contents, so they are in 
%   'normalfont' and uppercase
\makeatletter% 
\patchcmd{\l@part}%                     
    {\cftpartfont {#1}}%                
    {\normalfont \texorpdfstring{%      
      \uppercase{#1}}{{#1}} }%
    {\typeout{Success: Patch %
      'l@part' to uppercase %
      part-level headings in the %
      table of contents.}}%
    {\typeout{Fail: Patch %
      'l@part' to uppercase % 
      part-level headings in the %
      table of contents.}}%
\makeatother%

% Patch the command that writes chapter-level 
%   entries to the table of contents, so they are 
%   in 'normalfont' and uppercase
\makeatletter% 
\patchcmd{\l@chapapp}%                  
    {\cftchapterfont {#1}}%             
    {\normalfont \texorpdfstring{%      
      \uppercase{#1}}{{#1}} }%
    {\typeout{Success: Patch %
      'l@chapapp' to uppercase %
      part-level headings in the %
      table of contents.}}%
    {\typeout{Fail: Patch %
      'l@chapapp' to uppercase %
      part-level headings in the %
      table of contents.}}%
\makeatother%

% Load 'hyperref' to get the \texorpdfstring
%   command
\usepackage{hyperref}

This solution uses a similar approach to the one provided in memoir documentation with the \settocpreprocessor macro in that both use the \texorpdfstring macro from the hyperref package. The \texorpdfstring macro lets users send different text depending on whether the text will be typeset by LaTeX or not. Chapter headings are an example of content that gets typeset by LaTeX in the table of contents and in the PDF as PDF bookmarks, so \texorpdfstring is a great way to avoid these kinds of errors.


Notes:


  1. See table 9.3 on p. 150 of the memoir documentation for the other values of K.