This post describes how Mac users can quickly create PDFs with searchable text out of Google Book chapters (comparable software is also available on PCs). Being able to do so is especially useful for people using the academic workflow described in an earlier post. Here’s what you need:
- Web browser (e.g., Firefox) with these extensions (or equivalent for a browser other than Firefox):
- PDF printer (e.g., included in Preview)
- OCR engine (e.g., ABBYY included in DevonThink Pro Office)
First, install the extensions in the web browser. I’m not sure what the equivalent of FlashGot in other browsers is, but equivalent Greasemonkey extensions for different browsers follow:
- Chrome: Tampermonkey
- Safari: SIMBL / GreaseKit (instructions)
Install the Google Book Downloader Greasemonkey script in your web browser: just click the “Install
” button on the page with the script. A dialogue box with instructions will pop up.
Next, search for the book on Google Books. Let’s pretend I’m interested in The White Man’s Burden by William Easterly. (This is a great book on international development and recommended for anyone interested in the topic.) If you have the Google Book Downloader script installed, you’ll see a button in the left column named “Download this book
”.
If you click the “Download this book
” button, you’ll see another set of buttons in its place.
You can use the drop down menus to select the page range that you want to download. I suspect that Google detects when you load too many pages from a single book, so downloading pages for the whole book will probably not work. You can enter a page range for a single chapter at a time, though. For The White Man’s Burden, the first chapter goes from “PT13
” to “PT48
”.
Then, if you click “Get Download Links
” the Greasemonkey script will go to work and create links for each page of the book. An easy way to grab all the image files is to right click and select “FlashGot All
”. FlashGot will pop up with a dialogue asking where to download the image files. After setting the download folder, FlashGot will pull down all the files.
After downloading, all the images and a bunch of other files from Google Books will be on your computer. You can ignore or delete the extra files. If you’re using a Mac, you can use Preview to open and print all the image files into a single PDF. Select all the image files and press ⌘ + O
.
Preview will open with all the image files. It would be a good idea to check through the image files to make sure that there are no duplicates and that all the pages are in order.
Then, you can print all the images in one PDF. You can select all the image files by pressing ⌘ + A
and then print them by pressing ⌥ + ⌘ + P
. If you want to use menus, you can click “Edit
”, and then “Select All
”; next, click the “File
” menu, and then “Print Selected Images ...
”.
Preview will pop up with a dialog box. You can select the button in the lower-left corner labelled “PDF
” and then select “Save as PDF ...
” from the drop down menu. Preview will produce a nice PDF of all the pages from the Google Book, but at this point, the pages are just images. You wouldn’t be able to search the text in the PDF, select text in the PDF, or copy and paste text from the PDF. In other words, the PDF is not much use to an academic researcher as a collection of images.
Finally, use an OCR engine on the PDF. If you have DevonThink Pro Office 2.0 as recommended in an earlier post, you can use its OCR engine, called ABBYY. You could also get ABBYY separately, although in my opinion you may as well pay another $50 and get DevonThink, too.1
To use OCR on the PDF with DevonThink, click the “File
” menu, then “Import
”, and then “Images (with OCR) ...
”. Then, ABBY will run OCR on the PDF. This step may take awhile, so be patient.
When OCR finishes, select the resulting file in DevonThink. Click the “File
” menu, then “Export
”, and then “Files and Folders ...
”. Or you can use the keyboard shortcut: ⌥ + ⌘ + E
. DevonThink will pop up a dialog box asking where you want to put the file.
After OCR, you can search the text in the PDF, select text, and use copy and paste to pull text out of the PDF. If you’re using Sente as recommended in an earlier post, you can import it like any other journal article and annotate it.
Notes:
There are also perl modules that can OCR images and PDFs. I have experimented with the PDF::OCR2 module and successfully extracted the text into a separate file, but I haven’t taken it further to see if it’s possible to run OCR on a PDF and keep the results in a PDF with the same appearance. If this can be done, it’s possible OCR PDFs for free instead of paying upwards of $100 for OCR software.↩
No comments:
Post a Comment