PDF Data Conversion Tools

Have you ever found useful data in PDF or text-based document and wondered if there was a simple way to transfer it into a spreadsheet format for statistical analysis? This question was recently posed to the staff here at DISC and we were able to locate three software tools that can be used for this purpose. What follows is a brief review of each tool, along with a comparison of the three with regard to which may be the most suitable for certain user needs. (please see Table 1. at the bottom for a quick glance at the core features of these three tools).

ABBYYFineReader14 (Standard and Corporate versions)

The standard version of ABBYYFineReader 14 allows users to edit a PDF, to make comments and collaborate with others within the PDF editor, and has the capability to covert PDF’s, other documents, and scans into a variety of formats (e.g. Word, Excel, Powerpoint, etc.). The conversion feature also includes combining multiple files into one converted document. The PDF editor also allows users to add images and text, to draw, to redact data, to add pages, and much more. AbbyyFinereader 14 also includes an OCR (Optical Character Recognition) editor which allows users to customize and verify recognized text, to recognize unusual characters and fonts, and to select certain areas of a document to recognize, among other things. Within this editor, users can compare an original document (e.g. the PDF version of a data table) with the OCR’d version, correct the text that was not recognized properly, and then save the corrected document in the desired format (e.g. an Excel spreadsheet).

The corporate version of ABBYYFineReader 14 contains these features along with additional capabilities which include a side-by-side comparison of documents for the purpose of detecting differences and automated conversion of documents. The comparison feature of the corporate version does not offer editing capabilities, however, and seems to be more useful for comparing differences in content versus structure.

ABBYY offers versions of FineReader 14 for both Windows and Mac, as well as many other related software products. Pricing begins at $200 for the standard version and $399 for the corporate version; licensing is perpetual and upgrades for registered users are available. A 30-day free trial option is also offered.

PDFTables

PDFTables is a proprietary software tool that allows users to download PDF files with tabular data, to preview them on a web page, and then to convert/download the preview in Excel, CSV, or XML document format. Both cloud-based and on-premise service models are offered and a free trial option is available so that users can convert PDF files immediately via the website (https://pdftables.com/). Pricing is based on the number of pages converted. The free trial option includes 25 pages, with another 50 being made available upon completion of a free registration. PDFTables also includes an API (application program interface) which allows users to automate PDF data extraction.

Tabula

Tabula is an free, open-source software tool that was created specifically for the purpose of extracting a data table from a PDF file and converting it into spreadsheet-compatible format (e.g. CSV, TSV, JSON.) The application can be downloaded from: https://tabula.technology/ and the interface is accessible via a browser tab that opens each time the application is run. It contains a PDF viewer which displays an imported PDF file from which the data can be selected, previewed, and exported into the desired file format. Tabula is Windows, Mac, and Linux-compatible. It is only designed for text-based PDF files, not scanned documents and/or other document types. The application interface displays a history of files that the user has previously imported, and there is also the option to save custom selections as a template for future use.

The following examples display the varying capabilities of these three conversion tools:

PDF file with data table: Page11852 WhigOCRed

Example of the PDF file (which contains tabular data and has undergone optical character recognition using another program) converted into Excel spreadsheet format Using ABBYYFineReader 14: ABBYY-Page11852 WhigOCRed

An example of the same file converted using Tabula: Tabula-Page11852 WhigOCRed

And, finally, using PDFTables: PDFtables- Page11852 WhigOCRed

When comparing these examples, it is clear that the levels of precision and clarity vary depending on which tool is used, with ABBYY being the most accurate and Tabula being the least. More manual editing will be required within a spreadsheet after conversion when using PDF Tables and Tabula. The PDF viewer within Tabula is not nearly as powerful as it is within ABBYY FineReader 14, when you compare the same document opened in both programs. However, we did note that when using Tabula greater detail and precision were achieved when smaller portions of the data table within the PDF were selected for conversion at a given time. It was also somewhat beneficial to perform optical character recognition on the PDF file prior to conversion. Tabula and PDFTables do not contain this capability, so you will need to use another program first to do this. In addition, neither tool offers the side-by-side document comparison options that are found in ABBYYFineReader 14, and PDFTables is only able to convert one page of a file at a time. Overall, it seems that Tabula may be more appropriate to use when converting PDF-based data that is displayed in a simple format, when converting multiple files into the same basic format, when cost may be an issue, and/or when one needs to convert data into spreadsheet format on an infrequent basis. PDFTables may be also useful when the volume and complexity of tabular data is not an issue. On the other hand, ABBYY FineReader 14 seems to be the better tool to use if you are under time constraints in your work, need to convert a great deal of tabular data, and/or need to convert data that is presented in a more complex format.

Table 1.

Software (Developed By) ABBY FineReader 14 (ABBYY) PDF Tables (The Sensible Code Company) Tabula (Journalists Manuel Aristarán, Mike Tigas Jeremy B. Merrill, and Jason Das)
Capabilities
  • Converts PDF’s, scans, and other documents into a variety of formats (e.g. Excel, PowerPoint).
  • Combines multiple files into one converted document.
  • Contains both PDF and OCR (Optical Character Recognition) editors.
  • Corporate version also includes automated document conversion capabilities and a side-by-side document comparison feature.
  • Windows, Mac, and Linux compatible.
  • Coverts PDF’s into Excel, CSV, or XML format (CSV and XML formats are offered through the on premise service model).
  • Has an API which allows users to automate data extraction.
  • Offers both cloud-based and on-premises service models.

 

 

 

 

 

  • Coverts data tables located within a PDF file into spreadsheet-compatible format (e.g. CSV, TSV, JSON).
  • Allows for selective and/or partial conversion of data tables.
  • Windows, Mac, and Linux compatible.
  • Keeps a file history of previously imported PDF’s.
  • Offers an option to save custom selections as templates for future use.

 

 

 

Limitations
  • Side-by-side document comparison feature in the corporate version does not include editing capabilities.
  • Requires minimal manual editing after conversion.

 

 

  • Does not include optical character recognition (OCR).
  • Can only covert one page at a time.
  • Requires some manual editing after conversion.

 

 

  • Does not include optical character recognition (OCR).
  • Only designed for text-based PDF files, not scanned documents and/or other document types.
  • Requires considerable manual editing after conversion.

 

Cost
  • $200 for the standard version, $399 for the corporate version; perpetual licensing and upgrades are available for registered users.
  • Free 30-day trial option is available.

 

  • Pricing is structured according to the number of pages used (e.g. $30 for a 1000 pages).
  • Free 25-page credit trial option is available; an additional 50 pages are given with a free registration.
  • Free, open-source tool.

 

 

 

 

 

This entry was posted in DISC, Uncategorized and tagged , , . Bookmark the permalink.

Comments are closed.