Belgian Municipal Accounts for 16 Governments, 1870-1933 – November 1, 2018

The Data and Information Services Center (DISC) has added yet another research study to its online archive this year:

Belgian Municipal Accounts for 16 Governments, 1870-1933 is a study that contains information and data about the receipts and expenditures of 16 municipal governments in Belgium during the period 1870-1933. All of the information on receipts and expenditures was taken from various volumes of the ‘Annuaire Statistique de la Belgique’. While this source provides information about both the budget (intended receipts and expenditures) and accounts (actual receipts and expenditures) of municipal governments, only the latter is included in the data file.** Information about political parties, city council seats, and college seats is also provided.

This study was initially deposited at the Data and Program Library Service (which later became DISC) in 1979 by its principal investigator, Michael T. Aiken. Since then, the data files and documentation have been available by request only.

As of November 2018, users are now able to access this study documentation (e.g. codebook, data files) online; a free registration is required in order to download data files. The codebook is available in PDF format. The dataset is available in raw ASCII, Stata, and SPSS file formats, with the corresponding command files and data dictionaries.

**Please note that data is missing for all cities from 1914 to 1918.

Posted in Data Release, DISC, DISC Archive, Government Statistics, Uncategorized | Comments Off on Belgian Municipal Accounts for 16 Governments, 1870-1933 – November 1, 2018

ICPSR Studies Are Searchable in the UW-Madison Library Catalog

The Cataloging and Metadata Services in the General Library System added 10,000 records from the Inter-University Consortium for Political and Social Research (ICPSR) to UW-Madison Library Catalog on October 16, 2018. Adding these metadata expands patrons’ search and discovery to studies in a world’s major social science data archive. When you search GLS catalog, https://www.library.wisc.edu/, studies from ICPSR will be displayed in your result page. You can follow the Online Access link to ICPSR to access a study. In Advance Search, you can specify ICPSR in Series field to only search ICPSR’s holdings in campus library catalog. Please contact DISC if you have any questions about ICPSR’s collections.

Posted in DISC, ICPSR | Comments Off on ICPSR Studies Are Searchable in the UW-Madison Library Catalog

PDF Data Conversion Tools

Have you ever found useful data in PDF or text-based document and wondered if there was a simple way to transfer it into a spreadsheet format for statistical analysis? This question was recently posed to the staff here at DISC and we were able to locate three software tools that can be used for this purpose. What follows is a brief review of each tool, along with a comparison of the three with regard to which may be the most suitable for certain user needs. (please see Table 1. at the bottom for a quick glance at the core features of these three tools).

ABBYYFineReader14 (Standard and Corporate versions)

The standard version of ABBYYFineReader 14 allows users to edit a PDF, to make comments and collaborate with others within the PDF editor, and has the capability to covert PDF’s, other documents, and scans into a variety of formats (e.g. Word, Excel, Powerpoint, etc.). The conversion feature also includes combining multiple files into one converted document. The PDF editor also allows users to add images and text, to draw, to redact data, to add pages, and much more. AbbyyFinereader 14 also includes an OCR (Optical Character Recognition) editor which allows users to customize and verify recognized text, to recognize unusual characters and fonts, and to select certain areas of a document to recognize, among other things. Within this editor, users can compare an original document (e.g. the PDF version of a data table) with the OCR’d version, correct the text that was not recognized properly, and then save the corrected document in the desired format (e.g. an Excel spreadsheet).

The corporate version of ABBYYFineReader 14 contains these features along with additional capabilities which include a side-by-side comparison of documents for the purpose of detecting differences and automated conversion of documents. The comparison feature of the corporate version does not offer editing capabilities, however, and seems to be more useful for comparing differences in content versus structure.

ABBYY offers versions of FineReader 14 for both Windows and Mac, as well as many other related software products. Pricing begins at $200 for the standard version and $399 for the corporate version; licensing is perpetual and upgrades for registered users are available. A 30-day free trial option is also offered.

PDFTables

PDFTables is a proprietary software tool that allows users to download PDF files with tabular data, to preview them on a web page, and then to convert/download the preview in Excel, CSV, or XML document format. Both cloud-based and on-premise service models are offered and a free trial option is available so that users can convert PDF files immediately via the website (https://pdftables.com/). Pricing is based on the number of pages converted. The free trial option includes 25 pages, with another 50 being made available upon completion of a free registration. PDFTables also includes an API (application program interface) which allows users to automate PDF data extraction.

Tabula

Tabula is an free, open-source software tool that was created specifically for the purpose of extracting a data table from a PDF file and converting it into spreadsheet-compatible format (e.g. CSV, TSV, JSON.) The application can be downloaded from: https://tabula.technology/ and the interface is accessible via a browser tab that opens each time the application is run. It contains a PDF viewer which displays an imported PDF file from which the data can be selected, previewed, and exported into the desired file format. Tabula is Windows, Mac, and Linux-compatible. It is only designed for text-based PDF files, not scanned documents and/or other document types. The application interface displays a history of files that the user has previously imported, and there is also the option to save custom selections as a template for future use.

The following examples display the varying capabilities of these three conversion tools:

PDF file with data table: Page11852 WhigOCRed

Example of the PDF file (which contains tabular data and has undergone optical character recognition using another program) converted into Excel spreadsheet format Using ABBYYFineReader 14: ABBYY-Page11852 WhigOCRed

An example of the same file converted using Tabula: Tabula-Page11852 WhigOCRed

And, finally, using PDFTables: PDFtables- Page11852 WhigOCRed

When comparing these examples, it is clear that the levels of precision and clarity vary depending on which tool is used, with ABBYY being the most accurate and Tabula being the least. More manual editing will be required within a spreadsheet after conversion when using PDF Tables and Tabula. The PDF viewer within Tabula is not nearly as powerful as it is within ABBYY FineReader 14, when you compare the same document opened in both programs. However, we did note that when using Tabula greater detail and precision were achieved when smaller portions of the data table within the PDF were selected for conversion at a given time. It was also somewhat beneficial to perform optical character recognition on the PDF file prior to conversion. Tabula and PDFTables do not contain this capability, so you will need to use another program first to do this. In addition, neither tool offers the side-by-side document comparison options that are found in ABBYYFineReader 14, and PDFTables is only able to convert one page of a file at a time. Overall, it seems that Tabula may be more appropriate to use when converting PDF-based data that is displayed in a simple format, when converting multiple files into the same basic format, when cost may be an issue, and/or when one needs to convert data into spreadsheet format on an infrequent basis. PDFTables may be also useful when the volume and complexity of tabular data is not an issue. On the other hand, ABBYY FineReader 14 seems to be the better tool to use if you are under time constraints in your work, need to convert a great deal of tabular data, and/or need to convert data that is presented in a more complex format.

Table 1.

Software (Developed By) ABBY FineReader 14 (ABBYY) PDF Tables (The Sensible Code Company) Tabula (Journalists Manuel Aristarán, Mike Tigas Jeremy B. Merrill, and Jason Das)
Capabilities
  • Converts PDF’s, scans, and other documents into a variety of formats (e.g. Excel, PowerPoint).
  • Combines multiple files into one converted document.
  • Contains both PDF and OCR (Optical Character Recognition) editors.
  • Corporate version also includes automated document conversion capabilities and a side-by-side document comparison feature.
  • Windows, Mac, and Linux compatible.
  • Coverts PDF’s into Excel, CSV, or XML format (CSV and XML formats are offered through the on premise service model).
  • Has an API which allows users to automate data extraction.
  • Offers both cloud-based and on-premises service models.

 

 

 

 

 

  • Coverts data tables located within a PDF file into spreadsheet-compatible format (e.g. CSV, TSV, JSON).
  • Allows for selective and/or partial conversion of data tables.
  • Windows, Mac, and Linux compatible.
  • Keeps a file history of previously imported PDF’s.
  • Offers an option to save custom selections as templates for future use.

 

 

 

Limitations
  • Side-by-side document comparison feature in the corporate version does not include editing capabilities.
  • Requires minimal manual editing after conversion.

 

 

  • Does not include optical character recognition (OCR).
  • Can only covert one page at a time.
  • Requires some manual editing after conversion.

 

 

  • Does not include optical character recognition (OCR).
  • Only designed for text-based PDF files, not scanned documents and/or other document types.
  • Requires considerable manual editing after conversion.

 

Cost
  • $200 for the standard version, $399 for the corporate version; perpetual licensing and upgrades are available for registered users.
  • Free 30-day trial option is available.

 

  • Pricing is structured according to the number of pages used (e.g. $30 for a 1000 pages).
  • Free 25-page credit trial option is available; an additional 50 pages are given with a free registration.
  • Free, open-source tool.

 

 

 

 

 

Posted in DISC, Uncategorized | Tagged , , | Comments Off on PDF Data Conversion Tools

United Nations Population Fund Release Annual Population Report – October 17, 2018

Today, the United Nations Population Fund released their annual “State of the World Population” report (.pdf format, 152p.). The theme of this year’s report is “The Power of Choice: Reproductive Rights and the Demographic Transition.” The full report can be found at:

https://www.unfpa.org/swop-2018

Posted in Online Headlines | Comments Off on United Nations Population Fund Release Annual Population Report – October 17, 2018

A Tale of Two Data Projects: Curation at the Qualitative Data Repository

Qualitative Data Repository (QDR) at Syracuse University will host a free webinar on how to curate and share qualitative research data on Wednesday, November 7th, at 12pm (EST). Attendees can learn about two data projects deposited with QDR, from the initial contact, through a variety of curation steps, and to their eventual publication. It will cover the challenges posed by sharing and publishing qualitative data. Please register at https://www.eventbrite.com/e/a-tale-of-two-data-projects-curation-at-the-qualitative-data-repository-tickets-51458200864.

Posted in Classes, Conferences & Webinars | Comments Off on A Tale of Two Data Projects: Curation at the Qualitative Data Repository

Financial Characteristics of Cities in the United States, 1905-1930 – September 5, 2018

The following research study is now available in the online archive of the Data and Information Services Center (DISC):

Financial Characteristics of Cities in the United States, 1905-1930 (Christopher Curran, Principal Investigator)

Data was collected in the summer of 1969 from volumes of the following U.S. Bureau of the Census publications: Statistics of Cities Having a Population Over 30,000 and Financial Statistics of Cities Having a Population Over 30,000. Statistics were then compiled in order to describe the pattern of financial transactions in U.S. cities for the period 1905-1930, as part of an examination of regional and population differences.

Variables in this research study include: population, non-government costs, interest charges, public service enterprise payments, general government expenses, protection expenses, health expenses, sanitation expenses, highway expenses, charity, hospital and correction expenses, education expenses, recreation expenses, miscellaneous expenses, general government outlays, health outlays, sanitation outlays, charity, hospital, and correction outlays, education outlays, recreation outlays, miscellaneous outlays, public service enterprise outlays, non-revenue receipts, general property taxes, special taxes, poll taxes, business taxes, special assessments, fines, forfeits, and escheats, subventions and grants, highway privileges, rent revenue, interest revenue, miscellaneous revenues, public service enterprise revenue.

This study was initially deposited at the Data and Program Library Service (which later became DISC) in 1979 by Christopher Curran, an economics professor at Emory University in Atlanta, Georgia. Since then, data files and documentation for the study have been available by request only.

As of September 2018, users are now able to access all study documentation (e.g. codebook, data files) online; a free registration is required in order to download data files. The codebook is available in PDF format. The dataset is available in raw ASCII, Stata, and SPSS file formats, with the corresponding command files and data dictionaries.

Posted in DISC, DISC Archive, US Census | Comments Off on Financial Characteristics of Cities in the United States, 1905-1930 – September 5, 2018

ICPSR Data Fair 2018 – Data: Powered By You

This virtual conference is scheduled for October 1-5. It features over 20 webinars covering topics like data transparency, data activism, sharing data, data in community, and more. Attendees need to register each webinar individually. Data Fair schedule can be found on the event website.
https://www.icpsr.umich.edu/icpsrweb/content/membership/datafair/index.html

Posted in Classes, Conferences & Webinars, ICPSR | Comments Off on ICPSR Data Fair 2018 – Data: Powered By You

NIH Funding Logevity by Gender

This recent article (http://www.pnas.org/content/early/2018/07/10/1800615115) published in the Proceedings of the National Academy of Sciences examined the National Institutes of Health (NIH) research grantees from 1991 to 2010 and found that only 31% of the principal investigators are women in the biomedical sciences. Michael Lauer, Deputy Director for Extramural Research in NIH summarized some findings in his blog.

https://nexus.od.nih.gov/all/2018/09/07/funding-longevity-by-gender-among-nih-supported-investigators/

Posted in Journal Articles | Comments Off on NIH Funding Logevity by Gender

Social Science Computing Cooperative Offers Training Classes

The Social Science Computing Cooperative (SSCC) offers classes on statistical packages (R, SAS, Stata, Mplus, and NVivo). Please visit SSCC trainging page, https://ssc.wisc.edu/sscc_jsp/training/index.jsp to learn more about these classes. All SSCC training classes require registration. These are in-person classes. Materials for many classes are available online in the SSCC’s Statistical Computing Knowledge Base, https://ssc.wisc.edu/sscc/pubs/stat.htm.

Posted in Classes, Conferences & Webinars | Comments Off on Social Science Computing Cooperative Offers Training Classes

– New US Bureau of Labor Statistics Report on Women’s Earnings – August 21, 2018

The US Bureau of Labor Statistics has recently released their annual report on women’s earning – Highlights of Women’s Earnings in 2017 (BLS Report 1075, August 2018, .pdf, HTML, and Excel format, 153p.). The report utilized data from the Current Population Survey (CPS) to estimate the annual earnings of women in the United States. “The weekly and monthly earnings estimates in this report reflect information collect from one-fourth of the households in the monthly survey and averaged for the calendar year.” The full report is available at:

https://www.bls.gov/opub/reports/womens-earnings/2017/home.htm

Posted in Government Statistics | Comments Off on – New US Bureau of Labor Statistics Report on Women’s Earnings – August 21, 2018