An Introduction to Using Data at DISC

Introduction: Data Types

Modified from a text by Gregory Haley, Head, Electronic Data Service, Columbia University.

There are many issues related to the topic of numeric data. Perhaps the most basic is that of data type: primary versus secondary data. DISC is, of course, a library of data sets for use in secondary analysis. In addition to making available data from a wide variety of sources (for example, government, private companies, and individual researchers) we also archive data -- that is, preserve and distribute data -- that are produced locally by University faculty and staff.

Data for which the researcher designs the survey instrument (questionnaire), administers the survey, collects the data, and enters the data into a database are considered primary data. The researcher can be an individual or an organization, such as a news service or a research institute. Primary data are incredibly expensive, both in time and money. An editor of a major city newspaper estimated that he could pay a full time reporter's salary for a year for the same amount as it would take to execute a single, local public opinion poll. Polls are so expensive to conduct, in fact, that the major news services have used data collected by a single organization on which to base their exit poll reporting during elections. Understanding the process by which data are created is important for anyone doing quantitative research. We include a section on primary data in this series of documents for that reason.

Data that are used by someone other than the person who collected them are referred to as secondary data. While this category occassionally includes data that are hand-entered from print material (a necessity sometimes), it generally refers to the use of published numeric data. These data can be obtained from a number of sources including established archives, private companies, or directly from principal investigators. The benefits of using secondary data are that you have neither the time nor the financial investment in their collection. The trade-off, though, is that you do not have the control over how the instrument is designed and implemented, what questions are asked, how the data are collected, or how carefully they are cleaned and documented.

There are other serious trade-offs between primary and secondary data. One of these is the issue of anomalies; when you are using data you have collected yourself, you have a clear understanding of how those data should appear (in a frequency table, for example). You are more quickly able to spot and correct errors. If you are using someone else's data, you won't necessarily know all of the subtleties that were involved in making coding decisions and in inputing the data, and errors in the data may be difficult to resolve. When using secondary data it is wise to begin with a careful analyses of the frequencies of all variables that you will use.

Next: Creating Data