Introduction: Data Types
Modified from a text by Gregory Haley, Head, Electronic Data Service, Columbia University.
There are many issues related to the topic of numeric data. Perhaps the most basic is that of data type: primary versus secondary data. DISC is, of course, a library of data sets for use in secondary analysis. In addition to making available data from a wide variety of sources (for example, government, private companies, and individual researchers) we also archive data -- that is, preserve and distribute data -- that are produced locally by University faculty and staff.
Data for which the researcher designs the survey instrument (questionnaire),
administers the survey, collects the data, and enters the data into
a database are considered primary data. The researcher can
be an individual or an organization, such as a news service or a research
institute. Primary data are incredibly expensive, both in time and money.
An editor of a major city newspaper estimated that he could pay a full
time reporter's salary for a year for the same amount as it would take
to execute a single, local public opinion poll. Polls are so expensive
to conduct, in fact, that the major news services have used data collected
by a single organization on which to base their exit poll reporting
during elections. Understanding the process by which data are created
is important for anyone doing quantitative research. We include a section
on primary data in this series of documents for that reason.
Data that are used by someone other than the person who collected them
are referred to as secondary data. While this category occassionally
includes data that are hand-entered from print material (a necessity
sometimes), it generally refers to the use of published numeric data.
These data can be obtained from a number of sources including established
archives, private companies, or directly from principal investigators.
The benefits of using secondary data are that you have neither the time
nor the financial investment in their collection. The trade-off, though,
is that you do not have the control over how the instrument is designed
and implemented, what questions are asked, how the data are collected,
or how carefully they are cleaned and documented.
There are other serious trade-offs between primary and secondary data.
One of these is the issue of anomalies; when you are using data you
have collected yourself, you have a clear understanding of how those
data should appear (in a frequency table, for example). You are more
quickly able to spot and correct errors. If you are using someone else's
data, you won't necessarily know all of the subtleties that were involved
in making coding decisions and in inputing the data, and errors in the
data may be difficult to resolve. When using secondary data it is wise
to begin with a careful analyses of the frequencies of all variables
that you will use.
Next: Creating Data