![]() |
An Introduction to Using Data at DPLSWhat are Data?Modified from a text by Gregory Haley, Head, Electronic Data Service, Columbia University. In the context of data libraries and archives, 'data' means computer-readable data. We acquire, store and disseminate data for secondary research. This implies that the data collected for a primary purpose are then made available for research by other individuals or groups. This research may seek to replicate analyses already carried out by primary researchers in order to verify, extend, or elaborate upon the original results, or to analyse the data from an entirely different perspective. Censuses and large surveys carried out by governments for their own policy purposes are particularly rich sources of data for further exploration. For most, the "Introduction to Data" occurs in the context of an introductory or advanced course in statistical analysis. Typically, the data you have used has been preformatted by a TA or RA for use with a particular statistical package such as SPSS, SAS or STATA. Once you embark upon your own research, you will rapidly discover that few datasets come prepared for immediate access by your favorite application. In fact, you will find that getting your data in an appropriate form for analysis is oftentimes more involved than the most complex statistical analyses you will employ! Using Secondary DataUsually your research design will be such that you will not have to collect your own data but can test your hypotheses using data that already exist among the wealth of data available in the public realm. These data might be small, simple, micro-level data such as a public opinion poll, or a survey of social or political attitudes. Or they may be more extensive and complex data, such as the Current Population Surveys or the Panel Study of Income Dynamics. Alternatively, many macro-level data sets (geographically aggregated data such as the County Business Patterns or the International Financial Statistics) are also available. Regardless, the challenge with secondary data is to assure yourself that the data appropriately address your research question such that you are not caught in a dilemma of altering your hypothesis to fit the data. The sorts of questions you will need to ask yourself when you are evaluating secondary data sources for use include the appropriateness of the study's unit of analysis and sampling, the variables and their values, and levels of measurement. Finding dataAt the Data and Program Library our primary source of data is the Inter-university Consortium for Political and Social Research (ICPSR) located in Ann Arbor Michigan. They have a searchable database of their extensive collection, which is available freely to all University of Wisconsin students, staff and faculty. The DPLS also obtains data from other sources such as the U.S. Government, from International Organizations, and from private vendors. You can browse all of the holdings at DPLS via our Online Catalog, and you can search for data on the Internet yourself by using the DPLS Internet Crossroads. Up until the early 1990's most data were available only on electronic storage media such as a magnetic tape. Data had to be accessed via a mainframe computer and a great amount of technological knowledge was required to use them -- knowledge not only of a statistical nature but also general computer knowledge. More recently the desktop computing revolution has brought data directly to the researcher's desktop: mainframe technology has been replaced by the P.C., CD-ROMs, FTP, and the World Wide Web. This revolution has not only been technological in nature; many more people can now use data without having to devote tremendous amounts of time to master all of the technical knowledge that once was mandatory. The one thing that hasn't changed is the task of locating the data that will suit your research need. In most cases this will be a time-consuming and meticulous process. There are many tools available to assist you in this process, including online and printed resources, books and articles, and, of course, the knowledge and experience of others. Once you find the dataset that you want, you can contact DPLS staff to help you gain access to it. Many of our datasets, including all those that come from the ICPSR, can be distributed via Anonymous FTP. Some data, specifically those on commercially produced CD-ROMs, must be used within the data library. Also be aware that for all of the steps of finding the right data there are potential gotchas. For this reason it is prudent to give yourself plenty of time not only to locate your data but to review all of the associated information (including technical documentation and codebooks). Accessing the DataAs a starting point, assume that you have identified a dataset containing information you would like to analyze. These data consist of a number of measured attributes--called variables--each describing a set of observations. In the case of a survey, the observations are typically individual respondents and the variables are responses solicited from questions about attitudes, behaviors and traits. Before getting started, it is essential that you understand the technical language of data. The most basic concepts are the record and the field. A record is simply a line. In some cases, the record contains all of the information about an observation, but as is noted below that isn't always true. A field is a column or columns containing the data for a specific variable. This assumes, of course, that your data is in a fixed column format--meaning that the values for a particular variable are in the same column(s) for all observations--which is true for 90% of all data files you will encounter. The alternative is a variable format in which the variables a found in a specified sequence within the file but are not in the same locations for each observation. The relationship between records and fields constitutes a "data structure". The four most common structures are as follows:
Typically, you won't want all of the variables in a file, and in some instances you won't want all of the observations. The process of reducing the number of variables is called an extraction; reducing the number of observations is called subsetting a dataset. Generally, once a suitable dataset has been identified you will need to subset those cases and/or extract those variables which you will want to save for further processing. On many of our proprietary CD-ROMs extraction software (of varying quality) is included. In general it will allow you to point and click on the variables you want. Your data will be extracted as either a raw ASCII file or a spreadsheet or statistical package system file. The DPLS produces on-line Users Guides for many of these products. In the case of ICPSR data, the place to start is with a CODEBOOK which is more or less a manual describing a particular study or data collection. While the content and format of codebooks vary considerably between data collections, the typical codebook contains the following information:
Most codebooks have at least two major sections: the data dictionary which lists the variables and column locations and the data collection instrument. In a number of cases, there is a section describing how to read the codebook! You will also need to use a statistical package such as SAS or SPSS to access the supplied raw data and create your extract. With many ICPSR data sets SAS and SPSS command files are included, along with an electronic codebook, to facilitate this process. The following is an example of an ICPSR dataset titled The Euro-barometer 14: Trust in the European Community, October 1980. The study consists of a single raw ASCII data file, a codebook file, and SPSS and SAS command files. The first line of data begins like this: 795820010000101032078001233133113240002002120000131013221120030420117720003 Each line of data takes 106 columns, so there isn't room for them all on a single line of thise page. Each line of data is a single observation (or record). The first 25 records of data look like this. There are a total of 9994 records in the entire data set (this file specification information can be obtained via the ICPSR or the DPLS web sites). We will add a scale with these data even though we already have a codebook, and we will only use the first 60 columns of data. Now we can compare the the data to the documentation, or the codebook. 1___5___10___15___20___25___30___35___40___45___50___55___60 795820010000101032078001233133113240002002120000131013221120 Let's look at the first few pages of the machine readable codebook. We will snip out the initial introductory material from the document, but you should get into the practice of reading this material very closely. In it you will find out about the population from which the sample was drawn, whether the respondents were selected by random sampling, by cluster random sampling or a proportional random sampling process. In the first two instances, each respondent will need to be weighted, whereas in the latter, they will not need to be. A variety of other important information will also be found in the introductory material. In the codebook, the first variable, named VAR0001 is documented like this: VAR 0001 ICPSR STUDY NUMBER-7958 NO MISSING DATA CODES REF 0001 LOC 1 WIDTH 4 DK 1 COL 3- 6 ICPSR STUDY NUMBER-7958 Here the first variable identifies the dataset by the ICPSR number. This is important. These data are collected for the European Union, and they originate from the Zentralarkhiv fur Sozialforschung (ZA) at the University of Koln. If these data came from ZA, the documentation from ICPSR might be of little use. This underscores an important point: is always absolutely essential that you determine the codebook matches the data; having a codebook that is the incorrect version as compared to the data can mean that the data will be unusable. Since we are using data that are formated with one record per case, the first variable starts in column 1 and is 4 columns wide. There are no missing data codes. If we were using a dataset with more than one record per case, an important piece of information that we would need to account for would be the record number (or type), often identified with DK for deck number. The second variable, starting in column 4 and only one column wide, identifies the edition, which for this dataset is the second. VAR 0002 ICPSR EDITION NUMBER-2 NO MISSING DATA CODES REF 0002 LOC 5 WIDTH 1 DK 1 COL 7 ICPSR EDITION NUMBER -------------------- THE NUMBER IDENTIFYING THE RELEASE EDITION OF THIS DATASET. 2. WINTER, 1983 RELEASE If we look at the line of data above, we see the first five numbers are 7958 and 2. In another example of a typical codebook entry (this time from the American National Election Survey) we see the following:
VAR 0062 R INTREST-POL CAMPGN MD=0 OR GE 8
REF 0062 LOC 151 WIDTH 1
In this interview I will be talking with you about the
recent elections, as well as a number of other things.
First, I have some questions about the political campaigns
that took place this election year.
Q.A1. Some people don't pay much attention to political
campaigns. How about you? Would you say that you were
VERY MUCH INTERESTED, SOMEWHAT INTERESTED, or NOT MUCH
INTERESTED in following the political campaigns this year?
----------------------------------------------------------
304 1. VERY MUCH INTERESTED
635 3. SOMEWHAT INTERESTED
419 5. NOT MUCH INTERESTED
8. DK
1 9. NA
1126 0. INAP, 1992 cross section
This entry shows the actual survey text as well as the coded values, their labels, and frequencies, plus missing data information in the MD field. Other data sets have different types of codebooks, and you will find with experience that there really isn't a "standard" for documenting data. For example, an entry from the 1990 Census Public Microdata looks quite different than the ICPSR codebooks described above:
Since these are hierarchical data, there is a RECTYPE indicator, in this case Person Record. The "BEGIN Column contains the starting column location of the variable and "SIZE" indicates the width of the variable. The lines beginning with "V" contain the coded values of the variable and a description of these codes. In almost all cases, the codebook will contain the information you will need to begin thinking about and writing syntax to extract the variables and cases you need from the raw data. The basics pieces of information that you will need are:
The time-consuming task, especially with huge, comprehensive studies such as the Panel Study of Income Dynamics, is to read through the long lists of variables, track down their descriptions in the codebook and keep track of their column locations until you have to write your extraction program (or, if you are lucky, to edit an already existing program). Once you've decided which variables you need, you'll need to decide upon an application to use. The application you decide to use should be one that is suited for the types of analyses that you expect to conduct; however, there isn't always an obvious choice. Some applications are well-suited to specific types of analysis; however, they are poor choices as "extraction engines" because they either don't handle data manipulation efficiently or they cannot work with complex data structures. Most people use either SPSS or SAS for extracting data because both have very robust data manipulation capabilities. After you extract your data you are ready to write a statistical package program to read and analyze them. Return to the Table of Contents. |
©2008 Data and
Information Services Center, University of Wisconsin-Madison.
If you have trouble accessing this page, please contact disc@mailplus.wisc.edu.