This set of documents covers some of the issues involved in the use of numeric data, both in general and specifically at Data and Information Services Center (DISC).
Topics include how data is created, how to locate suitable data for research, how to acquire the information necessary to work competently with the data, and where to look to find further resources.
Creating Data
Modified from a text by Gregory Haley, Head, Electronic Data Service, Columbia University.
There have been many books written and courses taught on conducting surveys. The purpose of this document is to provide very basic information on the processes involved in creating data.
If you find you need to collect your own data, you should be prepared to spend a lot of time designing and testing your data collection instrument, and collecting, coding, and entering your data. You will need to consider many issues such as the phrasing and order of your questions, the sampling procedure you will use, and how you will code your data. The survey example used here has been borrowed from SAS, and will use a fitness survey that tests the effects of different forms of exercise on measures of weight and the heart. The population for our example consists of overweight people, and will only measure the subjects one time. A real fitness study would test participants a number of times and would therefore have a base-line, and several follow-up surveys.
Click here to see the sample survey.
Once all your surveys have been completed and returned to you, you will have to input the data into a file. You might choose one of the data entry programs, such as dBase or SPSS, which will allow you to set up an interactive template that greatly facilitates the input process and provides a higher level of accuracy than using a basic editing program such as edit in dos, edt on the VMS VAX, or emacs or vi on Unix. With this method of data entry, you write the data directly into your file, line by line. A word of warning, however: the degree of error in this method of data entry is very high. It is easy to lose your place, easy to make typing errors, and very difficult to find anomalies in your data entry patterns. If at all possible, it is best to use a data entry program.
The answers from the first three questionnaires will look like this:
01 Amy 39 67 9 160 159 74 60 37 37 S 5 02 Diana 22 66 15 154 134 72 62 29 16 C 1 03 Nina 33 63 11 134 140 80 70 26 26 S 5
We can compare these data to the survey, and reading across, we see the variables are:
- Case Id
- Name
- Age
- Height
- Number of years overweight
- Starting weight
- Ending weight
- Starting pulse rate
- Ending pulse rate
- Starting skinfold
- Ending skinfold
- Form of Exercise
- Enjoyment Level
Personal Data
Fitness Measures
Exercise Measures
Click here to see the complete data file.
The next step is to devise a codebook, which will name the variables and give the column locations for each. Thus the first three records of data with their column locations look like this:
1___5___10___15___20___25___30___35___40___45___50___55___60 01 Amy 39 67 9 160 159 74 60 37 37 S 5 02 Diana 22 66 15 154 134 72 62 29 16 C 1 03 Nina 33 63 11 134 140 80 70 26 26 S 5
Now we can take the list of variables and create the codebook. You can assign any name to a variable, with most statistical packages, the variable name must be eight characters or less. Names can be a mnemonic, e.g., CaseID, Name, etc. a sequential list such as var01, var02, etc. or a combination as appropriate (e.g., caseid, name, var03, var04 etc.). The important thing is that you name your variables in a consistent manner so that later you can find your variables in the shortest amount of time.
After each variable name, you will want to record the beginning and ending column of each variable. All the variables, except the last two, will not have any pre-set range of values. For the last two variables your codebook should indicate the valid responses and what they mean.
In the example below, a combination of names and numbers are used.
- caseid..( 1- 2) Case Id
Personal Data
- name....( 5-14) Name
- var01...(15-16) Age
- var02...(19-20) Height (in inches)
- var03...(23-24) Number of years overweight
Fitness Measures
- var04...(27-29) Starting weight
- var05...(32-34) Ending weight
- var06...(37-38) Starting pulse rate
- var07...(41-42) Ending pulse rate
- var08...(45-46) Starting skinfold
- var09...(49-50) Ending skinfold
Exercise Measures
- var10......(53) Form of Exercise
- S = Swimming
C = Cycling
W = Walking
- S = Swimming
- var11......(56) Enjoyment Level
- 1 through 5 (1 = low, 5 = high)
You will notice that two of your variables use letters rather than numbers. These are commonly called "string" or "alphanumeric" variables. Each statistical package uses a different syntax to indicate that the computer is using letters rather than numbers. SPSS uses an a in parentheses, (a), after the column locations, and SAS uses a dollar sign, $, after the variable name. In fact, any nominal level variable can be treated as a string, whether it is a number or an alphanumeric. In our example, we could treat CaseID as a string, since we are not interested in any summary statistics such as mean, median, or mode. Treating all nominal level variables as strings can save your computer a lot of time when calculating statistics on your entire dataset.
Now you are ready to take your data and write a statistical package program to read and analyze them.
Next: Using Secondary Data