An Introduction to Using Data at DISC

This set of documents covers some of the issues involved in the use of numeric data, both in general and specifically at Data and Information Services Center (DISC).

Topics include how data is created, how to locate suitable data for research, how to acquire the information necessary to work competently with the data, and where to look to find further resources. 

Creating Data

Modified from a text by Gregory Haley, Head, Electronic Data Service, Columbia University.

There have been many books written and courses taught on conducting surveys. The purpose of this document is to provide very basic information on the processes involved in creating data.

If you find you need to collect your own data, you should be prepared to spend a lot of time designing and testing your data collection instrument, and collecting, coding, and entering your data. You will need to consider many issues such as the phrasing and order of your questions, the sampling procedure you will use, and how you will code your data. The survey example used here has been borrowed from SAS, and will use a fitness survey that tests the effects of different forms of exercise on measures of weight and the heart. The population for our example consists of overweight people, and will only measure the subjects one time. A real fitness study would test participants a number of times and would therefore have a base-line, and several follow-up surveys.

Click here to see the sample survey.

Once all your surveys have been completed and returned to you, you will have to input the data into a file. You might choose one of the data entry programs, such as dBase or SPSS, which will allow you to set up an interactive template that greatly facilitates the input process and provides a higher level of accuracy than using a basic editing program such as edit in dos, edt on the VMS VAX, or emacs or vi on Unix. With this method of data entry, you write the data directly into your file, line by line. A word of warning, however: the degree of error in this method of data entry is very high. It is easy to lose your place, easy to make typing errors, and very difficult to find anomalies in your data entry patterns. If at all possible, it is best to use a data entry program.

The answers from the first three questionnaires will look like this:

01  Amy       39  67   9  160  159  74  60  37  37  S  5
02  Diana     22  66  15  154  134  72  62  29  16  C  1
03  Nina      33  63  11  134  140  80  70  26  26  S  5

We can compare these data to the survey, and reading across, we see the variables are:

  1. Case Id
  2. Personal Data

  3. Name
  4. Age
  5. Height
  6. Number of years overweight
  7. Fitness Measures

  8. Starting weight
  9. Ending weight
  10. Starting pulse rate
  11. Ending pulse rate
  12. Starting skinfold
  13. Ending skinfold
  14. Exercise Measures

  15. Form of Exercise
  16. Enjoyment Level

Click here to see the complete data file.

The next step is to devise a codebook, which will name the variables and give the column locations for each. Thus the first three records of data with their column locations look like this:

01  Amy       39  67   9  160  159  74  60  37  37  S  5
02  Diana     22  66  15  154  134  72  62  29  16  C  1
03  Nina      33  63  11  134  140  80  70  26  26  S  5

Now we can take the list of variables and create the codebook. You can assign any name to a variable, with most statistical packages, the variable name must be eight characters or less. Names can be a mnemonic, e.g., CaseID, Name, etc. a sequential list such as var01, var02, etc. or a combination as appropriate (e.g., caseid, name, var03, var04 etc.). The important thing is that you name your variables in a consistent manner so that later you can find your variables in the shortest amount of time.

After each variable name, you will want to record the beginning and ending column of each variable. All the variables, except the last two, will not have any pre-set range of values. For the last two variables your codebook should indicate the valid responses and what they mean.

In the example below, a combination of names and numbers are used.

  1. caseid..( 1- 2) Case Id

    Personal Data

  2. name....( 5-14) Name
  3. var01...(15-16) Age
  4. var02...(19-20) Height (in inches)
  5. var03...(23-24) Number of years overweight

    Fitness Measures

  6. var04...(27-29) Starting weight
  7. var05...(32-34) Ending weight
  8. var06...(37-38) Starting pulse rate
  9. var07...(41-42) Ending pulse rate
  10. var08...(45-46) Starting skinfold
  11. var09...(49-50) Ending skinfold

    Exercise Measures

  12. var10......(53) Form of Exercise
    • S = Swimming
      C = Cycling
      W = Walking
  13. var11......(56) Enjoyment Level
    • 1 through 5 (1 = low, 5 = high)

You will notice that two of your variables use letters rather than numbers. These are commonly called "string" or "alphanumeric" variables. Each statistical package uses a different syntax to indicate that the computer is using letters rather than numbers. SPSS uses an a in parentheses, (a), after the column locations, and SAS uses a dollar sign, $, after the variable name. In fact, any nominal level variable can be treated as a string, whether it is a number or an alphanumeric. In our example, we could treat CaseID as a string, since we are not interested in any summary statistics such as mean, median, or mode. Treating all nominal level variables as strings can save your computer a lot of time when calculating statistics on your entire dataset.

Now you are ready to take your data and write a statistical package program to read and analyze them.

Next: Using Secondary Data