December 1995

Spiders, Robots and Crawlers, Oh My!

Many World Wide Web (WWW) sites have emerged to provide tools for content searching. Some of these sites are comprised of searchable indexes created by automated web search programs (Harvest, Excite); others are actually subject lists created by humans (Yahoo, BUBL Subject Tree (BUBL Subject Tree is no longer in existence, but try BUBL Link at http://bubl.ac.uk/link/, as of 7/10/98). Meta-Sites allow access to many individual search tools, sometimes allowing the user to channel a single keyword through several search engines (SavvySearch, All-in-One).

The quality of these sites---of their content and their presentation style varies widely. Using them successfully has become an art in itself, and keeping up with their growing numbers is now almost an impossibility. The burden of quality control rests with the user's understanding of the tool: caveat emptor! Some guidelines may help:

It is important to know what information is indexed by a search tool, in order to get reasonable returns or "hits". For instance, Open Text does full-text indexing, while World Wide Web Worm indexes words tagged as document titles only. Other questions to ask are: how frequently is the index updated; how much of the WWW is covered (Lycos claims "90% of the Web"); and what kinds of Internet sites are included in the index? (WWW only, or FTP, Gopher, News, and Telnet sites...)

Second, the design of the search engine definitely affects the comprehensiveness, precision, and relevance of returns. Often, there is a trade-off involved. WebCrawler, for example, generally returns a large number of hits for a simple keyword search, but many of them may not be relevant. Yahoo returns extensive subject lists in response to a general topic search--a great launchpad for browsing Web resources, but frustrating if something specific is needed. Lycos may work well if precision is important, but does not allow phrase searching. InfoSeek does well at both recall and precision--however, a subscription fee is charged for use. (Some commercial tools use advertising instead.)

Personal preference plays a role as well. Search forms can be simple and non-transparent, or complex but customizable. EINet Galaxy uses string searching; WWW Worm allows Boolean operators (and, or, not). Excite offers ranked listings based on weighted comparisons of terms in documents. OpenText has a flexible search form and is fast. Output varies, as well. More sophisticated search tools produce detailed entries containing abstracts, outlines, and frequent words used. However, these abstracts are machine generated, and shouldn't be taken as measures of quality.

For even more information about web searchers, go to a DPLS hypertext document prepared for a conference of the Association of Public Data Users (APDU) this Fall on the subject.

Z39.50 Gateways

Another interesting type of search tool is a Z39.50 gateway to a library's online public access catalog (OPAC). Z39.50 is a standard communications protocol which allows the user to search the online catalogs of several libraries, using a common interface. For instance, a person could perform a search on Madcat, then perform another search on the University of Minnesota's OPAC, without learning new commands.

The Library of Congress has a Z39.50 gateway site (http://lcweb.loc.gov/z3950/) allowing access to the Library of Congress catalog. There are also links to the Z39.50 gateways of several universities, including Madison. Though you can link to many OPACs, there is no mechanism to perform a search on multiple OPACs--one query can't be submitted to both Madison and Minneapolis at the same time.

The amount of information returned is limited as well. The Madison gateway won't tell you in which campus library a title is located. However, for purposes such as interlibrary loan, these gateways provide a quick and simple means for locating titles.

Selected Recent Acquisitions

  • World Fertility Data [CD-ROM] (QP-510-001)
  • Bureau of Health Professions Area Resource Files (A.R.F.), 1940-1994 (QG-014-002)
  • Prospects: the Congressionally-Mandated Study of Educational Growth and Opportunity, 1991-1992. [CD-ROM] (QD-026-001)
  • The German Social-Economic Panel (GSOEP), Waves 1-11 [CD-ROM] (CA-511-001)
  • International Social Survey Program, 1985- 1992 [CD-ROM] (SA-520-003 ~ SA-520-008)
  • Medicare Current Beneficiary Survey, 1991 and 1992 (QG-034-001, QG-034-002)
  • World Data 1995: World Bank Indicators on CD-ROM (CB-528-001)

Statistical Agencies at Risk

As DPLS News goes to print, the U.S. FY96 budget battle is still being waged. Several statistical agencies are marked for funding cuts, elimination, or consolidation, although differing versions are still floating around Congress. The Horn Bill, H.R. 2521, known as the Statistical Consolidation Act, would create a "Federal Statistical Service," combining the functions of the Census Bureau, the Bureau of Labor Statistics and the Bureau of Economic Analysis, advised by a non-government council of ªexpertsº. The Department of Commerce Dismantling Act (H.R. 1756) calls for moving the Census Bureau from the Commerce Dept. to the Dept. of the Treasury, places the BEA under the Federal Reserve, and axes the Economic Development Administration. The Senate version (S. 929) places all Commerce functions under the Dept. of Labor. We will attempt an update in the next issue.

School Choice Data

by Christopher A. Thorn

As soon as the Milwaukee Parental Choice Program is archived at the DPLS, it will be available to users through the World Wide Web. We plan to make both a replication dataset and raw data available to other researchers. The study includes:

I. Enrollment Data
II. Survey Data - Demographic and Attitudinal
III. Test Data (Iowa Test of Basic Skills - Reading and Math scores).

One element of the enrollment data is the Student Record Data Base for Choice and Milwaukee Public School Control Group Students from 1990-1994. This information is provided by MPS. These data show MPS/Choice students descriptive data and year to year public school enrollment status. The Choice Applied Status File 1990-1994 is compiled from the study©s master database. This data set contains program status for Choice applied students. We also include a "school descriptor" file which includes X-Y Coordinates and MPS school characteristics compiled from the MPS published document "School Report Card."

The survey files contain several different survey types. The first is the MPS Control Group survey from 1991. The comparable Choice survey file is the Wave 1 survey. The Wave 1 (Fall) Surveys were done in years 1990-1994. Each time, all who applied to Choice in the survey year were surveyed. In order to look at the effects of the Choice schools on parental involvement and atttudes, we also resurveyed in the Spring. Wave 2 (Spring) Surveys were done in years 1991-1995. Each time, all who had ever been accepted to Choice and all non-accepted Choice from the survey year were surveyed. In response to unexpectedly high attrition from the Choice program we added an attrition survey in the Summer after the second year. The Attrition Surveys were done in years 92-95. All students who had been enrolled in Choice in the year prior, but had not re-enrolled in the next year were surveyed.

The test data contains all MPS test records from the MPS Control Group as well as all tests taken by Choice applicants both prior to and after any Choice enrollment. The Choice enrolled students were tested every year. Their private school scores are also included in the test data sets. A number of interesting groups can be studied with this data. We have looked at the performance of Choice applicants who were not accepted into the program to look at the effects of an attempted exit from the public school system. One could also examine the scores of attriters from Choice or the effect of different types of public schools (magnate, P5, etc.) on test scores.

Methodology Conference

The International Sociological Association is sponsoring the fourth International Social Science Methodology conference at the University of Essex in the U.K., July 1-5, 1996. The conference theme is "Data Connection: Data Construction, Cumulativity and Theory." Abstracts are invited and due by January 20, 1996. The conference URL is http://www.essex.ac.uk/essex96. (This link is no longer active, as of 7/10/98.)

Internet Corner

Labor Force Statistics from the CPS

Enjoy it while it lasts, now that it's back from furlough. (Yes, government web sites were also shut down last month.) In our humble opinion, the Bureau of Labor Statistics has done a real nice job with this site, which serves data and information on the labor force aspects of the Current Population Survey (a Census Bureau project which is commissioned by BLS). Headed by icons for Overview, FAQ, and Contacts, the site breaks down into News Releases, Data, Publications, and Related Programs. Under "Data, Most Requested Series," one is presented with a long list of headings as general as "Unemployment Rate - Civilian Labor Force", and as specific as "Civilian Employment-Population Ratio 20 Yrs. & Over White Female", etc. At the bottom of the page is a form in which one can select years and formats for the output file. Point your browser to http://stats.bls.gov/cpshome.htm .

IPUMS Update

The Integrated Public Use Microdata Sample is now publicly available from the Social History Research Lab at the University of Minnesota, http://www.hist.umn.edu . This project consists of 23 high-precision samples of the U.S. Census, taken from the census years from 1850-1990. The samples from these eleven census years are combined into a single database, with uniform codes and integrated documentation. The data and documentation are free; the IPUMS staff will sell a hard copy version of the user guide for $30.00.

The IPUMS project staff have also created "small" and "tiny" random extracts of the Integrated Public Use Microdata Series for all census years. According to the IPUMS staff, the "small" samples contain about 20,000 cases, and should be "sufficient for many applications". The "tiny" samples contain about 2,000 cases, and are useful for "quick diagnostics and perhaps some instructional purposes." As of November 10, 1995, the data are not available via the IPUMS web page. The data is accessible via anonymous FTP at torgo.hist.umn.edu (email address for password), and located in ipums/data/small and ipums/data/tiny.

AIDS Data Animation Project

Using mortality data from the National Centers for Health Statistics, the AIDS Data Animation Project has created a web site which provides still frame and animated maps of regional United States AIDS mortality trends. The maps depict weekly AIDS mortality rates from 1981-1993. Documentation is available at the site as well. The URL is: http://www.ciesin.org/datasets/cdc-nci/cdc-nci.htm . The still frame images are in GIF format; the animated maps require an MPEG viewer and may be quite slow to appear. File sizes are listed for each mapping.

Geographical Information Systems (GIS)

The following web site is under development, and requesting comments for improvements: http://www.cyntax.com.au/gis. (This site is no longer able to support this service, checked on 7/10/98.) It provides links to several GIS and related sites. Data-providing sites include the U.S. Geological Services Declassified Satellite Photographs, U.S. Census Tiger Data, and the Environmental Systems Research Institute. There are some sites referenced in the "Data" portion of this site which do not provide access to data, but are descriptions of GIS products such as CD-ROMs for sale, and links to various technical documents and utilities for the PC ARC/INFO software.

