Abstracts for Wednesday, 26 May 2004
Plenary
The Future of Social Science Data Archiving in the United States: A Discussion with Five Archive Directors
Chair: Margaret O. Adams
Myron Gutmann (ICPSR)
Kenneth Bollen (The Odum Institute, University of North Carolina)
Michael Carlson (National Archives and Records Administration)
Copeland Young (The Murray Center, Radcliffe College)
Richard Rockwell (The Roper Center, University of Connecticut)
A1: The Diverse World of Digital Libraries
Chair: Tess Trost
Multimedia Oral History Database
Zoltan Lux (Institute for the History of the 1956 Hungarian Revolution)
The Oral History Archive at the 1956 Institute in Budapest contains about a thousand life interviews. A good two-thirds of these tie in with the 1956 Hungarian Revolution, as they contain recollections by participants or their children. They vary in length between 50 and several thousand pages. Each was made as a sound recording.
Thanks to a successful competitive application for funds, a start could be made in 2003 to digitize the recordings and the texts. The purpose is to preserve the interviews in digital form. Since the existing database handling system cannot store large files, we began by devising a data-archiving program system based on Oracle. This was not just intended to provide efficient data storage. It also set out to meet the standards of international practice (DDI) and promote later development of a still more efficient content-based search facility. Much of the material is confidential in character and cannot be published, so that great care had to be taken in devising a system of entitlement grades for access.
Special heed has been paid to making it possible to have interconnection with other databases, for which we are seeking partners. At present, a provisional test version is accessible at the URL: server2001.rev.hu/oha/index.html.
Economic Growth Center Digital Library: Creating Access to Statistical Sources Not Born Digital
Ann Green (Yale University)
Julie Linden (Yale University)
The Economic Growth Center Digital Library (ssrs.yale.edu/egcdl), funded
by the Andrew W. Mellon Foundation, digitizes and makes accessible a
selection of Mexican state statistical abstracts from Yale University
Library's Economic Growth Center Library Collection.
In a departure from most digital libraries, which concentrate on images or
texts, EGCDL focuses on statistical tables. This project addresses issues
and challenges unique to statistical materials, such as:
- Evaluating whether common digitization practices and standards,
generally developed for images and text, are ideally suited to
statistically-intensive documents.
- Automating metadata production for thousands of PDF and Excel files.
Detailed table-level metadata records will be created in XML according to
the Data Documentation Initiative (DDI) specification, including the DDI
aggregate data extension. In addition to a user interface that presents the
PDF versions of the statistical abstracts along with individual tables from
the series, a selection of tables and metadata also will be presented in
the Nesstar system. This will allow users to browse lists of tables by
topic, state, and year, and to search across the entire collection for
specific individual tables.
We also address the long-term preservation of the digital materials
produced in this project and their relationship to the original printed
source materials.
A Digital Library in a Multilingual Environment
Cor van der Meer (Fryske Akademy)
Mercator is a network of three research and documentation centres dealing with the regional and minority languages which are spoken by more than forty million citizens of the European Union. The Mercator-Education centres started with a pilot project for the creation of a digital library on European Minority Languages with text, image and sound. The project is financed by the Royal Dutch Academy of Scientists. The pilot will take one year and will be carried out with Frisian digital material. The aim is to develop a digital library to index, classify and catalogue scientific sources concerning European minority languages and knowledge of and from academic researchers. The content will cover linguistics, sociolinguistics, literature, media, legislation, education, culture history and language policy. The pilot will serve as a development model for partners in other European linguistic communities.
This presentation will focus on some of the crucial issues of this project, like the creation of a user profile and contentplan. Also very important is the set of requirements and functionalities which should help to decide which type of applications or tools are necessary. Formats and standards have to be chosen for metadata, the repository, the publications, etc. Organisation is a crucial aspect, but also practical matters such as 'flexible' user interfaces are very important.
A2: Pulling It All Together: Strategies in Data Preparation
Chair: San Cannon
Separating Our Concerns: Evaluating the Use of Apache's Cocoon Project to Efficiently Manage Data Tasks at the Minnesota Population Center
William C. Block (Minnesota Population Center)
Each year more and more social science data is made available to researchers, and with it comes an ever-surging demand for easy access to data over the web. Such demands place a substantial technical burden on data providers, who must constantly prepare and update datasets, websites, and related documentation. At the Minnesota Population Center such activities are carried out by a small army of workers, including research staff, programmers, and web designers, that specialize in various aspects of the data preparation and dissemination process.
As the size and complexity of our data projects has grown, so has our need to efficiently accomplish our tasks. Getting research staff (who do most of the data preparation), programmers (who often do "back end" processing work), and web designers (responsible for the "front end" look and feel of our projects) to work efficiently together can be a major challenge. This presentation will describe our evaluation of Apache's Cocoon Project to "separate the concerns" of our research staff, programmers, and designers so that each group can work independently yet in parallel to efficiently achieve their tasks.
Mixing It: Preparing Qual+Quant Data Collections for Dissemination: Experiences from the UK Data Archive
Louise Corti (UK Data Archive, University of Essex)
In this paper I will provide an overview of some of the challenges faced by the UK Data Archive in accessioning and processing mixed methods collections, i.e., those comprising quantitative and qualitative data. I will discuss issues pertaining to: data preparation including issues of case-linkage and anonymity; descriptive/cataloguing requirements; documentation or user guide preparation; and staff skill and training requirements. Two case studies will be used to illuminate the problems and solutions.
Practical Viability of Multiple Imputation as a Tool for Disclosure Protection for Large Scale Recurring Surveys
Pat Doyle (U.S. Census Bureau)
The literature often cites potential threats to the continued viability of microdata products arising from the increased availability of administrative data in the public domain and the decreased barriers to access by individuals not skilled in data processing. Yet demand for such products continues to rise as research and public policy demands on data become more sophisticated and require more in-depth analysis of the complexities of modern society. If the threat becomes real and the demand for microdata continues, the social science community will need an alternative to the traditional microdata products.
Current research toward replacements for public use microdata files includes, among other options, proposals to disseminate analytically valid synthetic microdata. To date the research has focused on the methodology and on experiments designed to determine validity of the approach. There is another area of research needed to determine whether such methods can gain acceptance as a production tool by the data producers and the data users in the statistical community. In particular, producers need to understand what they can do to ensure users will have faith in the quality of the estimates derived from synthetic data.
This presentation solicits feedback on the concept of disseminating synthetic data generated from a multiple imputation synthesizing methodology currently under development.
A3: Collaboration among Data Providers: Strength in Numbers
Chair: Ernie Boyko
Data Curation and Digital Preservation: A View from the UK: Parts 1 and 2
Peter Burnhill (EDINA National Data Centre and University Data Library)
Robin Rice (EDINA National Data Centre and University Data Library)
The Digital Curation Centre (DCC; www.dcc.ac.uk) has been established and funded by the UK government to provide leadership to the academic community on the related problems of scientific data curation and the long-term digital preservation of scholarly output. The funders awarded the bid to a consortium of four UK institutions, led by the University of Edinburgh, to provide a range of services for the initial three years of the centre's funding. The other partners are the University of Glasgow's Humanities Advanced Technology and Information Institute, the Council for the Central Laboratory of the Research Councils at Rutherford and Daresbury Appleton Laboratories, and the UK Office for Library and Information Networking at Bath. Each site will contribute a different expertise to the Centre, which is currently in the set-up phase of its operation.
This paper will describe how a widely distributed partnership is being managed to achieve several 'proper tensions:'
- between the needs of the hard sciences, which represent one end of the continuum, and the needs of the soft disciplines of the social sciences and humanities along the other end;
- between the need for cutting edge research which will improve the state of knowledge about preservation and database curation, and the need for quick development of tools tuned to the immediate needs of the users; and
- among a vast array of international standards efforts and preservation tools developed under hugely disparate circumstances, all of which will be in competition for certification or publicity by the Centre, to be rubber-stamped (or not) as deserving adoption by communities of practice.
Peter Burnhill, Director (Phase One) of the DCC during the set-up phase, will outline some of the drivers behind the decision to set-up the Centre, the strategy being adopted to engage such a diverse range of communities, and the approach being taken to make an organisation from four partner institutions, drawing upon experience gained in setting up the EDINA National Data Centre nine years ago.
Robin Rice, Phase One Project Coordinator, will describe what the social sciences have both to offer and to learn from the other disciplines in the emerging fields of data curation and digital preservation, with a focus on the current state of the art and the challenges ahead.
Archiving Historical Research Data
Hans Jorgen Marker (Dansk Data Arkiv)
Historical research increasingly uses and produces data, and archiving of these data is relevant for the same reasons that archiving of social science data is relevant. Actually, archiving of historical research data also raises some issues that are not that pressing when dealing with social science data.
Discussions between the institutions that archive historical data were quite vivid 10 to 15 years ago but until recently there has been a period of silence. Some of the archives involved are trying to remedy that, because cooperation is as relevant when archiving historical data as when archiving social science data. Actually, there is a broad area of common problems between Social Science Data Archives and History Data Archives and many of the Social Science Data Archives are to greater or lesser extent custodians of historical research data.
This presentation will point to some relevant areas of cooperation.
B1: Assessing User Needs and Data Services
Chair: Donna Tolson
Thinking Strategically: Development of a Library Data Services Plan
Katherine McNeill-Harman (Massachusetts Institute of Technology)
Over the past several years, developments in technology and
research have changed the ways in which libraries and their users interact
with social science data. Moreover, the integrated and interdisciplinary
nature of data requires collaboration among departments and organizations,
as well as with providers of data related to GIS and scientific
applications. These increasing and changing demands on the part of users
present challenges for institutions in allocating their limited resources.
In order to plan strategically to meet these needs, the MIT Libraries
conducted a project to create a 3 year Data Services Plan. The plan
contains goals for reference, instruction, collection development,
personnel, facilities, computing, evaluation, and implementation. This
presentation will describe the process of creating the Data Services Plan,
including user studies, staff input, and research among peers in the social
science data community. Additionally, it will discuss challenges faced, the development of priorities, and strategies for implementation.
Data Services Awareness and Use Survey: Assessing Secondary Data Needs at the University of Tennessee
Eleanor J. Read (University of Tennessee)
In recent years, the University of Tennessee has been striving to increase awareness and use of data services provided by the Libraries. A major move in that direction was hiring, for the first time, a data services librarian who could provide more specialized and proactive service to campus researchers. After three years with this new arrangement, we decided to conduct a survey to learn more about our secondary data users and to gauge the effectiveness of our various promotional activities. This session will describe the process used to gather information from faculty and graduate students in a variety of departments about the use of secondary data in their research, and about their awareness and use of the Libraries' Data Services. The results of the survey, completed by about 375 respondents, will be used to help plan future services and target groups that are potential data users.
Building the Statistical Knowledge Network: A Progress Report
Carol Hert (Syracuse University)
Finding and using statistics can be challenging because such information is located in multiple places and exists in large volumes. Efforts such as FedStats (www.fedstats.gov) address the challenge by providing gateways. Our project takes these efforts further by proposing the Statistical Knowledge Network (SKN).
We envision a seamless network, where users have transparent access to varied statistical information. The SKN would enable people to find statistics without having to know particular sources, and provide context for understanding and use.
Over the last 4 years, we have been developing the SKN: developing a suite of tools for end-users, conceptualizing the architecture, and conducting user studies. In this presentation, we present a status report on our work to date and our future directions.
Acknowledgments: Other contributors to this work are Gary Marchionini and Stephanie W. Haas of the University of North Carolina-Chapel Hill, and Ben Shneiderman and Catherine Plaisant, of the University of Maryland-College Park. This material is based upon work supported by the National Science Foundation (NSF) under Grant EIA 0131824. Project information is available at http://ils.unc.edu/govstat.
B2: Peopling the Grid: A Panel Discussion of Institutional Solutions to Data Exchange
There is a lot of talk right now about the Grid and its application to
social science research. What often seem to be lost in the glitter of
Grid-talk are the underlying institutional relationships that are a
prerequisite for all fancy techno-fixes to function. It is no good talking
about using the Grid to pull together disparate data housed in multiple
international locations for simultaneous distributed analysis if there are
not data access and exchange agreements and systems that permit this kind
of movement of data. This panel discussion will seek to examine more
closely different models for data exchange and access arrangements among
collaborating institutions.
Melanie Wright and Lucy Bell (UKDA, University of Essex, UK) will
investigate the solution of centralised authentication regimes by
discussing the UKDA's experience with the implementation of Athens Single
Sign On, both positive and negative. Keith Cole (ESDS International,
MIMAS, University of Manchester, UK) will explore the role of licensing and
the experience of ESDS International in negotiating unconventional
countrywide dissemination rights. Reto Hadorn (Swiss Information and Data
Archive Service for the Social Sciences (SIDOS)) will present the efforts
of CESSDA to evolve its trans-border data exchange agreement towards of
model of exchange of user rights rather than data itself.
It is hoped a lively and free-ranging discussion of the issues will follow
the brief presentations.
One-Stop-Shop: UKDA Implementation of Athens
Melanie F. Wright (UK Data Archive, University of Essex)
Lucy Bell (UK Data Archive, University of Essex)
International Data Licensing: The Experience of ESDS International
Keith John Cole (MIMAS)
The CESSDA Transborder Data Agreement
Reto Hadorn (Swiss Information and Data Archive Service for the Social Sciences (SIDOS))
B3: The DDI Expert Committee: Who We Are, Where We're Going, and What It Means to You
Chair: Mary Vardigan
James A. Jacobs (University of California, San Diego)
Ilona Einowski (University of California, Berkeley)
Wendy Thomas (Minnesota Population Center)
In October 2003, the new DDI Expert Committee met for the first time, organized itself into working groups, and began the process of moving DDI development forward within the structure of the DDI Alliance. While this may appear to be internal reorganization from the outside, what it means for the advancement of the DDI technically and conceptually is impressive. First, it means we have a broader group of people providing input and energy into the process. Second, it means we can move on a number of development fronts: structural reform to keep current with new technical implementation options; expanding the coverage of the DDI to deal with additional types of files and expanded information descriptions; and a new and better focus on you, the DDI user, and what you need to make sense of and apply the DDI to your work. Finally, it means we can provide you with realistic expectations of what will be accomplished and made available to you over the next year.
In this session, speakers from the three major Working Groups of the DDI Expert Committee--Structural Reform, Substantive Content, and Usability and Outreach--will provide an integrated overview of the Committee's current and future work.
C1: Reinventing a Data Archive in the 21st Century: Process Improvement at ICPSR
Chair: Myron Gutmann
Peggy Overcashier (ICPSR)
Cole Whiteman (ICPSR)
Halliman Winsborough (University of Wisconsin)
In 2003, as ICPSR began its fifth decade of service to the social science research community, the organization undertook a major review of its processes and practices, focusing particularly on the "data pipeline" process, to determine ways to improve study tracking, efficiency, and services.
To that end, an internal Process Improvement Committee (PIC) was established. Consulting with an expert on organizational workflow, this group charted the current pipeline process, from data acquisition through final public release, and recommended a set of minor improvements to the system. Taking its charge to the next level, the group also crafted guidelines for an "ideal" process incorporating technological innovation and a greater emphasis on standards.
Following submission of the final PIC report in October 2003, an External Process Review Committee was constituted to provide an impartial view and guidance in moving forward. This external committee met with staff in Ann Arbor and then submitted its own report with additional recommendations. ICPSR is now engaged in implementing recommendations from both reports.
This session will feature ICPSR Director Myron Gutmann as chair and representatives from the internal and external review committees. Discussion will focus on the changes that ICPSR is making in how it acquires, processes, distributes, and preserves data; the impact of those changes; and the change process itself.
C2: Mapping the Past with GIS
Chair: Steve Citron-Pousty
Counting Cows and Cabbages: Web-based Extraction and Delivery of Geo-referenced Data
Stuart Macdonald (Edinburgh University Data Library)
As we move towards a 'common geographic framework' for a range of data, the concept of 'walking across' geo-spatial resources as diverse as population censuses, digital mapping data, historic statistical data, and digital boundary data, is becoming a reality, with the potential for introducing or removing 'layers' of geo-referenced data to suit the sophisticated needs of end-users. To use such data users must be able to find it and ascertain quality and suitability, thus the need for robust metadata with appropriate geographic tagging.
The Agricultural Data Service (AgDS), as part of Edinburgh University Data Library, supplies geo-referenced data, derived from Agricultural Censuses from 1969, on the distribution of agricultural activity in Great Britain. For any year the data are collected for groups of farm holdings and made available as grid square estimates at various resolutions based on the British National Grid.
This paper will describe the evolution from a command-line driven extraction and delivery service, to an online, web-based service complete with geo-interface allowing data visualisation and end-user interaction. Such a mechanism and resource forms part of a 'common geographic framework' that allows diverse geo-referenced data to be located by standardised common themes.
Effort Towards a Dutch Historical Geographic Information System
Luuk Schreven (Netherlands Institute for Scientific Information Services (NIWI))
The history department of the Netherlands Institute for Scientific Information Services (NIWI) has recently started a Historical GIS project. The project will be set up as a pilot that first of all focuses on the Dutch censuses that were held between 1795 and 1971. More historical datasets will become available through this GIS in the future if this pilot is successful. Within the project we will focus on a geographical level that is below the municipality. The least aggregated data available in the census records concern districts and neighbourhoods. This presentation will address the basic principles of our GIS project and the progress made thus far. NHGIS: The Bonus Materials
Wendy Thomas (Minnesota Population Center)
The goal of the National Historical Geographic Information System (NHGIS) is to collect, describe, and provide access to U.S. aggregate data going back to 1790 and to create the boundary files for counties and tracts back to their inception. Our approach has always been to integrate over 300 file descriptions and millions of data item descriptions through the DDI metadata description. In doing this integration, we have created a wide range of auxiliary files describing:
- geographic entities and their relationships over time
- cross-walks between various coding systems over time
- legal name changes of geographic entities
- geographic hierarchies and their relationships to each other
- DDI instances of standard variables (ready to cut-edit-and-paste)
- and more.
For data users and particularly for data archivists and metadata creators these are truly bonus materials. The files are all ASCII fixed format and come with DDI compliant metadata. The modular approach of NHGIS lets you benefit from our work without tying you to the NHGIS system itself. This presentation will show you those materials currently available and what we're working on in the future. Hopefully our work will allow you to save time and increase the benefit of the NHGIS project to the research world.
C3: Data Management Infrastructures: Advances in Processing and Dissemination
Chair: Chuck Humphrey
UIS RUSSIA Technologies for Social Sciences Research Network
Tatyana Yudina (Moscow State University, UIS RUSSIA)
The UIS RUSSIA (University Information System RUSSIA), www.cir.ru, operates since 2000 as a freely-accessible Internet-based collective digital library for research and education in social sciences. The system maintains holdings of social domain data and documents obtained from primary sources: government, non-governmental organizations and private holders. Currently the system integrates 1.5+ million documents from 60+ collections.
Users' increasing demand for additional holdings and the numerous high-quality resources maintained inside the research community have led the UIS RUSSIA team to develop a distributed network of high-quality holdings among participating organizations. The team is sharing the technology created with other participants ready to adopt the software to process their holdings and make the metadata available for the UIS RUSSIA search engine.
Cooperation has started with several journals, online sites and other resources. A user may search across these virtually integrated collections and download full text documents from a holder's server. This approach is particularly appropriate for some partners whose information cannot be held on remote servers due to its status or commercial interests. Support and trouble-shooting is provided by the UIS RUSSIA team. The presentation will discuss the progress of this project.
New User Interface for Managing the Archiving Process in FSD
Jouni Sivonen (Finnish Social Science Data Archive)
In the beginning of its operations, the Finnish Social Science Data Archive (FSD) started to use a simple Access97 database for managing the archiving process of data. This database, called Tiipii, was developed gradually as the routine procedures for archiving were being established.
In 2002 FSD started a new project called Tiipii2, aiming to replace the old interface and database with a more user friendly graphical user interface (GUI) and a new relational database. At the moment the GUI is at the testing stage. It will be used to control the data archiving process and to handle internal and external information services. The project has been implemented by open source tools. The paper presents the system which consists of 1) PostgreSQL database in Linux platform, 2) Java code using J2SE, and 3) CORBA architecture using JacORB, which is a free Java implementation of the OMG's CORBA standard.
Getting Wired: Caffeinating Microdata Production at the Minnesota Population Center with Java
Marcus Peterson (Minnesota Population Center)
Preparing new Integrated Public Use Microdata Series (IPUMS) datasets for public release can be a time-consuming and painstaking process. Even after the digitization and harmonization of a given dataset, considerable work is still required in disseminating the data and its supplementary documentation to the public. To expedite the turnaround of new and often disparate datasets, the Minnesota Population Center (MPC) has developed a suite of Java-based utilities for generating viewable microdata documentation. Powered by centralized metadata, these tools comprise a generalized application programmer interface (API) for documenting frequencies, coding schemes, and overall IPUMS variable design. This Java API employs object-oriented principles to minimize dataset-specific programming and to ensure the rapid deployment of new data. Furthermore, the IPUMS API provides the core of the newly redesigned web-based data dissemination system. These recent programming advances will enable MPC researchers to process and release new IPUMS data with increased accuracy, efficiency, and speed.
|