I seem to have gotten into trouble by not being clear about what I would hope social survey data might contain. Here are some proposed standards:
- everything, including questions and response sets shall be kept in machine readable form
- it shall be possible to reconstruct the entire survey procedures and instruments from the stored data without human intervention
- if this information is kept maintained within a statistical package or in a markup language, programs to translate it into a simple open human and machine readable form like CSV format shall be available, with their inverses, so that the the translations can be verified
- routine regression tests shall be made by translating the simple machine-readable files back into the archive or working format, so that changes to the statistical packages or the markup language which affect the data can be discovered.
- whether intended for distribution or not, raw data shall be kept in this way, in addition to any corrected data that has been prepared for general use
- meta-data to describe all the data including questionnaire and codebook data shall be maintained as data according to these same standards
- a description of filters used to subset the data into private, academic and public releases shall be maintained as data according to these standards
The total effect of these standards is to produce a collection of data and meta-data which can be processed automatically, without human intervention, by relatively simple programs whose operation can be checked and debugged easily. The purpose of these proposed standards is to support social technology, the automated use of this data, instead of merely social science, although the social technology to be developed will also provide tools for the scientific researcher.