The Importance of Data Documentation for Survey Data Harmonization

by Marta Kołczyńska, The Ohio State University and Polish Academy of Sciences

Data, according to the United Nations Statistical Commission, are “the physical representation of information in a manner suitable for communication, interpretation, or processing by human beings or by automatic means” (UNSC 2000: 6). In other words, for information to qualify as data, it needs to be usable. Usable survey data depends on the availability and the high-quality of documentation.

Survey documentation refers to information on when, where, how and by whom the study was conducted, including information on the type of the sampling, size of the sample, response rate, preparation of the questionnaire and other instruments, as well as pretesting, and fieldwork control. In the Internet age, this information should accompany the survey data set in the form of one or more documents electronically available for viewing and downloading.

The main goal of any statistical analysis using survey data is to draw inferences about the target population. The precondition is that the survey sample is representative for the population. Representativeness can be approached in different ways and met to different degrees.

The researcher ultimately has to decide whether a given survey sample is sufficiently representative to solve their research problem. This decision requires knowledge about sampling, including the sampling scheme, the sampling frame and, if such is the case, details of stratified samples or other methods. For researchers, additional aspects of the survey process, such as response rates and control of fieldwork, are also important to review in order to assess survey data quality.

In the case of cross-national studies, it is also advisable to review the survey tools, typically questionnaires and the process of their creation, including what translation procedure was applied, and whether the questionnaires were pretested. Best practices for translation are debated in the field of survey methodology (see e.g. Harkness, Pennell, and Schoua-Glusberg 2004; Harkness, Villar, and Edwards 2010). However, the consensus is that high quality translation is a prerequisite for comparability of data collected in different linguistic and cultural contexts. Information on the translation procedure must be provided in the survey documentation for a given country.

Pretesting is not only a way of validating the translation to avoid information loss or changes in the meaning of the basic concepts; it is also a way to assess the degree to which the questionnaire meets the criteria of acculturation (i.e. to what extent it fits to the mindset of potential respondents). If information about pretesting is lacking or inadequate, then, justifiably, researchers have lower confidence in the data.
Similarly, high quality surveys usually perform some kind of fieldwork control that typically consists of a personal visit or phone call to back-check the previously collected data. Regardless of the method, fieldwork control is generally beneficial because it improves interviewers’ performance. Again, if there was no fieldwork control or information about it is not provided in survey documentation, researchers worry about the quality of that data.

Documentation – at least in the case of surveys – is an integral part of the data. Information about sampling, response rate, translation of the questionnaire, pretesting and fieldwork control cannot be found in the numerical data recorded in computer files, but it is important for interpretation of these data. In the case of comparative studies, variations in documentation quality within and across international projects should be recorded as survey-quality indicators.

Working within the Harmonization Project makes this point clear. In searching through the documentation of the 22 international survey projects listed in Table 1 in this Newsletter, my colleagues and I have found wide variation in the standards of documentation accompanying each data set. At this point we created five variables describing data documentation of all 1726 national surveys: (1) response rate – whether this information is provided or not, (2) numerical value of response rate, if given, (3) indication of any efforts at controlling the quality of the questionnaire translation, (4) whether there is any indication of questionnaire pretesting, and (5) attempts of the fieldwork control (Schoene and Kołczyńska 2014). With the exception of numerical values of the response rate, all other variables are dummies (1 – yes, 0 – otherwise). The distribution of all these variables differentiates national surveys enough to claim that surveys from the selected international projects are of varying quality.

We aim to build documentation quality controls into statistical analyses of the Harmonization Project database, to check empirically the consequences of weak documentation standards in cross-national projects. In doing so, we hope to contribute to the discussions about how to increase confidence in extant cross-national survey data.

Marta Kołczyńska is a PhD student at the Department of Sociology, The Ohio State University, and a research assistant in the Harmonization Project.