The staff of the Wisconsin Longitudinal Survey told me proudly that they had more than 14 thousand variables. In a sense that is true, but I am going work with them in a way that makes the number more like 100 thousand. What the scientists who do social surveys consider a variable is not really a variable in the mathematical sense. One should imagine not a column of codes like: 1, male; 2, female; 3, refused. Instead, a mathematician might think of these as three columns of a matrix. The fact that only one answer appears in the set of three columns does not necessily change this. Instead the numbers in each column may represent probability. A zero in one column means certainty that this attribute does not apply to the individual, a one in another column represents certainty that this attribute does apply.
Now as we all know, there are few certainties in this world. People make mistakes, they lie, their response is coded incorrectly. So various data massage operation can be performed, as described in an earlier post. The results can still be interpreted as probabilities, but will be more realistic. Doing this and the other operations discussed in that earlier post are part of data correction. The social scientist may argue against this level of data correction, but it has great value, especially when done automatically for the purposes of social technology.
I will have no access to computers which can do this level of correction or make proper use of a survey anywhere near the size of the WLS, but I will be able to use some of it. I am still not sure how to select that to use. A tentative plan is to collect the columns representing the questions and responses of most significance, such as those for gender, age, race, religion and education. Given these, it should be possible to make predictors for the other columns. When a predictor is quite accurate, is corresponds to a column which gives us little information. When a predictor fails almost completely, then it corresponds to important new information. Repeating this process over and over, I should be able to locate perhaps as many as 4,000 colums which my machine will be capable of processing.
It may be that this can be done iteratively, storing as sufficiently processed the less useful columns after each processing step, then seeking replacements. I am not sure if this will work or not. Even if it does, the amount of processing may be excessive.
This is something I will be working on for quite a while, but at least I know enough about the variables to make a good start. — dpw