I have worked out a complete approach for dealing with the current dataset, the one chosen to start with, the Wisconsin Longitudinal Study. First is a method I have used before to get coordinates for rows, that is to say for individual people, in this kind of dataset. One uses the full list of variables, almost 14 thousand of them, to compare rows. Every variable which exactly matches is counted, then the count determines similarity between rows. The first step is to find the most distant two rows, by comparing each row with all others. Then add extremes, by comparing each row with all extracted extremes, starting with the first two. The row that is the farthest from all of the existing extremes is a new one. This process slows down considerably as each new extreme is added. Similarity to the final set of extremes is used to give coordinates. For a dataset of this size at least 10 coordinates must be extracted, in my experience.
The next step involves what I call microvariables, which are simple boolean variables based on the variables defined in the survey. A variable will have several responses. Each possible response defines a microvariable, a column of booleans, single bits. A set of coordinates for each column can then be defined as the mean of the row coordinates for all bits set in this column. Thus, each microvariable will have a set of coordinates, equal in number to the number of row coordinates. It is important to have these coordinates for the microvariables, because they are comparable, where the boolean columns are not — they are apples and oranges.
This is a fairly simple process, all of which I have done before on other data, so it should not be too hard for me to implement — dpw