Just a quick note: some of the data which wanted and could not get the ICPSR, such as the Educational Longitudinal Study of 2002, is available at http://nces.ed.gov/edat/ – I will continue to look for data sources other than the better known but clearly obstructionist ICPSR. I have managed to find on my old disks some data which the ICPSR used to make available before they cracked down on it. I wish I had updates for it, since for example, my GSS data is years old. If I can find it somewhere, I’ll let you know. — dpw
NCES EDAT as as Data Source
June 29th, 2010WLS SAS Catalog or Command Files and Bit Arrays
June 20th, 2010I was a bit worried about getting the formatting data out of the SAS Catalog or Command files in the Wisconsin Longitudinal Study. These are ASCII files, and necessary to read the CSV data files, but they are designed to be read by SAS, not by some program I might write. The answer to this is surprisingly simple. I used an ordinary text editor to strip off the first few lines and all the format lines at the end, then saved the file as a variable and value file, the whole of which is in one single format, like this:
value SEXRSP /* sex of respondent */
1 = ‘male’
2 = ‘female’ ;
Then I did the opposite, saving only the format lines, which maps variables to possibly new names:
format DEATYR DEATYR.;
format GROUP91 GROUP9A.;
These two new files are easily readable by a program which will be easily writeable.
It seems that the Python package Bitarray will work. It returns only a one dimensional array of bits, but as many as necessary can be put in a list. This will create something like a two dimensional array of bits. I need columns to be in the one dimensional arrays, so I will have to transpose the data. Did you know that you can transpose a two dimensional list of lists in a single line of Python code, using the map and zip functions? I don’t know if it will work for bit arrays yet, and especially don’t know if it will work for something huge, but I’ll try. — dpw
Row Coordinates, Comparing Microvariables
June 19th, 2010I have worked out a complete approach for dealing with the current dataset, the one chosen to start with, the Wisconsin Longitudinal Study. First is a method I have used before to get coordinates for rows, that is to say for individual people, in this kind of dataset. One uses the full list of variables, almost 14 thousand of them, to compare rows. Every variable which exactly matches is counted, then the count determines similarity between rows. The first step is to find the most distant two rows, by comparing each row with all others. Then add extremes, by comparing each row with all extracted extremes, starting with the first two. The row that is the farthest from all of the existing extremes is a new one. This process slows down considerably as each new extreme is added. Similarity to the final set of extremes is used to give coordinates. For a dataset of this size at least 10 coordinates must be extracted, in my experience.
The next step involves what I call microvariables, which are simple boolean variables based on the variables defined in the survey. A variable will have several responses. Each possible response defines a microvariable, a column of booleans, single bits. A set of coordinates for each column can then be defined as the mean of the row coordinates for all bits set in this column. Thus, each microvariable will have a set of coordinates, equal in number to the number of row coordinates. It is important to have these coordinates for the microvariables, because they are comparable, where the boolean columns are not — they are apples and oranges.
This is a fairly simple process, all of which I have done before on other data, so it should not be too hard for me to implement — dpw
Finding Compatibility Data in WLS
June 18th, 2010I am still looking for better sources of the information most important to me, but I have had some luck with the Wisconsin Longitudinal Study. Just a bit, but some. On question, repeated on different instruments asks how close the respondent feels to his or her spouse. A very few other questions refer to amount of sexual pleasure with a spouse. Taken together, they do give some very limited indications of compatibility. What makes this useful is that there are a number of questions asked the respondent about their spouse, and a few questions in which the spouse give some data about himself or herself. It’s not a lot of information, but it is some, and since it is a longitudinal study, there is some indication about how this changes over time. It will take me a while to make use of the data and I sure wish there was more, but there is enough to get started.
I have had much more luck locating information about jobs. The WLS asked a lot about occupation and job satisfaction. I’ll work with that data as I find time, but for now I am anxious to see if I really can get anything useful out of the limited interpersonal data. — dpw
A mathematical diversion
June 16th, 2010Oops. This just happens sometime. I get involved in writing about something mathematical and that part of my mind gets trapped there. On another blog I am am writing a novel, an online chapter-by-chapter novel, with daily installments. Today I was trying to deal with concept that was mathematical enough to occupy that part of my mind until it became too late in the day to work on software development or this blog. For the curious, see Chapter 14, at http://SocialTechNovel.SocialTechnology.ca/ — the previous few chapters talk about advanced social tech hardware, by the way. I”ll get back to software and this blog tomorrow. — dpw
Request for Available Dataset Information
June 15th, 2010I don’t know who might be following this or stumble across it, but it does get some notice from somebody, since a mistake gets drawn to my attention quickly enough. I may have driven people away by making too many assertions and not enough requests for anything. This may be because I have too much of some things, too much social survey data, for instance. There is a mountain of it. Over the years I’ve collected some, but never found quite the right data. So let me just ask –
Does anyone know of an education dataset with raw test answers in it? Not scores, actual answers. With the questions and correct answers, too, of course. That would be especially nice if some other information about the students were contained. Has anyone got some of their own material, not a published dataset, but just test answers from one or more of the tests they have administered themselves?
I am always looking for compatibility data. I’m not comfortable with trying to derive this information from more general surveys. I can do that using marriage-success data, but that is a stretch.
Lots of what I want to do involves using data which just doesn’t exist, never having been collected, to the best of my knowledge. But there really is a mountain of data out there, and who knows what might have been collected. Let me just fantasize about something I have never found, that might exist –
I would like a dataset which includes individual characteristics, interpersonal compatibility information, occupation and job history information, and answers to questions of fact.
I can find nothing with all of this, but I do have a tiny bit of information from electoral studies, in which people stated something about themselves and were then asked what they knew about the candidates. Since some of this is factual information which can be checked — candidates, party, incumbency, age, views on various topics, it is possible to see factual errors, almost like the answers to test questions. That, combined with information from the user about him or her self, is definitely useful, and some occupation information sometimes goes with that, but nothing whatsoever about interpersonal compatibility.
I wrote before about patching together the data from many sources to get what I want. I am sure I will still have to do that, but it doesn’t hurt to ask. Please help if you know of any available datasets that I might use. — dpw
