NCES EDAT as as Data Source

June 29th, 2010

Just a quick note:  some of the data which wanted and could not get the ICPSR, such as the Educational Longitudinal Study of 2002, is available at http://nces.ed.gov/edat/ – I will continue to look for data sources other than the better known but clearly obstructionist ICPSR.  I have managed to find on my old disks some data which the ICPSR used to make available before they cracked down on it.   I wish I had updates for it, since for example, my GSS data is years old.   If I can find it somewhere, I’ll let you know.  — dpw

Use a Highlighter on this page

Top Down and Bottom Up

June 23rd, 2010

I think that for a while I should alternate posts, because there are two general ways to look at this software development project. One is from the top down, the other from the bottom up.

At the lowest level, I have small bits of code, ready to be extended and to be embedded in larger units. At the highest level I have a general overview of the role of social technology in society and how software can help accomplish the goals to be set forth.

Here is a bottom up sketch, for example. I have some code for doing a very robust clustering and using the results to generate coordinates for whatever items are being clustered. The items are usually represented by rows. I put a serial number at the beginning of each row, which is often different from the unique ID assigned to whatever the row represents. I keep a separate file for row data with comment characters used to insert comment lines between rows.

The data to be processed is sometimes sent through a filter to remove comment lines, but a better way is to preserve them by adding them to the end of the following line, where they can be read into a string variable. Each line of data can be treated as one row of a matrix, with a possibly null string at the end, to be ignored in processing.

At the top end, looking down, I see social technology applying to individuals and to situations involving individuals. An example of the latter is the placing of individuals in jobs. An employer can be asked to specify the tasks needed doing, with preferences for full or part-time employees to do them. When employees will work in teams, the existing team members can be asked to give information about themselves.

Then when all this data is collected, the software can recommend people who can do the tasks as required, while fitting well into the existing team. Or a whole new team of compatible people can be assembled to perform the tasks associated with a project.

To do all this, the software will use questionnaire responses or a dialogue with the users (individuals, employers, other team members, etc.) to produce coded data, not unlike social survey data.

From the bottom up again, converting coded data into matrix data involves turning the list of variables and valid answers into rows and columns of a bit matrix – except for data already in numerical format, such as income data. The resulting numerical or binary matrix then undergoes column clustering using robust clustering algorithms, which as starting points, use the most distant rows.

The cluster data can be converted into coordinate data by representing the points in a cluster by their distance for the global centre, usually a heavily populated point at the intersection of the various clusters. Points in the cluster far from the global centre are further from it than the cluster centre, and so get a negative value. Points near the cluster centre are assigned a near zero value, points closer to the global centre get a positive value, usually +1 at the global centre itself. This can all be automated. Once data is in a coded matrix form, it is easy to do all the rest. The problem is to get it there.

Looking down from the top again, it should always be possible to add collections of data to some secure overall collection. Eventually the use of social technology will lead to the ever growing amassing of good data. In the interim, social survey data can be used.

Looking down from up high, we can think of there being some kind of data grinder or data monster which can be fed crude data and extract the information from it. The ideal social survey would be one which produced a set of files which include all publicly available raw data, plus enough information to reconstruct the entire survey from scratch, all in machine readable form.

It should be possible to download a big compressed (e.g. zipped) datafile, representing everything which the survey organizers decided to release. It should be possible to feed that one file into the data monster with no further human intervention.

The results be useful for every kind of social technology:

– the interpersonal matching of individuals

– matching people to jobs, including both finding jobs for people and finding people to perform various tasks

– finding educational opportunities for people and helping educational institutions select students

And so on – various applications are listed elsewhere. This is not intended as more than a survey of what the software development problem looks like from the top down and from the bottom up. More on both, probably in alternating posts. — dpw

Use a Highlighter on this page

WLS SAS Catalog or Command Files and Bit Arrays

June 20th, 2010

I was a bit worried about getting the formatting data out of the SAS Catalog or Command files in the Wisconsin Longitudinal Study.  These are ASCII files, and necessary to read the CSV data files, but they are designed to be read by SAS, not by some program I might write.   The answer to this is surprisingly simple.  I used an ordinary text editor to strip off the first few lines and all the format lines at the end, then saved the file as a variable and value file,  the whole of which is in one single format, like this:

value SEXRSP /* sex of respondent */
      1 = ‘male’ 
      2 = ‘female’ ;

Then I did the opposite, saving only the format lines, which maps variables to possibly new names:

format    DEATYR DEATYR.;
format   GROUP91 GROUP9A.;

These two new files are easily readable by a program which will be easily writeable.

It seems that the Python package Bitarray will work.  It returns only a one dimensional array of bits, but as many as necessary can be put in a list.   This will create something like a two dimensional array of bits.  I  need columns to be in the one dimensional arrays, so I will have to transpose the data.  Did you know that you can transpose a two dimensional list of lists in a single line of Python code, using the map and zip functions?  I don’t know if it will work for bit arrays yet, and especially don’t know if it will work for something huge, but I’ll try.  — dpw

Use a Highlighter on this page

Two Dimensional Array of Bits

June 19th, 2010

Some small changes – I am not doing exactly what I said yesterday.

I wrote about using variable responses as they were recorded, which would probably work, but I don’t quite trust it, since it is not quite clear how many coordinates would be needed to fully represent the rows (individual people). And it seems that adequately representing those values will take up more memory than by using what I call microvariables, which are columns of single bits.

The best way of representing the whole dataset seems to be as a single two dimensional array of bits. Each variable is replaced by several columns of single bits, each representing one possible value of the variable. Each individual person will be represented by one row of bits.

I have written about this before, but now think it the only way to go.

It is hard to find a nice way to do this. Pascal provides lovely arrays of booleans, but each one actually takes up one byte, not a single bit. I do miss VAX Pascal, with it’s Packed Array of Boolean data type, in which each bit occupied just a single bit of memory.

Python does have a nice package, of course, but I run on on a 64-bit machine and don’t quite trust the 64-bit version of the compiled package, which is only at version 0.3.5 anyway. If anybody has experience with this package, Bitarray, I would like to know about it. See http://pypi.python.org/pypi/bitarray/ for information. It is available as a precompiled binary for Windows at http://www.lfd.uci.edu/~gohlke/pythonlibs/ a very nice page, the best way to access all the well-know packages, (and a few good ones, not so well-known).

Anyway, using big bit arrays, I think I can guarantee that 16 coordinates would be enough to represent all possible rows of bit data. There will be fewer, I think, though I haven’t actually looked for duplicates.

Whether I trust the Python package or not, it seems to be the thing to use, so I will. I’ll report my results as I go along. — dpw

Use a Highlighter on this page

Row Coordinates, Comparing Microvariables

June 19th, 2010

I have worked out a complete approach for dealing with the current dataset, the one chosen to start with, the Wisconsin Longitudinal Study.   First is a method I have used before to get coordinates for rows, that is to say for individual people, in this kind of dataset.   One uses the full list of variables, almost 14 thousand of them, to compare rows.   Every variable which exactly matches is counted, then the count determines similarity between rows.   The first step is to find the most distant two rows, by comparing each row with all others.   Then add extremes, by comparing each row with all extracted extremes, starting with the first two.  The row that is the farthest from all of the existing extremes is a new one.  This process slows down considerably as each new extreme is added.   Similarity to the final set of extremes is used to give coordinates.   For a dataset of this size at least 10 coordinates must be extracted, in my experience.

The next step involves what I call microvariables, which are simple boolean variables based on the variables defined in the survey.   A variable will have several responses.  Each possible response defines a microvariable, a column of booleans, single bits.   A set of coordinates for each column can then be defined as the mean of the row coordinates for all bits set in this column.  Thus, each microvariable will have a set of coordinates, equal in number to the number of row coordinates.   It is important to have these coordinates for the microvariables, because they are comparable, where the boolean columns are not — they are apples and oranges. 

This is a fairly simple process, all of which I have done before on other data, so it should not be too hard for me to implement — dpw

Use a Highlighter on this page

Finding Compatibility Data in WLS

June 18th, 2010

I am still looking for better sources of the information most important to me, but I have had some luck with the Wisconsin Longitudinal Study.   Just a bit, but some.   On question, repeated on different instruments asks how close the respondent feels to his or her spouse.  A very few other questions refer to amount of sexual pleasure with a spouse.  Taken together, they do give some very limited indications of compatibility.   What makes this useful is that there are a number of questions asked the respondent about their spouse, and a few questions in which the spouse give some data about himself or herself.   It’s not a lot of information, but it is some, and since it is a longitudinal study, there is some indication about how this changes over time.    It will take me a while to make use of the data and I sure wish there was more, but there is enough to get started.

I have had much more luck locating information about jobs.  The WLS asked a lot about occupation and job satisfaction.  I’ll work with that data as I find time, but for now I am anxious to see if I really can get anything useful out of the limited interpersonal data.  — dpw

Use a Highlighter on this page

A mathematical diversion

June 16th, 2010

Oops.  This just happens sometime.  I get involved in writing about something mathematical and that part of my mind gets trapped there.  On another blog I am am writing a novel, an online chapter-by-chapter novel, with daily installments.  Today I was trying to deal with concept that was mathematical enough to occupy that part of my mind until it became too late in the day to work on software development or this blog.   For the curious, see Chapter 14, at http://SocialTechNovel.SocialTechnology.ca/ — the previous few chapters talk about advanced social tech hardware, by the way.  I”ll get back to software and this blog tomorrow.  — dpw

Use a Highlighter on this page

Request for Available Dataset Information

June 15th, 2010

I don’t know who might be following this or stumble across it, but it does get some notice from somebody, since a mistake gets drawn to my attention quickly enough.   I may have driven people away by making too many assertions and not enough requests for anything.  This may be because I have too much of some things, too much social survey data, for instance.   There is a mountain of it.   Over the years I’ve collected some, but never found quite the right data.  So let me just ask –

Does anyone know of an education dataset with raw test answers in it?  Not scores, actual answers.  With the questions and correct answers, too, of course.   That would be especially nice if some other information about the students were contained.    Has anyone got some of their own material, not a published dataset, but just test answers from one or more of the tests they have administered themselves?

I am always looking for compatibility data.   I’m not comfortable with trying to derive this information from more general surveys.   I can do that using marriage-success data, but that is a stretch. 

Lots of what I want to do involves using data which just doesn’t exist, never having been collected, to the best of my knowledge.  But there really is a mountain of data out there, and who knows what might have been collected.    Let me just fantasize about something I have never found, that might exist –

I would like a dataset which includes individual characteristics, interpersonal compatibility information, occupation and job history information, and answers to questions of fact.  

I can find nothing with all of this, but I do have a tiny bit of information from electoral studies, in which people stated something about themselves and were then asked what they knew about the candidates.   Since some of this is factual information which can be checked  — candidates, party, incumbency, age, views on various topics,  it is possible to see factual errors, almost like the answers to test questions.   That, combined with information from the user about him or her self, is definitely useful, and some occupation information sometimes goes with that, but nothing whatsoever about interpersonal compatibility.

I wrote before about patching together the data from many sources to get what I want.  I am sure I will still have to do that, but it doesn’t hurt to ask.  Please help if you know of any available datasets that I might use.  — dpw

Use a Highlighter on this page