Request for Available Dataset Information

I don’t know who might be following this or stumble across it, but it does get some notice from somebody, since a mistake gets drawn to my attention quickly enough.   I may have driven people away by making too many assertions and not enough requests for anything.  This may be because I have too much of some things, too much social survey data, for instance.   There is a mountain of it.   Over the years I’ve collected some, but never found quite the right data.  So let me just ask –

Does anyone know of an education dataset with raw test answers in it?  Not scores, actual answers.  With the questions and correct answers, too, of course.   That would be especially nice if some other information about the students were contained.    Has anyone got some of their own material, not a published dataset, but just test answers from one or more of the tests they have administered themselves?

I am always looking for compatibility data.   I’m not comfortable with trying to derive this information from more general surveys.   I can do that using marriage-success data, but that is a stretch. 

Lots of what I want to do involves using data which just doesn’t exist, never having been collected, to the best of my knowledge.  But there really is a mountain of data out there, and who knows what might have been collected.    Let me just fantasize about something I have never found, that might exist –

I would like a dataset which includes individual characteristics, interpersonal compatibility information, occupation and job history information, and answers to questions of fact.  

I can find nothing with all of this, but I do have a tiny bit of information from electoral studies, in which people stated something about themselves and were then asked what they knew about the candidates.   Since some of this is factual information which can be checked  — candidates, party, incumbency, age, views on various topics,  it is possible to see factual errors, almost like the answers to test questions.   That, combined with information from the user about him or her self, is definitely useful, and some occupation information sometimes goes with that, but nothing whatsoever about interpersonal compatibility.

I wrote before about patching together the data from many sources to get what I want.  I am sure I will still have to do that, but it doesn’t hurt to ask.  Please help if you know of any available datasets that I might use.  — dpw

Posted in Uncategorized | Tagged , , | Leave a comment

Variables in Social Survey Data

The staff of the Wisconsin Longitudinal Survey told me proudly that they had more than 14 thousand variables.   In a sense that is true, but I am going work with them in a way that makes the number more like 100 thousand.    What the scientists who do social surveys consider a variable is not really a variable in the mathematical sense.   One should imagine not a column of codes like:  1, male; 2, female; 3, refused.  Instead, a mathematician might think of these as three columns of a matrix.  The fact that only one answer appears in the set of three columns does not necessily change this.  Instead the numbers in each column may represent probability.   A zero in one column means certainty that this attribute does not apply to the individual, a one in another column represents certainty that this attribute does apply.

Now as we all know, there are few certainties in this world.  People make mistakes, they lie, their response is coded incorrectly.  So various data massage operation can be performed, as described in an earlier post.   The results can still be interpreted as probabilities, but will be more realistic.   Doing this and the other operations discussed in that earlier post are part of data correction.   The social scientist may argue against this level of data correction, but it has great value, especially when done automatically for the purposes of social technology.

I will have no access to computers which can do this level of correction or make proper use of a survey anywhere near the size of the WLS, but I will be able to use some of it.   I am still not sure how to select that to use.  A tentative plan is to collect the columns representing the questions and responses of most significance, such as those for gender, age, race, religion and education.   Given these, it should be possible to make predictors for the other columns.  When a predictor is quite accurate, is corresponds to a column which gives us little information.  When a predictor fails almost completely, then it corresponds to important new information.   Repeating this process over and over, I should be able to locate perhaps as many as 4,000 colums which my machine will be capable of processing.

It may be that this can be done iteratively, storing as sufficiently processed the less useful columns after each processing step, then seeking replacements.  I am not sure if this will work or not.  Even if it does, the amount of processing may be excessive.

This is something I will be working on for quite a while, but at least I know enough about the variables to make a good start.  — dpw

Posted in Uncategorized | Tagged , , , | Leave a comment

Update on Using WLS Data

I think we will be able to use the data from the Wisconsin Longitudinal Study (WLS) without heroic measures like dealing with an enormous XML file.  It seems that the Comma Separated Value (CSV) files (which are actually tab separated, not comma separated), can be combined with the catalog files in the SAS distribution to produce something adequate for our purposes.   The catalog files are less than full codebooks, but are somewhat descriptive of the data.

Reading over the variable status and description documention and small parts of the enormous cross-reference tables, I am disturbed to find that many variables are actually constructed ones, combinations two or more distinct questions.  Nevertheless, what is available will do for protyping and testing.

Once again I have a wish list — I would still like raw data and easy access to the actual questions asked, but this seems a generic problem with social surveys , as far as I can tell.  I have downloaded and spent some time on serveral of them, and except for some smaller studies like simple election studies, I could find nothing available which just tells us the basic facts, “Here is the question, exactly as asked”, and “Here is the answer received.”    Just the facts, ma’am, just the facts.   I recognize that this is difficult for in-person and telephone interviews, where the temptation to prompt the respondent may be overwhelming, but still, a single should be asked, recorded verbatim, the result recorded, and that question should become a single variable.

 I’ll write more about this in later posts, but for now I have work to do, making use of what is available.  For the WLS data set covering 1957 to 2007, that is 12988 variables, with data obtained from 10317 respondents, a very impressive collection indeed.  — dpw

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Proposed Standards for Social Surveys

I seem to have gotten into trouble by not being clear about what I would hope social survey data might contain.   Here are some proposed standards:

  • everything, including questions and response sets shall be kept in machine readable form
  • it shall be possible to reconstruct the entire survey procedures and instruments from the stored data without human intervention
  • if this information is kept maintained within a statistical package or in a markup language, programs to translate it into a simple open human and machine  readable form like CSV format shall be available, with their inverses, so that the the translations can be verified
  • routine regression tests shall be made by translating the simple machine-readable files back into the archive or working format, so that changes to the statistical packages or the markup language which affect the data can be discovered.
  • whether intended for distribution or not, raw data shall be kept in this way, in addition to any corrected data that has been prepared for general use
  • meta-data to describe all the data including questionnaire and codebook data shall be maintained as data according to these same standards
  • a description of filters used to subset the data into private, academic and public releases shall be maintained as data according to these standards

The total effect of these standards is to produce a collection of data and meta-data which can be processed automatically, without human intervention, by relatively simple programs whose operation can be checked and debugged easily.    The purpose of these proposed standards is to support social technology, the automated use of this data, instead of merely social science, although the social technology to be developed will also provide tools for the scientific researcher.

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Oops, yes WLS codebooks were digitalized, though not in a form I can use.

I have been well upbraided and justly so by Professor Robert M. Hauser for my comments about the WLS codebooks not being available in digital form.  What I should have said is that they are not in any simple digital form that anyone can use, such as the CSV format in which their actual response data is made available.  

 I do have a problem with using the codebook data,  which is not in any easy to use format, but I should have been more careful how I said it.   What I meant by my comment was that the PDF and HTML files are not in an easily machine readable format, and there was not much I could do with the the codebooks available on the site in those files print them out. 

I did ignore the existence of large statistical packages that would give me access to the codebook data along with the rest of the data.   I can’t affort to spend a lot of money on a package whose only purpose would be to translate the codebook data into the kind of simple digital format I could use directly.
 
I did indeed mention one of those packages, Stata, in the immediately previous post, where I noted my hope that if the freely available stats package R could import Stata files,  I might be able to get what I wanted, since from R it is easy enough to export it in a format that I can read and manipulate using Python.  
 
I should have spelled out my frustration at finding only PDF and HTML files, useless as data for my intended Python program, though I did that  more clearly in the previous post, which did outline that plan for getting codebook data from Stata files via R.

Even that now seems futile, because the R which I used to use will not run only the only machine I have now, a 64-bit one.  If anyone knows of an inexpensive package which will run on a 64-bit AMD machine and read the files which are available, please let me know.

My real problem is trying to start a project which has no budget for big statistical packages like SAS, SPSS or Stata.  This would have been so simple to do if they had made codebook data available in something as simple the CSV format used for the tabular data itself.   As it stands, I find myself sitting here with a powerful computer and a good programming language which runs on it, but no way of accessing the codebook data.
 
That was the the source of the rather severe frustration which led me to say the wrong thing about the WLS codebooks  — for which I am very sorry.   Yes, they are in a digital format, though not one useful to me.    — dpw

Posted in Uncategorized | Tagged , , , , | Leave a comment

If only the WLS people had Digitized their Codebooks

After once again uploading the Wisconsin Longitudinal Study datasets and related documentation, I looked it over, pleased that it had been updated a few years since I last dealt with it in 2002.  But as then, I found the usual stumbling block.  They had no digitized codebook to offer for the main study, only for a few minor ones.   This is such a problem, and one reason why will have to do a lot of work by hand, something we cannot continue to do if we want advanced social technology.  I am trying to remember all the other datasets I downloaded and did some work on over the years.  I am sure I do remember one with a somewhat useful digitized codebook.   What was that?  Well, I’ll look for it.  But if only the WLS had made it easy …   Of course codebooks would be nice, but actual questionnaires would be better.   I seem to remember some election study which had them.  It’s just on the tip of my tongue.  Probably archived on one of my CDs.    I hope.   Anyway, maybe one of these days the right people will realize the importance of digitizing everything, so that using the data can be fully automated.  — dpw

Posted in Uncategorized | Tagged , , , , , | 3 Comments

Data to Prototype With

In case anyone else wants to play about with the same data I’ll be using to prototype with, the site to go to is   http://www.ssc.wisc.edu/wlsresearch/ – home of the famous Wisconsin Longitudinal Study, the WLS.   Their data is wonderful stuff, believe me.   I don’t like the way social surveys are done, but I think this the best of a poor lot, at least.    Very good for our purposes.  I have downloaded by the Comma Separated Value, CSV, format data and the Stata data, hoping that the R documentation is correct in saying that R can import Stata data files.   I don’t actually want to use R, though it is a great package, not to mention free, but from R it is easy enough to export it in a format that I can read and manipulate using Python.  I am more concerned about the variable information in the Stata files than the actual user response data, which I could easily read using the CSV files.

All this is for prototyping, doing things with the data that I have mentioned in earlier blog posts.   I will only use a subset of the data, since dealing with more than a little of that massive data mountain (37 megabytes in zipped CSV) would be too much for one person, and I am only prototyping anyway.  On the other hand, I do want the prototype to be capable of expansion and refinement into something real, so I’ll try to avoid limited the data capacity.

I am still going to be posting requirements analysis and design information from time to time, but I need to get more contact with the data and do a bit of coding in order to — well, to keep from being too bored, frankly.   Dealing with data is fun, coding is fun, the paperwork is not.   Not fun, but important.  Don’t think I am minimizing its importance.

The last time I dealt with data from the WLS was 2002, and the survey had not reached 2000 yet.  Now the whole dataset covers the test group from 1957 to 2007, which is a lot better, and more supplementary studies have been added. 

Well, this is something to keep me busy, and will take a while.  If anyone wants to get involved, please, get involved.  — dpw

Posted in Uncategorized | Tagged , , , | Leave a comment

AI or Not AI?

The basic answer is Not AI.  In Chapter Six, today’s installment of the online novel Social Tech High, at http://SocialTechNovel.SocialTechnology.ca/  I illustrate some user dialogue.  This does look at first glance as if it was Artificial Intelligence.  As envisioned, it will indeed have something in common with the low-tech AI toys associated with the classical MIT school of AI, but it is not intended to ever be anything like a an actual artificial intelligence.   It is really just a natural language interface to a large database of questionnaire data.   Answers provided by the user change the user’s profile slightly and raise the estimated values assigned to questions in that database.  The question with the highest probability of providing useful information is asked next.   This is the basis for the user dialogue, which will continue as long as the user cares to interact with the system.   As envisioned a user may be presented with spontaneous suggestions, but will more commonly ask for them.

Making this work requires a large database of questions and answers, as discussed in previous posts.   Only some of those can come from existing social survey data.   New questions will have to be added, and this will require a cooperative user base, receptive to new questions.    The willingness of users to be “beta-testers” of new questions is something that can be a new “meta-question”, randomly added as an initial question to various users, perhaps those meeting some profile suggested by the question designed.  It can be seen that adding questions to the database is analogous to developing software.  In a way it is developing software.  

Could the whole database and the software which drives it become a kind of AI?   That is not impossible, but there must be some way to prevent the dialogue from degenerating into the kind of nasty feedback loop which occured when pseudo-psychiatrist Eliza met pseudo-paranoid PARRY.    A memory for past questions and responses would need to be added.   Internally, the programs would have to ask themselves, “Have I been asked this question recently?”, “Is my best current answer the same as it was recently?”  “Should I invoke one of the standard methods of diverting the conversation in another direction? ”  “Which would be the best one?”  or some other questions to prevent feedback driven oscillations.

Would that indeed be a practical way of implementing an AI, even if one is not required for the envisioned social technology?   Maybe.    I welcome others opinions, but again, I do want to focus on the practical immediate problems, giving people tools to help them optimize their own social environments.  — dpw

Posted in Uncategorized | Tagged , , , | Leave a comment

Data Merging, Large and Small

On the large scale, organizations with large databases for social use can merge their data, as hinted at in Chapter One of my online novel http://SocialTechNovel.SocialTechnology.ca/ and on a much smaller scale, individuals can merge data.   In either case the data must be kept private and secure.  Let us for the moment consider only those who have a computer and Internet connection, and look at individual merging of data.  The hardest thing to do is keep technically savvy individuals from downloading large amounts of data, cracking whatever encryption it has, then using it to access to personal information about the users.

The problem is closely related the armaments race, as described by William L. Shirer in The Rise and Fall of the Third Reich.  In this classic book, Shirer analyzed the activities of the German armaments firm of Krupp, then came up with a general rule.  In the race between better armour and better guns, those who buy the guns will eventually win.   Armour is expensive and always slightly out of date, and it is a sitting target.  Between increases in the strength of armour, what has been installed is always vulnerable to better guns.  Thus the idea of burying likely targets deep in hardened bunkers eventually lost out to the increases in the power and accuracy of nuclear weapons.

Does this apply in the computer age?  Probably not.  It is easy to switch from a 128 bit encryption key to a 256 bit key, but creating programs to crack a longer key is much harder.   As is easier to see if the bits are regrouped, the number of possible candidates to be considered is grows rapidly with key length.  Cracking an encrypted message is NP-Complete with respect to key length.  Encoding and decoding increase only linearly with respect to key length.  Encoded document length is also a linear function of key length.  So the old rule about armour versus weapons does not apply in cryptography.

But practical social technology cannot depend on people typing in passwords for access to their own encrypted data.  What is needed is something automatic.   And that must depend on a mechanism for passing data along trusted connections.   This is a hard problem to solve.    It may be compared to the problem of keeping actual keys made out of steel or brass, for the houses of a neighbour or friend.  “We have to go out of town, its an emergency.  Please feed the cat.  You have a key.”  But a burglar who breaks into one house may then find a key to another, perhaps two or more others.  He can then enter those houses, looking for other keys.   There may be a whole network of people trusting only their friends or neighbours, all vulnerable to a single break-in.   Information networks make the problem more severe, unless people use only difficult to guess passwords, which they have nevertheless memorized, never trusting that password to another person.   That is not good enough.

This is not as difficult a problem as it might appear.  What people want advanced social technology to do is make suggestions.  How those suggestions originate is of  less concern than the verification of them.     A message-passing protocol can help.    A message containing an encrypted trap-door encryption key can be passed along from one computer to another to another, at each step growing by the addition of routing information.  The receiving machine can send back not just a return message but an encryped check message “You apparently passed along a message with this check-code.  Can you verify that?    This would give each computer a lot of extra work to do, but could make the system more reliable.   In essence this would treat each communications channel as protected by armour, protected by encryption.  Not only would the message would be encrypted, so would the channel.  Cloaked in this armour the individual communications channels could resist attacks from outside.

I do not claim to be an expert on data privacy or security, but I feel the likelihood of sufficiently secure channels and messages is high.  Thier encrypted armour may need to be increasing more powerful, but to break it would require a very disproportionate amount of decoding.   More on the use of this in automated social technology Monday.  Sunday is my father’s 90th birthday party, and I must be there.  — dpw

Posted in Uncategorized | Tagged , , | Leave a comment