Why all that stuff about data?

I am afraid that my last two posts about massaging and combining data were far from clear.  What is it all for?

There is a fundamental concept, easy to state.   We will often want to match two people, for a platonic friendship or for an intimate sexual one, probably a romantic one.   In doing this it would be very nice to have a lot of empirical data about couples.  There is very little available, but suppose we did have it.   Then we could say that a person who will be a good match for you is some very much like those who are good matches to people similar to you.   If A and B are similar, X is a good match for A, and Y is similar to X, then Y is likely a good match for B.  This one inference is not sufficient but it is evidence.   If we found several people like you and knew that each had good matches who were in some ways similar, then another person similar to them would much more likely be a match for you.

This is a matter of probability, of course, of inductive inference, not logical deduction.  But it can work.  All that is needed is that elusive data about couples.  Soon we will be able to collect that data for ourselves, but to get started we need enough data that we can offer people something useful.   That data is very hard to find.  Couple studies are rare and inadequate.  But by sufficient work on the data we do have, we can make estimates.

There are a number of questions we would like to have the answers to, questions rarely asked or in data unavailable to us, or both.   Having asked a woman about her life, we would have prefered the investigators asked about her husband and the success of her marriage.   A  very small number of these questions were asked in various surveys.   The idea of making use of all that social survey data seems to be a poor one.   But after much study, I have concluded that the information is actually there, buried in the mountain of social survey data, from which only a tiny bit of good ore has been mined.   The problem is to extract it. 

The term ‘data mining’ has become popular.   I think it a good one.  Let me push the analogy a bit.  In one of my novels, I wrote about fully automated self-reproducing factory making factories.  Most importantly, I wrote about these becoming a part of a large space ship, which, in the book, flew to an asteroid and mined it for material to reproduce itself.  Then there were two self-reproducing space ships, which would multiply exponentially, gradually turning the asteroid belt into a fleet of space ships.   In this book I contrasted this approach to the more traditional one in which asteroid mining ships descended on an asteroid, mined the metals out of it, and left the worthless remains behind, just as miners on Earth do.  What I noted was that an asteroid miner would be less like an Earthbound strip miner than the Earthbound meat packing company which “uses every part of the pig but the squeal”.   Given the virtually limitless power of the sun, every bit of an asteroid could be mined, and most valuable of all would be just the parts traditional science fiction imagines to be left behind, the bare rock.   Rock usually contains silicon and oxygen,  both vital materials, either in combination or separated.  It may contain calcium instead of silicon, but that that is also a useful material, a lightweight metal.  What part of an asteroid is not of some use?  Asteroid mining is essentially just the separation of components, not the mining of good from bad.

Similarly, what part of a mass of social survey data is not useful?   It is all information.  What part of it is useless information?   Prove to me that any of it is useless.  Please.

Given all this data, is data mining just the extraction of good information from bad?   Not at all.  It is just the separation of data.  What we want to know is in there.  More is in there, useful to somebody.  Indeed, social technology will eventually find uses for all of it.   That was the main reason for the last two, rather obtuse posts.  They were about collecting, massaging and combining data, prior to data separation, preparing for the data mining process — dpw

Posted in Uncategorized | Tagged , , , , | Leave a comment

Combining Data Sources

This will be discussed primarily in terms of working with existing social survey data, but will apply just as much to newly collected data, once we get past a single collection instrument. As noted before, existing social survey data will be very useful.   How?  

It is easiest to understand this in terms of a machine-generated dialogue system than in a questionnaire based ssytem.    Given a mass of processed social survey data, preferably from a multitude of data sources, it is not hard to find key questions, ones which correspond to principal factors in question space.    One such key question is, “Gender, male, female, no-answer”.   Note: “no-answer” is data too.   Knowing this question is important, our question generation software can ask it.  From that answer, other question immediately arise.  If the user is male, the best next question is almost certainly age.   That may be so for a woman as well, but almost important would be “Have you given birth to any children?” or some variant of that.    Given the answers to those questions, otherw will arise, the exact order and need for them depending on the individual and on previous answers.  So to generate questions, we look at existing social survey data, if for no other reason, though there are many other reasons, of course.

Combining data from existing data sources (or, eventually, from our own) is done this way.   Two questionnaires (survey instruments) will have some overlap.  Most will at least ask the user’s gender.   Most will ask something about age or date of birth, probably just expecting an age range.  Whether asked or not, actual birthdate is rarely made available to users of the datasets.    Two different question and answer sets with some overlap can be considered supersets of a smaller set given to a larger population.   With many overlapping questionnaires, a lot of overlap can be used to estimate questions missing from each set.   Let us suppose that instrument A asked a person’s income range, and instrument B did not.  But suppose that both questionnaires asked about age, gender, home location, employment record, work location.   Then the whole column of data about income range missing from instrument B because it was not a question ever asked, can be treated as missing data, and estimated using the methods given in the last post.

As discussed in that post, missing data should be “recovered” or estimated iteratively.   From all the datasets which overlap more or less in different areas, we can crudely estimate columns of data missing from each one of them, columns full of holes because of unasked questions.   It is possible in this way to provide estimates for the whole grand superset of all questions, the set including of the questions asked on any one of them.   For scientific study, this would be outrageous, but for technology, it is an appropriate thing to do.   It is much more valid, appropriate and useful if done iteratively.    Essentially we are going to be reconstructing a big structure from little pieces of it.  

 Think of a statue or pot in an archeological dig.  Let us say that it exists in pieces, some of which may be missing, some of which may be extraneous.   Gradually, piece by piece, we reconstruct the pot.  But don’t use a very permanent glue, because once it is reconstructed (first guess) you may need to remove and replace some pieces, moving them around, discarding them as extraneous, perhaps manufacturing ones whose size and shape can only be seen once the first attempted reconstruction is done. 

Gradually, iteratively over time, the amount of reconstructed data can be enormous, even if most of it is apparently extrapolation at first.   As we extrapolate to fill in gaps from just two datasets, then move on, bringing in more and more, soon we are interpolating between well-understood points.   This is a complicated process, which I will need to describe in more detail, and will at a later date.   As I said in my earlier post, we can and must automate this, so as to do technology, not science, though the technology will also provide tools for the scientists.   Again, I have done some limited work with rather crude research-only software that I’ve written,  but I am quite sure it can all be made to work.   It will require much more than just expanding on what I have already done, but that is a start  — dpw

Posted in Uncategorized | Tagged , , , | Leave a comment

Data Massage

This is something we can practice on with existing social survey data, but in fact what we do will not be practice.  This data will prove invaluable.  How good it will be will depend on how limited the vision of the people who created the  survey instruments (questionnaires) was, and how much access to raw data we can have.  As I wrote on my main blog, http://www.SocialTechnology.ca/wordpress,  raw data, raw data, raw data.  Sing the data miner’s lament with me “Oh my deepest data mine, lost forever data mine.” Once the data has been truly thrown away, it is gone. When the teacher throws out the test pages, to the incinerator or landfill, so much of what we want is gone. 

But inevitably data in otherwise well-collected datasets will be missing.   And what we have will often need much massage, especially linearization.

One of the first things to do is to fill in holes in the data.  The basic method is this:  where a column of data representing one user is missing a datum, find several similar columns which do have the answer to that question, then take the mean, median or mode, whichever seems best, filling in the hole from the others.   Fill in all the holes this way, or as many as possible, then iterate.   Where there was a hole that was filled in, throw away that guess, and fill in the hole again, using the improved dataset which has had many other holes filled in.   Do this for all the holes, throwing away and filling in with improved estimates.  That is one iteration.  Make several passes through the data.  The process is likely to converge to some fairly reliable estimates for all the missing data.    Since much of this data is multiple choice, individual columns of choices should be orthogonal, this can be used to check and fix this process as necessary.    All of this can be automated, and must be.   We are talking about technology here, not science.  We need to do something with this data, not just spend months examining, diddling with it and writing dissertations about it.

Another important form of data massage is linearization.   We eventually want to use linear algebra on the data, especially factor analysis or principal components analysis, which are more or less but not exactly the same thing.   To do that we want linearized data.   A way of getting this is to assume that it is already linear, select each column of data in turn and try and estimate it using lots of other columns.   Often the original  column of data can survive linear prediction, but sometimes it will be revealed as the logarithm or cubic function of what is estimated using the other columns of data.  In those cases a linearization function is obvious, and the column of data can be changed into one more useful for linear algebra.  This can be done over and over, using the linearized columns as they are created to help doing re-estimations of other columns.  Eventually we will have transformed the dataset into one very suitable for PCA or other forms of analysis.  Note that linearization functions must be recorded, so they can be used in undoing what has been done.  For example, a data simplification method can be this: linearize the data, perform PCA, throw away the lowest weight factors, rotate the data back, undoing the PCA, then delinearize the data, producing a simplified and actually corrected version of the original data.   Note that this process can also be automated, and must be, again because we are doing technology, not science.

Note that the correction through such a simplification process can improve the estimates made of missing data and can correct the data in other ways, though it cannot eliminate systematic errors resulting from poorly designed questionnaires which have a bad response set (response bias), in which for example, people are asked for intimate details of their sex lives and rarely answer correctly unless presented with equally appealing or unappealing choices.     For the social sciences, such poorly designed questionnaires are a disgrace, but can still be useful, and much is made over using the biased data.   For technology we need other methods.    I will go into that in another post, but basically the method is to use internal consistency checks to find the most reliable responders, then use their answers as a clue to biases in the responses to some questions.   This is harder to do,  but I believe it can be done well enough, and can and must also be automated.  

So, there is a quick survey of data massage for social technology, as distinct from social science.   Writing software to do all of this is not as difficult as it seems, from my own limited experience with admittedly somewhat crude software which I’ve worked on over the years.  — dpw

P.S.  The second chapter of my novel about social technology is out, at http://SocialTechNovel.SocialTechnology.ca/

Posted in Uncategorized | Leave a comment

Ah, software. New revised. again tool selections.

Belay that! Hold all tool selection ideas before this. I wish I could use TikiWiki, but it is too template driven, hence slow, has a wiki syntax incompatible with the popular wikipedia, which is driven by the very nice MediaWiki. I like MediaWiki, just like it, best of all. Sorry, that’s just the way it is. And for blogs, I like WordPress, which is surely the best known and best supported of all. Just the ease of uploading and installing plugins alone would make it powerful. WordPress has CMS capabilities, but I am using Joomla, withy some reservations. I may regret that, and it is not too late to put everything under WordPress. I will try using WordPress only, no other CMS on a new subdomain created for the specific purpose of writing a novel about Social Technology. The novel itself will be a blog, but I may add pages about the characters, settings and ideas. A writer’s notebook.

Now to the more serious matter at hand, what I am now calling the Advanced Social Technology Software Development Project. No attempt to make that an nice sounding acronym, sorry. I am just not going to blog any code or publish in any other way for quite a while. I love to write code and find anything else boring, but there is just no way to handle a big project like this without lots of other work before any coding is done. I’ll try to put up a blog post every day about software development, not mentioning tools very often, except when I need to write about tools we need to incorporate in whatever software we write. I say we, though I modestly claim the ability to do it all myself, given time, because it is software we urgently need and it is too big a project for one man. What I hope to do is produce a good set of analysis and design documents, be prepared to have the design thrown out the window by anyone who comes to help, then and only then write some code myself. Of the two progamming languages I have already chosen, Python for the client-side and PHP for the server-side, I’d prefer to do the Python programming, on and for my own desktop machine, first, later setting it up to interact with the server. But I am open to suggestions. Anyone with ideas, please let me know. — dpw

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Probable Use of TikiWiki

I think I will probably end up using TikiWiki for the Software Development Site, even though I have other wiki and content management systems installed. It is an all-in-one solution that has a built in wiki, a blog, content management, and even social networking. I have been asked to help other non-profit organizations set up their software, which I have agreed to do. I think what will probably happen is that I will offer them space on my site, and install a copy of TikiWiki for them to use. It will take some configuration, but should be easy enough to manage, despite its eccentricities. They will have to provide their own content, of course, but that is what matters, not the software. Not yet. What we will be developing here will be a whole new generation of software, and that will matter. One of the key requirements will be making content transferrable, so that people using TikiWiki or some other system will be able to migrate their content to the new system without loss.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Index Page for Social Technology Software Development

Social Software Development

This site is for the development of new software. It complements the main Social Technology site.

What is Social Technology?

Briefly, Social Technology is technology for social uses or technology with a social basis. Facebook is a conspicuous example.

The Status of Social Technology Software Today
As explained in many pages on the main site, the important thing is to make the right social connections, to the best available people, job, and other aspects of a person’s Social Environment. It is the right connections which must be sought, not simply a large number of connections. Indeed, having a large number of connection is actually harmful — it decreases the signal to noise ratio. Much of current social technology is actually harmful , I believe, rather like surgery in the days before the mechanisms of disease were understood.

As the WWW has become almost a necessity in many people’s lives, it has also become part of the problem. As search engines like Google have become extremely popular they have also become part of the problem. They may increase the gross or total amount of communication in a person’s life, but often interfere with what can be called the net, meaning profitable, communication, the amount of information that is actually absorbed and used. It was actually easier to reach out and find people to discuss things with before there was more people than content on the net, so it was easier to catch people. Now a vast amount of the valuable information on the web pages are lost in a sea of information which is even beyond Google’s ability to index. I use the play on words Net Net-Bandwidth as opposed to Gross Net Bandwidth to describe the continually shrinking signal to noise ratio on the Internet and more importantly on the Social Network.

How much useful information flows between people? I argue that it is much much much less than it could or should be. Bandwidth has a great deal to do with interpersonal compatibility, though that is not the only consideration. I believe that most people spend most of their time and effort communicating poorly with incompatible people. I have written a lot about the harm this does to society and the benefits of doing something about it. I feel this should not require such extensive explanation, but even with it people do not seem to get the point.

How Can We Measure Progress?

We need to be able to measure progress towards the goals of increasing net net-bandwidth and increasing the signal to noise ratio. We coud do this if we develop a way of estimating the amount of data passed and absorbed in ongoing conversations which are seen as meaningful by both parties — meaningful enough to reply to. Most e-mail that is written goes often into the void and is not answered or receives only the most token answers. Most web pages are rarely visited and the visits generate no e-mails, no conversations, nothing to show that the page was ever looked at.

Primitive tools like hit counters are not reliable indications, it is the consequences of people reading a page which matters. There is no real technology for measuring all this yet, but clearly most net activity has nothing to do with meaningful information transfer — social bandwidth, the actual amount of information composed in person by people, then read and absorbed by others. But this could be measured, or approximately measured. That kind of measuring tool would itself be social technology.

Proposed Software
The software to be developed here shall include:

social networking capabilites, like Elgg
Content Management System capability
bandwidth measuring capabilities
software development support
interpersonal matching capabilities for networking
career modelling and matching capabilities

Posted in Uncategorized | Tagged , , , , | Leave a comment

Initial Requirements

 

This is the first of many requirements analysis pages, a very informal one.

Though informal, I shall try to use the word “shall” in my initial formulation of requirements.

  • The software shall support requirements analysis, at various levels of formality
  • The software shall provide support for software development using one or more models of the development process
  • The software shall support work done by one or more individuals to produce something protected by copyright, based on various copyright templates
  • The software shall support collaborative work by teams of individuals
  • The software shall support collaborative work by people registered as developers or working anonymously.

 These requirements are for the software development process, and are aimed at collaborative open-source development, while also supporting private work under copyright, for commercial purposes.

 This software development site should itself me managed with the kind of software just described, but nothing appropriate seems to exist. In its absence, various Content Management Systems (CMS) will be used.

 A typical CMS is aimed at file management, first of all, including file creation and editing. Some of these systems have vast

numbers of bells and whistles. They can do almost everything. But not quite. None are as fully capable as an actual computer programming language. The need to remedy this fact produces several other requirements.

  • The software shall have all the functionality of a complete programming language.
  • The software shall accept as input files of program source code and shall provide an interpreter for them
  • The software shall be able to generate program source code representing the current state of the system, which can be loaded back into the interpreter to recreate that state
  • The software will provide editors with as much functionality as current advanced programming editors, for writing and editing program source code.

 These requirements specify something like an Integrated Development Environment, (IDE). Indeed, the software will include what we usually think of as an IDE, though it will support work at the requirements analysis and design levels.

 These requirements are only the beginning.

 There are many more, some of which will involve tying these IDE parts of the software with the profiles of individuals working on a project, and with the profiles of a company or institution hosting the project. There must also be ways of developing and using a profile based on the kind of project — how it compares with other projects. This information must at least set various defaults, so the user does not face a steep learning curve. Requirements for these aspects of the system will be posted later.

 To sum up, then, this project will be required to have the functionality of a CMS, include the capabilities of a full programming language, provide an IDE for that language, and make work within the CMS and IDE as easy as possible using individual and project profiles to provide defaults wherever these are possible and desirable.

 More to come – dpw.

Posted in Uncategorized | Tagged , , , , | Leave a comment