Some small changes – I am not doing exactly what I said yesterday.

I wrote about using variable responses as they were recorded, which would probably work, but I don’t quite trust it, since it is not quite clear how many coordinates would be needed to fully represent the rows (individual people). And it seems that adequately representing those values will take up more memory than by using what I call microvariables, which are columns of single bits.

The best way of representing the whole dataset seems to be as a single two dimensional array of bits. Each variable is replaced by several columns of single bits, each representing one possible value of the variable. Each individual person will be represented by one row of bits.

I have written about this before, but now think it the only way to go.

It is hard to find a nice way to do this. Pascal provides lovely arrays of booleans, but each one actually takes up one byte, not a single bit. I do miss VAX Pascal, with it’s Packed Array of Boolean data type, in which each bit occupied just a single bit of memory.

Python does have a nice package, of course, but I run on on a 64-bit machine and don’t quite trust the 64-bit version of the compiled package, which is only at version 0.3.5 anyway. If anybody has experience with this package, Bitarray, I would like to know about it. See http://pypi.python.org/pypi/bitarray/ for information. It is available as a precompiled binary for Windows at http://www.lfd.uci.edu/~gohlke/pythonlibs/ a very nice page, the best way to access all the well-know packages, (and a few good ones, not so well-known).

Anyway, using big bit arrays, I think I can guarantee that 16 coordinates would be enough to represent all possible rows of bit data. There will be fewer, I think, though I haven’t actually looked for duplicates.

Whether I trust the Python package or not, it seems to be the thing to use, so I will. I’ll report my results as I go along. — dpw