Introduction to SOM/LVQ_PAK 3.0

This document describes new features and changes to the packages since the 2.x versions. If you are not familiar with the packages already, you might want to take a look at the package documentation first.

The new packages work just like the older versions if you don't use any of the new features. The file format is changes slightly but the old data files should work. The only change is the option to ignore some vector components. Some of the new features only work in an UNIX-like environment because they require the UNIX popen() function call. New features are:

Reading and writing compressed files

To read or write compressed files just put the suffix .gz at the end of the filename. The file is automatically uncompressed or compressed as the file is being read or written. SOM/LVQ use gzip for compressing and uncompressing. It can also read files compressed with regular UNIX compress-command (since gzip does it). The commands used for compressing and decompressing can be changed with command line options or at compile time.

Example: with vsom, to use a compressed data file for teaching:

vsom -din data.dat.gz ...
Since compressing/uncompressing uses the popen() function to run gzip, compressed files won't work in some environments (MSDOS).

Using standard input/output

To use standard input or output, use the minus ('-') as filename. Data is then read from stdin or written to stdout. For example, to read teaching data from stdin with vsom:
vsom -din - ...

Piped commands

If you use a filename that starts with the UNIX pipe character ('|'), the filename is executed as a command. If the file is opened for writing the output of the SOM command is piped to the command as standard input. Likewise, when the file is opened for reading the output of the command is read by the SOM programs.

For example:

   vsom -cin "|randinit ..." ... 
Vsom would start the program randinit when it wants to read the initial codebook. However, the same thing could be done with:
   randinit ... | vsom -cin - ...
This feature is useful in the saving of snapshots. The reading and writing of compressed commands is actually a special case of this feature (the same restrictions about popen() apply).

Buffered input of data files

This means that the whole data set doesn't have to be loaded in memory all the time. SOM/LVQ can be set, for example, to hold max 10000 lines of data in memory at on time. When the 10000 data vectors have been used, the next 10000 data vectors are loaded over the old ones (the amount of memory needed is the space needed for 10000 vectors, not the whole file). The buffered reading is transparent to the user and mostly also to the programmer and it works with compressed files and piped commands also.

Note that when the whole file has been read once and we want to read it again, the file has to be rewinded (for regular files) or the uncompressing command has to be rerun. This is done automatically and the user doesn't usually have to worry about it but it forces some restrictions on the input file: If the source is a pipe, it can't be rewinded. Regular files, compressed files and standard input (if it is a file) work. Using a pipe works fine if you don't have to rewind it, ie. the data doesn't end or the number of iterations is smaller than the number of data vectors.

Most programs support the buffered reading of data files. It is ativated with the command line option -buffer followed with the maximum number of data vectors to be kept in memory. For example:

vsom ... -buffer 10000
would read the input data file 10000 lines at a time.

Leaving out components of data vectors

If you want some components of some data vectors to be ignored in calculations, mark those components with 'x' (replace the numerical value). For example, a part of your 5-dimensional data file might look like this:
1.1 2.0 0.5 4.0 5.5
1.3 6.0 x   2.9 x
1.9 1.5 0.1 0.3 x 
When vector distances are calculated or the winner is calculated or when adapting codebook vectors and with labeling the components marked with x are ignored, they are not adapted (the corresponding component in the codebook vector) or used in distance calculations. The string that indicates a component that should be ignored can be changed with a command line option or set at compile time.

Snapshots of codebook

Save snapshots of the codebook during teaching. (vsom and lvq training programs) The interval between snapshots is specified with the option -snapinterval. The snapshot filename can be specified with the option -snapfile. If no filename is given, the name of the output code file is used. The filename is actually passed to sprintf(3) as the format string and the number of iterations so far is passed as the next argument. For example:
   vsom -snapinterval 10000 -snapfile "ex.%d.cod" ...
gives you snapshots files every 10000 iterations with names staring with: ex.10000.cod, ex.20000.cod, ex.30000.cod, etc.

Another example:

   ./vsom -din ex.dat -cin ex2.cod -cout ex.cod -rlen 10000
          -alpha 0.02 -radius 3
          -snapfile "|./vcal -din ex_fts.dat -cin - -cout foo.%d.cod"
          -snapinterval 1000
This command would teach the map file ex2.cod with data from file ex.dat with 10000 iterations. The teached codebook file is saved in file ex.cod. Every 1000 iterations the codebook is piped to vcal, which labels the codebook units with data from ex_fts.dat. The labeled codebooks are saved in files foo.1000.cod, foo.2000.cod, etc.

Randomized entry order

By default the data vectors are used in the order they appear in the data file. To use them in random order use the -rand option followed by a seed for the random number generator. For example -rand 10 would initialize the random number generator with the seed value of 10. Seed 0 initializes the random number generator with the current time.

Random entry order works so that when data is read from file or the data is reused the order of vectors is randomized. If the whole data file is loaded into memory (not using buffered loading) the whole set is randomized. If buffered loading is used the randomization is done for the piece of file that is loaded into memory.

Common options

These options are common to all programs:
-mask_str string
Sets the string that indicates a vector components to be ignored in data files. For example,
         vsom ... -mask_str "*" ...
would ignore components thats are marked with '*' instead of 'x'. Longer strings can also be used.
Displays the version number of the SOM/LVQ code library and compile date.
-compress_cmd command
Sets the command used to compress files. Default: gzip -9 -c >%s
-uncompress_cmd command
Sets the command used to decompress files. Default: gzip -d -c %s

Environment variables

Some defaults can be set with environment variables:
Sets the command used to compress files. Default: gzip -9 -c >%s
Sets the command used to decompress files. Default: gzip -d -c %s
Set the string that indicates a vector component to be ignored. See also: changing mask string