Data file handling


Opening, loading and saving files

Loading and saving datafiles and codebooks is pretty easy if you don't want to do anything fancy. All you need are these three functions: open_entries to open a data or codebook file, close_entries to close a file you have opened earlier and save_entries to save a codebook entries to a file. Another function for saving data is save_entries_wcomments that allows you to put your own headers or comments at the start of the file after the default headers.
struct entries *open_entries(char *name)
Open a data file. Returns a pointer to a ready-to-use entries-structure. After opening the file you can set the buffering mode with set_buffer or with set_teach_params along with other options. Note that no data is loaded from the file at this point. The data is loaded when it it first requested, usually by rewind_entries. Returns NULL on error.

void close_entries(struct entries *entries)
Deallocates an entries-file. Frees memory allocated for entry-list and closes the file associated with entries if there is one open.

int save_entries_wcomments(struct entries *codes, char *out_code_file, char *comments)
Saves data to a file with optional comment or header lines. Codes is the data to save, out_code_file is the file name and comments is a pointer to a string containing the comments or NULL if no comments. Returns a non-zero value on error.

int save_entries(struct entries *codes, char *out_code_file)
Saves data to a file. This is actually a macro that calls save_entries_wcomments with the comments as NULL.

int label_not_needed(int level)
Controls whether labels are needed for every data entry or not. Setting level to nonzero means that labels are not necessary. This function sets the default used by open_entries.

The following functions are lower level functions that are used mainly by the functions above and when going through a data file. In most cases you don't have to any of these directly. A couple of functions, however, can be quite useful: write_header to write a header of an entries-structure to a file and write_entry to write a single data_entry to a file. These can be quite useful when you want to make a program that acts as a filter. See the source code for the program visual for an example on how to use them.
struct data_entry *load_entry(struct entries *entr, struct data_entry *entry)
Loads one data_entry from file associated with entr. If entry is non-NULL, the old data_entry is reused, otherwise a new entry is allocated. Returns NULL on error.

int write_header(struct file_info *fi, struct entries *codes)
Writes the default header information to a file.

int write_entry(struct file_info *fi, struct entries *entr, struct data_entry *entry)
Writes one data entry (entry) to file fi from entr.

struct entries *open_data_file(char *name)
Opens a data file for reading. Returns a pointer to entries-structure or NULL on error. If name is NULL, just allocates the data structure but doesn't open any file.

int read_headers(struct entries *entries)
Reads the header information from file and sets the entries variables accordingly. Return a non-zero value on error.

int skip_headers(struct file_info *fi)
Skips over headers of a file. Used when a file is re-opened. Returns a non-zero value on error.

int rewind_datafile(struct file_info *fi)
Go to the beginning of file to the point where the first data entry is. You don't usually have to use this at all because this is automatically done by rewind_entries and next_entry. Returns 0 on success, error code otherwise.

struct entries *read_entries(struct entries *entries)
Reads data from file to memory. If LOADMODE_ALL is used the whole file is loaded into memory at once and the file is closed. If buffered loading (LOADMODE_BUFFER) is used at most N data vectors are read into memory at one time. The buffer size N is given in the entries->buffer field. In both cases if there are any previous entries in the entries structure, they are overwritten and the memory space allocated for them is reused. Returns NULL on error, otherwise returns entries.

Going through the data

A very common operation is going thru all the vectors in a data or code file, possibly many times, like when teaching a map with vsom. Two functions are provided to do this: rewind_entries which gets you the first vector from a datafile and next_entry which gives the next vectors. These functions handle buffered reading, compressed files, etc. automatically without you having to worry about it at all. A typical use could be line this:
  struct data_entry *dtmp;
  eptr p;

  /* ... */

  dtmp = rewind_entries(data, &p); /* go to start of data, initalize p */
  while (dtmp != NULL) {

    /* whatever needs to be done */

    dtmp = next_entry(&p); /* get next entry */
  }
Where data is an entries-structure containing your data.

Eptr is used to store the current position in the datafile and it is defined as:

struct entry_ptr {
  struct data_entry *current;
  struct entries *parent;
  long index;
};
typedef struct entry_ptr eptr;
Current is a pointer to the current vector, parent is a pointer to the entries with which this pointer is associated and index is the number of the current entry in the datafile (first is 0). You can have multiple eptrs for the same datafile as long as you aren't using buffered input. Also, eptrs can be copied from one eptr to another.

When the end of the datafile is reached, next_entry returns NULL. If you want to read the data again, just call rewind_entries again and then go on as before.

struct data_entry *rewind_entries(struct entries *entries, eptr *ptr)
Go to the first entry in entries list. Returns pointer to first data_entry and initializes the entry pointer (ptr) structure. The ptr pointer should be used with next_entry to get the following data vectors from the datafile. Loads data from file if it hasn't been loaded yet and rewinds file if we are using buffered reading.

id="next_entry">struct data_entry *next_entry(eptr *ptr)
Get next entry from the entries table. Ptr is the entry pointer got from rewind_entries earlier. Returns NULL when at end of table or end of file is encountered. If loadmode is buffered, loads more data from file when needed.

Manipulating data

These functions are used to allocate, free or copy individual data_entrys or whole codebooks or files.
struct entries *alloc_entries(void)
Allocate and initialize an entries-structure. Returns NULL on error.

struct entries *copy_entries(struct entries *entr)
create a new entries structure with the same parameters as the original. Doesn't copy data_entrys.

struct data_entry *init_entry(struct entries *entr, struct data_entry *entry)
Initialize and possibly allocate room for data_entry that is to be in entr. If entry is NULL, a new entry is allocated and initialized. If entry is a pointer to an old entry, the old entry is re-initialized. Return NULL on error.

void free_entry(struct data_entry *entry)
Deallocates a data_entry.

void free_entrys(struct data_entry *data)
Free a list of entries starting with data.

struct data_entry *copy_entry(struct entries *entries, struct data_entry *data)
Make a copy of data.

Other functions

void set_buffer(struct entries *entries, long buffer)
Sets the buffer size of an entries-file and selects LOADMODE_BUFFER or turns on LOADMODE_ALL if buffer is 0.

int set_teach_params(struct teach_params *params, struct entries *codes, struct entries *data, long dbuffer)
Sets values in teaching parameter structure params based on values given in codebook (codes) and data (data) files. Dbuffer is the data file's buffer length or 0 if no buffering is wanted. By default sets topology and map type from codes and selects winner, distance and adaptation functions that use euclidean distance. The teach_params-structure is described in more detail in the Customizing SOM/LVQ_PAK -document.