Generic Analyzer

From csml-wiki.northwestern.edu
Jump to navigation Jump to search

Overview

The "generic analyzer" (GA) is a program available on all local machines, allowing rapid statistical analysis of typical simulation data.

Data files are expected to be arranged in columns, and the GA will provide the average of each column, along with its standard deviation and correlation coefficient. As a rule of thumb, if the correlation coefficient is above 0.1, the data are considered correlated and the standard deviation is underestimated. This is typical for simulation data when sampling takes place more often than the autocorrelation time. To address this, GA will group the data in blocks and make another pass. The number of data points is now decreased, generally reducing the correlation coefficient.

General usage

 generic_analyzer [options] filename [columns]

filename is a plain text file. Lines starting with a '#' will be ignored (use this feature to indicate column descriptions and other information into your simulation data). If filename ends with '.gz' the file is assumed to be compressed with [gzip] and will be decompressed by GA on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.

If the number of columns is not provided, GA will determine it from the first non-comment line in the file. Otherwise, only the number of columns indicated will be analyzed. Also note the -c option below.

Options

  • -a n
    By default, GA allows up to 107 samples (Note that this number is reduced for files that contain large numbers of samples, see #Memory considerations). This option multiplies the maximum number of entries by n.
  • -c n
    Ignore the first n columns on each line. If the number of columns is specified explicitly on the command-line, these columns are counted after the ignored columns.
  • -i m
    Discard the first m samples. Note that comment lines are not counted as samples. This option is normally employed to exclude simulation data taken when a system is not equilibrated yet.
  • -o filename

Interpreting GA output

When reading the output for a given column, start at the top, and go down until you have reached the first estimate for which the correlation coefficient is less than 0.1. This line provides the proper estimate of the standard deviation. To help you quickly locate this line, it is marked by a '<' at the end. Note that the correlation time can be different for different columns. The summary printed at the end of the output provide the average of each column along with the correct standard deviation and the number of independent samples. If the data in a certain column are not decorrelated even at the most coarse blocking level, a warning is issued.

Memory considerations

Special usage notes

Algorithm