Generic Analyzer

From csml-wiki.northwestern.edu
Jump to navigation Jump to search

Overview

The "generic analyzer" (GA) is a program available on all local machines, allowing rapid statistical analysis of typical simulation data. It is particularly useful for processing output from Monte Carlo and molecular dynamics simulations.

Data files are expected to be arranged in columns, and the GA will provide the average of each column, along with its standard deviation and correlation coefficient. As a rule of thumb, if the correlation coefficient is above 0.1, the data are considered correlated and the standard deviation is underestimated. This is typical for simulation data when sampling takes place more often than the autocorrelation time. To address this, GA will group the data in blocks and make another pass. The number of data points is now decreased, generally reducing the correlation coefficient.

General usage

 generic_analyzer [options] filename [columns]

filename is a plain text file. Lines starting with a '#' will be ignored (use this feature to insert column descriptions and other information into your simulation data). If filename ends with '.gz' the file is assumed to be compressed with [gzip] and will be decompressed by GA on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.

If the number of columns is not provided, GA will determine it from the first non-comment line in the file. Otherwise, only the number of columns indicated will be analyzed. Also note the -c option below.

Options

  • -a n
    By default, GA allows up to 107 samples (but note that this number is reduced for files that contain large numbers of columns, see Memory considerations below). This option multiplies the maximum number of entries by n.
  • -c n
    Ignore the first n columns on each line. If the number of columns is specified explicitly on the command-line, these columns are counted after the ignored columns.
  • -i m
    Discard the first m samples. Note that comment lines are not counted as samples. This option is normally employed to exclude simulation data taken when a system is not equilibrated yet.
  • -o filename
    Redirect the data normally written to ana.dat (see below) to filename.

Interpreting GA output

When reading the output for a given column, start at the top, and go down until you have reached the first estimate for which the correlation coefficient is less than 0.1. This line provides the proper estimate of the standard deviation. To help you quickly locate this line, it is marked by a '<' at the end. Note that the correlation time can be different for different columns. The summary printed at the end of the output provide the average of each column along with the correct standard deviation and the number of independent samples. If the data in a certain column are not decorrelated even at the most coarse blocking level, a warning is issued.

The screen output of GA can be redirected to a file, but it must be noted that all output is written to stderr. When using the [bash shell], redirection is achieved via

generic_analyzer file 2> redirected_output

GA also writes a compact version of the analysis results to a file. For n columns, this file contains a single line of the format 'A1 S1 A2 S2 ... An Sn', where Ai is the average of column i and Si the corresponding standard deviation (error of the mean). By default the file is called 'ana.dat', but this can be changed via the -o option. Note that a second invocation of GA will not overwrite the file, but instead append another line to it. This format is useful for plotting results in gnuplot, via a command of the form

plot "ana.dat" using 0:1:2 with err

or even

plot "ana.dat" using 0:1:2 with err, "" using 0:3:4, "" using 0:5:6

which will plot the average values of three columns (and their standard deviation) as a function of the line number in 'ana.dat', where each line number represents a different invocation of GA. Note that the standard deviations written into 'ana.dat' are those corresponding to the first decorrelated set of samples for each column (if no decorrelated set exists, the standard error obtained for the most coarse blocking level will be used).

Memory considerations

By default, GA allows up to 107 samples. If the file contains more samples, use the -a option to allocate memory for an integer multiple of this limit. However, after the number of columns has been determined (either specified explicitly on the command line or automatically from the first sample), the number of entries will be reduced to limit the total memory consumption to 2 GB. If the -a option is used, this limit will be increased to an integer multiple of 2 GB.

The total memory requirements can also be reduced by analyzing a file in multiple passes, selecting a subset of the columns in each pass. For example, for a file with 1000 columns, the first 600 columns can be analyzed via

generic_analyzer file 600

and the remaining 400 columns can be analyzed via

generic_analyzer -c 600 file 400

(Note that the argument 400 is optional in the second invocation, but would be relevant if the file would be analyzed in more than two passes.)

The maximum number of columns handled by GA is 2048, so for larger numbers of columns the file always must be processed in multiple passes, following the same approach.

Special usage notes

  • To check whether non-equilibrated data at the beginning of a file is affecting the calculated averages, run GA again on the same data file, omitting a fraction of the data via the -i option. If this leads to a statistically significant change in the averages, it may indicate that the first part of the file contains samples that are not equilibrated.
  • All lines (except comment lines) must contain at least the number of columns specified on the command-line, or the number of columns automatically determined from the first non-comment line. If a line contains more columns, a warning is issued, but the analysis proceeds. On the other hand, if any line contains fewer columns than this number, the analysis is aborted.
  • Blank lines are not permitted, not even at the end of a file.
  • Although GA tries to detect corruptions in a data file, limitations in the scanf() function make it ignore certain non-numerical input. Specifically, an entry that starts with a number followed by a spurious character is truncated. For example, '24j3' is read as '24' (scientific notation however is read correctly, that is 5.3e2 is read as 530).

Algorithm

(coming soon)

Download binary versions (Linux and OS X)

The current version is dated 03/28/2014. It is strongly recommend that you upgrade from any earlier version. You can download binary versions of this program here, but note that this was created for internal lab use - we cannot provide any support.