Generic Analyzer: Difference between revisions

From csml-wiki.northwestern.edu
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
Line 44: Line 44:


=== Download binary versions (Linux and OS X) ===
=== Download binary versions (Linux and OS X) ===

Note: The current version is dated 03/28/2014. Earlier versions are not supported.

Revision as of 13:02, 15 May 2014

Overview

The "generic analyzer" (GA) is a program available on all local machines, allowing rapid statistical analysis of typical simulation data.

Data files are expected to be arranged in columns, and the GA will provide the average of each column, along with its standard deviation and correlation coefficient. As a rule of thumb, if the correlation coefficient is above 0.1, the data are considered correlated and the standard deviation is underestimated. This is typical for simulation data when sampling takes place more often than the autocorrelation time. To address this, GA will group the data in blocks and make another pass. The number of data points is now decreased, generally reducing the correlation coefficient.

General usage

 generic_analyzer [options] filename [columns]

filename is a plain text file. Lines starting with a '#' will be ignored (use this feature to indicate column descriptions and other information into your simulation data). If filename ends with '.gz' the file is assumed to be compressed with [gzip] and will be decompressed by GA on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.

If the number of columns is not provided, GA will determine it from the first non-comment line in the file. Otherwise, only the number of columns indicated will be analyzed. Also note the -c option below.

Options

  • -a n
    By default, GA allows up to 107 samples (but note that this number is reduced for files that contain large numbers of samples, see Memory considerations below). This option multiplies the maximum number of entries by n.
  • -c n
    Ignore the first n columns on each line. If the number of columns is specified explicitly on the command-line, these columns are counted after the ignored columns.
  • -i m
    Discard the first m samples. Note that comment lines are not counted as samples. This option is normally employed to exclude simulation data taken when a system is not equilibrated yet.
  • -o filename

Interpreting GA output

When reading the output for a given column, start at the top, and go down until you have reached the first estimate for which the correlation coefficient is less than 0.1. This line provides the proper estimate of the standard deviation. To help you quickly locate this line, it is marked by a '<' at the end. Note that the correlation time can be different for different columns. The summary printed at the end of the output provide the average of each column along with the correct standard deviation and the number of independent samples. If the data in a certain column are not decorrelated even at the most coarse blocking level, a warning is issued.

Memory considerations

By default, GA allows up to 107 samples. If the file contains more samples, use the -a option to allocate memory for an integer multiple of this limit. However, after the number of columns has been determined (either specified explicitly on the command line or automatically from the first sample), the number of entries will be reduced to limit the total memory consumption to 2 GB. If the -a option is used, this limit will be increased to an integer multiple of 2 GB.

The total memory requirements can also be reduced by analyzing a file in multiple passes, selecting a subset of the columns in each pass. For example, for a file with 1000 columns, the first 600 columns can be analyzed via

generic_analyzer file 600

and the remaining 400 columns can be analyzed via

generic_analyzer -c 600 file 400

(Note that the argument 400 is optional in the second invocation, but would be relevant if the file would be analyzed in more than two passes.)

Special usage notes

Algorithm

Download binary versions (Linux and OS X)

Note: The current version is dated 03/28/2014. Earlier versions are not supported.