Generic Analyzer: Difference between revisions

From csml-wiki.northwestern.edu
Jump to navigation Jump to search
mNo edit summary
 
(77 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== Overview ===
=== Overview ===


The "generic analyzer" (GA) is a program available on all [[hardware|local machines]], allowing rapid statistical analysis of typical simulation data.
The "generic analyzer" (GA) is a command-line driven program that allows rapid statistical analysis of typical simulation data. It is particularly useful for processing output from Monte Carlo and molecular dynamics simulations.


Data files are expected to be arranged in columns, and the GA will provide the average of each column, along with its standard deviation and correlation coefficient. As a rule of thumb, if the correlation coefficient is above 0.1, the data are considered correlated and the standard deviation is underestimated. This is typical for simulation data when sampling takes place more often than the autocorrelation time. To address this, GA will group the data in blocks and make another pass. The number of data points is now decreased, generally reducing the correlation coefficient.
Data files are expected to be arranged in columns, and the GA will provide the average of each column, along with its standard deviation and correlation coefficient. As a rule of thumb, if the correlation coefficient is above 0.1 (i.e., more than 10% correlation), the data are considered correlated and the standard deviation is underestimated. This is typical for simulation data when sampling takes place more often than the autocorrelation time. To address this, GA will group the data in blocks and make another pass. The number of data points is now decreased, generally reducing the correlation coefficient. This is procedure is repeated with iteratively larger block sizes.

=== History ===

In the late 1980s/early 1990s, [https://www.universiteitleiden.nl/en/staffmembers/henk-blote Henk Blöte] at [http://www.tudelft.nl Delft University of Technology] incorporated techniques for the careful analysis of statistical uncertainties and correlations in Monte Carlo codes for spin models (notably used in a [http://csml.northwestern.edu/resources/Reprints/jpa1.pdf large-scale study of the 3D Ising universality class]). To use this analysis method for Monte Carlo data of polymer blends (simulated via the bond-fluctuation model), Erik Luijten implemented it in C in 1999 (the original subroutines were written in Fortran). This code was combined with a function for binning energy and density data for use in the multiple-histogram reweighting technique. After adapting the code again in 2000 for the analysis of the [http://csml.northwestern.edu/resources/Reprints/prl6.pdf critical point of the Restricted Primitive Model electrolyte], he decided that it would be more efficient to incorporate the routines in a general data analysis program, which became the "generic analyzer" in January 2002. Since then, the code was extended to allow testing of inconsistencies in input files, permit a variable number of columns, offer flexible memory allocation, etc. (see [[Generic_Analyzer#Options|Options]] below).


=== General usage ===
=== General usage ===
Line 9: Line 13:
generic_analyzer [options] ''filename'' [columns]
generic_analyzer [options] ''filename'' [columns]


<tt>filename</tt> is a plain text file. Lines starting with a '#' will be ignored (use this feature to indicate column descriptions and other information into your simulation data). If <tt>filename</tt> ends with '<tt>.gz</tt>' the file is assumed to be compressed with [[http://en.wikipedia.org/wiki/Gzip gzip]] and will be decompressed by GA on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.
<tt>filename</tt> is a plain text file. Lines starting with a '#' will be ignored (use this feature to insert column descriptions and other information into your simulation data). If <tt>filename</tt> ends with '<tt>.gz</tt>' the file is assumed to be compressed with [http://en.wikipedia.org/wiki/Gzip gzip] and will be decompressed by GA on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.


If the number of <tt>columns</tt> is not provided, GA will determine it from the first non-comment line in the file. Otherwise, only the number of columns indicated will be analyzed. Also note the [[#c-option|<tt>-c</tt>]] option below.
Typically, the user does not specify the number of <tt>columns</tt>, but GA will determine it from the first non-comment line in the file. Otherwise, only the number of columns indicated will be analyzed. Also note the [[#c-option|<tt>-c</tt>]] option below.

To read from standard input instead of a file, specify ''STDIN'' as the filename.


==== Options ====
==== Options ====


* <tt id="a-option">-a</tt> ''n''<br>By default, GA allows up to 10<sup>7</sup> samples (but note that this number is reduced for files that contain large numbers of samples, see [[#Memory considerations|Memory considerations]] below). This option multiplies the maximum number of entries by ''n''.
* <tt id="a-option">-a</tt> ''n''<br>By default, GA allows up to 10<sup>7</sup> samples (but note that this number is reduced for files that contain large numbers of columns, see [[#Memory considerations|Memory considerations]] below). This option multiplies the maximum number of entries by ''n''. For example <tt>-a 10</tt> will permit up to 10<sup>8</sup> samples.
* <tt id="c-option">-c</tt> ''n''<br>Ignore the first ''n'' columns on each line. If the number of columns is specified explicitly on the command-line, these columns are counted ''after'' the ignored columns.
* <tt id="c-option">-c</tt> ''n''<br>Ignore the first ''n'' columns on each line (<tt>c</tt> stands for 'column'). If the number of columns is specified explicitly on the command-line, these columns are counted ''after'' the ''n'' ignored columns.
* <tt>-i</tt> ''m''<br>Discard the first ''m'' samples. Note that comment lines are not counted as samples. This option is normally employed to exclude simulation data taken when a system is not equilibrated yet.
* <tt id="i-option">-i</tt> ''m''<br>Discard the first ''m'' samples (<tt>i</tt> stands for 'ignore'). Note that comment lines are not counted as samples. This option is normally employed to exclude simulation data taken when a system is not equilibrated yet.
* <tt id="o-option">-o</tt> ''filename''<br>
* <tt id="o-option">-o</tt> ''filename''<br>Redirect the data normally written to [[#ana.dat|ana.dat]] (see below) to ''filename''.


=== Interpreting GA output ===
=== Interpreting GA output ===


When reading the output for a given column, start at the top, and go down until you have reached the first estimate for which the correlation coefficient is less than 0.1. This line provides the proper estimate of the standard deviation. To help you quickly locate this line, it is marked by a '<' at the end. Note that the correlation time can be different for different columns. The summary printed at the end of the output provide the average of each column along with the correct standard deviation and the number of independent samples. If the data in a certain column are not decorrelated even at the most coarse blocking level, a warning is issued.
When reading the output for a given column, start at the top, and go down until you have reached the first estimate for which the correlation coefficient is less than 0.1. This line provides the proper estimate of the standard deviation. To help you quickly locate this line, it is marked by a '<' at the end. Note that the correlation time can be different for different columns. The summary printed at the end of the output provides the average of each column along with the correct standard deviation and the number of independent samples. If the data in a certain column are not decorrelated even at the most coarse blocking level, a warning is issued.


The screen output of GA can be redirected to a file, but it must be noted that all output is written to <tt>stderr</tt>. When using the [http://en.wikipedia.org/wiki/Bash_(Unix_shell) bash shell], redirection is achieved via
GA writes a compact version of the analysis results to a file. For ''n'' columns, this file contains a single line of the format A1 S1 A2 S2 ... A''n'' S''n'', where A''i'' is the average of column ''i'' and S''i'' the corresponding standard deviation (error of the mean). By default the file is called 'ana.dat', but this can be changed via the [[#o-option|<tt>-o</tt>]] option. Note that a second invocation of GA will not overwrite the file, but instead append another line to it. This format is useful for plotting results in [[gnuplot]], via a command of the form <blockquote><tt>plot "ana.dat" using 0:1:2 with err</tt></blockquote> or even <blockquote><tt>plot "ana.dat" using 0:1:2 with err, "" using 0:3:4, "" using 0:5:6</tt></blockquote> which will plot the average values of three columns (and their standard deviation) as a function of the line number in 'ana.dat', where each line number represents a different invocation of GA.


generic_analyzer file 2> redirected_output


<span id="ana.dat"></span>
The screen output of GA can also be redirected to a file, but it must be noted that all output is written the <tt>stderr</tt>. When using the [[http://en.wikipedia.org/wiki/Bash_(Unix_shell) bash shell]], redirection is achieved via <blockquote><tt>generic_analyzer file 2> redirected_output</tt></blockquote>
GA also writes a compact version of the analysis results to a file. For ''n'' columns, this file contains a single line of the format 'A1 S1 A2 S2 ... A''n'' S''n''', where A''i'' is the average of column ''i'' and S''i'' the corresponding standard deviation (error of the mean). By default the file is called 'ana.dat', but this can be changed via the [[#o-option|<tt>-o</tt>]] option. Note that a second invocation of GA will not overwrite the file, but instead append another line to it. This format is useful for plotting results in [[gnuplot]], via a command of the form
plot "ana.dat" using 0:1:2 with err
or even
plot "ana.dat" using 0:1:2 with err, "" using 0:3:4, "" using 0:5:6

which will plot the average values of three columns (and their standard deviation) as a function of the line number in 'ana.dat', where each line number represents a different invocation of GA. Note that the standard deviations written into 'ana.dat' are those corresponding to the first decorrelated set of samples for each column (if no decorrelated set exists, the standard error obtained for the most coarse blocking level will be used).


=== Memory considerations ===
=== Memory considerations ===


By default, GA allows up to 10<sup>7</sup> samples. If the file contains more samples, use the [[#a-option|<tt>-a</tt>]] option to allocate memory for an integer multiple of this limit.
By default, GA allows up to 10<sup>7</sup> samples (lines). If the file contains more samples, use the [[#a-option|<tt>-a</tt>]] option to allocate memory for an integer multiple of this limit.
However, after the number of columns has been determined (either specified explicitly on the command line or automatically from the first sample), the number of entries will be reduced to limit the total memory consumption to 2 GB. If the [[#a-option|<tt>-a</tt>]] option is used, this limit will be increased to an integer multiple of 2 GB.
However, after the number of columns has been determined (either specified explicitly on the command line or automatically from the first sample), the number of entries will be reduced to limit the total memory consumption to 2 GB. If the [[#a-option|<tt>-a</tt>]] option is used, this limit will be increased to an integer multiple of 2 GB.


The total memory requirements can also be reduced by analyzing a file in multiple passes, selecting a subset of the columns in each pass. For example, for a file with 1000 columns, the first 600 columns can be analyzed via
The total memory requirements can also be reduced by analyzing a file in multiple passes, selecting a subset of the columns in each pass. For example, for a file with 1000 columns, the first 600 columns can be analyzed via
generic_analyzer file 600
<blockquote>
<tt>generic_analyzer file 600</tt>
</blockquote>
and the remaining 400 columns can be analyzed via
and the remaining 400 columns can be analyzed via
generic_analyzer -c 600 file 400
<blockquote>
(Note that the argument <tt>400</tt> is redundant in the second invocation, but would be relevant if the file had to be analyzed in more than two passes.)
<tt>generic_analyzer -c 600 file 400</tt>

</blockquote>
The maximum number of columns handled by GA is 2048, so for larger numbers of columns the file always must be processed in multiple passes, following the same approach.
(Note that the argument <tt>400</tt> is optional in the second invocation, but would be relevant if the file would be analyzed in more than two passes.)


=== Special usage notes ===
=== Special usage notes ===

<ul>

<li>To check whether non-equilibrated data at the beginning of a file is affecting the calculated averages, run GA again on the same data file, omitting a fraction of the data via the [[#i-option|<tt>-i</tt>]] option. If this leads to a statistically significant change in the averages, it may indicate that the first part of the file contains samples that are not equilibrated.</li>

<li>All lines (except comment lines) must contain at least the number of columns specified on the command-line, or the number of columns automatically determined from the first non-comment line. If a line contains more columns, a warning is issued, but the analysis proceeds. On the other hand, if any line contains fewer columns than this number, the analysis is aborted.</li>

<li>Blank lines are not permitted, not even at the end of a file.</li>

<li>Although GA tries to detect corruptions in a data file, limitations of the <tt>scanf()</tt> function make it ignore certain non-numerical input. Specifically, an entry that starts with a number followed by a spurious character is truncated. For example, '24j3' is read as '24' (however, scientific notation is parsed correctly; i.e., '5.3e2' is read as '530').</li>

<li>The ability to read from <tt>stdin</tt> makes it possible to process simulation data and directly pipe the results into GA. For example:
<pre>
awk unprocessed_data | generic_analyzer STDIN
</pre>
(assuming that <tt>awk</tt> processes the original data into a stream of columns)
</li>

<li>Some programs create data files with MS-DOS line endings, and certain email programs convert data files sent as attachments to include such line endings. GA recognizes this format and can properly read it. (This can be important in unexpected situations, e.g., when a data file is transmitted as an email attachment.)</li>

</ul>


=== Algorithm ===
=== Algorithm ===


(coming soon)
=== Download binary versions (Linux and OS X) ===

=== Download (Linux and OS X) ===

The current version is dated 06/24/2014. It is strongly recommended that you upgrade from any earlier version. You can download binary versions of this program here, but note that this was created for internal lab use - we cannot provide any support.

* [https://pergamon.ms.northwestern.edu/Download/Generic_analyzer/ga.zip Linux executable (recommended version; compiled on OpenSuSE 13.1, 64-bit)]
* [https://pergamon.ms.northwestern.edu/Download/Generic_analyzer/ga_static.zip Linux executable (try if regular version does not work; static binary that should be suitable for most x86_64 Linux installations)]
* [https://pergamon.ms.northwestern.edu/Download/Generic_analyzer/ga_osx.zip OS X executable (compiled on OS X 10.9.3, also works on OS X 10.10)]
* [https://pergamon.ms.northwestern.edu/Download/Generic_analyzer/ga_windows.zip Windows executable (tested on Windows 10)] Note: since a standard installation of Windows does not provide gzip, you must install [http://ariadne.ms.northwestern.edu/Download/Generic_analyzer/gzip.zip gzip.exe] in a folder listed in your %PATH% if you wish to read compressed files.

=== Copyright ===


You are free to download this program and use it for your research. However, please note that the copyright for this code is retained by Erik Luijten, 1999-2018.
Note: The current version is dated 03/28/2014. Earlier versions are not supported.

Latest revision as of 09:09, 14 August 2021

Overview

The "generic analyzer" (GA) is a command-line driven program that allows rapid statistical analysis of typical simulation data. It is particularly useful for processing output from Monte Carlo and molecular dynamics simulations.

Data files are expected to be arranged in columns, and the GA will provide the average of each column, along with its standard deviation and correlation coefficient. As a rule of thumb, if the correlation coefficient is above 0.1 (i.e., more than 10% correlation), the data are considered correlated and the standard deviation is underestimated. This is typical for simulation data when sampling takes place more often than the autocorrelation time. To address this, GA will group the data in blocks and make another pass. The number of data points is now decreased, generally reducing the correlation coefficient. This is procedure is repeated with iteratively larger block sizes.

History

In the late 1980s/early 1990s, Henk Blöte at Delft University of Technology incorporated techniques for the careful analysis of statistical uncertainties and correlations in Monte Carlo codes for spin models (notably used in a large-scale study of the 3D Ising universality class). To use this analysis method for Monte Carlo data of polymer blends (simulated via the bond-fluctuation model), Erik Luijten implemented it in C in 1999 (the original subroutines were written in Fortran). This code was combined with a function for binning energy and density data for use in the multiple-histogram reweighting technique. After adapting the code again in 2000 for the analysis of the critical point of the Restricted Primitive Model electrolyte, he decided that it would be more efficient to incorporate the routines in a general data analysis program, which became the "generic analyzer" in January 2002. Since then, the code was extended to allow testing of inconsistencies in input files, permit a variable number of columns, offer flexible memory allocation, etc. (see Options below).

General usage

 generic_analyzer [options] filename [columns]

filename is a plain text file. Lines starting with a '#' will be ignored (use this feature to insert column descriptions and other information into your simulation data). If filename ends with '.gz' the file is assumed to be compressed with gzip and will be decompressed by GA on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.

Typically, the user does not specify the number of columns, but GA will determine it from the first non-comment line in the file. Otherwise, only the number of columns indicated will be analyzed. Also note the -c option below.

To read from standard input instead of a file, specify STDIN as the filename.

Options

  • -a n
    By default, GA allows up to 107 samples (but note that this number is reduced for files that contain large numbers of columns, see Memory considerations below). This option multiplies the maximum number of entries by n. For example -a 10 will permit up to 108 samples.
  • -c n
    Ignore the first n columns on each line (c stands for 'column'). If the number of columns is specified explicitly on the command-line, these columns are counted after the n ignored columns.
  • -i m
    Discard the first m samples (i stands for 'ignore'). Note that comment lines are not counted as samples. This option is normally employed to exclude simulation data taken when a system is not equilibrated yet.
  • -o filename
    Redirect the data normally written to ana.dat (see below) to filename.

Interpreting GA output

When reading the output for a given column, start at the top, and go down until you have reached the first estimate for which the correlation coefficient is less than 0.1. This line provides the proper estimate of the standard deviation. To help you quickly locate this line, it is marked by a '<' at the end. Note that the correlation time can be different for different columns. The summary printed at the end of the output provides the average of each column along with the correct standard deviation and the number of independent samples. If the data in a certain column are not decorrelated even at the most coarse blocking level, a warning is issued.

The screen output of GA can be redirected to a file, but it must be noted that all output is written to stderr. When using the bash shell, redirection is achieved via

 generic_analyzer file 2> redirected_output

GA also writes a compact version of the analysis results to a file. For n columns, this file contains a single line of the format 'A1 S1 A2 S2 ... An Sn', where Ai is the average of column i and Si the corresponding standard deviation (error of the mean). By default the file is called 'ana.dat', but this can be changed via the -o option. Note that a second invocation of GA will not overwrite the file, but instead append another line to it. This format is useful for plotting results in gnuplot, via a command of the form

 plot "ana.dat" using 0:1:2 with err

or even

 plot "ana.dat" using 0:1:2 with err, "" using 0:3:4, "" using 0:5:6

which will plot the average values of three columns (and their standard deviation) as a function of the line number in 'ana.dat', where each line number represents a different invocation of GA. Note that the standard deviations written into 'ana.dat' are those corresponding to the first decorrelated set of samples for each column (if no decorrelated set exists, the standard error obtained for the most coarse blocking level will be used).

Memory considerations

By default, GA allows up to 107 samples (lines). If the file contains more samples, use the -a option to allocate memory for an integer multiple of this limit. However, after the number of columns has been determined (either specified explicitly on the command line or automatically from the first sample), the number of entries will be reduced to limit the total memory consumption to 2 GB. If the -a option is used, this limit will be increased to an integer multiple of 2 GB.

The total memory requirements can also be reduced by analyzing a file in multiple passes, selecting a subset of the columns in each pass. For example, for a file with 1000 columns, the first 600 columns can be analyzed via

 generic_analyzer file 600

and the remaining 400 columns can be analyzed via

 generic_analyzer -c 600 file 400

(Note that the argument 400 is redundant in the second invocation, but would be relevant if the file had to be analyzed in more than two passes.)

The maximum number of columns handled by GA is 2048, so for larger numbers of columns the file always must be processed in multiple passes, following the same approach.

Special usage notes

  • To check whether non-equilibrated data at the beginning of a file is affecting the calculated averages, run GA again on the same data file, omitting a fraction of the data via the -i option. If this leads to a statistically significant change in the averages, it may indicate that the first part of the file contains samples that are not equilibrated.
  • All lines (except comment lines) must contain at least the number of columns specified on the command-line, or the number of columns automatically determined from the first non-comment line. If a line contains more columns, a warning is issued, but the analysis proceeds. On the other hand, if any line contains fewer columns than this number, the analysis is aborted.
  • Blank lines are not permitted, not even at the end of a file.
  • Although GA tries to detect corruptions in a data file, limitations of the scanf() function make it ignore certain non-numerical input. Specifically, an entry that starts with a number followed by a spurious character is truncated. For example, '24j3' is read as '24' (however, scientific notation is parsed correctly; i.e., '5.3e2' is read as '530').
  • The ability to read from stdin makes it possible to process simulation data and directly pipe the results into GA. For example:
      awk unprocessed_data | generic_analyzer STDIN
    

    (assuming that awk processes the original data into a stream of columns)

  • Some programs create data files with MS-DOS line endings, and certain email programs convert data files sent as attachments to include such line endings. GA recognizes this format and can properly read it. (This can be important in unexpected situations, e.g., when a data file is transmitted as an email attachment.)

Algorithm

(coming soon)

Download (Linux and OS X)

The current version is dated 06/24/2014. It is strongly recommended that you upgrade from any earlier version. You can download binary versions of this program here, but note that this was created for internal lab use - we cannot provide any support.

Copyright

You are free to download this program and use it for your research. However, please note that the copyright for this code is retained by Erik Luijten, 1999-2018.