Autocorrelation: Difference between revisions
mNo edit summary |
|||
(32 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
=== General usage === |
=== General usage === |
||
autocorrelation [options] ''filename'' |
autocorrelation [options] ''filename'' ''starttime'' ''maxtime'' |
||
<tt>filename</tt> is a plain text file. Lines starting with a '#' will be ignored (use this feature to insert column descriptions and other information into your simulation data). If <tt>filename</tt> ends with '<tt>.gz</tt>' the file is assumed to be compressed with [http://en.wikipedia.org/wiki/Gzip gzip] and will be decompressed on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis. |
<tt>filename</tt> is a plain text file. Lines starting with a '#' will be ignored (use this feature to insert column descriptions and other information into your simulation data). If <tt>filename</tt> ends with '<tt>.gz</tt>' the file is assumed to be compressed with [http://en.wikipedia.org/wiki/Gzip gzip] and will be decompressed on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis. |
||
Line 12: | Line 12: | ||
When redirecting the output, note that the '''autocorrelation function''' for each column is written to standard output, whereas the '''autocorelation time''' (and all other information) is written to standard error. See [[#Special usage notes|usage notes]] below. |
When redirecting the output, note that the '''autocorrelation function''' for each column is written to standard output, whereas the '''autocorelation time''' (and all other information) is written to standard error. See [[#Special usage notes|usage notes]] below. |
||
''starttime'' and ''maxtime'' are mandatory arguments. ''starttime'' is used to discard initial samples. Normally it is simply set to 0. ''maxtime'' sets the upper bound for the integral of the autocorrelation function. |
|||
=== Options === |
=== Options === |
||
* <tt id="a-option">-a</tt><br>Compute the autocorrelation time by integrating the absolute autocorrelation function. This is useful to eliminate cancellations due to anticorelations (which would result in an underestimation of the correlation time), but also will enhance noise in the tail of the autocorrelation function (where the function fluctuates around zero; normally such fluctuations cancel out). |
* <tt id="a-option">-a</tt><br>Compute the autocorrelation time by integrating the absolute autocorrelation function. This is useful to eliminate cancellations due to anticorelations (which would result in an underestimation of the correlation time), but also will enhance noise in the tail of the autocorrelation function (where the function fluctuates around zero; normally such fluctuations cancel out). |
||
* <tt id="c-option">-c n:m</tt><br>Compute the cross correlation of columns <tt>n</tt> and <tt>m</tt>. Note: columns are numbered from 0. |
|||
* <tt id="c-option">-c</tt> |
|||
* <tt id="e-option">-e</tt><br>Disable truncating the data set when using FFT-based calculation (which in turn is enabled via the -f option). Normally, use of -f truncates the data set to a multiple of 256 (for more than 10240 samples) or a multiple of 16384 (for more than 10<sup>6</sup> samples). Note that disabling this truncation carries a significant speed penalty. However, it makes the results identical to those obtained without using the FFT. |
* <tt id="e-option">-e</tt><br>Disable truncating the data set when using FFT-based calculation (which in turn is enabled via the -f option). Normally, use of -f truncates the data set to a multiple of 256 (for more than 10240 samples) or a multiple of 16384 (for more than 10<sup>6</sup> samples). Note that disabling this truncation carries a significant speed penalty (and is primarily used for debugging purposes). However, it makes the results identical to those obtained without using the FFT. (If the size of the original data set is already a multiple of 256 (or 16384), then the results with and without FFT are identical even without this option.) |
||
* <tt id="f-option">-f</tt><br> |
* <tt id="f-option">-f</tt><br>Employ a Fast Fourier Transform to greatly reduce the computational effort, by analyzing the data as a convolution in the frequency domain. This works for the autocorrelation function ([http://mathworld.wolfram.com/Wiener-KhinchinTheorem.html Wiener-Khinchin Theorem]) as well as for the cross correlation (option <tt>-c</tt>) ([http://mathworld.wolfram.com/Cross-CorrelationTheorem.html Cross-Correlation Theorem]) and the mean squared displacement (option <tt>-m</tt>). Note that this requires computing the function over the entire time domain; therefore, for very rapidly decaying functions, where ''maxtime'' can be kept small, conventional evaluation might be faster. |
||
* <tt id="m-option">-m</tt><br>Calculate the mean squared displacement (MSD). See [[#Algorithm|algorithm]] below for the relation between the MSD and the autocorrelation function. |
|||
* <tt id="m-option">-m</tt> |
|||
* <tt id="o-option">-o</tt> ''filename''<br>Write autocorrelation function to ''filename''. |
* <tt id="o-option">-o</tt> ''filename''<br>Write autocorrelation function to ''filename''. |
||
* <tt id="u-option">-u</tt><br>Do not normalize the autocorrelation function. Without this option, the autocorelation function C(t) is normalized by C(0). Note that this option disables calculation of the autocorrelation time (see [[#Algorithm|algorithm]] below). |
|||
* <tt id="u-option">-u</tt> |
|||
* <tt id="x-option">-x</tt><br>Only compute the cross term in the autocorrelation function, i.e., the first term in C(t) (see [[#Algorithm|algorithm]] below). |
|||
* <tt id="x-option">-x</tt> |
|||
=== Interpreting |
=== Interpreting output === |
||
Topics (coming soon): |
|||
* choosing maxtime |
|||
* anticorrelations |
|||
* negative time cross correlations |
|||
* ... |
|||
=== Special usage notes === |
=== Special usage notes === |
||
Line 31: | Line 40: | ||
<li>Note that, per standard GNU style, options can be combined. Thus, for example, '<tt>-a -e -f</tt>' can be specified as '<tt>-aef</tt>'.</li> |
<li>Note that, per standard GNU style, options can be combined. Thus, for example, '<tt>-a -e -f</tt>' can be specified as '<tt>-aef</tt>'.</li> |
||
<li>(more tips coming soon)</li> |
|||
</ul> |
</ul> |
||
=== Algorithm === |
=== Algorithm === |
||
For a discrete time series <math>a_1, \ldots, a_N</math>, the '''autocovariance''' is given by |
|||
<math>C(t) = \frac{1}{N-t} \sum_{j=1}^{N-t} a_j a_{j+t} - |
|||
\left(\frac{1}{N-t} \sum_{j=1}^{N-t} a_j \right) |
|||
\left(\frac{1}{N-t} \sum_{j=1}^{N-t} a_{j+t} \right) |
|||
</math>, |
|||
which can be evaluated for <math>0 \leq t < N</math>. The '''autocorrelation''' (or autocorrelation ''function'') is the autocovariance normalized by the variance of <math>a</math>. If we assume time invariance, this variance is <math>C(0)</math>, so that the autocorrelation becomes <math>C(t)/C(0)</math>. (The special case <math>C(0)=0</math>, which arises for a constant time series, is caught by '''AC'''.) The time difference <math>t</math> is also called the ''lag time.'' By default, the '''AC''' code computes the autocorrelation; the autocovariance is obtained via the [[#u-option|<tt>-u</tt>]] option. |
|||
If the autocorrelation function decays exponentially, <math>C(t) = A \exp(-t/\tau)</math>, then <math>\tau</math> can be found as |
|||
<math>\tau = \int_0^\infty [C(t)/C(0)] dt</math>. |
|||
This concept is generalized by integrating the autocorrelation function irrespective of its functional form. Thus, the autocorrelation time is computed as |
|||
<math>\tau = \sum_{i=0}^{\mathrm{maxtime}} [C(i)/C(0)]</math>. |
|||
Some more nomenclature: Sometimes the autocorrelation refers to the ''cross term'' <math>\langle a_j a_{j+t}\rangle</math> (where the brackets indicate averaging over <math>j</math>), which is the first term in <math>C(t)</math>. In that case, <math>C(t)/C(0)</math> is referred to as the '''autocorrelation coefficient''' or autocorrelation coefficient ''function.'' To compute the cross term in '''AC''', specify the [[#x-option|<tt>-x</tt>]] option (or [[#u-option|<tt>-u</tt>]] [[#x-option|<tt>-x</tt>]] to suppress normalization). |
|||
For two time series <math>a_1, \ldots, a_N</math> and <math>b_1, \ldots, b_N</math>, the '''covariance''' is defined as |
|||
<math>X(t) = \frac{1}{N-t} \sum_{j=1}^{N-t} a_j b_{j+t} - |
|||
\left(\frac{1}{N-t} \sum_{j=1}^{N-t} a_j \right) |
|||
\left(\frac{1}{N-t} \sum_{j=1}^{N-t} b_{j+t} \right) |
|||
</math>. |
|||
The '''cross correlation''' is then <math>X(t)/X(0)</math>. '''AC''' computes cross correlations via the [[#c-option|<tt>-c</tt>]] option. Moreover, as for the autocorrelation, it is possible to suppress normalization via [[#u-option|<tt>-u</tt>]], and to only compute the first term of the covariance via [[#x-option|<tt>-x</tt>]]. |
|||
Finally, if the variable <math>a_i</math> is a time-dependent coordinate <math>x</math>, then the '''mean squared displacement''' (MSD) for a time interval (time difference) ''t'' is defined as |
|||
<math>M(t) = \frac{1}{N-t} \sum_{j=1}^{N-t} \left( x_{j+t} - x_j \right)^2</math>, |
|||
which can be written as |
|||
<math>M(t) = \frac{1}{N-t} \sum_{j=1}^{N-t} \left( x_j \right)^2 |
|||
+ \frac{1}{N-t} \sum_{j=1}^{N-t} \left( x_{j+t} \right)^2 |
|||
- \frac{2}{N-t} \sum_{j=1}^{N-t} x_j x_{j+t} |
|||
</math>. |
|||
This can be computed by '''AC''' via the [[#m-option|<tt>-m</tt>]] option. |
|||
Evaluation of the first two terms for <math>t=0,\ldots,N-1</math> requires <math>\mathcal{O}(N)</math> operations. The third term we recognize as as proportional to the first term (or "cross term") in <math>C(t)</math>. This also implies that we can accelerate calculation of the MSD via FFT (use options [[#m-option|<tt>-m</tt>]] [[#f-option|<tt>-f</tt>]]), and evaluate the function for the entire time domain at cost <math>N \log N</math>. |
|||
=== Download binary versions (Linux and OS X) === |
=== Download binary versions (Linux and OS X) === |
||
The current version is |
The current version is 3.93, dated December 2013. It is strongly recommend that you upgrade from any earlier version. You can download binary versions of this program here, but note that this was created for internal lab use - we cannot provide any support. |
||
(binaries coming soon) |
(binaries coming soon, pending minor improvements) |
Latest revision as of 10:11, 31 October 2015
Overview
The "autocorrelation" (AC) program is based upon the generic analyzer. It computes the integrated autocorrelation time for time series of data, on a per column basis. It is particularly useful for processing output from Monte Carlo and molecular dynamics simulations. The program is available on all local machines.
The autocorrelation program was originally written in 2002/2003 by Erik Luijten for efficiency comparisons involving the Geometric Cluster Algorithm.
General usage
autocorrelation [options] filename starttime maxtime
filename is a plain text file. Lines starting with a '#' will be ignored (use this feature to insert column descriptions and other information into your simulation data). If filename ends with '.gz' the file is assumed to be compressed with gzip and will be decompressed on the fly. Note that this happens in memory; no decompressed version of the file is written to disk. This has the advantage that no additional disk space is required and that no additional time is required to compress the data again after the analysis.
To read from standard input instead of a file, specify STDIN as the filename.
When redirecting the output, note that the autocorrelation function for each column is written to standard output, whereas the autocorelation time (and all other information) is written to standard error. See usage notes below.
starttime and maxtime are mandatory arguments. starttime is used to discard initial samples. Normally it is simply set to 0. maxtime sets the upper bound for the integral of the autocorrelation function.
Options
- -a
Compute the autocorrelation time by integrating the absolute autocorrelation function. This is useful to eliminate cancellations due to anticorelations (which would result in an underestimation of the correlation time), but also will enhance noise in the tail of the autocorrelation function (where the function fluctuates around zero; normally such fluctuations cancel out). - -c n:m
Compute the cross correlation of columns n and m. Note: columns are numbered from 0. - -e
Disable truncating the data set when using FFT-based calculation (which in turn is enabled via the -f option). Normally, use of -f truncates the data set to a multiple of 256 (for more than 10240 samples) or a multiple of 16384 (for more than 106 samples). Note that disabling this truncation carries a significant speed penalty (and is primarily used for debugging purposes). However, it makes the results identical to those obtained without using the FFT. (If the size of the original data set is already a multiple of 256 (or 16384), then the results with and without FFT are identical even without this option.) - -f
Employ a Fast Fourier Transform to greatly reduce the computational effort, by analyzing the data as a convolution in the frequency domain. This works for the autocorrelation function (Wiener-Khinchin Theorem) as well as for the cross correlation (option -c) (Cross-Correlation Theorem) and the mean squared displacement (option -m). Note that this requires computing the function over the entire time domain; therefore, for very rapidly decaying functions, where maxtime can be kept small, conventional evaluation might be faster. - -m
Calculate the mean squared displacement (MSD). See algorithm below for the relation between the MSD and the autocorrelation function. - -o filename
Write autocorrelation function to filename. - -u
Do not normalize the autocorrelation function. Without this option, the autocorelation function C(t) is normalized by C(0). Note that this option disables calculation of the autocorrelation time (see algorithm below). - -x
Only compute the cross term in the autocorrelation function, i.e., the first term in C(t) (see algorithm below).
Interpreting output
Topics (coming soon):
- choosing maxtime
- anticorrelations
- negative time cross correlations
- ...
Special usage notes
- Note that, per standard GNU style, options can be combined. Thus, for example, '-a -e -f' can be specified as '-aef'.
- (more tips coming soon)
Algorithm
For a discrete time series , the autocovariance is given by
,
which can be evaluated for . The autocorrelation (or autocorrelation function) is the autocovariance normalized by the variance of . If we assume time invariance, this variance is , so that the autocorrelation becomes . (The special case , which arises for a constant time series, is caught by AC.) The time difference is also called the lag time. By default, the AC code computes the autocorrelation; the autocovariance is obtained via the -u option.
If the autocorrelation function decays exponentially, , then can be found as
.
This concept is generalized by integrating the autocorrelation function irrespective of its functional form. Thus, the autocorrelation time is computed as
.
Some more nomenclature: Sometimes the autocorrelation refers to the cross term (where the brackets indicate averaging over ), which is the first term in . In that case, is referred to as the autocorrelation coefficient or autocorrelation coefficient function. To compute the cross term in AC, specify the -x option (or -u -x to suppress normalization).
For two time series and , the covariance is defined as
.
The cross correlation is then . AC computes cross correlations via the -c option. Moreover, as for the autocorrelation, it is possible to suppress normalization via -u, and to only compute the first term of the covariance via -x.
Finally, if the variable is a time-dependent coordinate , then the mean squared displacement (MSD) for a time interval (time difference) t is defined as
,
which can be written as
.
This can be computed by AC via the -m option.
Evaluation of the first two terms for requires operations. The third term we recognize as as proportional to the first term (or "cross term") in . This also implies that we can accelerate calculation of the MSD via FFT (use options -m -f), and evaluate the function for the entire time domain at cost .
Download binary versions (Linux and OS X)
The current version is 3.93, dated December 2013. It is strongly recommend that you upgrade from any earlier version. You can download binary versions of this program here, but note that this was created for internal lab use - we cannot provide any support.
(binaries coming soon, pending minor improvements)