Module « scipy.stats »
Signature de la fonction bootstrap
def bootstrap(data, statistic, *, vectorized=True, paired=False, axis=0, confidence_level=0.95, n_resamples=9999, batch=None, method='BCa', random_state=None)
Description
bootstrap.__doc__
Compute a two-sided bootstrap confidence interval of a statistic.
When `method` is ``'percentile'``, a bootstrap confidence interval is
computed according to the following procedure.
1. Resample the data: for each sample in `data` and for each of
`n_resamples`, take a random sample of the original sample
(with replacement) of the same size as the original sample.
2. Compute the bootstrap distribution of the statistic: for each set of
resamples, compute the test statistic.
3. Determine the confidence interval: find the interval of the bootstrap
distribution that is
- symmetric about the median and
- contains `confidence_level` of the resampled statistic values.
While the ``'percentile'`` method is the most intuitive, it is rarely
used in practice. Two more common methods are available, ``'basic'``
('reverse percentile') and ``'BCa'`` ('bias-corrected and accelerated');
they differ in how step 3 is performed.
If the samples in `data` are taken at random from their respective
distributions :math:`n` times, the confidence interval returned by
`bootstrap` will contain the true value of the statistic for those
distributions approximately `confidence_level`:math:`\, \times \, n` times.
Parameters
----------
data : sequence of array-like
Each element of data is a sample from an underlying distribution.
statistic : callable
Statistic for which the confidence interval is to be calculated.
`statistic` must be a callable that accepts ``len(data)`` samples
as separate arguments and returns the resulting statistic.
If `vectorized` is set ``True``,
`statistic` must also accept a keyword argument `axis` and be
vectorized to compute the statistic along the provided `axis`.
vectorized : bool, default: ``True``
If `vectorized` is set ``False``, `statistic` will not be passed
keyword argument `axis`, and is assumed to calculate the statistic
only for 1D samples.
paired : bool, default: ``False``
Whether the statistic treats corresponding elements of the samples
in `data` as paired.
axis : int, default: ``0``
The axis of the samples in `data` along which the `statistic` is
calculated.
confidence_level : float, default: ``0.95``
The confidence level of the confidence interval.
n_resamples : int, default: ``9999``
The number of resamples performed to form the bootstrap distribution
of the statistic.
batch : int, optional
The number of resamples to process in each vectorized call to
`statistic`. Memory usage is O(`batch`*``n``), where ``n`` is the
sample size. Default is ``None``, in which case ``batch = n_resamples``
(or ``batch = max(n_resamples, n)`` for ``method='BCa'``).
method : {'percentile', 'basic', 'bca'}, default: ``'BCa'``
Whether to return the 'percentile' bootstrap confidence interval
(``'percentile'``), the 'reverse' or the bias-corrected and accelerated
bootstrap confidence interval (``'BCa'``).
Note that only ``'percentile'`` and ``'basic'`` support multi-sample
statistics at this time.
random_state : {None, int, `numpy.random.Generator`,
`numpy.random.RandomState`}, optional
If `seed` is ``None`` (or `np.random`), the `numpy.random.RandomState`
singleton is used.
If `seed` is an int, a new ``RandomState`` instance is used,
seeded with `seed`.
If `seed` is already a ``Generator`` or ``RandomState`` instance then
that instance is used.
Pseudorandom number generator state used to generate resamples.
Returns
-------
res : BootstrapResult
An object with attributes:
confidence_interval : ConfidenceInterval
The bootstrap confidence interval as an instance of
`collections.namedtuple` with attributes `low` and `high`.
standard_error : float or ndarray
The bootstrap standard error, that is, the sample standard
deviation of the bootstrap distribution
References
----------
.. [1] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap,
Chapman & Hall/CRC, Boca Raton, FL, USA (1993)
.. [2] Nathaniel E. Helwig, "Bootstrap Confidence Intervals",
http://users.stat.umn.edu/~helwig/notes/bootci-Notes.pdf
.. [3] Bootstrapping (statistics), Wikipedia,
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
Examples
--------
Suppose we have sampled data from an unknown distribution.
>>> import numpy as np
>>> rng = np.random.default_rng()
>>> from scipy.stats import norm
>>> dist = norm(loc=2, scale=4) # our "unknown" distribution
>>> data = dist.rvs(size=100, random_state=rng)
We are interested int the standard deviation of the distribution.
>>> std_true = dist.std() # the true value of the statistic
>>> print(std_true)
4.0
>>> std_sample = np.std(data) # the sample statistic
>>> print(std_sample)
3.9460644295563863
We can calculate a 90% confidence interval of the statistic using
`bootstrap`.
>>> from scipy.stats import bootstrap
>>> data = (data,) # samples must be in a sequence
>>> res = bootstrap(data, np.std, confidence_level=0.9,
... random_state=rng)
>>> print(res.confidence_interval)
ConfidenceInterval(low=3.57655333533867, high=4.382043696342881)
If we sample from the distribution 1000 times and form a bootstrap
confidence interval for each sample, the confidence interval
contains the true value of the statistic approximately 900 times.
>>> n_trials = 1000
>>> ci_contains_true_std = 0
>>> for i in range(n_trials):
... data = (dist.rvs(size=100, random_state=rng),)
... ci = bootstrap(data, np.std, confidence_level=0.9, n_resamples=1000,
... random_state=rng).confidence_interval
... if ci[0] < std_true < ci[1]:
... ci_contains_true_std += 1
>>> print(ci_contains_true_std)
875
Rather than writing a loop, we can also determine the confidence intervals
for all 1000 samples at once.
>>> data = (dist.rvs(size=(n_trials, 100), random_state=rng),)
>>> res = bootstrap(data, np.std, axis=-1, confidence_level=0.9,
... n_resamples=1000, random_state=rng)
>>> ci_l, ci_u = res.confidence_interval
Here, `ci_l` and `ci_u` contain the confidence interval for each of the
``n_trials = 1000`` samples.
>>> print(ci_l[995:])
[3.77729695 3.75090233 3.45829131 3.34078217 3.48072829]
>>> print(ci_u[995:])
[4.88316666 4.86924034 4.32032996 4.2822427 4.59360598]
And again, approximately 90% contain the true value, ``std_true = 4``.
>>> print(np.sum((ci_l < std_true) & (std_true < ci_u)))
900
`bootstrap` can also be used to estimate confidence intervals of
multi-sample statistics, including those calculated by hypothesis
tests. `scipy.stats.mood` perform's Mood's test for equal scale parameters,
and it returns two outputs: a statistic, and a p-value. To get a
confidence interval for the test statistic, we first wrap
`scipy.stats.mood` in a function that accepts two sample arguments,
accepts an `axis` keyword argument, and returns only the statistic.
>>> from scipy.stats import mood
>>> def my_statistic(sample1, sample2, axis):
... statistic, _ = mood(sample1, sample2, axis=-1)
... return statistic
Here, we use the 'percentile' method with the default 95% confidence level.
>>> sample1 = norm.rvs(scale=1, size=100, random_state=rng)
>>> sample2 = norm.rvs(scale=2, size=100, random_state=rng)
>>> data = (sample1, sample2)
>>> res = bootstrap(data, my_statistic, method='basic', random_state=rng)
>>> print(mood(sample1, sample2)[0]) # element 0 is the statistic
-5.521109549096542
>>> print(res.confidence_interval)
ConfidenceInterval(low=-7.255994487314675, high=-4.016202624747605)
The bootstrap estimate of the standard error is also available.
>>> print(res.standard_error)
0.8344963846318795
Paired-sample statistics work, too. For example, consider the Pearson
correlation coefficient.
>>> from scipy.stats import pearsonr
>>> n = 100
>>> x = np.linspace(0, 10, n)
>>> y = x + rng.uniform(size=n)
>>> print(pearsonr(x, y)[0]) # element 0 is the statistic
0.9962357936065914
We wrap `pearsonr` so that it returns only the statistic.
>>> def my_statistic(x, y):
... return pearsonr(x, y)[0]
We call `bootstrap` using ``paired=True``.
Also, since ``my_statistic`` isn't vectorized to calculate the statistic
along a given axis, we pass in ``vectorized=False``.
>>> res = bootstrap((x, y), my_statistic, vectorized=False, paired=True,
... random_state=rng)
>>> print(res.confidence_interval)
ConfidenceInterval(low=0.9950085825848624, high=0.9971212407917498)
Améliorations / Corrections
Vous avez des améliorations (ou des corrections) à proposer pour ce document : je vous remerçie par avance de m'en faire part, cela m'aide à améliorer le site.
Emplacement :
Description des améliorations :