Participer au site avec un Tip
Rechercher
 

Améliorations / Corrections

Vous avez des améliorations (ou des corrections) à proposer pour ce document : je vous remerçie par avance de m'en faire part, cela m'aide à améliorer le site.

Emplacement :

Description des améliorations :

Vous êtes un professionnel et vous avez besoin d'une formation ? RAG (Retrieval-Augmented Generation)
et Fine Tuning d'un LLM
Voir le programme détaillé
Module « scipy.stats »

Fonction monte_carlo_test - module scipy.stats

Signature de la fonction monte_carlo_test

def monte_carlo_test(data, rvs, statistic, *, vectorized=None, n_resamples=9999, batch=None, alternative='two-sided', axis=0) 

Description

help(scipy.stats.monte_carlo_test)

Perform a Monte Carlo hypothesis test.

`data` contains a sample or a sequence of one or more samples. `rvs`
specifies the distribution(s) of the sample(s) in `data` under the null
hypothesis. The value of `statistic` for the given `data` is compared
against a Monte Carlo null distribution: the value of the statistic for
each of `n_resamples` sets of samples generated using `rvs`. This gives
the p-value, the probability of observing such an extreme value of the
test statistic under the null hypothesis.

Parameters
----------
data : array-like or sequence of array-like
    An array or sequence of arrays of observations.
rvs : callable or tuple of callables
    A callable or sequence of callables that generates random variates
    under the null hypothesis. Each element of `rvs` must be a callable
    that accepts keyword argument ``size`` (e.g. ``rvs(size=(m, n))``) and
    returns an N-d array sample of that shape. If `rvs` is a sequence, the
    number of callables in `rvs` must match the number of samples in
    `data`, i.e. ``len(rvs) == len(data)``. If `rvs` is a single callable,
    `data` is treated as a single sample.
statistic : callable
    Statistic for which the p-value of the hypothesis test is to be
    calculated. `statistic` must be a callable that accepts a sample
    (e.g. ``statistic(sample)``) or ``len(rvs)`` separate samples (e.g.
    ``statistic(samples1, sample2)`` if `rvs` contains two callables and
    `data` contains two samples) and returns the resulting statistic.
    If `vectorized` is set ``True``, `statistic` must also accept a keyword
    argument `axis` and be vectorized to compute the statistic along the
    provided `axis` of the samples in `data`.
vectorized : bool, optional
    If `vectorized` is set ``False``, `statistic` will not be passed
    keyword argument `axis` and is expected to calculate the statistic
    only for 1D samples. If ``True``, `statistic` will be passed keyword
    argument `axis` and is expected to calculate the statistic along `axis`
    when passed ND sample arrays. If ``None`` (default), `vectorized`
    will be set ``True`` if ``axis`` is a parameter of `statistic`. Use of
    a vectorized statistic typically reduces computation time.
n_resamples : int, default: 9999
    Number of samples drawn from each of the callables of `rvs`.
    Equivalently, the number statistic values under the null hypothesis
    used as the Monte Carlo null distribution.
batch : int, optional
    The number of Monte Carlo samples to process in each call to
    `statistic`. Memory usage is O( `batch` * ``sample.size[axis]`` ). Default
    is ``None``, in which case `batch` equals `n_resamples`.
alternative : {'two-sided', 'less', 'greater'}
    The alternative hypothesis for which the p-value is calculated.
    For each alternative, the p-value is defined as follows.

    - ``'greater'`` : the percentage of the null distribution that is
      greater than or equal to the observed value of the test statistic.
    - ``'less'`` : the percentage of the null distribution that is
      less than or equal to the observed value of the test statistic.
    - ``'two-sided'`` : twice the smaller of the p-values above.

axis : int, default: 0
    The axis of `data` (or each sample within `data`) over which to
    calculate the statistic.

Returns
-------
res : MonteCarloTestResult
    An object with attributes:

    statistic : float or ndarray
        The test statistic of the observed `data`.
    pvalue : float or ndarray
        The p-value for the given alternative.
    null_distribution : ndarray
        The values of the test statistic generated under the null
        hypothesis.

.. warning::
    The p-value is calculated by counting the elements of the null
    distribution that are as extreme or more extreme than the observed
    value of the statistic. Due to the use of finite precision arithmetic,
    some statistic functions return numerically distinct values when the
    theoretical values would be exactly equal. In some cases, this could
    lead to a large error in the calculated p-value. `monte_carlo_test`
    guards against this by considering elements in the null distribution
    that are "close" (within a relative tolerance of 100 times the
    floating point epsilon of inexact dtypes) to the observed
    value of the test statistic as equal to the observed value of the
    test statistic. However, the user is advised to inspect the null
    distribution to assess whether this method of comparison is
    appropriate, and if not, calculate the p-value manually.

References
----------

.. [1] B. Phipson and G. K. Smyth. "Permutation P-values Should Never Be
   Zero: Calculating Exact P-values When Permutations Are Randomly Drawn."
   Statistical Applications in Genetics and Molecular Biology 9.1 (2010).

Examples
--------

Suppose we wish to test whether a small sample has been drawn from a normal
distribution. We decide that we will use the skew of the sample as a
test statistic, and we will consider a p-value of 0.05 to be statistically
significant.

>>> import numpy as np
>>> from scipy import stats
>>> def statistic(x, axis):
...     return stats.skew(x, axis)

After collecting our data, we calculate the observed value of the test
statistic.

>>> rng = np.random.default_rng()
>>> x = stats.skewnorm.rvs(a=1, size=50, random_state=rng)
>>> statistic(x, axis=0)
0.12457412450240658

To determine the probability of observing such an extreme value of the
skewness by chance if the sample were drawn from the normal distribution,
we can perform a Monte Carlo hypothesis test. The test will draw many
samples at random from their normal distribution, calculate the skewness
of each sample, and compare our original skewness against this
distribution to determine an approximate p-value.

>>> from scipy.stats import monte_carlo_test
>>> # because our statistic is vectorized, we pass `vectorized=True`
>>> rvs = lambda size: stats.norm.rvs(size=size, random_state=rng)
>>> res = monte_carlo_test(x, rvs, statistic, vectorized=True)
>>> print(res.statistic)
0.12457412450240658
>>> print(res.pvalue)
0.7012

The probability of obtaining a test statistic less than or equal to the
observed value under the null hypothesis is ~70%. This is greater than
our chosen threshold of 5%, so we cannot consider this to be significant
evidence against the null hypothesis.

Note that this p-value essentially matches that of
`scipy.stats.skewtest`, which relies on an asymptotic distribution of a
test statistic based on the sample skewness.

>>> stats.skewtest(x).pvalue
0.6892046027110614

This asymptotic approximation is not valid for small sample sizes, but
`monte_carlo_test` can be used with samples of any size.

>>> x = stats.skewnorm.rvs(a=1, size=7, random_state=rng)
>>> # stats.skewtest(x) would produce an error due to small sample
>>> res = monte_carlo_test(x, rvs, statistic, vectorized=True)

The Monte Carlo distribution of the test statistic is provided for
further investigation.

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.hist(res.null_distribution, bins=50)
>>> ax.set_title("Monte Carlo distribution of test statistic")
>>> ax.set_xlabel("Value of Statistic")
>>> ax.set_ylabel("Frequency")
>>> plt.show()



Vous êtes un professionnel et vous avez besoin d'une formation ? Deep Learning avec Python
et Keras et Tensorflow
Voir le programme détaillé