Vous êtes un professionnel et vous avez besoin d'une formation ?
Deep Learning avec Python
et Keras et Tensorflow
Voir le programme détaillé
Module « scipy.stats »
Signature de la fonction pearsonr
def pearsonr(x, y, *, alternative='two-sided', method=None, axis=0)
Description
help(scipy.stats.pearsonr)
Pearson correlation coefficient and p-value for testing non-correlation.
The Pearson correlation coefficient [1]_ measures the linear relationship
between two datasets. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear relationship.
Positive correlations imply that as x increases, so does y. Negative
correlations imply that as x increases, y decreases.
This function also performs a test of the null hypothesis that the
distributions underlying the samples are uncorrelated and normally
distributed. (See Kowalski [3]_
for a discussion of the effects of non-normality of the input on the
distribution of the correlation coefficient.)
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets.
Parameters
----------
x : array_like
Input array.
y : array_like
Input array.
axis : int or None, default
Axis along which to perform the calculation. Default is 0.
If None, ravel both arrays before performing the calculation.
.. versionadded:: 1.13.0
alternative : {'two-sided', 'greater', 'less'}, optional
Defines the alternative hypothesis. Default is 'two-sided'.
The following options are available:
* 'two-sided': the correlation is nonzero
* 'less': the correlation is negative (less than zero)
* 'greater': the correlation is positive (greater than zero)
.. versionadded:: 1.9.0
method : ResamplingMethod, optional
Defines the method used to compute the p-value. If `method` is an
instance of `PermutationMethod`/`MonteCarloMethod`, the p-value is
computed using
`scipy.stats.permutation_test`/`scipy.stats.monte_carlo_test` with the
provided configuration options and other appropriate settings.
Otherwise, the p-value is computed as documented in the notes.
.. versionadded:: 1.11.0
Returns
-------
result : `~scipy.stats._result_classes.PearsonRResult`
An object with the following attributes:
statistic : float
Pearson product-moment correlation coefficient.
pvalue : float
The p-value associated with the chosen alternative.
The object has the following method:
confidence_interval(confidence_level, method)
This computes the confidence interval of the correlation
coefficient `statistic` for the given confidence level.
The confidence interval is returned in a ``namedtuple`` with
fields `low` and `high`. If `method` is not provided, the
confidence interval is computed using the Fisher transformation
[1]_. If `method` is an instance of `BootstrapMethod`, the
confidence interval is computed using `scipy.stats.bootstrap` with
the provided configuration options and other appropriate settings.
In some cases, confidence limits may be NaN due to a degenerate
resample, and this is typical for very small samples (~6
observations).
Raises
------
ValueError
If `x` and `y` do not have length at least 2.
Warns
-----
`~scipy.stats.ConstantInputWarning`
Raised if an input is a constant array. The correlation coefficient
is not defined in this case, so ``np.nan`` is returned.
`~scipy.stats.NearConstantInputWarning`
Raised if an input is "nearly" constant. The array ``x`` is considered
nearly constant if ``norm(x - mean(x)) < 1e-13 * abs(mean(x))``.
Numerical errors in the calculation ``x - mean(x)`` in this case might
result in an inaccurate calculation of r.
See Also
--------
spearmanr : Spearman rank-order correlation coefficient.
kendalltau : Kendall's tau, a correlation measure for ordinal data.
Notes
-----
The correlation coefficient is calculated as follows:
.. math::
r = \frac{\sum (x - m_x) (y - m_y)}
{\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}
where :math:`m_x` is the mean of the vector x and :math:`m_y` is
the mean of the vector y.
Under the assumption that x and y are drawn from
independent normal distributions (so the population correlation coefficient
is 0), the probability density function of the sample correlation
coefficient r is ([1]_, [2]_):
.. math::
f(r) = \frac{{(1-r^2)}^{n/2-2}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}-1)}
where n is the number of samples, and B is the beta function. This
is sometimes referred to as the exact distribution of r. This is
the distribution that is used in `pearsonr` to compute the p-value when
the `method` parameter is left at its default value (None).
The distribution is a beta distribution on the interval [-1, 1],
with equal shape parameters a = b = n/2 - 1. In terms of SciPy's
implementation of the beta distribution, the distribution of r is::
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
The default p-value returned by `pearsonr` is a two-sided p-value. For a
given sample with correlation coefficient r, the p-value is
the probability that abs(r') of a random sample x' and y' drawn from
the population with zero correlation would be greater than or equal
to abs(r). In terms of the object ``dist`` shown above, the p-value
for a given r and length n can be computed as::
p = 2*dist.cdf(-abs(r))
When n is 2, the above continuous distribution is not well-defined.
One can interpret the limit of the beta distribution as the shape
parameters a and b approach a = b = 0 as a discrete distribution with
equal probability masses at r = 1 and r = -1. More directly, one
can observe that, given the data x = [x1, x2] and y = [y1, y2], and
assuming x1 != x2 and y1 != y2, the only possible values for r are 1
and -1. Because abs(r') for any sample x' and y' with length 2 will
be 1, the two-sided p-value for a sample of length 2 is always 1.
For backwards compatibility, the object that is returned also behaves
like a tuple of length two that holds the statistic and the p-value.
References
----------
.. [1] "Pearson correlation coefficient", Wikipedia,
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
.. [2] Student, "Probable error of a correlation coefficient",
Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310.
.. [3] C. J. Kowalski, "On the Effects of Non-Normality on the Distribution
of the Sample Product-Moment Correlation Coefficient"
Journal of the Royal Statistical Society. Series C (Applied
Statistics), Vol. 21, No. 1 (1972), pp. 1-12.
Examples
--------
>>> import numpy as np
>>> from scipy import stats
>>> x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2]
>>> res = stats.pearsonr(x, y)
>>> res
PearsonRResult(statistic=-0.828503883588428, pvalue=0.021280260007523286)
To perform an exact permutation version of the test:
>>> rng = np.random.default_rng(7796654889291491997)
>>> method = stats.PermutationMethod(n_resamples=np.inf, random_state=rng)
>>> stats.pearsonr(x, y, method=method)
PearsonRResult(statistic=-0.828503883588428, pvalue=0.028174603174603175)
To perform the test under the null hypothesis that the data were drawn from
*uniform* distributions:
>>> method = stats.MonteCarloMethod(rvs=(rng.uniform, rng.uniform))
>>> stats.pearsonr(x, y, method=method)
PearsonRResult(statistic=-0.828503883588428, pvalue=0.0188)
To produce an asymptotic 90% confidence interval:
>>> res.confidence_interval(confidence_level=0.9)
ConfidenceInterval(low=-0.9644331982722841, high=-0.3460237473272273)
And for a bootstrap confidence interval:
>>> method = stats.BootstrapMethod(method='BCa', rng=rng)
>>> res.confidence_interval(confidence_level=0.9, method=method)
ConfidenceInterval(low=-0.9983163756488651, high=-0.22771001702132443) # may vary
If N-dimensional arrays are provided, multiple tests are performed in a
single call according to the same conventions as most `scipy.stats` functions:
>>> rng = np.random.default_rng(2348246935601934321)
>>> x = rng.standard_normal((8, 15))
>>> y = rng.standard_normal((8, 15))
>>> stats.pearsonr(x, y, axis=0).statistic.shape # between corresponding columns
(15,)
>>> stats.pearsonr(x, y, axis=1).statistic.shape # between corresponding rows
(8,)
To perform all pairwise comparisons between slices of the arrays,
use standard NumPy broadcasting techniques. For instance, to compute the
correlation between all pairs of rows:
>>> stats.pearsonr(x[:, np.newaxis, :], y, axis=-1).statistic.shape
(8, 8)
There is a linear dependence between x and y if y = a + b*x + e, where
a,b are constants and e is a random error term, assumed to be independent
of x. For simplicity, assume that x is standard normal, a=0, b=1 and let
e follow a normal distribution with mean zero and standard deviation s>0.
>>> rng = np.random.default_rng()
>>> s = 0.5
>>> x = stats.norm.rvs(size=500, random_state=rng)
>>> e = stats.norm.rvs(scale=s, size=500, random_state=rng)
>>> y = x + e
>>> stats.pearsonr(x, y).statistic
0.9001942438244763
This should be close to the exact value given by
>>> 1/np.sqrt(1 + s**2)
0.8944271909999159
For s=0.5, we observe a high level of correlation. In general, a large
variance of the noise reduces the correlation, while the correlation
approaches one as the variance of the error goes to zero.
It is important to keep in mind that no correlation does not imply
independence unless (x, y) is jointly normal. Correlation can even be zero
when there is a very simple dependence structure: if X follows a
standard normal distribution, let y = abs(x). Note that the correlation
between x and y is zero. Indeed, since the expectation of x is zero,
cov(x, y) = E[x*y]. By definition, this equals E[x*abs(x)] which is zero
by symmetry. The following lines of code illustrate this observation:
>>> y = np.abs(x)
>>> stats.pearsonr(x, y)
PearsonRResult(statistic=-0.05444919272687482, pvalue=0.22422294836207743)
A non-zero correlation coefficient can be misleading. For example, if X has
a standard normal distribution, define y = x if x < 0 and y = 0 otherwise.
A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797...,
implying a high level of correlation:
>>> y = np.where(x < 0, x, 0)
>>> stats.pearsonr(x, y)
PearsonRResult(statistic=0.861985781588, pvalue=4.813432002751103e-149)
This is unintuitive since there is no dependence of x and y if x is larger
than zero which happens in about half of the cases if we sample x and y.
Vous êtes un professionnel et vous avez besoin d'une formation ?
RAG (Retrieval-Augmented Generation)et Fine Tuning d'un LLM
Voir le programme détaillé
Améliorations / Corrections
Vous avez des améliorations (ou des corrections) à proposer pour ce document : je vous remerçie par avance de m'en faire part, cela m'aide à améliorer le site.
Emplacement :
Description des améliorations :