样本分析 - 《scipy.stats 相关文档翻译》

这里，我们设置了t分布的形态参数，在这里就是自由度，设为10.使用size=1000表示
我们的样本由1000个抽样是独立的（伪）。当我们不指派loc和scale时，它们具有默认值0和1.

X是一个numpy数组。我们可以直接调用它的方法。


5.26327732981 -3.78975572422
>>> print x.mean(), x.var() # equivalent to np.mean(x), np.var(x)
0.0140610663985 1.28899386208

如何比较分布本身和它的样本的指标？

>>> m, v, s, k = stats.t.stats(10, moments='mvsk')
>>> n, (smin, smax), sm, sv, ss, sk = stats.describe(x)

>>> print 'distribution:',
distribution:
>>> sstr = 'mean = %6.4f, variance = %6.4f, skew = %6.4f, kurtosis = %6.4f'
>>> print sstr %(m, v, s ,k)
mean = 0.0000, variance = 1.2500, skew = 0.0000, kurtosis = 1.0000
>>> print 'sample:      ',
sample:
>>> print sstr %(sm, sv, ss, sk)
mean = 0.0141, variance = 1.2903, skew = 0.2165, kurtosis = 1.0556

注意：stats.describe用的是无偏的方差估计量，而np.var却用的是有偏的估计量。

我们可以使用t检验是否样本与给定均值（这里是理论均值）存在统计显著差异。

>>> print 't-statistic = %6.3f pvalue = %6.4f' %  stats.ttest_1samp(x, m)
t-statistic =  0.391 pvalue = 0.6955

P值为0.7，这代表第一类错误的概率，在例子中，为10%。我们不能拒绝“该样本均值为0”这个假设，
0是标准t分布的理论均值。

>>> tt = (sm-m)/np.sqrt(sv/float(n))  # t-statistic for mean
>>> print 't-statistic = %6.3f pvalue = %6.4f' % (tt, pval)
t-statistic =  0.391 pvalue = 0.6955

这里Kolmogorov-Smirnov检验（KS检验）被被用来检验样本是否来自一个标准的t分布。

>>> print 'KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(x,'norm')

无论如何，标准正态分布有1的方差，当我们的样本有1.29时。如果我们标准化我们的样本并且
测试它比照正态分布，那么P值将又一次很高我们将还是不能拒绝假设是来自正态分布的。

>>> d, pval = stats.kstest((x-x.mean())/x.std(), 'norm')
>>> print 'KS-statistic D = %6.3f pvalue = %6.4f' % (d, pval)
KS-statistic D =  0.032 pvalue = 0.2402

注释：KS检验假设我们比照的分布就是以给定的参数确定的，但我们在最后估计了均值和方差，
这个假设就被违反了，故而这个测试统计量的P值是含偏的，这个用法是错误的。

最后，我们可以检查分布的右尾，我们可以使用分位点函数ppf，其为cdf函数的逆，来获得临界值，
或者更直接的，我们可以使用残存函数的逆来办。

>>> crit01, crit05, crit10 = stats.t.ppf([1-0.01, 1-0.05, 1-0.10], 10)
>>> print 'critical values from ppf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f'% (crit01, crit05, crit10)
critical values from ppf at 1%, 5% and 10%   2.7638   1.8125   1.3722
>>> print 'critical values from isf at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f'% tuple(stats.t.isf([0.01,0.05,0.10],10))
critical values from isf at 1%, 5% and 10%   2.7638   1.8125   1.3722
>>> freq01 = np.sum(x>crit01) / float(n) * 100
>>> freq05 = np.sum(x>crit05) / float(n) * 100
>>> freq10 = np.sum(x>crit10) / float(n) * 100
>>> print 'sample %%-frequency at 1%%, 5%% and 10%% tail %8.4f %8.4f %8.4f'% (freq01, freq05, freq10)
sample %-frequency at 1%, 5% and 10% tail   1.4000   5.8000  10.5000

在这三种情况中，我们的样本有有一个更重的尾部，即实际在理论分界值右边的概率要高于理论值。
我们可以通过使用更大的样本来获得更好的拟合。在以下情况经验频率已经很接近理论概率了，
但即使我们重复这个过程若干次，波动依然会保持在这个程度。

>>> freq05l = np.sum(stats.t.rvs(10, size=10000) > crit05) / 10000.0 * 100
>>> print 'larger sample %%-frequency at 5%% tail %8.4f'% freq05l
larger sample %-frequency at 5% tail   4.8000

我们也可以比较它与正态分布的尾部，其有一个轻的多的尾部：

>>> print 'tail prob. of normal at 1%%, 5%% and 10%% %8.4f %8.4f %8.4f'% \
tail prob. of normal at 1%, 5% and 10%   0.2857   3.4957   8.5003

卡方检验可以被用来测试，是否一个有限的分类观测值频率与假定的理论概率分布具有显著差异。

我们看到当t分布检验没被拒绝时标准正态分布却被完全拒绝。在我们的样本区分出这两个分布后，
我们可以先进行拟合确定scale与location再检查拟合后的分布的差异性。

>>> tdof, tloc, tscale = stats.t.fit(x)
>>> tprob = np.diff(stats.t.cdf(crit, tdof, loc=tloc, scale=tscale))
>>> nprob = np.diff(stats.norm.cdf(crit, loc=nloc, scale=nscale))
>>> tch, tpval = stats.chisquare(freqcount, tprob*n_sample)
>>> nch, npval = stats.chisquare(freqcount, nprob*n_sample)
>>> print 'chisquare for t:      chi2 = %6.3f pvalue = %6.4f' % (tch, tpval)
chisquare for t:      chi2 =  1.577 pvalue = 0.9542
>>> print 'chisquare for normal: chi2 = %6.3f pvalue = %6.4f' % (nch, npval)
chisquare for normal: chi2 = 11.084 pvalue = 0.0858

在经过参数调整之后，我们仍然可以以5%水平拒绝正态分布假设。然而却以95%的p值显然的不能拒绝t分布。

自从正态分布变为统计学中最常见的分布，就出现了大量的方法用来检验一个样本
是否可以被看成是来自正态分布的。

首先我们检验分布的峰度和偏度是否显著地与正态分布的对应值相差异。

>>> print 'normal skewtest teststat = %6.3f pvalue = %6.4f' % stats.skewtest(x)
normal skewtest teststat =  2.785 pvalue = 0.0054
>>> print 'normal kurtosistest teststat = %6.3f pvalue = %6.4f' % stats.kurtosistest(x)
normal kurtosistest teststat =  4.757 pvalue = 0.0000

将这两个检验组合起来的正态性检验

>>> print 'normaltest teststat = %6.3f pvalue = %6.4f' % stats.normaltest(x)
normaltest teststat = 30.379 pvalue = 0.0000

在所有三个测试中，P值是非常低的，所以我们可以拒绝我们的样本的峰度与偏度与正态分布相同的假设。

当我们的样本标准化之后，我们依旧得到相同的结果。

>>> print 'normaltest teststat = %6.3f pvalue = %6.4f' % \
...                      stats.normaltest((x-x.mean())/x.std())
normaltest teststat = 30.379 pvalue = 0.0000

因为正态性被很强的拒绝了，所以我们可以检查这种检验方式是否可以有效地作用到其他情况中。

>>> print 'normaltest teststat = %6.3f pvalue = %6.4f' % stats.normaltest(stats.t.rvs(10, size=100))
normaltest teststat =  4.698 pvalue = 0.0955
normaltest teststat =  0.613 pvalue = 0.7361