2024年5月11日发(作者:理依波)
1 Coin flipping example
2 Interpretation
3 Misunderstandings
4 Problems
5 See also
6 References
7 Further reading
8 External links
[edit] Coin flipping example
Main article: Checking whether a coin is fair
For example, an experiment is performed to determine whether a coin flip is fair (50% chance, each, of landing heads or tails)
or unfairly biased (> 50% chance of one of the outcomes).
Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The p-value of this result
would be the chance of a fair coin landing on heads at least 14 times out of 20 flips. The probability that 20 flips of a fair coin
would result in 14 or more heads can be computed from binomial coefficients as
This probability is the (one-sided) p-value. It measures the chance that a fair coin would give a result at least this extreme.
[edit] Interpretation
It has been suggested that this article or section be merged into Statistical hypothesis testing. (Discuss)
Proposed
since September 2011.
Traditionally, one rejects the null hypothesis if the p-value is smaller than or equal to the significance level,
[3]
often
represented by the Greek letter α (alpha). (Greek α is also used for Type I error; the connection is that a hypothesis test that
rejects the null hypothesis for all samples that have a p-value less than α will have a Type I error of α.) A significance level of
0.05 would deem as extraordinary any result that is within the most extreme 5% of all possible results under the null
hypothesis. In this case a p-value less than 0.05 would result in the rejection of the null hypothesis at the 5% (significance)
level.
When we ask whether a given coin is fair, often we are interested in the deviation of our result from the equality of numbers of
heads and tails. In this case, the deviation can be in either direction, favoring either heads or tails. Thus, in this example of 14
heads and 6 tails, we may want to calculate the probability of getting a result deviating by at least 4 from parity in either
direction (two-sided test). This is the probability of getting at least 14 heads or at least 14 tails. As the binomial distribution is
symmetrical for a fair coin, the two-sided p-value is simply twice the above calculated single-sided p-value; i.e., the two-sided
p-value is 0.115.
In the above example we thus have:
null hypothesis (H
0
): fair coin; P(heads) = 0.5
observation O: 14 heads out of 20 flips; and
p-value of observation O given H
0
= Prob(≥ 14 heads or ≥ 14 tails) = 0.115.
The calculated p-value exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of
14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time
were the coin in fact fair. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall
evenly, the deviation from expected outcome is small enough to be consistent with chance.
However, had one more head been obtained, the resulting p-value (two-tailed) would have been 0.0414 (4.14%). This time
the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone – is rejected when
using a 5% cut-off.
To understand both the original purpose of the p-value p and the reasons p is so often misinterpreted, it helps to know that p
constitutes the main result of statistical significance testing (not to be confused with hypothesis testing), popularized by
Ronald A. Fisher. Fisher promoted this testing as a method of statistical inference. To call this testing inferential is
misleading, however, since inference makes statements about general hypotheses based on observed data, such as the
post-experimental probability a hypothesis is true. As explained above, p is instead a statement about data assuming the null
hypothesis; consequently, indiscriminately considering p as an inferential result can lead to confusion, including many of the
misinterpretations noted in the next section.
On the other hand, Bayesian inference, the main alternative to significance testing, generates probabilistic statements about
hypotheses based on data (and a priori estimates), and therefore truly constitutes inference. Bayesian methods can, for
instance, calculate the probability that the null hypothesis H
0
above is true assuming an a priori estimate of the probability
that a coin is unfair. Since a priori we would be quite surprised that a coin could consistently give 75% heads, a Bayesian
analysis would find the null hypothesis (that the coin is fair) quite probable even if a test gave 15 heads out of 20 tries (which
as we saw above is considered a "significant" result at the 5% level according to its p-value).
Strictly speaking, then, p is a statement about data rather than about any hypothesis, and hence it is not inferential. This
raises the question, though, of how science has been able to advance using significance testing. The reason is that, in many
situations, p approximates some useful post-experimental probabilities about hypotheses, such as the post-experimental
probability of the null hypothesis. When this approximation holds, it could help a researcher to judge the post-experimental
plausibility of a hypothesis.
[4]
[5]
[6]
[7]
Even so, this approximation does not eliminate the need for caution in interpreting p
inferentially, as shown in the Jeffreys–Lindley paradox mentioned below.
[edit] Misunderstandings
The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is
rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null
hypothesis is true). A small p-value that indicates statistical significance does not indicate that an alternative hypothesis is
ipso facto correct.
Despite the ubiquity of p-value tests, this particular test for statistical significance has come under heavy criticism due both to
its inherent shortcomings and the potential for misinterpretation.
There are several common misunderstandings about p-values.
[8][9]
1. The p-value is not the probability that the null hypothesis is true.
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and
classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very
close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the
results more easily). This is the Jeffreys–Lindley paradox.
2. The p-value is not the probability that a finding is "merely a fluke."
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently
cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning
which is that the p-value is the chance of obtaining such results if the null hypothesis is true.
3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called
prosecutor's fallacy.
4. The p-value is not the probability that a replicating experiment would not yield the same conclusion.
5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
6. The significance level of the test is not determined by the p-value.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data
are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
(However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given
level, and allows the reader to decide for himself whether to consider the results significant.)
7. The p-value does not indicate the size or importance of the observed effect (compare with effect size). The two do vary
together however – the larger the effect, the smaller sample size will be required to get a significant p-value.
[edit] Problems
Main article: Statistical hypothesis testing#Controversy
Critics of p-values point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary
choice of level (often set at 0.05).
[10]
If significance testing is applied to hypotheses that are known to be false in advance, an
insignificant result will simply reflect an insufficient sample size. Another problem is that the definition of "more extreme" data
depends on the intentions of the investigator; for example, the situation in which the investigator flips the coin 100 times has
a set of extreme data that is different from the situation in which the investigator continues to flip the coin until 50 heads are
achieved.
[11]
As noted above, the p-value p is the main result of statistical significance testing. Fisher proposed p as an informal measure
of evidence against the null hypothesis. He called researchers to combine p in the mind with other types of evidence for and
Free online p-values calculators for various specific tests (chi-square, Fisher's F-test, etc.).
Understanding P-values, including a Java applet that illustrates how the numerical values of p-values can give quite
misleading impressions about the truth or falsity of the hypothesis under test.
2024年5月11日发(作者:理依波)
1 Coin flipping example
2 Interpretation
3 Misunderstandings
4 Problems
5 See also
6 References
7 Further reading
8 External links
[edit] Coin flipping example
Main article: Checking whether a coin is fair
For example, an experiment is performed to determine whether a coin flip is fair (50% chance, each, of landing heads or tails)
or unfairly biased (> 50% chance of one of the outcomes).
Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The p-value of this result
would be the chance of a fair coin landing on heads at least 14 times out of 20 flips. The probability that 20 flips of a fair coin
would result in 14 or more heads can be computed from binomial coefficients as
This probability is the (one-sided) p-value. It measures the chance that a fair coin would give a result at least this extreme.
[edit] Interpretation
It has been suggested that this article or section be merged into Statistical hypothesis testing. (Discuss)
Proposed
since September 2011.
Traditionally, one rejects the null hypothesis if the p-value is smaller than or equal to the significance level,
[3]
often
represented by the Greek letter α (alpha). (Greek α is also used for Type I error; the connection is that a hypothesis test that
rejects the null hypothesis for all samples that have a p-value less than α will have a Type I error of α.) A significance level of
0.05 would deem as extraordinary any result that is within the most extreme 5% of all possible results under the null
hypothesis. In this case a p-value less than 0.05 would result in the rejection of the null hypothesis at the 5% (significance)
level.
When we ask whether a given coin is fair, often we are interested in the deviation of our result from the equality of numbers of
heads and tails. In this case, the deviation can be in either direction, favoring either heads or tails. Thus, in this example of 14
heads and 6 tails, we may want to calculate the probability of getting a result deviating by at least 4 from parity in either
direction (two-sided test). This is the probability of getting at least 14 heads or at least 14 tails. As the binomial distribution is
symmetrical for a fair coin, the two-sided p-value is simply twice the above calculated single-sided p-value; i.e., the two-sided
p-value is 0.115.
In the above example we thus have:
null hypothesis (H
0
): fair coin; P(heads) = 0.5
observation O: 14 heads out of 20 flips; and
p-value of observation O given H
0
= Prob(≥ 14 heads or ≥ 14 tails) = 0.115.
The calculated p-value exceeds 0.05, so the observation is consistent with the null hypothesis — that the observed result of
14 heads out of 20 flips can be ascribed to chance alone — as it falls within the range of what would happen 95% of the time
were the coin in fact fair. In our example, we fail to reject the null hypothesis at the 5% level. Although the coin did not fall
evenly, the deviation from expected outcome is small enough to be consistent with chance.
However, had one more head been obtained, the resulting p-value (two-tailed) would have been 0.0414 (4.14%). This time
the null hypothesis – that the observed result of 15 heads out of 20 flips can be ascribed to chance alone – is rejected when
using a 5% cut-off.
To understand both the original purpose of the p-value p and the reasons p is so often misinterpreted, it helps to know that p
constitutes the main result of statistical significance testing (not to be confused with hypothesis testing), popularized by
Ronald A. Fisher. Fisher promoted this testing as a method of statistical inference. To call this testing inferential is
misleading, however, since inference makes statements about general hypotheses based on observed data, such as the
post-experimental probability a hypothesis is true. As explained above, p is instead a statement about data assuming the null
hypothesis; consequently, indiscriminately considering p as an inferential result can lead to confusion, including many of the
misinterpretations noted in the next section.
On the other hand, Bayesian inference, the main alternative to significance testing, generates probabilistic statements about
hypotheses based on data (and a priori estimates), and therefore truly constitutes inference. Bayesian methods can, for
instance, calculate the probability that the null hypothesis H
0
above is true assuming an a priori estimate of the probability
that a coin is unfair. Since a priori we would be quite surprised that a coin could consistently give 75% heads, a Bayesian
analysis would find the null hypothesis (that the coin is fair) quite probable even if a test gave 15 heads out of 20 tries (which
as we saw above is considered a "significant" result at the 5% level according to its p-value).
Strictly speaking, then, p is a statement about data rather than about any hypothesis, and hence it is not inferential. This
raises the question, though, of how science has been able to advance using significance testing. The reason is that, in many
situations, p approximates some useful post-experimental probabilities about hypotheses, such as the post-experimental
probability of the null hypothesis. When this approximation holds, it could help a researcher to judge the post-experimental
plausibility of a hypothesis.
[4]
[5]
[6]
[7]
Even so, this approximation does not eliminate the need for caution in interpreting p
inferentially, as shown in the Jeffreys–Lindley paradox mentioned below.
[edit] Misunderstandings
The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is
rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null
hypothesis is true). A small p-value that indicates statistical significance does not indicate that an alternative hypothesis is
ipso facto correct.
Despite the ubiquity of p-value tests, this particular test for statistical significance has come under heavy criticism due both to
its inherent shortcomings and the potential for misinterpretation.
There are several common misunderstandings about p-values.
[8][9]
1. The p-value is not the probability that the null hypothesis is true.
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and
classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very
close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the
results more easily). This is the Jeffreys–Lindley paradox.
2. The p-value is not the probability that a finding is "merely a fluke."
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently
cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning
which is that the p-value is the chance of obtaining such results if the null hypothesis is true.
3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called
prosecutor's fallacy.
4. The p-value is not the probability that a replicating experiment would not yield the same conclusion.
5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
6. The significance level of the test is not determined by the p-value.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data
are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
(However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given
level, and allows the reader to decide for himself whether to consider the results significant.)
7. The p-value does not indicate the size or importance of the observed effect (compare with effect size). The two do vary
together however – the larger the effect, the smaller sample size will be required to get a significant p-value.
[edit] Problems
Main article: Statistical hypothesis testing#Controversy
Critics of p-values point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary
choice of level (often set at 0.05).
[10]
If significance testing is applied to hypotheses that are known to be false in advance, an
insignificant result will simply reflect an insufficient sample size. Another problem is that the definition of "more extreme" data
depends on the intentions of the investigator; for example, the situation in which the investigator flips the coin 100 times has
a set of extreme data that is different from the situation in which the investigator continues to flip the coin until 50 heads are
achieved.
[11]
As noted above, the p-value p is the main result of statistical significance testing. Fisher proposed p as an informal measure
of evidence against the null hypothesis. He called researchers to combine p in the mind with other types of evidence for and
Free online p-values calculators for various specific tests (chi-square, Fisher's F-test, etc.).
Understanding P-values, including a Java applet that illustrates how the numerical values of p-values can give quite
misleading impressions about the truth or falsity of the hypothesis under test.