Why bother with Bayesian t-tests? arXiv:2211.02613v1 [math.ST] 4

Why bother with Bayesian t-tests?

Fintan Costello

School of Computer Science and Informatics,

University College Dublin

and

Paul Watts

Department of Theoretical Physics,

National University of Ireland Maynooth

November 7, 2022

Abstract

Given the well-known and fundamental problems with hypothesis testing via clas-

sical (point-form) signiﬁcance tests, there has been a general move to alternative

approaches, often focused on the Bayesian t-test. We show that the Bayesian t-test

approach does not address the observed problems with classical signiﬁcance testing,

that Bayesian and classical t-tests are mathematically equivalent and linearly related

in order of magnitude (so that the Bayesian t-test providing no further information

beyond that given by point-form signiﬁcance tests), and that Bayesian t-tests are

subject to serious risks of misinterpretation, in some cases more problematic than

seen for classical tests (with, for example, a negative sample mean in an experiment

giving strong Bayesian t-test evidence in favour of a positive population mean). We

do not suggest a return to the classical, point-form signiﬁcance approach to hypothe-

sis testing. Instead we argue for an alternative distributional approach to signiﬁcance

testing, which addresses the observed problems with classical hypothesis testing and

provides a natural link between the Bayesian and frequentist approaches.

Keywords: Hypothesis Testing; Signiﬁcance; Replication

arXiv:2211.02613v1 [math.ST] 4 Nov 2022

1 Introduction

It is clear that classical or point-form signiﬁcance testing has serious problems: many sta-

tistically signiﬁcant experimental results fail to occur reliably in replications (e.g. Camerer

et al., 2018; Open Science Collaboration et al., 2015; Klein et al., 2018, 2014), the chance

of getting a statistically signiﬁcant p-value increases with sample size, irrespective of the

presence or absence of a true eﬀect (Thompson, 1998) and point-form null hypotheses are

always false (and to quote Cohen, 2016: “if the null hypothesis is always false, what’s the

big deal about rejecting it?”). In an attempt to address these problems various researchers

have argued for a move to Bayesian hypothesis testing approaches, with a particular focus

on generalisations of Jeﬀrey’s Bayesian t-test, which involves a Bayes Factor comparison

with a nested, point-form null hypothesis (Jeﬀreys, 1948; G¨onen et al., 2005; Fox and

Dimmic, 2006; Rouder et al., 2009; Wang and Liu, 2016; Schmalz et al., 2021). In this

paper we show that this Bayesian t-test approach does not, in fact, address any of these

problems with classical null hypothesis testing. Instead, the Bayesian t-test involves com-

parison against a point-form null which we know is always false (so what’s the big deal

about getting evidence against it?); the form of the Bayesian t-test means that probability

of getting Bayesian evidence against the null increases with sample size, irrespective of the

presence or absence of a true eﬀect; and the Bayesian t-test gives results which are simply a

linear transformation of those obtained in classical signiﬁcance tests and so are necessarily

subject to the same problems of replication and reliability as seen in classical tests. We

also show that the Bayesian t-test is subject to serious risks of misinterpretation, arguably

more problematic than those seen for classical tests. We demonstrate these points in detail

below, beginning with a derivation of the general Bayesian t-test and next showing that

these problems all hold with this general form (and so hold for all speciﬁc instantiations).

We the argue that researchers should move to hypothesis testing relative to distributional

rather than point-form nulls, and show that the distributional approach does not suﬀer

from any of these issues and further, naturally relates the Bayesian and frequentist ap-

proaches to hypothesis testing. We conclude by brieﬂy giving our view of the roles that

Bayesian and frequentist statistical tools can play in scientiﬁc research.

2 The Bayesian t-test

The Bayesian t-test is a speciﬁc type of Bayes Factor test, originally developed by Jeﬀreys

(1948) and extended in various diﬀerent ways by a range of other researchers. The Bayes

Factor

p(y|H

)

p(y|H

)

is the ratio of the likelihood of observed data y under hypothesis H

to its likelihood under

: the higher the value of BF

, the more Bayesian evidence our data y gives in favour

of H

and against H

. More speciﬁcally, hypotheses or models H

and H

are taken to be

probability distributions with some parameter values ﬁxed and some parameters following

speciﬁed prior distributions, and p(y|H) is estimated in terms of the density of the H

distribution integrated over those priors (the marginal likelihood). A Bayesian t-test is a

Bayes Factor test where the observed data takes the form of a t statistic, and where the

hypothesis H

is the null hypothesis used in the classical t-test. Various forms of Bayesian

t-test have been proposed in the literature: while each uses a diﬀerent form of alternative

hypothesis H

, all have the same statistic t and the same null hypothesis H

Here we present Bayesian t-test in the context of a one-sample test, but with a generic

structure which covers all possible forms of alternative hypothesis H

. We assume an

experiment involving N measurements or observations X with observed sample mean X,

degrees of freedom ν and sample variance

(X − X)

We assume that observations X follow a normal distribution

X ∼ N(µ, σ

)

for unknown parameters µ and σ, which means that

νS

follows a χ

distribution with ν degrees of freedom. We let

d =

represent the sample eﬀect in this experiment and deﬁne the variable

t =

√

N = d

√

Our null hypothesis is that µ = 0: that observations X follow the normal distribution

X ∼ N(0, σ

)

with unknown variance σ

(the null hypothesis in a classical t-test), so hypothesis H

: X ∼ N(0, σ

/N) (1)

and so the variable

√

νS

/ν

= t

follows a T distribution with ν degrees of freedom.

We take the alternative hypothesis H

to have parameters σ, m and σ

such that

observations X follow the Normal distribution

X ∼ N(µ, σ

)

and µ itself follows the Normal distribution

µ ∼ N(m, σ

)

so that for given values of σ, σ

and m we see that X follows the distribution

X ∼

∞

N(µ, σ

/N)N(µ|m, σ

)dµ

= N(m, σ

/N + σ

) = N



m, (σ

/N)



1 +



Deﬁning σ

= σ

/σ (so that σ

is the variance of the eﬀect size δ = µ/σ) this means that

our hypothesis H

: X ∼ N(m, (σ

/N)



1 + σ



)

(2)

so that for ﬁxed values of m and σ

the variable

1 + σ

X−m

(σ

/N )

[

1+σ

]

(σ

/N )

[

1+σ

]

νS

/ν

follows a non-central T distribution with ν degrees of freedom and non-centrality parameter

(σ

/N) [1 + σ

1/N + σ

Letting T

represent the standard (central) T distribution with ν degrees of freedom

and T

(θ) represent the non-central T distribution ν degrees of freedom and non-centrality

parameter θ, we can thus express our two hypotheses H

and H

in equivalent forms as

: t ∼ T

(3)

and

1 + σ

∼ T

1/N + σ

(4)

In the Bayesian approach the likelihoods p(t|H

) and p(t|H

) are taken to be equal to

the density of these these distributions at the values given in H

and H

. Taking f

(t)

to be the density of the standard T distribution at t and f

(t; θ) to be the density of the

non-central T with parameter θ the density of H

at t is

p(t|H

) = f

(t)

and the density of H

1 + σ

p(t|H

) =

1 + σ

;

1/N + σ

giving

√

1+σ



√

1+σ

;

√

1/N +σ



(t)

(5)

Equation 5 represents, for example, a one-sample instantiation of the Bayesian t-test of

G¨onen et al. (G¨onen et al., 2005; Gronau et al., 2019) and, taking δ = 0 and adding a

(1) prior on σ

, represents the one-sample JZS Bayesian t-test of Rouder et al. (2009).

Other forms of the Bayesian t test are produced by assuming diﬀerent prior distributions

for the parameters δ and δ

or by expanding the hypotheses in various ways (Jeﬀrey’s

original formulation, for example, involves splitting H

into three component hypotheses);

all approaches, however, take some analog of these H

and H

distributions as their starting

point, and so this presentation characterises the general Bayesian t-test.

A core distinction between diﬀerent forms of Bayesian t-test concerns the choice of value

or prior for δ (and so for m, the mean for the distribution of µ in the alternative hypothesis

). Default or local tests assume that δ is either equal to 0 (a delta distribution) or has

a mean of 0. This implies that m is also equal to or has a mean of 0, and so in these tests

both H

and H

assume the same mean for X. Informed or non-local tests, by contrast,

assume that δ is equal to or distributed around some non-zero value chosen on the basis of

prior knowledge in some way, so that X is assumed to have a diﬀerent mean in H

than

in H

. In the next section we discuss various problems of interpretation that arise with

default or local tests.

2.1 Problems of interpretation: default tests

Two points are immediately evident from this general presentation of the Bayesian t-test.

First, distribution H

is the null hypothesis distribution that underlies the classical or

point-form t-test (which also assumes that X is normally distributed around a mean of 0

with variance σ

/N). The point-form null hypothesis, however, is always false: and so it

is not clear what is to be gained by testing against it. Second, for default tests with δ = 0

(and so m = 0) hypotheses H

and H

diﬀer only in the variance they assign to X (compare

Equation 1 to Equation 2 with m = 0). Here there is a serious risk of misinterpretation,

arising because researchers commonly take default Bayesian t-test results in favour of H

as giving evidence that the population mean diﬀers from 0 (that is, evidence of a signiﬁcant

eﬀect). This is clearly incorrect: in a default test H

assumes that the population mean is

0, and we cannot take evidence in favour of H

as evidence against this assumption.

2.2 Bayesian evidence but no real eﬀect

For any ﬁxed value of σ, the variance of X in H

falls with rising N to a limit of 0, since

that variance is σ

/N. This means that the probability of getting any value X 6= 0 under

similarly falls to 0 with rising N; and so, for any positive value y and any sample eﬀect

d 6= 0, there exists some N

such that p(t|H

) = p(d

√

N|H

) < y for all N ≥ N

For any ﬁxed values of σ, m and σ

, however, the variance of X in H

falls with rising

N to a limit of σ

(since that variance is σ

/N + σ

). This means that for any sample

eﬀect d 6= 0 there will thus exist some value N

such that p(d

√

N|H

) > p(d

√

N|H

) holds

for all N ≥ N

. Given this we see that

lim

N→∞

= lim

N→∞

p(t|H

)

p(t|H

)

= ∞

necessarily holds for any value X 6= 0 (and hence any t 6= 0) and any required level of

evidence in favour of the alternative hypothesis in a Bayesian t-test can be obtained with

large enough sample size N, irrespective of the presence or absence of a true eﬀect for both

default and informed tests and irrespective of the choice of priors.

2.3 Problems of interpretation: informed tests

Comparing Equations 1 and 2 we see that for informed tests with m 6= 0, H

and H

diﬀer both in their assumed means and in their models of variance for X. This means that

evidence against H

and in favour of H

may arise as a consequence of this diﬀerence in

variance alone; again, this leads to a serious risk of interpretation, where researchers may

assume that evidence in favour of H

indicates that the population mean is closer to or

more consistent with the alternative mean m than the null mean 0. This is not the case.

It may be useful to give a concrete example of the problem. Suppose we have a one-

sample experiment with sample size N = 50 (and so ν = 49) and that in our null hypothesis

we assume δ = 0 (there is no eﬀect) and our alternative hypothesis we assume δ = 0.5 (there

is a medium-sized positive eﬀect). For our Bayesian analysis, we make the standard choice

of simple unit-information prior for δ of σ

= 1. Suppose we observe a medium-sized

negative eﬀect in our experiment of d = −0.5 (so that t = −0.5

√

50). Then applying

Equation 5 we have a Bayesian t-test comparing H

to H

√

1+σ



√

1+σ

;

√

1/N +σ



(t)

√

1+50



−0.5

√

1+50

0.5

√

1+1/50





−0.5

√



≈ 25

This is strong Bayesian evidence in favour of the alternative hypothesis H

, and if we mis-

takenly assume that BF

gives evidence about the hypothesised eﬀect δ (as opposed to

the eﬀect-plus-variance model H

), we will be led to the nonsensical conclusion that ob-

serving a medium-sized negative eﬀect in our experiment gives us strong Bayesian evidence

in favour of a medium-sized positive eﬀect.

Note that we pick on this one-sample instantiation of the G¨onen et al. (2005) t-test here

only because of its clarity and simplicity of presentation: the general problem (of negative

results giving apparently strong evidence in favour of a positive hypothesis) applies for all

informed or non-local Bayesian t tests, and arises, as before, because the variance of H

falls to 0 with rising N while the variance of H

does not.

2.4 Default Bayesian t-tests and classical t-tests are equivalent

Our last point involves the relationship between the Bayesian and classical t-tests. It has

long been observed that Bayesian t-test evidence in favour of the alternative and classical

point-form evidence against the null are essentially equivalent (to quote Jeﬀreys: “As a

matter of fact I have applied my signiﬁcance tests to numerous applications that have also

been worked out by Fisher’s, and have not yet found a disagreement in the actual decisions

reached”; cited in Ly et al., 2016); here we explain why this relationship holds.

We ﬁrst note that for large ν we have T

≈ Φ (the standard T distribution is well

approximated by the standard Normal distribution) and that for the standard Normal

distribution the Mills ratio

M(x) =

Φ(−|x|)

φ(x)

(the ratio of the cumulative Normal function at −|x| to the probability density at x) has

the well-known asymptotic approximation

M(x) ≈

|x|

(6)

which is relatively accurate for |x| > 3 (e.g. Small, 2010, pp. 43). This means that the

classical p value for a given t is approximated by

p = 2T

(−|t|) ≈ 2Φ(−|t|) ≈

2φ(t)

|t|

and substituting the expression for the standard Normal density

φ(x) =

√

2π

−

and taking the log gives

log(1/p) ≈

+ log(|t|) + log



π/2



For a default Bayesian t-test with δ = 0 we have

√

1+σ



√

1+σ



(t)

≈

√

1+σ



√

1+σ



φ(t)

and again substituting and taking the log gives

log(BF

) ≈



1 −

2(1 + σ



− log



1 + σ



≈

− log



1 + σ



and thus

log(BF

) ≈ log(1/p) −log(|t|) −log



π(1 + σ

N)/2



It is clear that large changes in the value of t cause large changes in the value of t

but much smaller changes in log |t|. This means that if we have a set of experiments with

approximately the same sample size N (so that changes in the log(

π(1 + σ

N)/2) term

across experiments are small) the we expect

log BF

≈ log 1/p + C

to hold across a given set of experiments for some constant

C = −



log



|t|

π(1 + σ

N)/2



where hxi indicates the average value of x in those experiments. This tells us that the

Bayesian t-test BF

and the point-form signiﬁcance p are equivalent, at least in terms

of order of magnitude. Our main concern when considering statistical signiﬁcance (or

Bayesian evidence) is in the order of magnitude of our result rather than its exact value:

in this context the Bayesian BF

and the classical p-value convey the same information,

and the two tests are essentially the same.

2.5 Testing the equivalence between p and BF

We tested this predicted relationship between Bayesian and classical t-tests using data

from the ﬁrst Many Labs replication project (Klein et al., 2014). This involved the replica-

tion of 16 diﬀerent experimental tasks investigating a variety of classic and contemporary

psychological eﬀects covering a range of diﬀerent topics. Each experiment was originally

published in the cognitive or social psychology literature, and was replicated by researchers

in around 36 diﬀerent sites. Of these 16 tasks, 11 involved independent t-tests: we down-

loaded the data on all experimental replications of these 11 tasks (396 experiments in total)

and used the standard R t.test function (R Core Team, 2021) to calculate the t-test p and

the ttest.tstat function (from the BayesFactor package, Morey and Rouder, 2021) to calcu-

late the Bayesian t-test BF

for each of these experiments. The R script for this analysis

is available online (see Supplementary Materials).

This particular form of Bayesian t-test is a default test assuming an alternative hypoth-

esis H

with δ = 0 and and with eﬀect sizes distributed normally around δ with variance

which itself follows an inverse χ

distribution with 1 degree of freedom. Under this

prior σ

is distributed around 1: this prior is therefore equivalent to, though slightly less

informative than, the unit information prior σ

= 1 we used in our earlier example.

Since the t-tests in this dataset were all independent two-sample tests with N

samples

in one group and N

in the other, we took the eﬀective sample size in each experiment to

eﬀ

1/N

+ 1/N

and calculated the value

−log



|t|

π(1 + N

eﬀ

)/2



for each experiment in this dataset, and took C to be the mean of these values, giving

C = −2.81 for these experiments. Our prediction is that log(BF

) and log(1/p) will have

a linear relation in these experiments, with a slope of 1 and an intercept of C. To test this

prediction we took the p and BF

values for each individual experiment and calculated the

best-ﬁtting log(BF

) vs log(1/p) line relating these values. The best-ﬁtting line had a slope

of 1.02 and an intercept of −2.81 ± 0.015 (see Figure 1): the predicted value C = −2.81

fell within this (quite narrow) interval, conﬁrming the predicted relationship.

Figure 1: Scatterplot of log(BF

) vs log(1/p) for the 396 t-test experiments in the Many

Labs 1 dataset, with the line of best ﬁt. The ﬁt is extremely good, with the line accounting

for more than 99% of the variance in values; the slope is almost exactly the predicted value

(1.02 vs 1), and the intercept of −2.81 matches the predicted C = −2.81 value exactly.

3 Distributional null hypothesis testing

We’ve given a general characterisation of the Bayesian t-test and shown that this general

form of the test, and so all speciﬁc instantiations, suﬀer from a series of problems: all

compare an alternative H

against a null H

that we already know to be false; all give

increasing evidence for H

irrespective of the presence or absence of any real eﬀect; none

give speciﬁc evidence about the population mean but instead give evidence about the

variance of that mean; and (under a series of approximations) all are essentially equivalent

to the classical t-test against a point-form null, providing no further information.

These problems arise from the use of the classical null hypothesis as H

in the Bayesian t-

test approach, and from the fact that the two hypotheses H

and H

being compared diﬀer

in both their model of variance and (for informed tests) in their assumed mean. Given

these problems it seems unlikely that a move to Bayesian rather than classical hypothesis

testing against the point-form null hypothesis will in any way address the problems with

reliability and replication that we see in scientiﬁc research. As an alternative, we suggest

that researchers consider Fisherian evidential testing against a single null hypothesis, but

with a distributional rather than a point form null. This is an approach where the statistical

model is that observations X follow the Normal distribution

X ∼ N(µ, σ

)

for unknown σ

and where µ itself follows the Normal distribution

µ ∼ N(m, σ

)

and where the null hypothesis is m = 0. We have recently proposed a distributional

null hypothesis testing model following this approach which takes σ

to represent the

variance in experimental means across replications of a given experiment. The null is

not always false in this model; evidence against the null in this model does not rise with

sample size irrespective of the presence or absence of a real eﬀect; and further, when the

between-experiment variance of means is obtained from sample data, this model estimates

the probability of replication of results in a way which reliably matches observed rates of

experimental replication (for a detailed presentation, see Costello and Watts, 2022).

This distributional approach depends on a parameter b = σ

/σ

representing the ra-

tio of between-experiment variance in means to within-experiment variance in individual

responses. While this parameter is mathematically identical to the eﬀect size variance σ

used in the derivation of the Bayesian t-test given above, it has a diﬀerent meaning: where

represents prior uncertainty about the eﬀect (and so is subjective in nature), b repre-

sents the relative variation in experimental means across diﬀerent experiments and so is

estimated from sample data (just as within-experiment variation σ is estimated from sam-

ple data). Further, where in a Bayesian t-test it is natural to choose an uninformative prior

for σ

, in the distributional approach the choice of value for b represents a trade-oﬀ between

Type I and Type II error: a high value for b means high assumed between-experiment

variance and so low Type I error (but high type II error), while a low assumed value for b

means low between-experiment variance and so high Type I error (but low type II error)

This distributional null approach can also be applied to the comparison of null and

alternative hypotheses; in this approach these two hypotheses are

√

1 + bN

∼ T



√

1 + bN



and

√

1 + bN

∼ T

√

1 + bN

;

1/N + b

and the Bayes Factor ratio for the alternative hypothesis H

against the null H

is the ratio

of densities of these two distributions

√

1+bN



−|t|

√

1+bN

;

√

1/N +b



√

1+bN



−|t|

√

1+bN



and since both hypotheses H

and H

necessarily assume the same variance for X but

diﬀerent means, Bayesian evidence in favour of H

indicates that the observed data is more

consistent with the mean in H

than the mean in H

. We can illustrate this using the same

one-sample experiment described earlier with sample size N = 50 (and so ν = 49), a null

hypothesis δ = 0 (there is no eﬀect) an alternative hypothesis δ = 0.5 (there is a medium-

sized positive eﬀect) and assuming, purely for comparison purposes, a unit-information

value of b = 1. This gives



−|t|

√

1+bN

;

√

1/N +b





−|t|

√

1+bN





−0.5

√

1+50

0.5

√

1+1/50





−0.5

√

1+50



= 0.69

and the test gives weak evidence in favour of H

, which is just as we would expect given

that the observed result d = −0.5 is not strongly consistent with either H

or H

, but is

slightly more consistent with H

In the distributional approach the signiﬁcance of a given result t relative to H

sig

(t|H

) = 2T



−|t|

√

1 + bN



while its signiﬁcance relative to an alternative hypothesis of some eﬀect size δ is

sig

(t|H

) = 2T

−|t|

√

1 + bN

;

1/N + b

and there is a linear relationship between these measures of signiﬁcance and the Bayes

Factor measures of evidence. To see this we approximate both these T distributions with

corresponding Normal distributions (a rough approximation since it takes the Normal to

approximate the non-central T ) giving

sig

(t|H

) ≈ 2Φ

−|d|

1/N + b

and

sig

(t|H

) ≈ 2Φ

−|d − δ|

1/N + b

and relating to Mills ratio gives

≈

sig

(t|H

)

sig

(t|H

)



−|d|

√

1/N +b





−|d−δ|

√

1/N +b



We are primarily concerned here with results d which are close to 0 or to δ (giving

evidence in favour of one hypothesis or the other), and so have the Mills ratio argument x

approaching 0 for one or other hypothesis. The approximation in Equation (6) diverges as

x → 0, however; and so, since φ(0) =

π/2, we use the modiﬁed approximation

M(x) ≈

2/π + |x|

which is relatively close to M(x) for x < 3 and which asymptotically approaches Equation

(6) (and so M(x)) as x → ∞. Given this we see that the Bayes Factor and the ratio of

distributional signiﬁcance have the approximate relationship

≈

sig

(t|H

)

sig

(t|H

)

2b/π + |d − δ|

2b/π + |d|

and the Bayes Factor measure of relative evidence for H

over H

given by result t is, to a

ﬁrst approximation, simply a linear transformation of the distributional signiﬁcance ratio

for t under H

and H

4 Discussion

Our focus so far has been on the relationship between Bayesian and frequentist approaches

to hypothesis testing in a particularly simple situation: the t test. Here we brieﬂy discuss

the relationship between these two approaches more generally. We take as our starting point

a account of the Bayesian/frequentist distinction as given in a recent primer on Bayesian

statistics:

The key diﬀerence between Bayesian and frequentist inference is that frequen-

tists do not consider probability statements about the unknown parameters to be

useful. Instead, the unknown parameters are considered to be ﬁxed; the likeli-

hood is the conditional probability distribution p(y|θ) of the data (y), given ﬁxed

parameters (θ). In Bayesian inference, unknown parameters are referred to as

random variables in order to make probability statements about them. The (ob-

served) data are treated as ﬁxed, whereas the parameter values are varied; the

likelihood is a function of θ for the ﬁxed data y.

(van de Schoot et al., 2021, p. 7)

We expand on this account by noting that in both frequentist and Bayesian approaches

we have some theory of the generative process producing data y. This theory gives us two

things: ﬁrst, an overall statistical model M with some set of independent parameters θ such

that y is assumed to follow the distribution y ∼ M(θ); and second, a list H containing,

for each parameter θ

, a particular selected value for that parameter (with some associated

uncertainty or variance in that value). The variances associated with parameter values in

H allows us to distinguish between ﬁxed and free parameters in our theory. A parameter

value is ﬁxed by theory, in this view, when our theory requires a speciﬁc value for H

so that

any change to that value would necessarily require us to abandon the theory: the variance

of a ﬁxed parameter is thus necessarily 0 in this theory. If a parameter is not ﬁxed it is

free, and its value must be estimated from data in some way, so that any such estimated

value, and any change in that value, remains consistent with our theory (and such that the

current best estimate, and its variance, is given by H

Both frequentist and Bayesian approaches typically assume the overall model M is ﬁxed,

and consider either testing or updating values of the parameter values H (“the” hypothesis).

Frequentist inference considers p(y|H): the probability distribution for data y conditional

on θ = H (on the assumption that the parameters are as described in H). Bayesian

inference considers p(H

new

|y, H): the updated parameter descriptions H

new

, conditional

on data y and on the prior values H. This common structure means that both forms

of statistical inference fall within a single uniﬁed framework deﬁned by M and H: any

Bayesian prior H can be tested via the frequentist inference p(y|H) and any frequentist

hypothesis H about θ can be updated via the Bayesian inference p(H

new

|y, H). Indeed both

forms of inference can be applied to the same data y, by asking whether y is consistent

with H and, if not, updating to produce a more consistent description H

new

and then

asking whether y is consistent with this new description (these are the prior and posterior

predictive checks commonly recommended in standard Bayesian workﬂows, even though

these checks involve a frequentist hypothesis test; see e.g. Gelman et al., 2020; Schad et al.,

2021). Note that for parameter values H

with variance of 0 (ﬁxed by theory) this updating

process will never cause any change in H

. This means that if p(y|H

new

) is less than some

signiﬁcance criterion α, we can conclude that data y is inconsistent with our overall theory

in some way: that updating to produce a set of parameter values consistent with the data

would either require us to change some values that are ﬁxed in that theory, to abandon

our prior estimates H for some or all of those values (which by assumption were consistent

with that theory) or to abandon our statistical model M.

It is necessarily the case in this uniﬁed framework that any form of hypothesis testing

(that is, any situation where we ask whether data y is consistent with some H) will nec-

essarily involve the frequentist inference p(y|H) in some way. It should not be surprising,

therefore, that the default Bayesian t test and the classical t test are essentially equivalent;

the equivalence arises because both depend on inferences of the form p(y|H).

SUPPLEMENTARY MATERIAL

The R script used in this paper available at https://osf.io/qajvu, and automatically

downloads the Many Labs 1 dataset, carries out the analysis, and generates Figure 1.

References

Camerer, C. F., A. Dreber, F. Holzmeister, T.-H. Ho, J. Huber, M. Johannesson, M. Kirch-

ler, G. Nave, B. A. Nosek, T. Pfeiﬀer, et al. (2018). Evaluating the Replicability of Social

Science Experiments in Nature and Science Between 2010 and 2015. Nature Human Be-

haviour 2 (9), 637–644.

Costello, F. and P. Watts (2022). How to Tell When a Result Will Replicate: Signiﬁcance

and Replication in Distributional Null Hypothesis Tests. Submitted.

Fox, R. J. and M. W. Dimmic (2006). A Two-Sample Bayesian T-Test for Microarray

Data. BMC Bioinformatics 7 (1), 1–11.

Gelman, A., A. Vehtari, D. Simpson, C. C. Margossian, B. Carpenter, Y. Yao, L. Kennedy,

J. Gabry, P.-C. B¨urkner, and M. Modr´ak (2020). Bayesian Workﬂow. arXiv Preprint

arXiv:2011.01808 .

G¨onen, M., W. O. Johnson, Y. Lu, and P. H. Westfall (2005). The Bayesian Two-Sample

T Test. The American Statistician 59 (3), 252–257.

Gronau, Q. F., A. Ly, and E.-J. Wagenmakers (2019). Informed Bayesian T-Tests. The

American Statistician.

Jeﬀreys, H. (1948). The Theory of Probability. OUP Oxford.

Klein, R. A., K. A. Ratliﬀ, M. Vianello, R. B. Adams Jr,

S. Bahn´ık, M. J. Bernstein,

K. Bocian, M. J. Brandt, B. Brooks, C. C. Brumbaugh, et al. (2014). Investigating

Variation in Replicability. Social Psychology 45 (3), 142–152.

Klein, R. A., M. Vianello, F. Hasselman, B. G. Adams, R. B. Adams Jr, S. Alper, M. Ave-

yard, J. R. Axt, M. T. Babalola,

S. Bahn´ık, et al. (2018). Many Labs 2: Investigating

Variation in Replicability Across Samples and Settings. Advances in Methods and Prac-

tices in Psychological Science 1 (4), 443–490.

Ly, A., J. Verhagen, and E.-J. Wagenmakers (2016). Harold Jeﬀreys?s Default Bayes Factor

Hypothesis Tests: Explanation, Extension, and Application in Psychology. Journal of

Mathematical Psychology 72, 19–32.

Morey, R. D. and J. N. Rouder (2021). BayesFactor: Computation of Bayes Factors for

Common Designs. R package version 0.9.12-4.3.

Open Science Collaboration et al. (2015). Estimating the Reproducibility of Psychological

Science. Science 349 (6251), aac4716.

R Core Team (2021). R: A Language and Environment for Statistical Computing. Vienna,

Austria: R Foundation for Statistical Computing.

Rouder, J. N., P. L. Speckman, D. Sun, R. D. Morey, and G. Iverson (2009). Bayesian

T Tests for Accepting and Rejecting the Null Hypothesis. Psychonomic Bulletin &

Review 16 (2), 225–237.

Schad, D. J., M. Betancourt, and S. Vasishth (2021). Toward a Principled Bayesian Work-

ﬂow in Cognitive Science. Psychological Methods 26 (1), 103.

Schmalz, X., J. Biurrun Manresa, and L. Zhang (2021). What Is a Bayes Factor? Psycho-

logical Methods.

Small, C. G. (2010). Expansions and Asymptotics for Statistics. Chapman and Hall/CRC.

Thompson, B. (1998). In Praise of Brilliance: Where That Praise Really Belongs. American

Psychologist 53 (7), 799–800.

van de Schoot, R., S. Depaoli, R. King, B. Kramer, K. M¨artens, M. G. Tadesse, M. Van-

nucci, A. Gelman, D. Veen, J. Willemsen, et al. (2021). Bayesian Statistics and Modelling.

Nature Reviews Methods Primers 1 (1), 1–26.

Wang, M. and G. Liu (2016). A Simple Two-Sample Bayesian T-Test for Hypothesis

Testing. The American Statistician 70 (2), 195–201.