Today, it's the last chance to get the course with 10% off:
Use the code LAUNCHWEEK: courses.decisionacademy.io/courses/intr...
Expires today
6/6
@alxndrmlk
"The Causal Guy" http://causalpython.io Author || Advisor || Educator Host at http://CausalBanditsPodcast.com Causal ML Tutor @ Uni of Oxford CausalSky: https://bsky.app/profile/did:plc:imz3rf35poonl7yxt7bogui4/feed/aaamrclcu3tfa
Today, it's the last chance to get the course with 10% off:
Use the code LAUNCHWEEK: courses.decisionacademy.io/courses/intr...
Expires today
6/6
In the "Intro to Biostatistics" course we released last week with Justin BΓ©lair, we share our perspective that data visualization already involves modeling decisions that can impact how we and our audience perceive the problems we address and the solutions we craft.
5/
If we're primarily interested in the central part of the distribution, and only care about approximate values, perhaps not.
But the "tail behavior" of this distribution will be dramatically different from a normal.
4/
It's neat and conveys a clear message about the distribution: it's rather symmetrical and fits nicely with the overlaid theoretical normal PDF.
But when we increase the number of bins (the bottom figure), we see clearly that the distribution is not symmetric.
Does it matter?
It depends.
3/
What you see in the figure below are two histograms representing one dataset.
The only difference?
The number of bins.
I designed the top histogram to resemble some of the histograms I encountered while reading scientific publications.
2/
"Just Plot the Data"
Which of the two datasets is normally distributed?
Plotting the data might seem like the most assumption-free way to examine its properties.
But is that really the case?
1/
#StatSky #Biostatistics #EconSky #EpiSky
- New Interrupted Time Series module in CausalPy + a tutorial
--------
We'll start sending today's issue at 9am PT / 12pm ET / 6pm CET (Sunday)
Join us at: causalpython.io (it's free!)
2/2
What a Week!
We literally had too many topics to fit into one newsletter this week.
Here's what we picked:
- Alberto D. Horner reviews the brand new book by Quentin Gallea, PhD
- David Rohde on why policies are stochastic in reinforcement learning
1/
#CausalSky #StatSky #EconSky #EpiSky #MLSky
The correct link is here: courses.decisionacademy.io/courses/intr...
:)
6/5
Apparently, Popper liked the idea, although he did not fully understand it, because -- paraphrasing his own words -- he was "not very good at statistics"
We're celebrating the Launch Week with 10% off for our Bsky friends!
Join us here: courses.decisionacademy.io/courses/intr...
5/5
That's why we teach about it in our new "Intro to Biostatistics" course with Justin BΓ©lair
PS: I heard many opinions that Popper either did not know about or knew about but did not like the idea of hypothesis testing. I learned from Deborah Mayo that this is not true.
4/
Our measurements are often imperfect (or "noisy")
So are our samples - they typically don't describe the population they come from perfectly.
Without understanding the idea of falsificationism, it's very difficult to make sense of many of the modern statistical frameworks.
3/
...one black swan disproves this claim.
In science, we're often in a more challenging situation than this.
Why?
Because we're not operating in the space of pure logical statements, but rather probabilistic ones.
We need probability to quantify the uncertainty:
2/
Modern Science Relies on an Idea That Disproving Is Easier Than Proving.
The idea was proposed by Karl Popper nearly a century ago.
We call it "falsificationism"
A million white swans don't prove all swans are white.
Neither does a trillion.
But...
1/
#StatSky #EpiSky #Biostatistics #EconSky
To celebrate Launch Week, we have a 10% discount for anyone reading this post (and for your friends and family as well -- share it with them, it's on us) with the code LAUNCHWEEK
Join us here: courses.decisionacademy.io/courses/intr...
It expires on Monday, March 9.
5/5
So we built it.
This new version is a self-paced version of our best-selling live cohort course.
And all coding is done in R, because...
No, just kidding all coding is done in R *and* Python, because you should pick which language you want to use, not us.
4/
- Second, we understand the importance of the causal perspective. We believe it should be discussed explicitly from Day 1 in any course on statistics
We haven't found a course that would do both of these things.
3/
- First, we believe that statistics shouldn't be taught as a set of unrelated procedures and simplified decision rules, but rather as a way of thinking
2/
Last Friday, We Launched "Intro to Biostatistics" with Justin BΓ©lair
Two ideas inspired us to build it:
1/
#StatSky #Biostatistics #EpiSky
So if p_control = 0.10 and p_treatment = 0.12, the SDs are 0.300 vs 0.325 β the Neyman-optimal split would be ~52:48.
Not that different from 50:50.
Have you ever considered that 50:50 split might not be optimal for your setting?
7/7
Notebook: github.com/AlxndrMlk/ca...
Heuristically, if we're only interested in binary conversions and the expected uplift is small, sticking to a 50:50 split can be a reasonable default.
Why?
Variance of a Bernoulli = p(1-p), which is a smooth function that changes slowly across realistic conversion rates.
6/
If we observe variance mismatch, we can use the Neyman allocation rule to find the approximate optimal split:
SD_t / (SD_t + SD_c)
As you can see in the plot below, it does a pretty good job of approximating the optimum.
5/
The intuition behind this is simple: the higher the variance, the more observations we need to confidently describe the distribution.
In practice, before we decide on the exact split, it might be good to run a pilot study to get variance estimates for both treatment and control conditions.
4/
This pattern results in variance inflation in the treatment grp compared to control
The 50:50 split becomes suboptimal
What does this mean in practice?
Statistical power could be increased by allocating more than 50% of units to the higher variance group (in our example - the treatment group)
3/
That the outcome var are equal between the treatment and control groups
Imagine you're testing a new AI assistant in your online store
You're measuring revenue per visitor
Some people love it and their order value goes up significantly
Others don't like it and spend much less than previously
2/
The Optimal Split for an A/B Test Is 50:50
Unless...
I recently saw a post explaining why the optimal split between treatment and control groups should be 50:50.
The optimal split is indeed 50:50, but only under one assumption:
1/
#CausalSky #StatSky #ABTest #EpiSky #MLSky #EconSky
We'll start sending today's issue at 9am PT / 12pm ET / 6pm CET (Sunday)
Subscribe here (it's free): causalpython.io
3/3
...a new paper by Matteo Ceriscioli and Karthika Mohan
- 5 upcoming causal events you don't want to miss (online & in-person)
- Justin BΓ©lair's causally-aware "Intro to Biostatistics" is now live on Decision Academy
2/
You Train Your Robot in August, It Trashes Your Garden in September.
In today's issue of causal Python Weekly:
- Causal POMDPs (Partially Observed Markov Decision Processes): Planning when the world changes - a review of...
1/
#CausalSky #MLSky #AISky #EconSky #StatSky
Day 3: From Struggle to Structure
- Debrief: common pitfalls and blind spots
- A step-by-step reasoning framework
- Where statistical tests actually enter the process
Register here: decisionacademy.io#the-challenge
We start today at 9am PT / 12pm ET / 6pm CET
5/5