Graham's intuition is assuming equality of the two distributions.
As I noted in a different comment here, you can pretty easily fix Graham's test. Compute min(accepted a) and min(accepted B) instead of the means. In your example, the min of the accepted distributions would both work out to be 80%.
This assumes that the populations of A and B are of the same size. A larger sample will tend to have a lower minimum under many real world distributions - a sample of one will have its minimum equal to its maximum.
Another reason the use of mins here is not helpful, is that adding one equally awful accepted candidate to group A and B would then remove whatever bias there was according to the test, which is not what we want the test to indicate.
The idea by PG is a rough rule of thumb and breaks down trivially - suppose VC fund X were to accept all candidates, but group A was worse than B, the test would falsely imply that the fund was biased.
It's unfortunate the idea was dressed up in statistical persiflage because it isn't rigorous -- it's a rough guideline. To make it rigorous wold be very hard: either the abilities of the candidate populations would have to be measured very closely (unrealistic), or a more scientific experiment conducted (A-B test where candidates from each group are included or excluded opposite to the prior decision, which would need big groups).
Compute min(accepted a) and min(accepted B) instead of the means.
Dude, your comments are normally smarter than this. Yeah, you can easily fix Grahams's test -- all you need are some numbers that do not exist and that we cannot measure.
We're talking about VC's evaluating founders. That does not, and cannot, get reduced to a numerical score. And even if VC's did use some sort of scoring rubric, then we would still not know if there was unfairness in the way they made the scores, or unfairness in the selection process. It would just be punting the problem down a layer. PG's central claim -- that a third-party can detect the bias/unfairness in the funding process just using math -- is false.
You can only know if the process is biased/unfair if you have deep qualitative understanding of the process.
A charitable interpretation of what he or she said is this: don't evaluate bias by looking at outcomes of the average applicant, look at the outcomes of the borderline applicants. Even if there is no perfect way to define or measure the minimum acceptable applicant, I think it is reasonable to identify whether applicants were borderline or not.
Isn't that, by the way, what YC has been saying for years in their rejection letters? "We're always surprised by how many of the last companies to make it wind up being the most successful"? Something like that.
A charitable interpretation of what he or she said is this: don't evaluate bias by looking at outcomes of the average applicant, look at the outcomes of the borderline applicants.
That is fine, that is what he was saying. The point is that his solution is completely impractical for the original goal of finding an objective, statistically valid way of measuring whether bias exists. "Borderline" cannot be measured objectively, only by subjective rubric scoring. And when you only measure the borderline candidates, you have reduced an already way-to-small sample even further.
PG and I are assuming a measurable outcome, which the selection process is explicitly supposed to predict.
I made no claims about practicality - right now all I have is a little bit of measure theory showing that pg's algo is, in principle, fixable. I fully agree that the first round capital data he cites is inadequate (and also wrong, due to the unjustified exclusion of uber, which they explicitly note would alter the results).
My concrete claim: PGs idea for a statistical test is solid, I can (and shortly will) prove a toy version works, and given enough work one can probably cook up a practical version for some problems.
"Your idea isn't 100% perfect right out of the gate" is a very unfair criticism. Are we supposed to nurture every idea in complete secrecy until it is perfect?
OK I missed that you meant "easily fixed" in the strictly mathematical sense, not in the practical, real-world application sense.
With statistics on human affairs, 99% of the hard part is not the math, it is applying that math to a complicated, heterogenous, and difficult to measure underlying phenomena. And in most cases, statistics alone will never give you a straight answer, the best they can do is supplement and confirm qualitative observations. Failing to recognize this is how you get all those unending media reports about how X is bad for your health. PG's post was at the level of one of those junk health news articles.
And because human affairs are hard, we should criticize anyone who dares to voice an idea they haven't fully figured out yet.
This idea that statistics can only confirm and supplement "qualitative observations" (I.e. my priors) is completely unscientific and anti-intellectual. If that's true, forget stats - lets just write down the one permitted belief on a piece of paper and not waste resources on science. Science is really boring when only one answer is possible.
This idea that statistics can only confirm and supplement "qualitative observations" (I.e. my priors) is completely unscientific and anti-intellectual.
Since when is investing in startups a science? What is anti-intellectual, what is anti-science is to use the wrong tool for the job. Human affairs are not a science in the way that physics is a science. Statistics are far, far more fraught because there are so many variables in play, phenomena are hard to quantify, each case is so heterogenous, etc. You cannot use statistics in human affairs without also having a very good observational understanding of what is actually going on, otherwise you will end up in all sorts of trouble.
So the PG estimator is clearly problematic. I agree that the yummfajitas (YM) estimator looks to be consistent. In this case though, we're dealing with (small) finite sample sizes, so we need to come up with some sort of test statistic. What would the YM test be here? It seems tricky since you are dealing with a conditional distribution based on left-censored data. I'm also not aware of any difference-of-minimums test, though I am happy to be educated if there is one!
I don't know of something to refer to, but I don't think the statistics are too hard. The test statistic would be exactly min(sample1) and min(sample2).
Suppose the cutoff sample is distributed according to f(x)H(x-C). Then the probability of the minima of a sample exceeding C+e by random chance, assuming the null hypothesis, is p = (1-\int_C^{C+e}f(x) dx)^N.
So now you have a frequentist hypothesis test. If you make reasonable assumptions on f(x) (non-vanishing near C, quantified somehow), it's even nice and non-parametric.
Does that assume both samples are identically distributed and the only difference is the cutoff? If it does, then couldn't we just continue to do a difference of means test and still be consistent? If it doesn't, how do you handle identifying the cutoff minima and the two different distributions in a frequentist way?
The only assumption I need is that P_{f,g}([C,C+d]) >= h(d) > 0 for some arbitrary monotonic function h(d). This comes directly from the p-value formula.
I.e., for any d, there is a finite probability of finding an A or a B in [C,C+d]. I don't actually care what the shapes of f or g are at all beyond this - as long as this probability exists and is bounded below (in whatever class of functions f and g might be drawn from), it's all fine.
Sorry, I'm confused here. A p-value makes an implicit assumption that your null hypothesis is a known N(0,1). That may be throwing me off a bit. I get the point of you want to look at the likelihood function which is just one minus the CDF in the given interval. I'm just not clear on how you can get around f and g being arbitrarily parameterized functions of a given class. Are you assuming we know the class and something about f?
A null hypothesis is just a specific thing you are trying to disprove. In this case, it's simply that the min of both distributions is identical.
I am assuming we know exactly one thing about the class the measures f and g come from: for every function in that class, \int_C^{C+d} f(x) dx >= h(d) for some monotonic function h(d).
The p-value is then computed in terms of h(d), since p >= h(d)^N.
You can only rarely calculate min(accepted a) in the actual world. In this example, the college learns no distributions; they only know whether the student passed or failed.
As I noted in a different comment here, you can pretty easily fix Graham's test. Compute min(accepted a) and min(accepted B) instead of the means. In your example, the min of the accepted distributions would both work out to be 80%.