-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is the Biden Election Day vote data nearly Gaussian ? #31
Comments
I don't understand why that's potentially suspicious. Which distribution did you expect to see? |
@testes-t Benford |
Nope, not Benford. Benford is to do with the leading digits in the vote counts, whereas these are the vote counts themselves. |
@testes-t @frycast If the voting data is, by its natural inclination, normally distributed, why should only the Biden data be this way, and only for the Election Day data ? Notice that the Biden data is more heavy-tailed (i.e. as a long tail to the right) which is barely visible in the Election Day data. However, the Biden data is almost bi-modal, which is another indicator of potential fraud. Most natural world data like this is heavy-tailed ? Why ? Think about issue #17. Here, it is argued that Biden is a highly preferred candidate. That's fine. But why would every single voter have the exact same individual preference? In pure math terms, why would the prior on the preference (p) be highly peaked ? Sure people can behave in similar ways, but, in the aggregate,I argue voters have both their own individual preferences and cluster into diverse peer groups with similar trends. Mathematically, one might say that this broadens the prior on the individual preferences, giving the fatter tailed distributions that then behave Benford-like, even in these small, bounded data sets. If you ask an Election Forensics researcher (like Mebane), so why isn't the Biden voting data always Benford ? Why are there so many deviations from Benford ? He would argue that this is evidence of some kind of effective election strategy by Biden. That is, this is a kind of herding behavior, where everyone in the sample is behaving (nearly) exactly the same way. And he believes this kind of induced herding behavior is widespread. And he has some simulations he put together to argue for this. He has a talk on this (https://www.youtube.com/watch?v=zkx_eO0PvXU) So if this is the case in the Milwaukee data that @frycast highlighted, that could be. But for us to really convince ourselves that we believe this, we should find a way to test this hypothesis in on other metrics in the data, such as looking at the total voting distributions (and the voter turnout data, etc). Here, it seems odd to me that, given this, that even the underlying induced herding behavior, we would still see a nearly perfect Gaussian distribution, and only for the Election Day data. That just seems odd to me. This is especially odd to me since we know that, at least in PA, the Trump turnout was near to over 100% (with many unaffiliated voters), whereas the Biden turnout was down ~75%. So why would the Biden election data show signatures of incredibly successful election strategy whereas is Trump data is more 'natural' ? |
I have no idea, but it's possible that Republicans are less collectivist. |
@testes-t. Something like that could be the case. That is, say election workers for the Biden campaign went all around the city (or called during Covid), ensuring people would vote that day, and enough so that they met some kind of internal quota on "how many people called, how many people said yes, etc". That is, they went out and physically brought them in on Election Day. This kind of collective action might generate such perfect data. But it could also be evidence of wide-scale collusion, where every polling place ensured they hit a minimum number of about 100 Biden voters no matter what, with a little random variation thrown in to hide the fraud. |
@charlesmartin14 I don't use python for these tests, but I imagine it's just as easy. |
@ndrewzigerelli The issue I have is not that the data would satisfy a p-value or other test like this, but that, structurally, the data is far more Gaussian than heavy-tailed. That is, if we compare the Biden Election Day vote data to different distributions (i.e using non-parametric K-S test) we would find that the Gaussian is the best fit Now it could be that I am imagining things, and that the tiny little tail we see (from 200 to 400) is just snall because the data set sample is so small This plot shows ~half the data (votes > 100), on a log10 scale You can see that little 'tail' at the end , above 200, which s about 3.5% of the data set. (In our work we call that a ' heavy-tailed finger' ) The Trump data is structurally completely different--here is a comparative plot on the same scale (density, not total votes) I have looked at hundreds of these kinds of plots in my own research on heavy-tailed phenomena, and this is just a case that really caught my attention. I'm happy to do the test you asked for, and some deeper analysis, and I'll try to get to it after work tonight. |
These two distributions look "structurally completely different" in the same way that a Poisson distribution with lambda=1 looks completely different from one which has lambda=4. The left-hand tail emerges as lambda gets larger -- or in our case as vote share grows -- and it begins to be well approximated by a gaussian. The most recent plot here masks the larger mean value, which is probably an important feature of the data. Note that I am not claiming these are or "should" be poisson, or approach gaussian. I just don't see this as something that is "suspicious", or to be characterized as an anomaly. I especially do not like the use of the word anomaly in the issues here, because (to me) that word implies that one understands the expected behavior of the distributions, and then sees a deviation from that expected behavior. The repo started from an implied assumption that we should expect Benford's Law, which I think got everything off on the wrong track. (Of course, it also didn't help that probably almost everyone here came in with a "Bayesian prior" of whether they believe there was fraud.) Since I am back working now, I'm not sure I've kept up with every reply to every issue. But it did look like there were some promising efforts to figure out what the expected behavior is. I'll also drop this one thought here, since I might not be back for a while. Another cloud hanging over everything is the use of just 3 counties (or more when I wasn't looking?), out of the more than 3,000 counties in the U.S. If the original repo author seriously cherry-picked these datasets, then there is an awful lot of effort spent fitting models to datasets that could themselves be weird due to sampling variation alone. |
@MechanicalTim These are great questions. And, like you, I'll come back after work to re-visit |
@MechanicalTim I have been analyzing counties outside of the "suspect" ones (Miami-Dade FL and Cuyahoga OH for example), and at least when it comes to Benford 2nd digit, both major parties conform quite well. It is difficult to collect data from many counties at the precinct level, as they simply don't supply such information much of the time. If anyone is able to source this data, I would love to run it through my tool. |
@snex. Can we get this additional data checked into the repo here as we collect it ? Thanks |
It is currently in my own repo over at https://github.com/snex/election_results_benford. Feel free to pull it in here. I am using XML files where available so you may have to convert those. |
We know a few things about the data generating process. We know the data are discrete. We also know that the number of votes can't drop below 0. There are (mostly) two sets of counts happening in each ward: the number of Trump supporters arriving and the number of Biden supporters arriving. That suggests we approximately have two Poisson processes in each ward, with different rates between wards. So the whole county could be modelled by a set of independent but non-identically distributed tuples of Poisson random variables. There are other possibilities, of course. This may not be the best choice. The above suggests it's unsurprising that we are seeing Poisson-like behaviour overall in the final distribution of counts. |
@charlesmartin14 if it is true that this herding behaviour effect is playing a role, then a simple explanation could be the massive impact of conditions related to the pandemic. |
Sorry guys, I'm busy for the next few hours..I'm on CA time. But , just briefly, Here's a 5 min workup of The Donald's Election Day Data A Poisson distribution would have an exponentially decaying tail...this data does not The data D appears to be best described by a Truncated Power Law, with power law exponent ~3.6 (Im doing this quick over a break; If I am in error, please repeat the analysis and do the actual Poisson) |
@frycast, I was using the Poisson distribution mostly as an example of a distribution that can have very different look, depending on parameters. But I'm hesitant to think of the election as a poisson (or even poisson-like) process, mainly because I cannot wrap my head around what the analogous thing to "arrival over a time or spatial interval" is -- voting doesn't seem fit that "interval" concept to me -- and also other poisson assumptions like independence of individual events, and so on. EDIT: clarify what I meant by "arrival" |
This data , with the Power Law exponent between 2 and 4, is consistent with a scale-invariant generating process, (subject to finite size effects). Or classical herding behavior. I can go into detail later but this is exactly what I mean structurally different...the tail behavior is different And It is exactly this kind of heavy-tailed , scale-invariant behavior that the Benford's Law is testing for. That's why Trump's data kinda looks Benford, and Biden's does not (on the first digit test). |
Ok, so back to this @frycast #MechanicalTim
This is my point about Biden's Election Day voting data in this thread. Trump is clearly not Poisson or Poisson-like (not sure what that means, but I take it to mean having a large peak at low values and an exponentially decaying tail). But Biden's data may in fact be Poisson or Poisson-like. (analysis to come, although I encourage others to do this too).
Maybe..but (I dont think) the Biden election day data is displaying this kind of herding behavior, unless, as pointed out elsewhere (I forgot the issue here), the democrats' behavior is just oddly collectivist |
@andrewzigerelli Here are the QQ plots. Great suggestion thanks |
@panicfarm That's a good observation...and why are there these large outlier peaks ? Are these statistical fluctuations or structural anomalies? Not sure I didn't check Trump's data for 2016 but I assume is a Truncated Power Law like 2020, not Gaussian. With regard to using Benford's Law, I think we need to generalize the approach to the constraints on the data we have in voting. I suspect that Benford's Law data is (almost always*) heavy-tailed but heavy-tailed data is not always Benford. Especially in cases like this, where they may be a soft lower bound on the data, and the sample sizes small. So, from Benford, we conjecture that naturally occurring data is heavy-tailed, but with finite-size effects If we want to detect fraud or other unusual patterns, we want to look for the signatures of finite-size heavy-tailed behavior. I think what we want to look at both the statistics of these large fluctuations near the mean, the shape of the tail. That is, are the devitations from normality within the bounds of the central limit theorem for this size data, or are they way outside. And, of course, does this tell us anything, or raise any suspicions, about the true voting patterns |
That could kinda work if you are considering a normal voting scenario where people vote in person, although factors like voting before or after work would result in non-Poisson clusters. However, the disputed votes are postal votes, which invalidates use of a Poisson model for those votes. |
@robscovell-ts The issue I am raising in this thread is that I don't see why real voting data like this would ever be Poisson (or Gaussian). it should be heavy-tailed, even if it's not Benford. Compare the Biden 2020 Election Day (not postal) data with the Hilary 2016 data. Here., the tails of the distribution (say > 400) are very similar, Trump and Hilary's data strongly overlap. Of course, many of the voting rules in PA were changed for 2020, so who knows how that affects the comparison. This is what I think we need to check, and for different cases. |
Here's the data for the Trump vs Biden winning in their own districts Trump data in the Trump winning Districts; Biden data in the Biden districtsI think this is consistent with the data in #17 presented by @markr-github, where the Biden data is Benford-like in the Trump winning districts. Notice that the Trump data, however, does have a long tail in the Trump winning districts, despite it clearly being non-Benford. it does not look Gaussian or Poisson to me. Election Day Data for both candidates in Biden-winning districtsAs expected, the Trump data is Benford-like (or maybe Poisson, I have not checked carefully, but I doubt it.) Election Day Data for both candidates in Trump-winning districtsNow we see the Biden non-Benford data (which seems weird the trump winning district ?) and we do see a little bit of tail in the Biden data... This is the data that looks suspicious... (I encourage others to double check these results to confirm.) |
I have a working hypothesis now for why this data might be Gaussian in case of fraud. |
The weirdest thing to me about all of this is that the Election Day vote distribution for Biden is almost perfectly Normal, with a slight right skew.
Whereas the Trump data , being heavy-tailed, just looks more like real-world data to me
Any thoughts ?
The text was updated successfully, but these errors were encountered: