In this post I want to go back to the basics of statistics, but with an advanced spin on things. By "advanced spin" I mean, both from in terms of mathematics and computational techniques. The topic will dive into is:
- Estimating a single parameter value from a distribution and then quantifying the uncertantity in the estimate.
In general I will take two approaches to quantitfying the uncertainity in the estimate, the first of which is frequentist and second that is Bayesian. I was originally inspired by Jake VanderPlas' post and admit, I am not very seasoned using Bayesian methods. That's why I'll be sticking to a simple example of estimating the mean rate or 𝜆 in a Poisson distribution from sampled data. An image of the Poisson distribution for various 𝜆 values which we wish to esimate are shown below:
From the computational perspective, I wanted to do something different and decided to write the probability distribution for generating the data in Scala, but then use it with Python. Why did I do this? Well, I like Scala and enjoyed the challenge of writing a Poisson distribution using a functional approach. I also wanted to learn more about Py4J which can be used to work with functions and objects in the JVM from Python. Apache Spark actually uses Py4J in PySpark to write Python wrappers for their Scala API. I've used both PySpark and Spark in Scala extensively in the past and doing this project gave me an opportunity to understand how PySpark works much better.
In this post I covered maximum likelyhood estimators (MLE) and Bayesian point estimators. The MLE in this case was simple and I could show how to quanitify the uncertaintity in the estimate using confidence intervals from the Fisher information. I use PyMC3 to calculate two Bayesian estimators and the credible Interval. PyMC's makes it easy MCMC methods to calculate and visualize posterior distributions for the parameter of interest as shown below,
One can also show that in the limit of large data Bayesian estimators and Maximum Likelyhood estimators converge to the same thing! This is called the Bernstein-von Miss Theorem.
You first need to compile the Scala code and build the uber jar using Maven
mvn package
You build the docker images,
docker compose build
The start up the containers through,
docker compose up
You can shut down the contains using
docker compose down