Matthew Bishop, Ashley Gates, Raheem Paxton, Swapna Subbagari
Georgia’s accountability court programs offer an alternative to traditional adjudication and incarceration for non-violent offenders charged with various drug-related crimes and DUIs. The state of Georgia contracted with the Carl Vinson Institute of Government at the University of Georgia to estimate the financial benefits of accountability courts, finding that, on average, the programs saved the state of Georgia more than $22,000 per graduate. This study also found that accountability courts are almost $5,000 less than the costs for traditional adjudication per defendant when considering both state and local costs (https://cjcc.georgia.gov/document/full-report/download). Considering these potential savings for the state, it’s no wonder that these programs are growing in popularity. This study aimed to identify the feature relevant for graduation and develop a prediction model that can be deployed for public use.
• What features are most relevant for graduation? • Does time between arrest and acceptance impact graduation? • Are there certain individual characteristics that increase one’s risk for termination?
• Acceptance type
• Age
• Arrest date,
• Referral date
• Risk level
• Acceptance Date- Arrest Date → Processing Time(Mathew will get back to us)
• Exit date
• Exit status
• Referral source
• Demographic info: DOB, Education level(could change), Employment status(at entry), Gender, Income level, Employment stability, Military service, Race
• Program type (See program codes below)
• Clinical Diagnosis and Level
• Diagnosis Reason
• Number of drug tests
• Count weekly judicial status meetings
• Primary drug of choice
• Secondary drug of choice
• Number of treatment sessions
• Residence County
• FD - Felony Drug
• DC - DUI Courts
• MH - Mental Health
• JD - Juvenile Drug
• JM - Juvenile Mental Health
• FT - Family Treatment
• VC - Veterans Court
The data were cleaned by examining relevant data points and items that would move forward in the final analysis. Participants with unrealistic values were removed from the analytic data frame.
Descriptive statistics were computed for all variables of interest. In particular, we examined the distribution of all continuous variables to determine whether they were normally distributed. The frequency counts of all categorical variables were explored. Finally, we explored the pros and cons of scaling and binning continuous variables.
The data were pre-processed by one-hot-encoding (i.e., get_dummies) all categorical variables and standardizing continuous variables using StandardScaler in Pandas. We then fit a Principal Component Analysis specifying that the model accounts for 99% of the variance in the data. Finally, we used TSNE to reduce the identified components further and derived an Inertia plot to determine the relevant clusters in the data. The elbow plot did not suggest a definitive cluster number. However, after exploring two to four clusters on relevant features, we decided on two clusters because of their ability to distinguish participants on relevant factors. The clusters were later visualized with Matplotlib (See Below).
Initially, we used Recursive Feature Elimination (RFE) with a Random Forest estimator to determine the number of features relevant for further testing. The features were selected using using 5-fold cross-validation with 2 repeats, which produced mean accuracy and standard deviation scores for the 5 to 20 features that were evaluated (See figure below).
The model revealed that eight features produced a reasonable accuracy, which preserving model parsimony. We then trained a series of 5 machine learning models using a Recursive feature elimination pipeline, with 5-fold cross-validation, with two repeats. The final model was chosen based on mean accuracy.
We observed that Gradient Boosting produced the highest accuracy. The model was then tuned using a randomized grid search to obtain the best precision, specificity, recall, and R-square. The unbalanced classification report was reported below.
The model is housed, along with other visualizations, in a Heroku (or similar tool) website. Other tools to be used to complete this project were Python Pandas, Tableau, Flask. HTML/CSS/Bootstrap, and SQL database.
The deployed app can be viewed here.
The final presentation can be viewed here.