Skip to content
Josh Burke edited this page May 16, 2019 · 34 revisions

Welcome

Welcome to the GlobalGiving-Depth wiki! This wiki offers an in-depth view of Hack4Impact’s initiatives this semester in partnership with GlobalGiving. This page will work as an index, linking off to subpages describing each individual approach to the problem. Check out the sidebar for quick-links to the pages offered on this wiki.

Problem

With many potential contacts to make, GlobalGiving needs to be able to make an informed choice when contacting nonprofits to bring into their network. GlobalGiving’s network consists of many organizations based in the US along with some nonprofits in other countries. However, the process of finding and applying to GlobalGiving remains significantly easier within the United States. In certain countries, factors including lack of internet connectivity and lack of access to documents required by GlobalGiving’s vetting process has led to slower onboarding and discovery. As a result, GlobalGiving aims to use data science techniques to preemptively find and reach out to nonprofits around the world in an attempt to streamline the process of acquiring and benefitting more nonprofits outside of the country.

In late 2018, Hack4Impact provided a solution for GlobalGiving which obtained basic information about many new organizations which are not yet a part of GlobalGiving’s network. Now, GlobalGiving aims to fill out these records with more detailed information about the work they do and for whom in order to further streamline the process of benefitting these organizations. Whereas last semester’s problem was more about discovering the breadth of organizations around the world, this semester is about depth.

Approaches

Along with the many algorithms we provide in this repo, we also spent some time developing a new categorization scheme that takes into account statistic trends in the data and implements mechanisms which enforce consistency of these trends. This idea came about in trying to imagine an ideal categorization scheme for classifying new NGOs.


Classification:

Classification algorithms are one way to characterize the work of unknown, new NGOs. By feeding in summary text of an NGO to a properly trained classifier model, you can obtain a set of categories that describes that NGO with some degree of accuracy.


Clustering:

Clustering algorithms offer the possibility of generating new sets of categories, or better understanding of the connections and similarities between NGOs. K-Means with Document Embeddings seem to be the most promising initiative in this category.

We attempted to design a Semi-Supervised LDA algorithm based on an article published online, but were unable to get the code to run. Here is the article for reference.


Data Processing/Visualization:

Most processing was involved with obtaining data and seeing what it looked like, along with getting it into a form we could analyze. Preprocessing like TF-IDF scoring and count vectorizing are not included here, but stock SKLearn preprocessors were used in many algorithms.

Conclusions

Classification

The general consensus among our team is that classification with current data/category sets is not especially useful. Our classifiers yield a maximum F1 score of 0.67 in tests, which is not accurate enough to provide significant value. The SGD Classifier depends on an Scikit-Learn implementation of SGD, which is already as optimized as it can be - so any improvements will have to come from the categorization scheme or larger/cleaner datasets.

The SGD Classifier can be trained on any arbitrary set of labels, so its performance depends on the predictability of an NGO's categories given its summary text. So improvements in performance can be achieved with either cleaner text, or more predictable/consistent categories. The Bag of Words classifier's performance, however, depends mostly on the word dictionaries provided to it, which can be built and rebuilt in many ways (it is hard to say what approach for this is best). Moving forward - we provide classifiers in this repository with the intent that they could be used on future datasets, ones that are cleaner (than the project summaries dataset) or have different label sets (such as the labels specified in the recategorization scheme).

Include, $\forall$ sections of the approaches, suggestions on ways to move forward. Talk about new/data-driven categories Idk exactly what else, but what did we learn?

Past Presentation Slides

Clone this wiki locally