Skip to content

General Recategorization Proposal

Josh Burke edited this page May 16, 2019 · 25 revisions

Summary

As we started classifying NGOs based on text from their websites, we found many anecdotal instances where the work of two NGOs might be very similar, but their self-selected categorizations were different. We uncovered significant trends in the dataset that inform potential solutions. In this document we analyze these trends and propose a revised categorization scheme that would help solve our initial problem of categorizing unknown organizations. Our hope is to help GlobalGiving in vetting new organizations and to better understand the breadth of NGOs that are out there.

The key premises of this proposal are as follows:

  • There is a difference between categories that describe “What does an organization do?” and “Who do they serve?”, in that both types are necessary. This distinction can be better addressed if built into the categorization scheme.
  • There are inherent associations between certain categories that might be better addressed by a hierarchical categorization scheme or merging categories.

Current Categorization Scheme

Organizations choose from these themes:

Children Animals
Women and Girls Arts and Culture
Health Climate Change
Economic Development Technology
Human Rights Sport
Environment Democracy and Governance
Disaster Recovery Microfinance
LGBTQAI+ Humanitarian Assistance

Trends in The Data

We were given a collection of 6,544 labelled non-profit organizations following the old categorization schema. As we started classifying the data and also through inspection of individual NGOs we found some key trends:

There are many organizations that have both a “who they support” and a “what they do” categorization.

We are defining a “who” category as either Children, Women and Girls, or Animals. This corresponds to who a nonprofit organization serves. Through an analysis of 3,653 organizations, we found that 83% of organizations had a "who” category, with 52% of organizations having a “who” and “what” category. The “what” categories are all the other categories including Education, Health, and Environment.

As a corollary, 31% of the organizations had only a “who” category. This does not provide much information as to what an organization does, which supports the idea that a “who” and “what” category are both necessary for meaningful classification.

In this document, we examined the probability of selecting a “Who They Support” category given that they have selected a “What They Do” category. In other words, organizations generally found it necessary to select at least one category from both groups in order to form an informative description of their purpose.

If similar organizations do not select both a “who they support” and a “what they do”, they may end up selecting very different categories.

This non-profit seems to focus on providing education, healthcare, and overall welfare mainly to mainly to children. They selected the “education” and “health” categories when applying to GlobalGiving, meaning they only selected “what they do” categories.

This non-profit focuses on giving male children educational and development opportunities. They selected the “children” category, which is only from “who they support”.

Both organizations have incredibly similar objectives, so it should follow that they are classified similarly - but they have no overlapping categories, as one chose to only select “what they do” categories, while the other only selected from “who they support”.

Among the “what they do” categories, conditional probabilities suggest that some categories could be represented with a hierarchical relationship.

Here are the examples we found from this analysis:

  • If Microfinance was picked, Economic Development was often picked
    • P(Economic Development | Microfinance) = .83
  • If Democracy & Governance was picked, Human Rights was often picked
    • P(Human Rights | Democracy and Governance) = .82
  • If LGBTQAI+ was picked, Human Rights was often picked
    • P(Human Rights | LGBTQAI+) = .86
  • If Disaster Recovery was picked, Humanitarian Asst. was often picked
    • P(Humanitarian Assistance | Disaster Recovery) = .64
  • If Hunger was picked, Humanitarian Asst. was often picked
    • P(Humanitarian Assistance | Hunger) = .56
  • If Climate Change was picked, Environment was often picked
    • P(Environment | Climate Change) = .88

Objectives

Create a categorization schema that:

  • Provides more informative classifications
  • Provides more specific classifications
  • Is easy to understand by organizations vetting into Global Giving

Proposal

We propose a revised categorization scheme with the following changes:

  • Splitting categories into two groups:
    • “Who They Serve”
    • “What They Do”
  • Add:
    • Senior Citizens to “Who They Serve”
    • Preservation to “Who They Serve?”
  • Merge:
    • Climate Change + Environment -> Environment

Revised Categorization Scheme

Select at least one from the “Who They Support” group.

Select at least one from the “What They Do” category.

Clarifications on Implementation

  • An organization must select at least 1 “Who” and at least 1 “What.”
  • Parents in hierarchies can be selected without any children being chosen (e.g. you can choose People without necessarily choosing any of the 3 below it)
    • The converse is not true
  • Design Consideration 1: Visible checkbox on every category.
    • This will make it more intuitive that it is OK to select a parent in a hierarchy and not select any of its children
  • Design Consideration 2: One way to make this easy to fill out for organizations would be to have the full picture visible from the beginning, but have portions that don’t need to be filled out yet “grayed out”
    • Example: At the top, user selects from People, Animals, Preservation, while the rest of the scheme is grayed out. If the user selects People, its 3 children become selectable, as does the “What” category (now that the user has selected at least one from “Who”).

Added/Deleted Categories

Who Section:

  • People (Added)
    • Not all organizations will select Children or Women & Girls, so this “umbrella” category will help ensure we always have a general representation of who the organization is serving
    • As noted above, an organization could either select People or select both People and Children. However, they cannot select Children without selecting People.
  • Senior Citizens (Added)
  • Preservation (Added)
    • This could be replaced with “Environmental Preservation”
      • Most examples of “Preservation” we found or could come up with that didn’t fall under Environmental Preservation were based on the preservation of human culture or were built exclusively for humans (e.g. museums), and could thus easily fall under “People”
    • This could also simply be replaced with a “Neither” category

What Section:

  • Climate Change (Removed)
    • If Climate Change was selected, Environment was almost always also selected; P(Environment | Climate Change) = .88, while P(Climate Change | Environment) = .45.
    • However, if keeping this category is desired, it could be placed as a “child” of Environment
  • New categories may be needed in the “What” section, although we have not been able to pinpoint new ones thus far
    • Progress is forthcoming with clustering by word/document embeddings

Algorithm Improvements Under Proposed Scheme

With the proposed categorization schema, we can improve classification algorithms, which would be useful in better understanding organizations that are not yet in GlobalGiving’s network. By classifying organizations in a hierarchical structure separating distinct subsets of categories, classification algorithms can be improved to be both more accurate and precise by classifying multiple, smaller subsets of classifications defining an organization.

The hierarchical structure allows algorithms to better leverage statistical trends between categories to better classify organizations. For example, if an organization was to select “Animals” as a “who”, it is also statistically more likely to select “Health” (P(Health | Animals) = 0.54) or “Environment” (P(Environment | Animals) = 0.66) , rather than “LGBTQAI+” (P(LGBTQAI+ | Animals) = 0). This statistical trend could be reflected in classification algorithms like Naive-Bayes to develop a bias towards classifying likely paired categories together and avoiding irrelevant ones based off real-world statistical probabilities. In addition, this schema can improve classification within subsets of categories. For example, if an organization was to select “Microfinance” as a “what”, it is also likely to select “Economic Development,” (P(Microfinance | Economic Development) = 0.83) rather than “Sport” (P(Microfinance | Sport) = 0.18).

In summary, the revised categorization schema will allow classification algorithms to better leverage statistical trends both between and within subsets of classifications.

Ideas for Further Improvement of Scheme

  • Adding categories to address all possible fields of service (e.g. preservation, housing - break up some larger categories)
    • We are in the process of exploring document embeddings and their potential to cluster documents precisely
  • Simplify language of category names

Appendix

Percentage and count of "who" categories given "what" categories

Probability of paired categories