N grams

Summary

N-grams are essentially ways of grouping words together - for example, a unigram is a single word, a bigram is two words, and a trigram is three words.

n-gram

Utilizing n-grams for text analysis can be useful as it takes into account words paired together (when n > 1), not just individual words - allowing for associations to come to light that may not have with a unigram model. In our first exploration of GlobalGiving's problem, one of our thoughts was to take advantage of n-grams to find associations with text. While we did not end up pursuing this direct route of classification, we did utilize unigrams and bigrams in our Stochastic Gradient Descent model.

The code we have for n-grams is more of a processing tool that formats an NGO's data into pandas DataFrames.

Usage

There is a main function provided in ngram_classifier.py file, resulting in a JSON dump of organized data.

File structures

The first argument when running the script should be the name of a JSON file that represents a list of NGOs. We have provided the file scraping_data.json. It will be in the following format:

{
  "projects": [
    {
      "url": "url",
      "text": "text",
      "themes": [{ "id": "id", "name": "name" }]
    }
  ]
}

The second argument should be the name of a JSON file to dump the resulting data to, which will be of this format:

{
  "projects": [
    {
      "url": "url",
      "themes": [{ "id": "id", "name": "name" }],
      "tfidf_values": [],
      "features": []
    }
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N grams

Summary

Usage

File structures

Clone this wiki locally