-
Notifications
You must be signed in to change notification settings - Fork 2
N grams
N-grams are essentially ways of grouping words together - for example, a unigram is a single word, a bigram is two words, and a trigram is three words.
Utilizing n-grams for text analysis can be useful as it takes into account words paired together (when n > 1), not just individual words - allowing for associations to come to light that may not have with a unigram model. In our first exploration of GlobalGiving's problem, one of our thoughts was to take advantage of n-grams to find associations with text. While we did not end up pursuing this direct route of classification, we did utilize unigrams and bigrams in our Stochastic Gradient Descent model.
The code we have for n-grams is more of a processing tool that formats an NGO's data into pandas DataFrames.
There is a main function provided in ngram_classifier.py
file, resulting in a JSON dump of organized data.
The first argument when running the script should be the name of a JSON file that represents a list of NGOs. We have provided the file scraping_data.json
. It will be in the following format:
{
"projects": [
{
"url": "url",
"text": "text",
"themes": [{ "id": "id", "name": "name" }]
}
]
}
The second argument should be the name of a JSON file to dump the resulting data to, which will be of this format:
{
"projects": [
{
"url": "url",
"themes": [{ "id": "id", "name": "name" }],
"tfidf_values": [],
"features": []
}
]
}