-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add patterns * Add chunker * Separate patterns for DE and EN in two classes * Refactor * Move counts to separate module * Move the count module to preprocessing, as counting is needed also for text analysis * Restructuring classes * Refactor and some docstring * Add example * More docstring * Make extract_mwe_candidates private * Fix import * Separate module for association measures. Initiate with PMI * Playing around with class heirarchy * Make prob() private * Clean up classes * Move unnecessary classes to functions * Wrtie count dict to file * Make in in n-gram a parameter * Warnings for bad entries * Move DataFrame reader to a separate module * Docstring * Change pickle to json * Test end to end * Add PMI association measure * Mv DataFrameReader out of this module * Remove unnecessary counting of the MWE matches at sentence level * Rename pattern classes * Rename MWE pattern classes * Clean up * Add arg for custom patterns * Rename to better names * Add tqdm * Add table output and sort and topn n options * Update docs with latest api changes * Show two decimal values * Remove language version * Format code and corresponding fixes * Format code and apply corresponding fixes * Rm legacy MWE * Format code and apply corresponding fixes * Update docs * Rm main * Import NgramExtractor * Rm main * Update docs with ngram extraction * Rm unused imports * Rename corpus to df * Test association measure * Format, docstring and fix issues * Fix wrongly placed error * Tests MWE * Bump version * Add sample dataset * Exclude unnecessary files
- Loading branch information
1 parent
e7935fd
commit 61fcabc
Showing
15 changed files
with
2,759 additions
and
413 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,128 @@ | ||
Multiword Expressions | ||
##################### | ||
Multiword Expressions (MWEs) | ||
############################ | ||
|
||
Multiword Expressions (MWEs) are phrases that can be treated as a single | ||
semantic unit. E.g. *swimming pool* and *climate change*. MWEs have | ||
application in different areas including: parsing, language models, | ||
language generation, terminology extraction, and topic models. | ||
Wordview can extract different types of MWEs in your text. | ||
application in different areas including: parsing, language generation, | ||
language modeling, terminology extraction, and topic models. | ||
|
||
Wordview can extract different types of MWEs from a text corpus in any of the supported languages. Wordview by default extracts the following types of MWEs: | ||
Light Verb Constructions (LVCs), 2 and 3 word Noun Compounds (NCs), 2 and 3 word Adjective-Noun Compounds (ANCs), and Verb-Noun Compounds (VNCs). | ||
However, you can specify other types of MWEs you want to extract using the `custom_pattern` argument. For more details, see the | ||
the documentation. | ||
|
||
.. code:: python | ||
# First we need to extract ngrams from the corpus | ||
# If this was not done previously, e.g. when running other functions of Wordview, | ||
# you can do it as follows: | ||
from wordview.preprocessing import NgramExtractor | ||
import pandas as pd | ||
imdb_train = pd.read_csv("data/IMDB_Dataset_sample.csv") | ||
extractor = NgramExtractor(imdb_train, "review") | ||
extractor.extract_ngrams() | ||
extractor.get_ngram_counts(ngram_count_file_path="data/ngram_counts.json") | ||
# Now we can extract MWEs | ||
from wordview.mwes import MWE | ||
import json | ||
mwe_obj = MWE(imdb_corpus, 'review', | ||
ngram_count_file_path='data/ngram_counts.json', | ||
language='EN', | ||
custom_patterns="NP: {<DT>?<JJ>*<NN>}", | ||
only_custom_patterns=False, | ||
) | ||
mwes = mwe_obj.extract_mwes(sort=True, top_n=10) | ||
json.dump(mwes, open('data/mwes.json', 'w'), indent=4) | ||
The above returns the results in a dictionary, that in this example we stored in `mwes.json` file. | ||
You can also return the result in a table: | ||
|
||
.. code:: python | ||
mwe_obj.print_mwe_table() | ||
.. code:: | ||
╔═════════════════════════╦═══════════════╗ | ||
║ LVC ║ Association ║ | ||
╠═════════════════════════╬═══════════════╣ | ||
║ SHOOT the binding ║ 26.02 ║ | ||
║ achieve this elusive ║ 24.7 ║ | ||
║ manipulate the wildlife ║ 24.44 ║ | ||
║ offset the darker ║ 24.02 ║ | ||
║ remove the bindings ║ 24.02 ║ | ||
║ Wish that Anthony ║ 23.9 ║ | ||
║ Add some French ║ 23.5 ║ | ||
║ grab a beer ║ 22.82 ║ | ||
║ steal the 42 ║ 22.5 ║ | ||
║ invoke the spirit ║ 22.12 ║ | ||
╚═════════════════════════╩═══════════════╝ | ||
╔══════════════════════╦═══════════════╗ | ||
║ NC2 ║ Association ║ | ||
╠══════════════════════╬═══════════════╣ | ||
║ gordon willis ║ 20.74 ║ | ||
║ Smoking Barrels ║ 20.74 ║ | ||
║ sadahiv amrapurkar ║ 20.74 ║ | ||
║ nihilism nothingness ║ 20.74 ║ | ||
║ tomato sauce ║ 20.74 ║ | ||
║ Picket Fences ║ 20.74 ║ | ||
║ deja vu ║ 19.74 ║ | ||
║ cargo bay ║ 19.74 ║ | ||
║ zoo souvenir ║ 19.16 ║ | ||
║ cake frosting ║ 19.16 ║ | ||
╚══════════════════════╩═══════════════╝ | ||
╔══════════════════════════════╦═══════════════╗ | ||
║ ANC2 ║ Association ║ | ||
╠══════════════════════════════╬═══════════════╣ | ||
║ bite-sized chunks ║ 20.74 ║ | ||
║ lizardly snouts ║ 20.74 ║ | ||
║ behind-the-scenes featurette ║ 20.74 ║ | ||
║ hidebound conservatives ║ 20.74 ║ | ||
║ judicious pruning ║ 20.74 ║ | ||
║ substantial gauge ║ 19.74 ║ | ||
║ haggish airheads ║ 19.74 ║ | ||
║ global warming ║ 19.74 ║ | ||
║ Ukrainian flags ║ 19.16 ║ | ||
║ well-lit sights ║ 19.16 ║ | ||
╚══════════════════════════════╩═══════════════╝ | ||
╔═══════════════╦═══════════════╗ | ||
║ VPC ║ Association ║ | ||
╠═══════════════╬═══════════════╣ | ||
║ upside down ║ 12.67 ║ | ||
║ Stay away ║ 12.49 ║ | ||
║ put together. ║ 11.62 ║ | ||
║ sit through ║ 10.93 ║ | ||
║ ratchet up ║ 10.83 ║ | ||
║ shoot'em up ║ 10.83 ║ | ||
║ rip off ║ 10.72 ║ | ||
║ hunt down ║ 10.67 ║ | ||
║ screw up ║ 10.41 ║ | ||
║ scorch out ║ 10.4 ║ | ||
╚═══════════════╩═══════════════╝ | ||
╔══════════════╦═══════════════╗ | ||
║ NP ║ Association ║ | ||
╠══════════════╬═══════════════╣ | ||
║ every penny ║ 12.78 ║ | ||
║ THE END ║ 12.07 ║ | ||
║ A JOKE ║ 11.79 ║ | ||
║ A LOT ║ 11.05 ║ | ||
║ Either way ║ 11.03 ║ | ||
║ An absolute ║ 10.72 ║ | ||
║ half hour ║ 10.65 ║ | ||
║ no qualms ║ 10.47 ║ | ||
║ every cliche ║ 10.46 ║ | ||
║ another user ║ 10.37 ║ | ||
╚══════════════╩═══════════════╝ | ||
# NC: NOUN-NOUN MWEs e.g. climate change | ||
# JNC: ADJECTIVE-NOUN MWEs e.g. big shot | ||
mwe = MWE(df=imdb_train, mwe_types=["NC", "JNC"], text_column='text') | ||
# build_counts method --that creates word occurrence counts, is time consuming. | ||
# Hence, you can run it once and store the counts, by the setting the | ||
# counts_filename argument. | ||
mwe.build_counts(counts_filename='tmp/counts.json') | ||
# Once the counts are created, extraction of MWEs is fast and can be carried out | ||
# with different parameters. | ||
# If the optional mwes_filename parameter is set, the extracted MWEs | ||
# will be stored in the corresponding file. | ||
mwes_dict = mwe.extract_mwes(counts_filename='tmp/counts.json') | ||
mwes_nc = {k: v for k, v in mwes_dict['NC'].items()} | ||
top_mwes_nc = [[k, v] for k,v in mwes_nc.items()][:10] | ||
print(tabulate(top_mwes_nc, tablefmt="double_outline")) | ||
The above results in a table the looks like the following and contains an ordered list of MWEs: | ||
|
||
.. code:: none | ||
╔══════════════════╦═══════╗ | ||
║ busby berkeley ║ 11.2 ║ | ||
║ burgess meredith ║ 11.13 ║ | ||
║ bruno mattei ║ 10.92 ║ | ||
║ monty python ║ 10.69 ║ | ||
║ ki aag ║ 10.65 ║ | ||
║ denise richards ║ 10.63 ║ | ||
║ guinea pig ║ 10.52 ║ | ||
║ blade runner ║ 10.48 ║ | ||
║ domino principle ║ 10.44 ║ | ||
║ quantum physics ║ 10.38 ║ | ||
╚══════════════════╩═══════╝ | ||
Notice how show and actor names such as `busby berkeley`, | ||
`burgess meredith`, and `monty python` as well other multi-word | ||
concepts such as `quantum physics` and `guinea pig` are captured, | ||
without the need for any labeled data and supervised model. This can | ||
speed things up and save much costs in certain situations. | ||
Notice how many interesting entities are captured, | ||
without the need for any labeled data and supervised model. | ||
This can speed things up and save much costs in certain applications. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
import pytest | ||
import pandas as pd | ||
from wordview.mwes.mwe import PMICalculator | ||
|
||
|
||
@pytest.fixture | ||
def ngram_counts_dict(): | ||
return { | ||
'coffee': 100, | ||
'shop': 150, | ||
'coffee shop': 80, | ||
'swimming': 50, | ||
'pool': 60, | ||
'swimming pool': 40, | ||
'give': 200, | ||
'a': 500, | ||
'speech': 45, | ||
'give a': 150, | ||
'a speech': 40, | ||
'give a speech': 35, | ||
'take': 100, | ||
'deep': 30, | ||
'breath': 25, | ||
'take a': 80, | ||
'a deep': 25, | ||
'deep breath': 20, | ||
'take a deep': 18, | ||
'a deep breath': 17, | ||
'take a deep breath': 15 | ||
} | ||
|
||
|
||
class TestPMICalculator: | ||
def test_compute_association(self, ngram_counts_dict): | ||
calculator = PMICalculator(ngram_count_source=ngram_counts_dict) | ||
ngram = 'coffee shop' | ||
pmi_value = calculator.compute_association(ngram) | ||
assert pmi_value == pytest.approx(3.24, abs=0.1) | ||
|
||
def test_compute_association_with_zero_count(self, ngram_counts_dict): | ||
calculator = PMICalculator(ngram_count_source=ngram_counts_dict) | ||
ngram = 'horseback riding' | ||
pmi_value = calculator.compute_association(ngram) | ||
assert pmi_value == float('-inf') |
Oops, something went wrong.