Skip to content

Commit

Permalink
Refactor MWE module. (#92)
Browse files Browse the repository at this point in the history
* Add patterns

* Add chunker

* Separate patterns for DE and EN in two classes

* Refactor

* Move counts to separate module

* Move the count module to preprocessing, as counting is needed also for text analysis

* Restructuring classes

* Refactor and some docstring

* Add example

* More docstring

* Make extract_mwe_candidates private

* Fix import

* Separate module for association measures. Initiate with PMI

* Playing around with class heirarchy

* Make prob() private

* Clean up classes

* Move unnecessary classes to functions

* Wrtie count dict to file

* Make in in n-gram a parameter

* Warnings for bad entries

* Move DataFrame reader to a separate module

* Docstring

* Change pickle to json

* Test end to end

* Add PMI association measure

* Mv DataFrameReader out of this module

* Remove unnecessary counting of the MWE matches at sentence level

* Rename pattern classes

* Rename MWE pattern classes

* Clean up

* Add arg for custom patterns

* Rename to better names

* Add tqdm

* Add table output and sort and topn n options

* Update docs with latest api changes

* Show two decimal values

* Remove language version

* Format code and corresponding fixes

* Format code and apply corresponding fixes

* Rm legacy MWE

* Format code and apply corresponding fixes

* Update docs

* Rm main

* Import NgramExtractor

* Rm main

* Update docs with ngram extraction

* Rm unused imports

* Rename corpus to df

* Test association measure

* Format, docstring and fix issues

* Fix wrongly placed error

* Tests MWE

* Bump version

* Add sample dataset

* Exclude unnecessary files
  • Loading branch information
meghdadFar authored Aug 3, 2023
1 parent e7935fd commit 61fcabc
Show file tree
Hide file tree
Showing 15 changed files with 2,759 additions and 413 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ repos:
# supported by your project here, or alternatively use
# pre-commit's default_language_version, see
# https://pre-commit.com/#top_level-default_language_version
language_version: python3.10
# language_version: python3.10
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
hooks:
Expand Down
10 changes: 10 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
Version 1.0.0
-------------
- Complete refactoring and upgrading of the MWE module.
- Support for extracting variable length MWEs given a custom user syntactic patterns of POS tags.
- Predefined patterns for extracting Light Verb Constructions (LVCs), 2-3 word Noun Compounds, 2-3 Adjective Noun Compounds, and 2-3 Verb Noun Compounds, and Verb Particle Constructions (VPCs).
- Refactoring of the Association Measure module.
- Move DataFrame reader to a separate preprocessing module so that it can support all modules easier.
- Add support for extracting ngrams for MWE and also ngram analysis.


Version 0.4.2
-------------
- Better encapsulation.
Expand Down
2,001 changes: 2,001 additions & 0 deletions data/IMDB_Dataset_sample.csv

Large diffs are not rendered by default.

163 changes: 118 additions & 45 deletions docs/source/mwes.rst
Original file line number Diff line number Diff line change
@@ -1,55 +1,128 @@
Multiword Expressions
#####################
Multiword Expressions (MWEs)
############################

Multiword Expressions (MWEs) are phrases that can be treated as a single
semantic unit. E.g. *swimming pool* and *climate change*. MWEs have
application in different areas including: parsing, language models,
language generation, terminology extraction, and topic models.
Wordview can extract different types of MWEs in your text.
application in different areas including: parsing, language generation,
language modeling, terminology extraction, and topic models.

Wordview can extract different types of MWEs from a text corpus in any of the supported languages. Wordview by default extracts the following types of MWEs:
Light Verb Constructions (LVCs), 2 and 3 word Noun Compounds (NCs), 2 and 3 word Adjective-Noun Compounds (ANCs), and Verb-Noun Compounds (VNCs).
However, you can specify other types of MWEs you want to extract using the `custom_pattern` argument. For more details, see the
the documentation.

.. code:: python
# First we need to extract ngrams from the corpus
# If this was not done previously, e.g. when running other functions of Wordview,
# you can do it as follows:
from wordview.preprocessing import NgramExtractor
import pandas as pd
imdb_train = pd.read_csv("data/IMDB_Dataset_sample.csv")
extractor = NgramExtractor(imdb_train, "review")
extractor.extract_ngrams()
extractor.get_ngram_counts(ngram_count_file_path="data/ngram_counts.json")
# Now we can extract MWEs
from wordview.mwes import MWE
import json
mwe_obj = MWE(imdb_corpus, 'review',
ngram_count_file_path='data/ngram_counts.json',
language='EN',
custom_patterns="NP: {<DT>?<JJ>*<NN>}",
only_custom_patterns=False,
)
mwes = mwe_obj.extract_mwes(sort=True, top_n=10)
json.dump(mwes, open('data/mwes.json', 'w'), indent=4)
The above returns the results in a dictionary, that in this example we stored in `mwes.json` file.
You can also return the result in a table:

.. code:: python
mwe_obj.print_mwe_table()
.. code::
╔═════════════════════════╦═══════════════╗
║ LVC ║ Association ║
╠═════════════════════════╬═══════════════╣
║ SHOOT the binding ║ 26.02 ║
║ achieve this elusive ║ 24.7 ║
║ manipulate the wildlife ║ 24.44 ║
║ offset the darker ║ 24.02 ║
║ remove the bindings ║ 24.02 ║
║ Wish that Anthony ║ 23.9 ║
║ Add some French ║ 23.5 ║
║ grab a beer ║ 22.82 ║
║ steal the 42 ║ 22.5 ║
║ invoke the spirit ║ 22.12 ║
╚═════════════════════════╩═══════════════╝
╔══════════════════════╦═══════════════╗
║ NC2 ║ Association ║
╠══════════════════════╬═══════════════╣
║ gordon willis ║ 20.74 ║
║ Smoking Barrels ║ 20.74 ║
║ sadahiv amrapurkar ║ 20.74 ║
║ nihilism nothingness ║ 20.74 ║
║ tomato sauce ║ 20.74 ║
║ Picket Fences ║ 20.74 ║
║ deja vu ║ 19.74 ║
║ cargo bay ║ 19.74 ║
║ zoo souvenir ║ 19.16 ║
║ cake frosting ║ 19.16 ║
╚══════════════════════╩═══════════════╝
╔══════════════════════════════╦═══════════════╗
║ ANC2 ║ Association ║
╠══════════════════════════════╬═══════════════╣
║ bite-sized chunks ║ 20.74 ║
║ lizardly snouts ║ 20.74 ║
║ behind-the-scenes featurette ║ 20.74 ║
║ hidebound conservatives ║ 20.74 ║
║ judicious pruning ║ 20.74 ║
║ substantial gauge ║ 19.74 ║
║ haggish airheads ║ 19.74 ║
║ global warming ║ 19.74 ║
║ Ukrainian flags ║ 19.16 ║
║ well-lit sights ║ 19.16 ║
╚══════════════════════════════╩═══════════════╝
╔═══════════════╦═══════════════╗
║ VPC ║ Association ║
╠═══════════════╬═══════════════╣
║ upside down ║ 12.67 ║
║ Stay away ║ 12.49 ║
║ put together. ║ 11.62 ║
║ sit through ║ 10.93 ║
║ ratchet up ║ 10.83 ║
║ shoot'em up ║ 10.83 ║
║ rip off ║ 10.72 ║
║ hunt down ║ 10.67 ║
║ screw up ║ 10.41 ║
║ scorch out ║ 10.4 ║
╚═══════════════╩═══════════════╝
╔══════════════╦═══════════════╗
║ NP ║ Association ║
╠══════════════╬═══════════════╣
║ every penny ║ 12.78 ║
║ THE END ║ 12.07 ║
║ A JOKE ║ 11.79 ║
║ A LOT ║ 11.05 ║
║ Either way ║ 11.03 ║
║ An absolute ║ 10.72 ║
║ half hour ║ 10.65 ║
║ no qualms ║ 10.47 ║
║ every cliche ║ 10.46 ║
║ another user ║ 10.37 ║
╚══════════════╩═══════════════╝
# NC: NOUN-NOUN MWEs e.g. climate change
# JNC: ADJECTIVE-NOUN MWEs e.g. big shot
mwe = MWE(df=imdb_train, mwe_types=["NC", "JNC"], text_column='text')
# build_counts method --that creates word occurrence counts, is time consuming.
# Hence, you can run it once and store the counts, by the setting the
# counts_filename argument.
mwe.build_counts(counts_filename='tmp/counts.json')
# Once the counts are created, extraction of MWEs is fast and can be carried out
# with different parameters.
# If the optional mwes_filename parameter is set, the extracted MWEs
# will be stored in the corresponding file.
mwes_dict = mwe.extract_mwes(counts_filename='tmp/counts.json')
mwes_nc = {k: v for k, v in mwes_dict['NC'].items()}
top_mwes_nc = [[k, v] for k,v in mwes_nc.items()][:10]
print(tabulate(top_mwes_nc, tablefmt="double_outline"))
The above results in a table the looks like the following and contains an ordered list of MWEs:

.. code:: none
╔══════════════════╦═══════╗
║ busby berkeley ║ 11.2 ║
║ burgess meredith ║ 11.13 ║
║ bruno mattei ║ 10.92 ║
║ monty python ║ 10.69 ║
║ ki aag ║ 10.65 ║
║ denise richards ║ 10.63 ║
║ guinea pig ║ 10.52 ║
║ blade runner ║ 10.48 ║
║ domino principle ║ 10.44 ║
║ quantum physics ║ 10.38 ║
╚══════════════════╩═══════╝
Notice how show and actor names such as `busby berkeley`,
`burgess meredith`, and `monty python` as well other multi-word
concepts such as `quantum physics` and `guinea pig` are captured,
without the need for any labeled data and supervised model. This can
speed things up and save much costs in certain situations.
Notice how many interesting entities are captured,
without the need for any labeled data and supervised model.
This can speed things up and save much costs in certain applications.


3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
[tool.poetry]
name = "wordview"
version = "0.4.2"
version = "1.0.0"
description = "Wordview is a Python package for text analysis."
authors = ["meghdadFar <meghdad.farahmand@gmail.com>"]
readme = "README.rst"
include = ["CHANGES.rst"]
exclude = ["notebooks/", "tests/", "data/"]
license = "MIT"
keywords = ["nlp", "text analysis", "statistics"]

Expand Down
44 changes: 44 additions & 0 deletions tests/mwe/test_association_measures.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import pytest
import pandas as pd
from wordview.mwes.mwe import PMICalculator


@pytest.fixture
def ngram_counts_dict():
return {
'coffee': 100,
'shop': 150,
'coffee shop': 80,
'swimming': 50,
'pool': 60,
'swimming pool': 40,
'give': 200,
'a': 500,
'speech': 45,
'give a': 150,
'a speech': 40,
'give a speech': 35,
'take': 100,
'deep': 30,
'breath': 25,
'take a': 80,
'a deep': 25,
'deep breath': 20,
'take a deep': 18,
'a deep breath': 17,
'take a deep breath': 15
}


class TestPMICalculator:
def test_compute_association(self, ngram_counts_dict):
calculator = PMICalculator(ngram_count_source=ngram_counts_dict)
ngram = 'coffee shop'
pmi_value = calculator.compute_association(ngram)
assert pmi_value == pytest.approx(3.24, abs=0.1)

def test_compute_association_with_zero_count(self, ngram_counts_dict):
calculator = PMICalculator(ngram_count_source=ngram_counts_dict)
ngram = 'horseback riding'
pmi_value = calculator.compute_association(ngram)
assert pmi_value == float('-inf')
Loading

0 comments on commit 61fcabc

Please sign in to comment.