Refactor MWE module. (#92)

* Add patterns * Add chunker * Separate patterns for DE and EN in two classes * Refactor * Move counts to separate module * Move the count module to preprocessing, as counting is needed also for text analysis * Restructuring classes * Refactor and some docstring * Add example * More docstring * Make extract_mwe_candidates private * Fix import * Separate module for association measures. Initiate with PMI * Playing around with class heirarchy * Make prob() private * Clean up classes * Move unnecessary classes to functions * Wrtie count dict to file * Make in in n-gram a parameter * Warnings for bad entries * Move DataFrame reader to a separate module * Docstring * Change pickle to json * Test end to end * Add PMI association measure * Mv DataFrameReader out of this module * Remove unnecessary counting of the MWE matches at sentence level * Rename pattern classes * Rename MWE pattern classes * Clean up * Add arg for custom patterns * Rename to better names * Add tqdm * Add table output and sort and topn n options * Update docs with latest api changes * Show two decimal values * Remove language version * Format code and corresponding fixes * Format code and apply corresponding fixes * Rm legacy MWE * Format code and apply corresponding fixes * Update docs * Rm main * Import NgramExtractor * Rm main * Update docs with ngram extraction * Rm unused imports * Rename corpus to df * Test association measure * Format, docstring and fix issues * Fix wrongly placed error * Tests MWE * Bump version * Add sample dataset * Exclude unnecessary files
meghdadFar · Aug 3, 2023 · 61fcabc · 61fcabc
1 parent e7935fd
commit 61fcabc
Show file tree

Hide file tree

Showing 15 changed files with 2,759 additions and 413 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -15,7 +15,7 @@ repos:
         # supported by your project here, or alternatively use
         # pre-commit's default_language_version, see
         # https://pre-commit.com/#top_level-default_language_version
-        language_version: python3.10
+        # language_version: python3.10
   - repo: https://github.com/PyCQA/flake8
     rev: 6.0.0
     hooks:

diff --git a/CHANGES.rst b/CHANGES.rst
@@ -1,3 +1,13 @@
+Version 1.0.0
+-------------
+- Complete refactoring and upgrading of the MWE module.
+- Support for extracting variable length MWEs given a custom user syntactic patterns of POS tags.
+- Predefined patterns for extracting Light Verb Constructions (LVCs), 2-3 word Noun Compounds, 2-3 Adjective Noun Compounds, and 2-3 Verb Noun Compounds, and Verb Particle Constructions (VPCs).
+- Refactoring of the Association Measure module.
+- Move DataFrame reader to a separate preprocessing module so that it can support all modules easier.
+- Add support for extracting ngrams for MWE and also ngram analysis.
+
+
 Version 0.4.2
 -------------
 - Better encapsulation.

diff --git a/data/IMDB_Dataset_sample.csv b/data/IMDB_Dataset_sample.csv
diff --git a/docs/source/mwes.rst b/docs/source/mwes.rst
@@ -1,55 +1,128 @@
-Multiword Expressions
-#####################
+Multiword Expressions (MWEs)
+############################
 
 Multiword Expressions (MWEs) are phrases that can be treated as a single
 semantic unit. E.g. *swimming pool* and *climate change*. MWEs have
-application in different areas including: parsing, language models,
-language generation, terminology extraction, and topic models.
-Wordview can extract different types of MWEs in your text.
+application in different areas including: parsing, language generation,
+language modeling, terminology extraction, and topic models.
+
+Wordview can extract different types of MWEs from a text corpus in any of the supported languages. Wordview by default extracts the following types of MWEs:
+Light Verb Constructions (LVCs), 2 and 3 word Noun Compounds (NCs), 2 and 3 word Adjective-Noun Compounds (ANCs), and Verb-Noun Compounds (VNCs).
+However, you can specify other types of MWEs you want to extract using the `custom_pattern` argument. For more details, see the 
+the documentation.
 
 .. code:: python
 
+   # First we need to extract ngrams from the corpus
+   # If this was not done previously, e.g. when running other functions of Wordview, 
+   # you can do it as follows:
+   from wordview.preprocessing import NgramExtractor
+   import pandas as pd
+   imdb_train = pd.read_csv("data/IMDB_Dataset_sample.csv")
+   extractor = NgramExtractor(imdb_train, "review")
+   extractor.extract_ngrams()
+   extractor.get_ngram_counts(ngram_count_file_path="data/ngram_counts.json")
+   
+   # Now we can extract MWEs
    from wordview.mwes import MWE
+   import json
+   mwe_obj = MWE(imdb_corpus, 'review',
+                  ngram_count_file_path='data/ngram_counts.json',
+                  language='EN', 
+                  custom_patterns="NP: {<DT>?<JJ>*<NN>}",
+                  only_custom_patterns=False,
+                  )
+    mwes = mwe_obj.extract_mwes(sort=True, top_n=10)
+    json.dump(mwes, open('data/mwes.json', 'w'), indent=4)
+    
+
+The above returns the results in a dictionary, that in this example we stored in `mwes.json` file.
+You can also return the result in a table:
+
+..  code:: python
+
+    mwe_obj.print_mwe_table()
+
+
+.. code::
+
+    ╔═════════════════════════╦═══════════════╗
+    ║ LVC                     ║   Association ║
+    ╠═════════════════════════╬═══════════════╣
+    ║ SHOOT the binding       ║         26.02 ║
+    ║ achieve this elusive    ║         24.7  ║
+    ║ manipulate the wildlife ║         24.44 ║
+    ║ offset the darker       ║         24.02 ║
+    ║ remove the bindings     ║         24.02 ║
+    ║ Wish that Anthony       ║         23.9  ║
+    ║ Add some French         ║         23.5  ║
+    ║ grab a beer             ║         22.82 ║
+    ║ steal the 42            ║         22.5  ║
+    ║ invoke the spirit       ║         22.12 ║
+    ╚═════════════════════════╩═══════════════╝
+
+    ╔══════════════════════╦═══════════════╗
+    ║ NC2                  ║   Association ║
+    ╠══════════════════════╬═══════════════╣
+    ║ gordon willis        ║         20.74 ║
+    ║ Smoking Barrels      ║         20.74 ║
+    ║ sadahiv amrapurkar   ║         20.74 ║
+    ║ nihilism nothingness ║         20.74 ║
+    ║ tomato sauce         ║         20.74 ║
+    ║ Picket Fences        ║         20.74 ║
+    ║ deja vu              ║         19.74 ║
+    ║ cargo bay            ║         19.74 ║
+    ║ zoo souvenir         ║         19.16 ║
+    ║ cake frosting        ║         19.16 ║
+    ╚══════════════════════╩═══════════════╝
+
+    ╔══════════════════════════════╦═══════════════╗
+    ║ ANC2                         ║   Association ║
+    ╠══════════════════════════════╬═══════════════╣
+    ║ bite-sized chunks            ║         20.74 ║
+    ║ lizardly snouts              ║         20.74 ║
+    ║ behind-the-scenes featurette ║         20.74 ║
+    ║ hidebound conservatives      ║         20.74 ║
+    ║ judicious pruning            ║         20.74 ║
+    ║ substantial gauge            ║         19.74 ║
+    ║ haggish airheads             ║         19.74 ║
+    ║ global warming               ║         19.74 ║
+    ║ Ukrainian flags              ║         19.16 ║
+    ║ well-lit sights              ║         19.16 ║
+    ╚══════════════════════════════╩═══════════════╝
+
+    ╔═══════════════╦═══════════════╗
+    ║ VPC           ║   Association ║
+    ╠═══════════════╬═══════════════╣
+    ║ upside down   ║         12.67 ║
+    ║ Stay away     ║         12.49 ║
+    ║ put together. ║         11.62 ║
+    ║ sit through   ║         10.93 ║
+    ║ ratchet up    ║         10.83 ║
+    ║ shoot'em up   ║         10.83 ║
+    ║ rip off       ║         10.72 ║
+    ║ hunt down     ║         10.67 ║
+    ║ screw up      ║         10.41 ║
+    ║ scorch out    ║         10.4  ║
+    ╚═══════════════╩═══════════════╝
+
+    ╔══════════════╦═══════════════╗
+    ║ NP           ║   Association ║
+    ╠══════════════╬═══════════════╣
+    ║ every penny  ║         12.78 ║
+    ║ THE END      ║         12.07 ║
+    ║ A JOKE       ║         11.79 ║
+    ║ A LOT        ║         11.05 ║
+    ║ Either way   ║         11.03 ║
+    ║ An absolute  ║         10.72 ║
+    ║ half hour    ║         10.65 ║
+    ║ no qualms    ║         10.47 ║
+    ║ every cliche ║         10.46 ║
+    ║ another user ║         10.37 ║
+    ╚══════════════╩═══════════════╝
 
-   # NC: NOUN-NOUN MWEs e.g. climate change
-   # JNC: ADJECTIVE-NOUN MWEs e.g. big shot
-   mwe = MWE(df=imdb_train, mwe_types=["NC", "JNC"], text_column='text')
-
-   # build_counts method --that creates word occurrence counts, is time consuming.
-   # Hence, you can run it once and store the counts, by the setting the
-   # counts_filename argument.
-   mwe.build_counts(counts_filename='tmp/counts.json')
-
-   # Once the counts are created, extraction of MWEs is fast and can be carried out
-   # with different parameters.
-   # If the optional mwes_filename parameter is set, the extracted MWEs
-   # will be stored in the corresponding file.
-   mwes_dict = mwe.extract_mwes(counts_filename='tmp/counts.json')
-   mwes_nc = {k: v for k, v in mwes_dict['NC'].items()}
-   top_mwes_nc = [[k, v] for k,v in mwes_nc.items()][:10]
-   print(tabulate(top_mwes_nc, tablefmt="double_outline"))
-
-The above results in a table the looks like the following and contains an ordered list of MWEs:
-
-..  code:: none
-
-   ╔══════════════════╦═══════╗
-   ║ busby berkeley   ║ 11.2  ║
-   ║ burgess meredith ║ 11.13 ║
-   ║ bruno mattei     ║ 10.92 ║
-   ║ monty python     ║ 10.69 ║
-   ║ ki aag           ║ 10.65 ║
-   ║ denise richards  ║ 10.63 ║
-   ║ guinea pig       ║ 10.52 ║
-   ║ blade runner     ║ 10.48 ║
-   ║ domino principle ║ 10.44 ║
-   ║ quantum physics  ║ 10.38 ║
-   ╚══════════════════╩═══════╝
-
-Notice how show and actor names such as `busby berkeley`,
-`burgess meredith`, and `monty python` as well other multi-word
-concepts such as `quantum physics` and `guinea pig` are captured,
-without the need for any labeled data and supervised model. This can
-speed things up and save much costs in certain situations.
+Notice how many interesting entities are captured,
+without the need for any labeled data and supervised model.
+This can speed things up and save much costs in certain applications.
 
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,10 +1,11 @@
 [tool.poetry]
 name = "wordview"
-version = "0.4.2"
+version = "1.0.0"
 description = "Wordview is a Python package for text analysis."
 authors = ["meghdadFar <meghdad.farahmand@gmail.com>"]
 readme = "README.rst"
 include = ["CHANGES.rst"]
+exclude = ["notebooks/", "tests/", "data/"]
 license = "MIT"
 keywords = ["nlp", "text analysis", "statistics"]
 

diff --git a/tests/mwe/test_association_measures.py b/tests/mwe/test_association_measures.py
@@ -0,0 +1,44 @@
+import pytest
+import pandas as pd
+from wordview.mwes.mwe import PMICalculator
+
+
+@pytest.fixture
+def ngram_counts_dict():
+    return {
+        'coffee': 100,
+        'shop': 150,
+        'coffee shop': 80,
+        'swimming': 50,
+        'pool': 60,
+        'swimming pool': 40,
+        'give': 200,
+        'a': 500,
+        'speech': 45,
+        'give a': 150,
+        'a speech': 40,
+        'give a speech': 35,
+        'take': 100,
+        'deep': 30,
+        'breath': 25,
+        'take a': 80,
+        'a deep': 25,
+        'deep breath': 20,
+        'take a deep': 18,
+        'a deep breath': 17,
+        'take a deep breath': 15
+    }
+
+
+class TestPMICalculator:
+    def test_compute_association(self, ngram_counts_dict):
+        calculator = PMICalculator(ngram_count_source=ngram_counts_dict)
+        ngram = 'coffee shop'
+        pmi_value = calculator.compute_association(ngram)
+        assert pmi_value == pytest.approx(3.24, abs=0.1)
+
+    def test_compute_association_with_zero_count(self, ngram_counts_dict):
+        calculator = PMICalculator(ngram_count_source=ngram_counts_dict)
+        ngram = 'horseback riding'
+        pmi_value = calculator.compute_association(ngram)
+        assert pmi_value == float('-inf')