GitHub - irutupatel/Text-Processing-and-indexing: A data mining project that preprocesses input text in preparation for indexing, and index it by using multimap data structure, preparing data for more comprehensive analysis

Description:

This project will take you through a simplified workflow of preparing and indexing text in preparation for more comprehensive analysis. First, it pre-processes any given text file in the preprocessor, and then perform indexing of the unique words using a map data structure.

Dependencies:

avl_tree.py
unsorted_table_map.py
sorted_table_map.py
chain_hash_map.py
probe_hash_map.py
splay_tree.py
red_black_tree.py
binary_search_tree.py
binary_tree.py
Empty.py
hash_map_base.py
linked_binary_tree.py
linked_queue.py
map_base.py
tree.py

Original:

project4.py
preprocessor.py
Indexer.py

Requirements:

Python 3

Imported modules:

time
collections
argparse
sys

Needed Input files:

original text
preprocessed text
optional typeofMap
optional indexFile (for indexed output)

Run as:

python3 project4.py [-h] [--map MAP] [--index INDEX] original preprocessed
python3 preprocessor.py [-h] [--output OUTPUT] input stopwords

Operation:

In the User Interface (UI), to show the user, the stats would be printed out like how much time a particular map took for indexing, how many words were indexed, average word frequency, median word frequency etc. At the end of which, in the prompt, user would be asked if they want to search for a word. If the word is not found, the user would be asked to input a new search word. And if the word is found, all the original lines containing that word would be printed, including the stats like, how much time it took for lookup, and the total occurrence of that word. After which, user is asked if they want to quit (or continue searching a new word). If no, then search is performed, while yes would exit.

Output:

To write an index file of all the words is an optional argument, while if asked to make one, a new index file of whatever name user gave would be written. The format of which would be the words (key/unique word) as the first word on each line followed by the line numbers of where those words occurred (values).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description:

Dependencies:

Original:

Requirements:

Imported modules:

Needed Input files:

Run as:

Operation:

Output:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Count_of_Monte_Cristo.txt		Count_of_Monte_Cristo.txt
Count_of_Monte_CristoIndex.txt		Count_of_Monte_CristoIndex.txt
Count_of_Monte_CristoPreprocessed.txt		Count_of_Monte_CristoPreprocessed.txt
Empty.py		Empty.py
Indexer.py		Indexer.py
README.md		README.md
TheYellowWallpaper.txt		TheYellowWallpaper.txt
TheYellowWallpaperIndex.txt		TheYellowWallpaperIndex.txt
TheYellowWallpaperPreprocessed.txt		TheYellowWallpaperPreprocessed.txt
avl_tree.py		avl_tree.py
binary_search_tree.py		binary_search_tree.py
binary_tree.py		binary_tree.py
chain_hash_map.py		chain_hash_map.py
hash_map_base.py		hash_map_base.py
linked_binary_tree.py		linked_binary_tree.py
linked_queue.py		linked_queue.py
map_base.py		map_base.py
output.txt		output.txt
preprocessor.py		preprocessor.py
probe_hash_map.py		probe_hash_map.py
project4.py		project4.py
red_black_tree.py		red_black_tree.py
sorted_table_map.py		sorted_table_map.py
splay_tree.py		splay_tree.py
stopwords.txt		stopwords.txt
tree.py		tree.py
unsorted_table_map.py		unsorted_table_map.py

irutupatel/Text-Processing-and-indexing

Folders and files

Latest commit

History

Repository files navigation

Description:

Dependencies:

Original:

Requirements:

Imported modules:

Needed Input files:

Run as:

Operation:

Output:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages