A simple tool to create a sorted list of most common words out of a set of pages picked randomly from Wikipedia.
In order to use this, you'll need the wikipedia Pyhton library, installable with:
pip install wikipedia
Run the script with:
wwl.py [OPTIONS]
-h, --help
Shows the help message.-p, --pages
Sets the number of pages to process (default100
).-l, --lang
Sets the language of the pages to retrieve (defaulten
).-s, --special
Sets the special chars to use as splitters (the space is always used, default\\!\"/()[]{}=?\'<>,;.:-—_+*@#«»
).-m, --min
Sets the minimum length of the words to process (default1
).-M, --max
Sets the maximum length of the words to process (0
for infinity, default0
).-t, --threads
Sets the maximum amount of threads working simultanously to retrieve pages (default1
).-T, --timeout
Sets the maximum time to wait for the threads to retrieve the pages once the last thread started (0
for infinity, default30
).-o, --output
Specifies the output file location (defaultoutput.txt
).-w, --words
Specifies the maximum number of words to save (0
for infinity, default0
).-d, --debug
Shows debug level logs.
wwl.py -p 1000 -l it -m 8 -t 50 -w 100
Saves 100 most common words of 8 or more characters out of 1000 italian pages using 50 threads.
WikipediaWordList is released under the Apache License 2.0.