File Language Analyzer is a suite of Python modules, that provides objects, constants and functions, to recognise the language of a file, analyze its informations and process (elaborate and create) .csv letter frequency tables.
Keep in mind that this project is programmed very poorly, however the logic behind the adopted method is interesting.
- Recognise the language of a file
- Convert .csv frequency table to Python dictionary
- Convert Python dictionary to .csv frequency table
- Generate frequency table starting from a set of Twitter messages
By analyzing the frequency of every single letter is possible to detect the language of a given text.
Once the characters' frequencies have been extracted, this information can be used as a representation of the text.
We want to find out which is its language, so we have to determine which is the table's column that has the nearest values.
To accomplish that, it can be used the Pythagorean theorem extended to 26 dimensions, the number of letters in the Latin alphabet.
By computing the distance between the given text and each language inside the table, it's possible to define which is the nearest language.
- Python 3.x
- Python built-in libraries
- Twitter API wrapped by tweepy library
- wikipedia-api module
- Flask
Use one of the following commands (according to the configuration of your environment):
$ pip install -r requirements.txt
or
$ py -m pip install -r requirements.txt
If you are in Bash-like environment with Python installed, you can run directly by typing:
$ ./Main.py
Otherwise, depending on your Python interpreter installation and your OS:
$ python Main.py
or
$ py Main.py
After that, go to http://127.0.0.1:5000 or http://localhost:5000 and try out the web interface.
Default frequency table is letters_frequency_twitter.csv
If you want to use tweetrain.py
's functions, you have to insert your personal Twitter tokens.
Look at the first four uppercase variables and fill in double quotes with the proper value.