-
Notifications
You must be signed in to change notification settings - Fork 17
ml: tools
collect your experiences here. This should include how to install the tool, what it can do for feature extraction and classification, and your experiences (easy? limited? data import, etc.). Also try to keep this brief.
People learn in different ways. I often like to start with a tutorial , make sure it works , and then gradually migrate from their data / features to my data and features. Some people like to start with the theory. Others start with the manual and build their own implementation.
Explored by PMR:
Hmm. I thought I had Python installed and wasted 10 minutes before I realised I didn't. I tried pip
and only later realised that we now need pip3
Moral. READ THE DOCS FROM THE START. I found https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-macos (there are similar tutorials for other OS's).
It also explained my_env
. Note try to find a recent tutorial. Many systems are still fluid.
Time: ca 20 minutes (includes reading the docs! and checking before issuing commands (typos in installation can be a nightmare)
Easy if I had done it carefully. I now have a virtual Python environment my_env/
.
Sci-Kit Learn (sklearn) is arguably the most used python library for machine learning applications. It has functionalities that enable every popular application of ML algorithms. There are a number of powerful ML models that can be used through this platform ranging from simple supervised methods, such as regression; slightly more complex supervised models, such as support vector machines; advanced supervised models, such as neural networks; and unsupervised methods, such as PCA or K-means clustering.
This is a standard python library and can, therefore, be installed through pip on the command line. This can be done by typing:
pip3 install sklearn
Sklearn incorporates many tools to convert image and text files to meaningful numerical representations. All pre-processing is performed with the same basic methodology. A comprehensive list of pre-processing tools available for use can be found here.
- You import and instantiate the method you wish to use to represent your data (In this case it would convert text to a TF-IDF vector representation).
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
- You feed this feature extraction method your data using the 'fit_transform' method. It will then construct your representation for you.
mydata = ['mystring1', 'mystring2', 'mystring3']
tfidf_matrix = vectorizer.fit_transform(mydata)
The output is then a matrix consisting of TF-IDF vectors with one row for every document in your corpus. This same workflow can be applied to pretty much any pre-processing step using this library.
There are a wide range of classification algorithms available for use within this toolkit. Links to the more popular models for classification can be found below. There are comprehensive descriptions of the mathematics/working principles of these models to be found through the same links.
- Latent Dirichlet Allocation
- Standardization and Normalization
- Classification Metrics (F1 score, precision/recall, accuracy etc.)
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn Explored by PMR:
- Python 3
- Jupyter Notebook
-
Time. about 20 minutes. Because you are able to
Copy
from the instructions we avoid mistakes, but we also can avoid thinking. - problems I was a bit rusty on how to run Notebooks and spent a bit of time working out how to make a Notebook and rename it; and to Run each cell.