ml: tools

Tools for machine learning

collect your experiences here. This should include how to install the tool, what it can do for feature extraction and classification, and your experiences (easy? limited? data import, etc.). Also try to keep this brief.

People learn in different ways. I often like to start with a tutorial , make sure it works , and then gradually migrate from their data / features to my data and features. Some people like to start with the theory. Others start with the manual and build their own implementation.

Frameworks

Python

Explored by PMR:

installation

Hmm. I thought I had Python installed and wasted 10 minutes before I realised I didn't. I tried pip and only later realised that we now need pip3

Moral. READ THE DOCS FROM THE START. I found https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-macos (there are similar tutorials for other OS's). It also explained my_env. Note try to find a recent tutorial. Many systems are still fluid.

installation (second time)

Time: ca 20 minutes (includes reading the docs! and checking before issuing commands (typos in installation can be a nightmare)

conclusion

Easy if I had done it carefully. I now have a virtual Python environment my_env/.

ML Toolkits

Sci-Kit Learn

Sci-Kit Learn (sklearn) is arguably the most used python library for machine learning applications. It has functionalities that enable every popular application of ML algorithms. There are a number of powerful ML models that can be used through this platform ranging from simple supervised methods, such as regression; slightly more complex supervised models, such as support vector machines; advanced supervised models, such as neural networks; and unsupervised methods, such as PCA or K-means clustering.

Installation

This is a standard python library and can, therefore, be installed through pip on the command line. This can be done by typing:

pip3 install sklearn

Feature Extraction with sklearn

Sklearn incorporates many tools to convert image and text files to meaningful numerical representations. All pre-processing is performed with the same basic methodology. A comprehensive list of pre-processing tools available for use can be found here.

You import and instantiate the method you wish to use to represent your data (In this case it would convert text to a TF-IDF vector representation).

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

You feed this feature extraction method your data using the 'fit_transform' method. It will then construct your representation for you.

mydata = ['mystring1', 'mystring2', 'mystring3']

tfidf_matrix = vectorizer.fit_transform(mydata)

The output is then a matrix consisting of TF-IDF vectors with one row for every document in your corpus. This same workflow can be applied to pretty much any pre-processing step using this library.

Classification with sklearn

There are a wide range of classification algorithms available for use within this toolkit. Links to the more popular models for classification can be found below. There are comprehensive descriptions of the mathematics/working principles of these models to be found through the same links.

Potentially Useful Functions of sklearn

Tutorials

Sci-kit learn

https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn Explored by PMR:

Requires:

Python 3
Jupyter Notebook

Experience

Time. about 20 minutes. Because you are able to Copy from the instructions we avoid mistakes, but we also can avoid thinking.
problems I was a bit rusty on how to run Notebooks and spent a bit of time working out how to make a Notebook and rename it; and to Run each cell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly