Project Setup

Get Leiningen

https://github.com/technomancy/leiningen#leiningen

Create the project:

~$ lein new ciml

Dependencies

We'll be using incanter and priority-map, edit ciml/project.clj and add the following dependenies: org.clojure/data.priority-map "0.0.2", incanter/incanter-core "1.5.4", and incanter/incanter-io "1.5.4". The entire file should look like this:

(defproject ciml "0.1.0-SNAPSHOT"
  :description "A Course in Machine Learning"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [org.clojure/data.priority-map "0.0.2"]
                 [incanter/incanter-core "1.5.4"]
                 [incanter/incanter-io "1.5.4"]])

Launch the REPL:

~/ciml$ lein repl

From here on out, all of the code in the wiki can be entered into the REPL. The REPL output is documented as semicolon-prefixed Clojure comments, starting with "; =>"

A quick look at the dependencies

Priority Map

A priority-map orders its entries by their values (as opposed to sorted-map, which orders entries by keys):

(require '[clojure.data.priority-map :refer :all])
(priority-map "Alice" 38 "Bob" 42 "Charles" 32)
; => {"Charles" 32, "Alice" 38, "Bob" 42}

What if we want the map ordered with the greatest value first? A comparator function can be passed to priority-map-by. In this case we'll use the > function:

(priority-map-by > "Alice" 38 "Bob" 42 "Charles" 32)
; => {"Bob" 42, "Alice" 38, "Charles" 32}

Incanter Datasets

Incanter includes a nice concept called a dataset, which represents tabular data. First require dataset and clojure.pprint, then create an incanter dataset and pretty-print it:

(require '[incanter.core :refer [dataset]])
(require '[clojure.pprint :refer [pprint]])

(def people
  (dataset [:name :age :gender]
    [["Alice" 38 :f]
     ["Bob" 42 :m]
     ["Charles" 32 :m]]))

(pprint people)
; => {:column-names [:name :age :gender],
;     :rows
;     ({:gender :f, :age 38, :name "Alice"}
;      {:gender :m, :age 42, :name "Bob"}
;      {:gender :m, :age 32, :name "Charles"})}

Notice that datasets are composed of clojure-native maps and lists, which means we can use the nice built-in clojure functions to manipulate them.

Selecting a single column

Since each row is simply a hashmap, and a keyword is a function that can get a value from a hashmap, we can use map to select a column of the dataset:

(map :age (:rows people))
; => (38 42 32)

Grouping rows

If we want to group rows by the values in a given column, we can use group-by:

(pprint (group-by :gender (:rows people)))
; => {:f [{:gender :f, :age 38, :name "Alice"}],
;     :m
;     [{:gender :m, :age 42, :name "Bob"}
;      {:gender :m, :age 32, :name "Charles"}
;     ]}

To simplify nesting, use the threading macro. This is equivilant to the above expression:

(->> people :rows (group-by :gender) pprint)
; => {:f [{:gender :f, :age 38, :name "Alice"}],
;     :m
;     [{:gender :m, :age 42, :name "Bob"}
;      {:gender :m, :age 32, :name "Charles"}
;     ]}

Frequencies

Another nice function is frequencies, which returns a map of distinct items in a collection along with the number of times the given item appears in the collection.

(frequencies [:a :a :b :a :c :b])
; => {:a 3, :b 2, :c 1}

So to find the number of people of each gender:

(->> people :rows (map :gender) frequencies)
; => {:f 1, :m 2}

Continue reading at Chapter 1: Decision Trees

Provide feedback

Saved searches

Use saved searches to filter your results more quickly