-
Notifications
You must be signed in to change notification settings - Fork 0
Project Setup
https://github.com/technomancy/leiningen#leiningen
~$ lein new ciml
We'll be using incanter and priority-map, edit ciml/project.clj and add the following dependenies: org.clojure/data.priority-map "0.0.2", incanter/incanter-core "1.5.4", and incanter/incanter-io "1.5.4". The entire file should look like this:
(defproject ciml "0.1.0-SNAPSHOT"
:description "A Course in Machine Learning"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/data.priority-map "0.0.2"]
[incanter/incanter-core "1.5.4"]
[incanter/incanter-io "1.5.4"]])
~/ciml$ lein repl
From here on out, all of the code in the wiki can be entered into the REPL. The REPL output is documented as semicolon-prefixed Clojure comments, starting with "; =>"
A priority-map orders its entries by their values (as opposed to sorted-map, which orders entries by keys):
(require '[clojure.data.priority-map :refer :all])
(priority-map "Alice" 38 "Bob" 42 "Charles" 32)
; => {"Charles" 32, "Alice" 38, "Bob" 42}
What if we want the map ordered with the greatest value first? A comparator function can be passed to priority-map-by. In this case we'll use the > function:
(priority-map-by > "Alice" 38 "Bob" 42 "Charles" 32)
; => {"Bob" 42, "Alice" 38, "Charles" 32}
Incanter includes a nice concept called a dataset, which represents tabular data. First require dataset and clojure.pprint, then create an incanter dataset and pretty-print it:
(require '[incanter.core :refer [dataset]])
(require '[clojure.pprint :refer [pprint]])
(def people
(dataset [:name :age :gender]
[["Alice" 38 :f]
["Bob" 42 :m]
["Charles" 32 :m]]))
(pprint people)
; => {:column-names [:name :age :gender],
; :rows
; ({:gender :f, :age 38, :name "Alice"}
; {:gender :m, :age 42, :name "Bob"}
; {:gender :m, :age 32, :name "Charles"})}
Notice that datasets are composed of clojure-native maps and lists, which means we can use the nice built-in clojure functions to manipulate them.
Since each row is simply a hashmap, and a keyword is a function that can get a value from a hashmap, we can use map to select a column of the dataset:
(map :age (:rows people))
; => (38 42 32)
If we want to group rows by the values in a given column, we can use group-by:
(pprint (group-by :gender (:rows people)))
; => {:f [{:gender :f, :age 38, :name "Alice"}],
; :m
; [{:gender :m, :age 42, :name "Bob"}
; {:gender :m, :age 32, :name "Charles"}
; ]}
To simplify nesting, use the threading macro. This is equivilant to the above expression:
(->> people :rows (group-by :gender) pprint)
; => {:f [{:gender :f, :age 38, :name "Alice"}],
; :m
; [{:gender :m, :age 42, :name "Bob"}
; {:gender :m, :age 32, :name "Charles"}
; ]}
Another nice function is frequencies, which returns a map of distinct items in a collection along with the number of times the given item appears in the collection.
(frequencies [:a :a :b :a :c :b])
; => {:a 3, :b 2, :c 1}
So to find the number of people of each gender:
(->> people :rows (map :gender) frequencies)
; => {:f 1, :m 2}
Continue reading at Chapter 1: Decision Trees