Skip to content

6. Creating DataFrame

Oleksandr Zaytsev edited this page Jan 10, 2018 · 1 revision

There are four ways of creating a data frame:

  1. from an array of rows or columns
  2. from matrix
  3. from file
  4. loading a built-in dataset

1. Creating DataFrame from an array of rows or columns

The easiest and most straightforward way of creating a DataFrame is by passing all data in an array of arrays to fromRows: or fromColumns: message. Here is an example of initializing a DataFrame with rows:

df := DataFrame fromRows: #(
   ('Barcelona' 1.609 true)
   ('Dubai' 2.789 true)
   ('London' 8.788 false)).

The same data frame can be created from the array of columns

df := DataFrame fromColumns: #(
   ('Barcelona' 'Dubai' 'London')
   (1.609 2.789 8.788)
   (true true false)).

Since the names of rows and columns are not provided, they are initialized with their default values: (1 to: self numberOfRows) and (1 to: self numberOfColumns). Both rowNames and columnNames can always be changed by passing an array of new names to a corresponding accessor. This array must be of the same size as the number of rows and columns.

df columnNames: #(City Population BeenThere).
df rowNames: #(A B C).

You can convert this data frame to a pretty-printed table that can be coppied and pasted into letters, blog posts, and tutorials (such as this one) using df asStringTable message

   |  City       Population  BeenThere  
---+----------------------------------
A  |  Barcelona       1.609       true  
B  |  Dubai           2.789       true  
C  |  London          8.788      false

2. Creating DataFrame from a Matrix

By it's nature DataFrame is similar to a matrix. It works like a table of values, supports matrix accessors, such as at:at: or at:at:put: and in some cases can be treated like a matrix. Some classes provide tabular data in matrix format. For example TabularWorksheet class of Tabular package that is used for reading XLSX files. To initialize a DataFrame from a maxtrix of values, use fromMatrix: method

matrix := Matrix
   rows: 3 columns: 3
   contents:
      #('Barcelona' 1.609 true
        'Dubai' 2.789 true
        'London' 8.788 false).
         
df := DataFrame fromMatrix: matrix.

Once again, the names of rows and columns are set to their default values.

3. Reading data from file

In most real-world scenarios the data is located in a file or database. The support for database connections will be added in future releases. Right now DataFrame provides you the methods for loading data from two most commot file formats: CSV and XLSX

DataFrame fromCSV: 'path/to/your/file.csv'.
DataFrame fromXLSX: 'path/to/your/file.xlsx'.

Since JSON does not store data as a table, it is not possible to read such file directly into a DataFrame. However, you can parse JSON using NeoJSON or any other library, construct an array of rows and pass it to fromRows: message, as described in previous sections.

4. Loading the built-in datasets

DataFrame provides several famous datasets for you to play with. They are compact and can be loaded with a simple message. An this point there are three datasets that can be loaded in this way - Iris flower dataset, a simplified Boston Housing dataset, and Restaurant tipping dataset.

DataFrame loadIris.
DataFrame loadHousing.
DataFrame loadTips.