Skip to content

Manipulating Data

David Wright edited this page Apr 28, 2018 · 6 revisions

Once you have data in a FrameTable, Meta.Numerics allows you to filter, transform, re-shape, and otherwise manipulate it. This tutorial explains how.

For the examples in this tutorial, we will work with a data set you can download in CSV format using the following code:

using System;
using System.IO;
using System.Net;
using Meta.Numerics.Data;

FrameTable table;
Uri url = new Uri("https://raw.githubusercontent.com/dcwuser/metanumerics/master/Examples/Data/example.csv");
WebRequest request = WebRequest.Create(url);
using (WebResponse response = request.GetResponse()) {
    using (StreamReader reader = new StreamReader(response.GetResponseStream())) {
        table = FrameTable.FromCsv(reader);
    }
}

How do I select a subset of the columns in my view?

Use the Select method to specify the columns you want:

FrameView selected = table.Select("Height", "Weight", "Sex");

You can also use the Discard method if you prefer to specify which columns to throw away instead of which to keep:

FrameView discarded = selected.Discard("Name");

How do I add computed columns?

Suppose you want to do some analysis using BMI as a predictor. BMI is defined as weight divided by height squared. This code shows how to add it as a computed column.

table.AddComputedColumn("Bmi", r => ((double) r["Weight"])/MoreMath.Sqr((double) r["Height"] / 100.0));
Console.WriteLine($"Bmi of first subject is {table["Bmi"][0]}.");

Note that the value of a computed column is re-computed each time it is accessed, so if you later change the value Height for some row, the value of Bmi will change, too.

How do I filter out rows with nulls?

If some of your columns are nullable (either because they contain reference-types such as strings or because they contain Nullable<T> types), you can easily create a view that contains only the rows with non-null values. To get a view without any null values, use:

FrameView noNulls = table.WhereNotNull();

To get a view without the rows in which a particular column has a null value, but in which null values in other columns may still be present, just give the same method the name of the column:

FrameView noNullWeights = table.WhereNotNull("Weight");

In fact, the WhereNotNull method can take any number of column names, and will eliminate rows with nulls in any of them:

FrameView noNullWeightsOrHeights = table.WhereNotNull("Weight", "Height");

Once you have filtered out nulls from a column of type Nullable<T>, you can present it as a collection of T to any API that wants one. For example:

double meanWeight = table.WhereNotNull("Weight").Columns["Weight"].As<double>().Mean();

How do I filter rows based on more complex criteria?

The Where method allows you to filter rows based on any criteria at all. If your filter depends only on the value of one column, you can use the column-specific overload:

FrameView men = table.Where<string>("Sex", s => s == "M");

If your filter criteria depend on the value of more than one column, you can use the row-processing overload:

FrameView shortMen = table.Where(
    r => ((string) r["Sex"]) == "M" && ((double) r["Height"] < 175.0)
);

How do I sort rows?

Easy! If you want to sort by a particular column, just tell OrderBy the column to sort by:

Frameview ordered = table.OrderBy("Height");

For this to work, the type of the column must be IComparable (or Nullable of an IComparable). You can order in reverse by specifying SortOrder.Descending:

Frameview reversed = table.OrderBy("Height", SortOrder.Descending);

If the column type is not IComparable, or you want to specify a custom comparer, just use the the overload that takes a Comparison delegate:

FrameView custom = table.OrderBy<double>("Height", (h1, h2) => h1.CompareTo(h2)); 

Finally, if you want to make comparisons based on more than one column value, use the row-processing overload. Here's an example that puts women before men, then shorter before taller:

FrameView sorted = table.OrderBy((r1, r2) => {
    int first = ((string) r1["Sex"]).CompareTo((string ) r2["Sex"]);
    int second = ((double) r1["Height"]).CompareTo((double) r2["Height"]);
    return first != 0 ? first : second;
});

Note that, if a column is nullable, and you specify some sort of customer comparer, your custom comparer will need to handle null values, unless you have already filtered the view to remove the null-valued rows. The default comparer for IComparable data already handles nulls by putting them first before all other values.

Don't all these transformations take up a lot of memory?

No! Of course, the original FrameTable with R rows and C columns takes up memory of order R * C. But when you create a new FrameView by filtering, sorting, or otherwise manipulating that 2-dimensional data array, Meta.Numerics does not create a new copy of the data requiring another R * C chunk of memory. Instead, each new view requires only C pointers to the view's columns and R pointers to the rows of the original table, i.e. memory of order R + C.

Why do I have to keep specifying types?

You will notice that in a lot of our examples, you had to tell the method invoked the kind of data it will use. For example, when you specify a filter for the "Height" column, you must tell it that your filter delegate accepts a double. Or when you get the value of the "Height" column from a FrameRow, it comes out as an object and you have to cast it to a double. Doesn't the computer know that the "Height" column contains doubles? It must know, since it tells you that if you ask for table.Columns["Height"].StorageType!

The answer is that Meta.Numerics knows this, but the compiler doesn't know it. Since .NET is statically typed, the compiler needs to know types at compile-type. For homogeneous collections of a type know at compile-time, .NET can deal with this using generics -- the compiler know that every element extracted from a List<double> is a double. But FrameTables are heterogenous collections (every column can have a different type) of types, and those type are not necessarily even known at compile-time (the columns might be formed by reading from a CSV which wasn't even written when the code compiled). So, at the point at which data is pulled out of a frame, you will need to specify its type using either a type parameter or a cast. After that, you will get all the goodness that comes from the compiler knowing that type in your code, but you have to take that first step.

As you will see later when he hand columns of data to analysis APIs, you can do this type specification column-wise rather than cell-wise, and being able to do so actually has some advantages, like being able to use a column of bool? as bool, or even int or string.

How do I find distinct values?

We didn't need to add our own method for this because .NET already provides one:

using System.Linq;

List<string> sexes = table["Sex"].As<string>().Distinct().ToList();

How do I produce group data?

Use the GroupBy method to collect rows into groups by the value of a column, and produce aggregate quantities for the rows in each group. If you just want to produce one aggregate quantity for each group, use the single-output overload. Here is some code that gives the count of each sex:

FrameTable counts = table.GroupBy("Sex", v => v.Rows.Count, "Count");

The delegate you specify will get called once for each group, and each time will be given a FrameView containing just the rows belonging to one group.

If you want to produce more than one aggregate quantity, that is only slightly trickier. Instead of writing a delegate that returns a single object of a particular type, write a delegate that returns an IReadOnlyDictionary<string, object> containing all the aggregate quantities you want to produce. Here is some code that computes the count, mean height, and standard deviation of height for each sex:

using System.Collections.Generic;
using Meta.Numerics.Statistics;

FrameTable summarize = table.GroupBy("Sex", v => { 
    SummaryStatistics summary = new SummaryStatistics(v["Height"].As<double>());
    return (new Dictionary<string, object>() {
        {"Count", summary.Count},
        {"Mean", summary.Mean},
        {"StdDev", summary.StandardDeviation}
    });
});

The keys become the column names of the output table. Your dictionary must, of course, contain the same keys for each row.

What's missing?

Union and Join are two biggies.

What's Next

Okay, now you have loaded in your data and wrangled it into the form you want. How do you actually do statistics on it? For some examples of how to do that, read Analyzing Data.

Home

Clone this wiki locally