-
Notifications
You must be signed in to change notification settings - Fork 5
Pandas
pandas unto itself is a very powerful Python data wrangling library. This documentation focuses on how to use pandas in the best way possible for the VIS-framework.
Greater knowledge of pandas will set you on an expedient path to VIS-ardry. It is recommended that you work through the following paths:
- 10 minutes to pandas - Minimum (Newbie)
- Tutorials - Apprentice
- Cookbook - VIS-ard
There are a lot of other uses for pandas as well, so it would only enhance your knowledge to read more about pandas at the following sites (especially if the terseness of the pandas documentation is not your "thing"):
- Pandas Data Analysis Site
- Chris Albon's Site, especially his Python/Data Wrangling sections
- Useful Pandas Snippets, by Becky Sweger
Pandas is a python package designed for easily structuring, displaying and manipulating data. It stores data in table-like structures called Series or DataFrames. The first thing you need to know about VIS is the difference between Series and DataFrames. Series are tables that only contain a single column, while DataFrames contain many Series (or columns). The distinction might not seem that big, but when we get into it more, you'll see that they often function quite differently.
We use pandas in VIS to create the output of the indexers. It not only provides a nice way of visualizing said output, but also makes the internal work of the indexers much easier. Another benefit is that we can directly output pandas DataFrames to csv files, which can be viewed in Excel.
Pandas also has a lot of its own features that can be useful for making your analyses more powerful without having to do a lot of your own scripting. Here, we will walk you through some of the most applicable features in this documentation and demonstrate how using pandas directly is easier and faster than many of the other work-arounds.
- DataFrame If you've ever used VIS before, you have probably seen a DataFrame at some point in the process, whether that was printing something in your terminal, or looking at the csv files downloaded from the client. Either way, the indexer output will hopefully not look to foreign to you. This is what the noterest indexer DataFrame looks like:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
0 F4 F3 Rest F3
8 F4 F3 NaN D3
10 E4 G3 NaN C3
12 C4 A3 Rest F3
15 NaN NaN NaN E3
16 Rest G3 NaN C3
20 G4 NaN NaN NaN
24 A4 F3 F3 C4
27 NaN NaN NaN B3
28 NaN Rest NaN A3
29 NaN NaN NaN G3
30 NaN NaN NaN A3
32 A4 C4 NaN NaN
33 NaN NaN NaN G3
34 NaN NaN NaN F3
35 NaN NaN NaN E3
36 A4 D4 F3 D3
40 G4 B3 G3 E3
- Series In the above example, each part of the piece itself is a Series of data. Individually, the first part as a Series would look like this:
0 F4
8 F4
10 E4
12 C4
15 NaN
16 Rest
20 G4
24 A4
27 NaN
28 NaN
29 NaN
30 NaN
32 A4
33 NaN
34 NaN
35 NaN
36 A4
40 G4
-
NaN NaN stands for "Not a Number", or "Not a Note" for our purposes. In either case, it means there is no value for that offset. Instead of leaving cells in the DataFrame blank, pandas fills in data-less spots with this. In the case of VIS, NaN indicates that neither a note nor a rest is being attacked, which means that a note or rest is being held.
-
Index The index is the location of the data, in VIS also known as the offset. This is the column of numbers on the far left of the table. Here it is as a list:
[0.0, 8.0, 10.0, 12.0, 15.0, 16.0, 20.0, 24.0, 27.0, 28.0, 29.0, 30.0, 32.0, 33.0, 34.0, 35.0, 36.0, 40.0]
- MultiIndex Normally, a DataFrame has an index and column labels. VIS uses something on top of that, called MultiIndexing, with which you can add labels to the column labels. For example, a single-indexed DataFrame would look something like this:
Parts 0 1 2 3
0 F4 F3 Rest F3
8 F4 F3 NaN D3
And the MultiIndexed DataFrame looks like this:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
0 F4 F3 Rest F3
8 F4 F3 NaN D3
There is an extra layer of abstraction here, but also additional information that is important for us to see what data we're actually looking at.
These are just some of the basic terms that you will come across later on. As long as you understand the basics now, more will become clear as you learn to use them.
Since we're looking at music, you might prefer to rotate your pandas DataFrames from their original orientation in order for it to look more like a musical score:
dataframe.T
gives this output:
0 8 10 12 15 16 20 24 27
Indexer Parts
noterest.NoteRestIndexer 0 F4 F4 E4 C4 NaN Rest G4 A4 NaN
1 F3 F3 G3 A3 NaN G3 NaN F3 NaN
2 Rest NaN NaN Rest NaN NaN NaN F3 NaN
3 F3 D3 C3 F3 E3 C3 NaN C4 B3
To get the list of offsets present in the piece, simply get the index of the DataFrame and (optionally) convert it to a list:
dataframe.index.tolist()
You can also do the same thing with the data itself. The values() function will get all the values of the DataFrame or Series, but in different ways. For DataFrames, calling values() on it:
dataframe.values()
will give you a list of lists of the data in each row:
[[u'F4' u'F3' 'Rest' u'F3']
[u'F4' u'F3' nan u'D3']
[u'E4' u'G3' nan u'C3']
Calling values() on a series will just give you a plain list of the data:
dataframe['noterest.NoteRestIndexer', '0'].values()
There are many ways of comparing DataFrames, a lot of them using the methods you saw above, but pandas also has a built in function that checks for equality between two DataFrames. Important: '==' does not work!
dataframe1.equals(dataframe2)
If you want to look at only a subset of the full DataFrame, there are a number of ways to approach this. First, there are the head and tail functions. They show the first or last few events, respectively, of the DataFrame. To use this, call the function and pass it the number of events you want to see from the beginning or end:
dataframe.head(3)
gives:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
0 F4 F3 Rest F3
8 F4 F3 NaN D3
10 E4 G3 NaN C3
or,
dataframe.tail(6)
gives:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
672 G4 NaN NaN NaN
673 NaN F4 NaN D3
674 F4 D4 NaN B-2
676 NaN C4 NaN C3
678 E4 NaN NaN NaN
680 F4 NaN F3 NaN
You can also use the pandas location function to slice a range of the dataframe:
dataframe.loc[8:24]
gives us:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
8 F4 F3 NaN D3
10 E4 G3 NaN C3
12 C4 A3 Rest F3
15 NaN NaN NaN E3
16 Rest G3 NaN C3
20 G4 NaN NaN NaN
24 A4 F3 F3 C4
Similarly, you might also want to look at what is happening at a particular offset. For this, you would use the location function in pandas:
dataframe.loc[8]
and if you want to see that offset in a particular part:
dataframe[('noterest.NoteRestIndexer', '2')].loc[8]
Though splitting up your dataframe horizontally, it might be more useful to split it vertically - by part rather than by measure, for example. To get the Series of a single part, you have to index into both parts of the MultiIndex, meaning you have to include both the indexer name and the part name:
dataframe['noterest.NoterestIndexer', '0']
The above code will give you the entire length of the piece, but only looking at one part. Make sure you have the right indexer name ('noterest.NoteRestIndexer' in this case) and that the part name ('0' in this case) actually exists in the DataFrame.
You might also want to look at all the parts, but one at a time, in which case you can use a for loop:
for part in dataframe:
dataframe[part]
Just as you can pull DataFrames apart, so can you push them together into one, or concatenate them. The simple way of calling this function:
pandas.concat([dataframe1, dataframe2])
returns a DataFrame that contains 2 DataFrames in a row. If you were to look at the index of this new DataFrame, it would be the indices of both DataFrames one after another. In other words, this way of concatenating DataFrames is fine for specific situations; it's especially good when you're looking at a lot of different pieces and don't want the data getting confused.
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
8 F4 F3 NaN D3
10 E4 G3 NaN C3
12 C4 A3 Rest F3
15 NaN NaN NaN E3
16 Rest G3 NaN C3
Indexer interval.HorizontalIntervalIndexer
Parts 0 1 2 3
8 -2 2 NaN -2
10 -3 2 NaN 4
12 Rest -2 Rest -2
15 NaN NaN NaN -3
16 Rest -2 NaN 8
However, if you want to look at the notes next to the intervals within a single piece, you probably want the index to be the same for both DataFrames, so you can see much more easily:
Indexer noterest.NoteRestIndexer interval.HorizontalIntervalIndexer
Parts 0 1 2 3 0 1 2 3
8 F4 F3 NaN D3 -2 2 NaN -2
10 E4 G3 NaN C3 -3 2 NaN 4
12 C4 A3 Rest F3 Rest -2 Rest -2
15 NaN NaN NaN E3 NaN NaN NaN -3
16 Rest G3 NaN C3 Rest -2 NaN 8
This can be done with as many DataFrames as necessary!
Although NaN's are generally quite a useful part of VIS output, they can lead to problems every now and then. There will come a time when you want to have a DataFrame or Series that doesn't contain any NaN's. For example, you might want to check what note is occurring at a given offset, but don't want NaN to be returned. You could, of course, look at the last valid piece of data in the DataFrame, but why code that yourself when pandas can do it for you!
The two main things you need to know to manipulate NaN's is dropna()
and fillna()
.
dropna()
, as the name suggests, is a function that drops the NaN's from your dataframe. You will probably want to be careful when using this one, because, by default, it removes all rows that contain a NaN. That means, even if there are notes in other parts in the row, they will be removed as well. This can have its uses, but it is best to beware of this.
Consequently, I generally use dropna()
on Series. First, I separate out each part from the DataFrame:
0 F4
8 F4
10 E4
12 C4
15 NaN
16 Rest
20 G4
24 A4
27 NaN
28 NaN
29 NaN
30 NaN
32 A4
33 NaN
34 NaN
35 NaN
36 A4
40 G4
and then call dropna()
on it:
0 F4
8 F4
10 E4
12 C4
16 Rest
20 G4
24 A4
32 A4
36 A4
40 G4
One of the issues with dropna()
is that you can easily remove way too much of the dataframe than you meant to. dropna()
has a setting called 'how' that has two options: 'all' and 'any'. 'All' only removes rows that contain all NaN's. You should find any rows like this in the standard VIS output, but if you've done other post-processing at this point, this can be a very useful tool. The 'any' option will remove all rows that have at least one NaN in them. Since it does this by default, it can be dangerous to call dropna()
on a DataFrame, which is why I tend to call it on Series. Either way, you might find your own use for this tool in your analyses.
fillna()
is another very useful DataFrame manipulation. It simply fills in the NaN's with other values, that you can specify when you call the function. The most useful for our purposes (because music is linear) is forward fill. This means that instead of only showing the note name when it is attacked, we can show it at every offset that it is sounding. For example, take the normal indexer results:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
0 F4 F3 Rest F3
8 F4 F3 NaN D3
10 E4 G3 NaN C3
12 C4 A3 Rest F3
15 NaN NaN NaN E3
16 Rest G3 NaN C3
20 G4 NaN NaN NaN
24 A4 F3 F3 C4
27 NaN NaN NaN B3
28 NaN Rest NaN A3
29 NaN NaN NaN G3
30 NaN NaN NaN A3
32 A4 C4 NaN NaN
33 NaN NaN NaN G3
34 NaN NaN NaN F3
35 NaN NaN NaN E3
36 A4 D4 F3 D3
40 G4 B3 G3 E3
This is what it would look like after calling fillna(method='ffill')
on it:
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
0 F4 F3 Rest F3
8 F4 F3 Rest D3
10 E4 G3 Rest C3
12 C4 A3 Rest F3
15 C4 A3 Rest E3
16 Rest G3 Rest C3
20 G4 G3 Rest C3
24 A4 F3 F3 C4
27 A4 F3 F3 B3
28 A4 Rest F3 A3
29 A4 Rest F3 G3
30 A4 Rest F3 A3
32 A4 C4 F3 A3
33 A4 C4 F3 G3
34 A4 C4 F3 F3
35 A4 C4 F3 E3
36 A4 D4 F3 D3
40 G4 B3 G3 E3
Notice that though, like dropna()
all the NaN's have been removed, this time there are no offsets missing, and there are a lot more repetitions in the notes being shown.
There are a few different ways of dealing with duplicate values in pandas. The pandas function drop_duplicates() takes one setting that has 3 different options. Depending on the option of this one setting, this function then does 3 different things:
dataframe.drop_duplicates(keep=False)
Deletes all duplicated values in the entire DataFrame
dataframe.drop_duplicates(keep='first')
Deletes all duplicated values except for the first occurence
dataframe.drop_duplicates(keep='last')
Deletes all duplicated values except for the last occurence
All these dropping of duplicates can, of course, be useful. However, the most common reason to drop duplicates that we, as VIS developers, have found is to drop only consecutive duplicates. Unfortunately, pandas doesn't have a built-in function to do this for us. Luckily, though, it only takes one line of code to get there yourself:
dataframe = dataframe.loc[dataframe.shift() != dataframe]
A common function for many data structures is the map function. Pandas, too, has equivalent functions for DataFrames and Series. With this, you can apply a function to every row of data in the DataFrame with only a single line of code, rather than looping through everything yourself.
The basic version is for Series and is called 'map':
series.map(my_func())
For DataFrames, there are two options, however. The first applies the function to each row - the row as a whole. This version is called 'apply':
dataframe.apply(my_func())
The other option is 'applymap', which applies the function to every individual element in each row in the DataFrame:
dataframe.applymap(my_func())
Pandas has a cool function called describe() that spits out a new DataFrame of data about your DataFrame. That data includes a number of different things, that also depend on what is relevent for the given DataFrame. Generally for VIS output, the most common data is: *count: the number of occurences *unique: the number of distinct elements *top: the most frequently occuring element *freq: the frequency of the top element
Indexer noterest.NoteRestIndexer
Parts 0 1 2 3
count 186 208 84 170
unique 13 16 11 13
top A4 C4 D4 F3
freq 41 30 13 31
This method is something we do all the time for music analysis, and pandas makes it that much easier. It is a way of cross-filtering results of various indexers to get specific and useful information about the music represented in the DataFrames.
The first step in this method is to think of a question you want to ask about the music. To make it simple, let's ask what intervals occur against the bass voice when the bass voice is singing an F3.
First, we have to get the relevant indexers (noterest and interval, in this case) and concatenate them:
notes = noterest.NoteRestIndexer(piece).run()
intervals = interval.IntervalIndexer(notes).run()
piece = pandas.concat([notes, intervals], axis=1)
Next, indicate the condition you want to filter your results by. In this case, we want to filter for when the bass voice sings an F3:
condition = piece.loc[:, ('noterest.NoteRestIndexer', '3')] == 'F3'
Then, specify in which indexer you want to find your results. In this case, we also have to take into consideration that we only want to look at columns that have intervals occuring against the bass. This means the column label with definitely have the number '3' in it:
location = piece['interval.IntervalIndexer'].loc[:, [part for part in piece['interval.IntervalIndexer'] if '3' in part]]
Now, we simple have to apply the condition to the location and we get our results!
location[condition]
The resulting DataFrame:
Parts 0,3 1,3 2,3
0 8 1 Rest
12 5 3 Rest
34 10 5 1
56 10 5 3
86 10 3 Rest
98 10 10 6
133 9 6 4