Pandas is the most popular Python library for doing data analysis. While it does offer quite a lot of functionality, it is also regarded as a fairly difficult library to learn well. Some reasons for this include:
- There are often multiple ways to complete common tasks
- There are over 240 DataFrame attributes and methods
- There are several methods that are aliases (reference the same exact underlying code) of each other
- There are several methods that have nearly identical functionality
- There are many tutorials written by different people that show different ways to do the same thing
- There is no official document with guidelines on how to idiomatically complete common tasks
- The official documentation, itself contains non-idiomatic code
Selecting a Single Column of Data
Selecting a single column of data from a Pandas DataFrame is just about the simplest task you can do and unfortunately, it is here where we first encounter the multiple-choice option that Pandas presents to its users.
You may select a single column as a Series with either the brackets or dot notation. Let’s read in a small, trivial DataFrame and select a column using both methods.
>>> import pandas as pd
>>> df = pd.read_csv('data/sample_data.csv')
>>> df
Selection with the brackets
Placing a column name in the brackets appended to a DataFrame selects a single column of a DataFrame as a Series.
>>> df['state']name
Jane NY
Niko TX
Aaron FL
Penelope AL
Dean AK
Christina TX
Cornelia TX
Name: state, dtype: object
Selection with dot notation
Alternatively, you may select a single column with dot notation. Simply, place the name of the column after the dot operator. The output is the exact same as above.
>>> df.state
Issues with the dot notation
There are three issues with using dot notation. It doesn’t work in the following situations:
- When there are spaces in the column name
- When the column name is the same as a DataFrame method
- When the column name is a variable
You can only use the brackets to select columns with spaces.
df['favorite food']
The column name is the same as a DataFrame method
When a column name and a DataFrame method collide, Pandas will always reference the method and not the column name. For instance, the column namecount
is a method and will be referenced when using dot notation. This actually doesn’t produce an error as Python allows you to reference methods without calling them. Let’s reference this method now.
df.count
The output is going to be very confusing if you haven’t encountered it before. Notice at the top it states ‘bound method DataFrame.count of’. Python is telling us that this is a method of some DataFrame object. Instead of using the method name, it outputs its official string representation. Many people believe that they’ve produced some kind of analysis with this result. This isn’t true and almost nothing has happened. A reference to the method that outputs the object’s representation has been produced.
The column name is a variable
Let’s say you are using a variable to hold a reference to the column name you would like to select. In this case, the only possibility again is to use the brackets. Below is a simple example where we assign the value of a column name to a variable and then pass this variable to the brackets.
>>> col = 'height'
>>> df[col]
Selection with at and iat
Two additional indexers, at
and iat
, exist that select a single cell of a DataFrame. These provide a slight performance advantage over their analogous loc
and iloc
indexers. But, they introduce the additional burden of having to remember what they do. Also, for most data analyses, the increase in performance isn’t useful unless it’s being done at scale. And if performance truly is an issue, then taking your data out of a DataFrame and into a NumPy array will give you a large performance gain.
Performance comparison iloc vs iat
vs NumPy
Let’s compare the perfomance of selecting a single cell with iloc
, iat
and a NumPy array. Here we create a NumPy array with 100k rows and 5 columns containing random data. We then create a DataFrame out of it and make the selections.
>>> import numpy as np
>>> a = np.random.rand(10 ** 5, 5)
>>> df1 = pd.DataFrame(a)>>> row = 50000
>>> col = 3>>> %timeit df1.iloc[row, col]
13.8 µs ± 3.36 µs per loop>>> %timeit df1.iat[row, col]
7.36 µs ± 927 ns per loop>>> %timeit a[row, col]
232 ns ± 8.72 ns per loop
While iat
is a little less than twice as fast asiloc
, selection with a NumPy array is about 60x as fast. So, if you really had an application that had performance requirements, you should be using NumPy directly and not Pandas.
Guidance: Use NumPy arrays if your application relies on performance for selecting a single cell of data and not at
or iat
.
>>> college = pd.read_csv('data/college.csv')
>>> college.head()
>>> college2 = pd.read_table('data/college.csv', delimiter=',')
>>> college.equals(college2)
True
read_table
is getting deprecated
I made a post in the Pandas Github repo suggesting that a few functions and methods that I’d like to see deprecated. The read_table
function is getting deprecated and should never be used.
Guidance: Only use read_csv to read in delimitted text files
isna
vs isnull
and notna
vs notnull
The isna
and isnull
methods both determine whether each value in the DataFrame is missing or not. The result will always be a DataFrame (or Series) of all boolean values.
These methods are exactly the same. We say that one is an alias of the other. There is no need for both of them in the library. The isna
method was added more recently because the characters na
are found in other missing value methods such as dropna
and fillna
. Confusingly, Pandas uses NaN
, None
, and NaT
as missing value representations and not NA
.
notna
and notnull
are aliases of each other as well and simply return the opposite of isna
. There’s no need for both of them.
Let’s verify that isna
and isnull
are aliases.
>>> college_isna = college.isna()
>>> college_isnull = college.isnull()
>>> college_isna.equals(college_isnull)
True
I only use isna
and notna
I use the methods that end in na
to match the names of the other missing value methods dropna
and fillna
.
You can also avoid ever using notna
since Pandas provides the inversion operator, ~
to invert boolean DataFrames.
Guidance: Use only isna
and notna
Arithmetic and Comparison Operators and their Corresponding Methods
All arithmetic operators have corresponding methods that function similarly.
+
–add
-
–sub
andsubtract
*
–mul
andmultiply
/
–div
,divide
andtruediv
**
–pow
//
–floordiv
%
–mod
All the comparison operators also have corresponding methods.
>
–gt
<
–lt
>=
–ge
<=
–le
==
–eq
!=
–ne
Builtin Python functions vs Pandas methods with the same name
There are a few DataFrame/Series methods that return the same result as a builtin Python function with the same name. They are:
sum
min
max
abs
Let’s verify that they give the same result by testing them out on a single column of data. We begin by selecting the non-missing values of the undergraduate student population column, ugds
.
>>> ugds = college['ugds'].dropna()
>>> ugds.head()0 4206.0
1 11383.0
2 291.0
3 5451.0
4 4811.0
Name: ugds, dtype: float64
Verifying sum
>>> sum(ugds)
16200904.0>>> ugds.sum()
16200904.0
Verifying max
>>> max(ugds)
151558.0>>> ugds.max()
151558.0
Verifying min
>>> min(ugds)
0.0>>> ugds.min()
0.0
Verifying abs
>>> abs(ugds).head()0 4206.0
1 11383.0
2 291.0
3 5451.0
4 4811.0
Name: ugds, dtype: float64>>> ugds.abs().head()0 4206.0
1 11383.0
2 291.0
3 5451.0
4 4811.0
Name: ugds, dtype: float64
Time the performance of each
Let’s see if there is a performance difference between each method.
sum performance
>>> %timeit sum(ugds)
644 µs ± 80.3 µs per loop>>> %timeit -n 5 ugds.sum()
164 µs ± 81 µs per loop
max performance
>>> %timeit -n 5 max(ugds)
717 µs ± 46.5 µs per loop>>> %timeit -n 5 ugds.max()
172 µs ± 81.9 µs per loop
min performance
>>> %timeit -n 5 min(ugds)
705 µs ± 33.6 µs per loop>>> %timeit -n 5 ugds.min()
151 µs ± 64 µs per loop
abs performance
>>> %timeit -n 5 abs(ugds)
138 µs ± 32.6 µs per loop>>> %timeit -n 5 ugds.abs()
128 µs ± 12.2 µs per loop
Guidance: Use the Pandas method over any built-in Python function with the same name.
Standardizing groupby Aggregation
There are a number of syntaxes that get used for the groupby
method when performing an aggregation. I suggest choosing a single syntax so that all of your code looks the same.
The three components of groupby aggregation
Typically, when calling the groupby
method, you will be performing an aggregation. This is the by far the most common scenario. When you are performing an aggregation during a groupby
, there will always be three components.
- Grouping column — Unique values form independent groups
- Aggregating column — Column whose values will get aggregated. Usually numeric
- Aggregating function — How the values will get aggregated (sum, min, max, mean, median, etc…)
My syntax of choice for groupby
There are a few different syntaxes that Pandas allows to perform a groupby aggregation. The following is the one I use.
df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})
A buffet of groupby
syntaxes for finding the maximum math SAT score per state
Below, we will cover several different syntaxes that return the same (or similar) result for finding the maximum SAT score per state. Let’s look at the data we will be using first.
>>> college[['stabbr', 'satmtmid', 'satvrmid', 'ugds']].head()
Method 1: Here is my preferred way of doing the groupby aggregation. It handles complex cases.
>>> college.groupby('stabbr').agg({'satmtmid': 'max'}).head()
Method 2a: The aggregating column can be selected within brackets following the call to groupby
. Notice that a Series is returned here and not a DataFrame.
>>> college.groupby('stabbr')['satmtmid'].agg('max').head()stabbr
AK 503.0
AL 590.0
AR 600.0
AS NaN
AZ 580.0
Name: satmtmid, dtype: float64
Method 2b: The aggregate
method is an alias for agg
and can also be used. This returns the same Series as above.
>>> college.groupby('stabbr')['satmtmid'].aggregate('max').head()
Method 3: You can call the aggregating method directly without calling agg
. This returns the same Series as above.
>>> college.groupby('stabbr')['satmtmid'].max().head()
Major benefits of preferred syntax
The reason I choose this syntax is that it can handle more complex grouping problems. For instance, if we wanted to find the max and min of the math and verbal sat scores along with the average undergrad population per state we would do the following.
>>> df.groupby('stabbr').agg({'satmtmid': ['min', 'max'],
'satvrmid': ['min', 'max'],
'ugds': 'mean'}).round(0).head(10)
This problem isn’t solvable using the other syntaxes.
Guidance — Use df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})
as your primary syntax of choice
Best of the API
The Pandas DataFrame API is enormous. There are dozens of methods that have little to no use or are aliases. Below is my list of all the DataFrame attributes and methods that I consider sufficient to complete nearly any task.
Attributes
- columns
- dtypes
- index
- shape
- T
- values
Aggregation Methods
- all
- any
- count
- describe
- idxmax
- idxmin
- max
- mean
- median
- min
- mode
- nunique
- sum
- std
- var
Non-Aggretaion Statistical Methods
- abs
- clip
- corr
- cov
- cummax
- cummin
- cumprod
- cumsum
- diff
- nlargest
- nsmallest
- pct_change
- prod
- quantile
- rank
- round
Subset Selection
- head
- iloc
- loc
- tail
Missing Value Handling
- dropna
- fillna
- interpolate
- isna
- notna
Grouping
- expanding
- groupby
- pivot_table
- resample
- rolling
Joining Data
- append
- merge
Other
- asfreq
- astype
- copy
- drop
- drop_duplicates
- equals
- isin
- melt
- plot
- rename
- replace
- reset_index
- sample
- select_dtypes
- shift
- sort_index
- sort_values
- to_csv
- to_json
- to_sql
Functions
- pd.concat
- pd.crosstab
- pd.cut
- pd.qcut
- pd.read_csv
- pd.read_json
- pd.read_sql
- pd.to_datetime
- pd.to_timedelta