Pandas + IPython + Jupyter Incantations

A place to store my hard earned pandas learnings.

Misc
Column operations
Tidy Data Long and Wide
Reshaping Data
Pivoting - Narrow to wide
- Simple pivot table by count
- Simple pivot table by percent change
Group By - Apply aggregate to a group of facts
- Named Aggregations
Pandas performance
Plotting
Exploratory Data Analysis

Misc

Jupyter set to window width

# Resize Jupyter to window width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

Pretty print the data frame

If you use print to show a dataframe, it’ll look ugly, use display instead.

display(df)

Jupyter run on a linux box, access from linux.

https://coderwall.com/p/ohk6cg/remote-access-to-ipython-notebooks-via-ssh

ipython magic commands

%history -n
%history -g <string> search for string
%recall - put in editor
%rerun - execute in editor

fix column width to match terminal

pd.set_option('display.large_repr', 'truncate')
pd.set_option('display.max_columns', 0)

List columns

df.columns()

Sorting

df.sort_index()
df.sort_values()

Column operations

Convert column type to category

df.column = df.column.astype("category")

Count values in category

df.column.value_counts()
df.column.value_counts(normalize=True)
df.column.value_counts(normalize=False).cumsum()

Histogram column

df.groupby(df.column).apply(len)

Custom Apply to a row

df.apply(axis="columns", func=myFuncThatTakesArowAsInput) # pretty darn slow.

Convert Json BLOB to new Json Columns

df = df.join(pd.DataFrame(df['json_string'].apply(json.loads).apply(pd.Series)))

Tidy Data Long and Wide

Tidy data is ideal for analysis. The data definitions are confusing, so I’ll provide my own:

Dimension (Dim/Pivot) - A caregorical property of a entity
Entity (Independent Variable) - Set of dimensions that uniquely identify the facts
Fact (Measure/Dependent Variable) - A fact/measure for an entity.

With these definitions, Tidy data is defined as:

Each fact forms a column.
Each observation (of an entity) forms a row (with a fact per column).
Each type of observational unit forms a table. (??)

This is a wide data set, where a row is the Set of Dimensions (Entity), and a fact per column. This is:

Easiest for humans to digest
An Excel Pivot Table
The best shape for panadas
(From Narrow: pivot, or multi group and un-stack)

Another form of tidy data, often how raw data looks is called narrow data,in this case a row is a Set of Dimensions (Entity), A column for Fact Name, A column for Fact Value. This is

Hard for humans to grock
The best shape for altair
(From wide: Use melt to get from wide to narrow)

Reshaping Data

Pivoting - Narrow to wide

Simple pivot table by count

pv = pd.pivot_table(df,index=pd.Grouper(freq='2W'), columns="column_1",values="column2", aggfunc='count')

Simple pivot table by percent change

pv = pd.pivot_table(df,index=pd.Grouper(freq='2W'), columns="column_1",values="column2", aggfunc='count').pcnt_change()

Group By - Apply aggregate to a group of facts

Named Aggregations

Groupby is great, but it’s a bugger to name aggregations, until pandas 0.25 added named aggregations

Pandas performance

Pandas Performance Tutorial
- Numba - faster apply operations (swifter includes it)
- df.eval - faster df operations
- pandas.read_csv(), usecols to limit columns loaded
- see memory - ts.memory_usage(deep=True)
Swifter - Smart Function Application (will use Numba, or Dask, or parallel apply)
Modin - Parallel DataFrame, design for compatiblity first
Dask - Parallel DataFrame - but use Modin instead)
Numba - JIT your functions, but use Swifter instead.
Pandarell - Parallel Apply (swifter equiv)

Plotting

Altair

Altair is completely obvious to use, and I enjoy it.

Altair normally can’t use index’s so you’ll need to reset_index first. You can do this in_place

You’ll also often need to unstack

Matplotlib

Learning plotting libraries can be a pain, dataviz is probably as important as data analysis, both because the viz will tell the story, but also because the viz will make analysis so much easier.

I spent a tonne of time working with matplotlib, but it’s so non-obvious to write, and I always need to look up its non onbvious syntax, and spend hours trying to figure out whatever quirk I’m getting bitten by:

Gotchyas:

Axes is a synonym for subplot. It should not be confused with axis.

Plotly with cufflinks

To check out:

plot.ly with cufflinks - seems to be emerging as a winner, but careful about the need for on-line vs offline mode.

Exploratory Data Analysis

Pandas-Profiling