Copied from my GitHub techdiary
Pandas + IPython + Jupyter Incantations
A place to store my hard earned pandas learnings.
- Misc
- Column operations
- Tidy Data Long and Wide
- Reshaping Data
- Pivoting - Narrow to wide
- Group By - Apply aggregate to a group of facts
- Pandas performance
- Plotting
- Exploratory Data Analysis
Misc
Jupyter set to window width
# Resize Jupyter to window width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
Pretty print the data frame
If you use print to show a dataframe, it’ll look ugly, use display instead.
display(df)
Jupyter run on a linux box, access from linux.
https://coderwall.com/p/ohk6cg/remote-access-to-ipython-notebooks-via-ssh
ipython magic commands
%history -n
%history -g <string> search for string
%recall - put in editor
%rerun - execute in editor
fix column width to match terminal
pd.set_option('display.large_repr', 'truncate')
pd.set_option('display.max_columns', 0)
List columns
df.columns()
Sorting
df.sort_index()
df.sort_values()
Column operations
Convert column type to category
df.column = df.column.astype("category")
Count values in category
df.column.value_counts()
df.column.value_counts(normalize=True)
df.column.value_counts(normalize=False).cumsum()
Histogram column
df.groupby(df.column).apply(len)
Custom Apply to a row
df.apply(axis="columns", func=myFuncThatTakesArowAsInput) # pretty darn slow.
Convert Json BLOB to new Json Columns
df = df.join(pd.DataFrame(df['json_string'].apply(json.loads).apply(pd.Series)))
Tidy Data Long and Wide
Tidy data is ideal for analysis. The data definitions are confusing, so I’ll provide my own:
- Dimension (Dim/Pivot) - A caregorical property of a entity
- Entity (Independent Variable) - Set of dimensions that uniquely identify the facts
- Fact (Measure/Dependent Variable) - A fact/measure for an entity.
With these definitions, Tidy data is defined as:
- Each fact forms a column.
- Each observation (of an entity) forms a row (with a fact per column).
- Each type of observational unit forms a table. (??)
This is a wide data set, where a row is the Set of Dimensions (Entity), and a fact per column. This is:
- Easiest for humans to digest
- An Excel Pivot Table
- The best shape for panadas
- (From Narrow: pivot, or multi group and un-stack)
Another form of tidy data, often how raw data looks is called narrow data,in this case a row is a Set of Dimensions (Entity), A column for Fact Name, A column for Fact Value. This is
- Hard for humans to grock
- The best shape for altair
- (From wide: Use melt to get from wide to narrow)
Reshaping Data
Pivoting - Narrow to wide
Simple pivot table by count
pv = pd.pivot_table(df,index=pd.Grouper(freq='2W'), columns="column_1",values="column2", aggfunc='count')
Simple pivot table by percent change
pv = pd.pivot_table(df,index=pd.Grouper(freq='2W'), columns="column_1",values="column2", aggfunc='count').pcnt_change()
Group By - Apply aggregate to a group of facts
Named Aggregations
Groupby is great, but it’s a bugger to name aggregations, until pandas 0.25 added named aggregations
Pandas performance
-
- Numba - faster apply operations (swifter includes it)
- df.eval - faster df operations
- pandas.read_csv(), usecols to limit columns loaded
- see memory - ts.memory_usage(deep=True)
- Swifter - Smart Function Application (will use Numba, or Dask, or parallel apply)
- Modin - Parallel DataFrame, design for compatiblity first
- Dask - Parallel DataFrame - but use Modin instead)
- Numba - JIT your functions, but use Swifter instead.
- Pandarell - Parallel Apply (swifter equiv)
Plotting
Altair
Altair is completely obvious to use, and I enjoy it.
Altair normally can’t use index’s so you’ll need to reset_index first. You can do this in_place
You’ll also often need to unstack
Matplotlib
Learning plotting libraries can be a pain, dataviz is probably as important as data analysis, both because the viz will tell the story, but also because the viz will make analysis so much easier.
I spent a tonne of time working with matplotlib, but it’s so non-obvious to write, and I always need to look up its non onbvious syntax, and spend hours trying to figure out whatever quirk I’m getting bitten by:
Gotchyas:
- Axes is a synonym for subplot. It should not be confused with axis.
Plotly with cufflinks
To check out:
plot.ly with cufflinks - seems to be emerging as a winner, but careful about the need for on-line vs offline mode.