Quickstart
Installation
pip install pandas-cat
Requires Python 3.10 or later.
Minimal example
import pandas as pd
import pandas_cat
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]
pandas_cat.profile(df=df, dataset_name="Accidents")
This writes report/report.html. Open it in any browser.
Built-in templates
pandas-cat ships three templates — pick one via the template= parameter.
default — static HTML with embedded SVG charts (used when template is
omitted):
pandas_cat.profile(df=df, dataset_name="Accidents")
modern — same static output with a refreshed visual style:
pandas_cat.profile(df=df, dataset_name="Accidents", template="modern")
interactive — data-driven report with three correlation metrics (Cramér’s V, Spearman Rank, Theil’s U) and per-category crosstabs:
pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive")
Sample reports:
Automatic data preparation
By default (auto_prepare=True) pandas-cat calls prepare()
before profiling. It converts all columns to ordered pandas.Categorical
where possible — numeric-string ranges ("0-10", "Over 75", "60+")
are sorted by their leading numeric token, plain text columns are sorted
alphabetically, and columns with no meaningful order are left as-is.
The effect is easiest to see directly:
import pandas as pd
import pandas_cat
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]
print("Before:")
print(df['Driver_Age_Band'].dtype) # str
print(sorted(df['Driver_Age_Band'].unique())) # alphabetical
df = pandas_cat.prepare(df)
print("After:")
print(df['Driver_Age_Band'].dtype) # category (ordered=True)
print(df['Driver_Age_Band'].cat.categories.tolist()) # natural numeric order
Output:
Before:
str
['16 - 20', '21 - 25', '26 - 35', '36 - 45',
'46 - 55', '56 - 65', '66 - 75', 'Over 75']
After:
category
['16 - 20', '21 - 25', '26 - 35', '36 - 45',
'46 - 55', '56 - 65', '66 - 75', 'Over 75']
Without prepare(), "6 - 10" sorts after "56 - 65" alphabetically.
After prepare(), every downstream operation — bar charts, frequency tables,
correlation matrices — uses the correct semantic order.
To skip preparation:
pandas_cat.profile(df=df, dataset_name="Accidents",
opts={"auto_prepare": False})
To run preparation without generating a report:
df = pandas_cat.prepare(df)
Note
pandas-cat is a categorical profiling package. The default preparation
engine reflects this: plain numeric columns with cat_limit or fewer
distinct values (default 20) are converted to ordered CategoricalDtype
and shown as frequency bar charts rather than histograms. Numeric columns
with more distinct values are left as continuous and shown with histograms.
No analytical information is lost: a bar chart shows every value and its exact frequency. A KDE histogram for low-cardinality discrete data (e.g. a 0/1 flag or a 1–5 rating scale) is less informative and visually misleading anyway.
To force a numeric column to stay continuous regardless of cardinality,
cast it explicitly before calling profile():
df['rating'] = df['rating'].astype(float)
pandas_cat.profile(df=df, opts={"auto_prepare": False})
Options reference
All options are passed as a dict to the opts parameter:
Option |
Default |
Description |
|---|---|---|
|
|
Call |
|
|
Categorical columns with more distinct values than this are excluded. Does not apply to continuous columns. |
|
|
Additional strings to treat as missing (on top of the built-in list). |
|
|
Strings from the built-in missing-value list to not treat as missing. |
|
|
When |