Quickstart

Installation

pip install pandas-cat

Requires Python 3.10 or later.

Minimal example

import pandas as pd
import pandas_cat

df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]

pandas_cat.profile(df=df, dataset_name="Accidents")

This writes report/report.html. Open it in any browser.

See the pre-generated output here

Built-in templates

pandas-cat ships three templates — pick one via the template= parameter.

default — static HTML with embedded SVG charts (used when template is omitted):

pandas_cat.profile(df=df, dataset_name="Accidents")

modern — same static output with a refreshed visual style:

pandas_cat.profile(df=df, dataset_name="Accidents", template="modern")

interactive — data-driven report with three correlation metrics (Cramér’s V, Spearman Rank, Theil’s U) and per-category crosstabs:

pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive")

Sample reports:

Automatic data preparation

By default (auto_prepare=True) pandas-cat calls prepare() before profiling. It converts all columns to ordered pandas.Categorical where possible — numeric-string ranges ("0-10", "Over 75", "60+") are sorted by their leading numeric token, plain text columns are sorted alphabetically, and columns with no meaningful order are left as-is.

The effect is easiest to see directly:

import pandas as pd
import pandas_cat

df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]

print("Before:")
print(df['Driver_Age_Band'].dtype)           # str
print(sorted(df['Driver_Age_Band'].unique())) # alphabetical

df = pandas_cat.prepare(df)

print("After:")
print(df['Driver_Age_Band'].dtype)                   # category (ordered=True)
print(df['Driver_Age_Band'].cat.categories.tolist())  # natural numeric order

Output:

Before:
str
['16 - 20', '21 - 25', '26 - 35', '36 - 45',
 '46 - 55', '56 - 65', '66 - 75', 'Over 75']

After:
category
['16 - 20', '21 - 25', '26 - 35', '36 - 45',
 '46 - 55', '56 - 65', '66 - 75', 'Over 75']

Without prepare(), "6 - 10" sorts after "56 - 65" alphabetically. After prepare(), every downstream operation — bar charts, frequency tables, correlation matrices — uses the correct semantic order.

To skip preparation:

pandas_cat.profile(df=df, dataset_name="Accidents",
                   opts={"auto_prepare": False})

To run preparation without generating a report:

df = pandas_cat.prepare(df)

Note

pandas-cat is a categorical profiling package. The default preparation engine reflects this: plain numeric columns with cat_limit or fewer distinct values (default 20) are converted to ordered CategoricalDtype and shown as frequency bar charts rather than histograms. Numeric columns with more distinct values are left as continuous and shown with histograms.

No analytical information is lost: a bar chart shows every value and its exact frequency. A KDE histogram for low-cardinality discrete data (e.g. a 0/1 flag or a 1–5 rating scale) is less informative and visually misleading anyway.

To force a numeric column to stay continuous regardless of cardinality, cast it explicitly before calling profile():

df['rating'] = df['rating'].astype(float)
pandas_cat.profile(df=df, opts={"auto_prepare": False})

Options reference

All options are passed as a dict to the opts parameter:

Option

Default

Description

auto_prepare

True

Call prepare() before profiling.

cat_limit

20

Categorical columns with more distinct values than this are excluded. Does not apply to continuous columns.

na_values

None

Additional strings to treat as missing (on top of the built-in list).

na_ignore

None

Strings from the built-in missing-value list to not treat as missing.

keep_default_na

True

When False, only na_values are used; the built-in list is ignored.