pandas-cat
pandas-cat (PANDAS-CATegorical profiling) is a library for profiling categorical datasets and preparing them for analysis. It generates HTML reports with category distributions, correlations, and missing-value summaries, and automatically reorders numeric-like categories into their natural order.
Pass any DataFrame and get a self-contained HTML report in one call:
import pandas_cat
pandas_cat.profile(df, dataset_name="Road accidents")
The report gives you:
Bar charts — frequency counts and percentages for every categorical column.
Histograms — distribution for every numeric column.
Correlations — between all variables and between categorical values.
Missing-value summary — sentinel detection and gap counts per column.
Memory breakdown — usage by column.
Two preparation helpers keep the data clean before profiling:
prepare(df) detects numeric-like categories and converts them to ordered
CategoricalDtypeso charts and correlations respect natural order.Without
prepare(), pandas sorts categories alphabetically — a common trap:# Alphabetical (wrong) — pandas default 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 6–10, 76+, Under 6
After
prepare(), the natural numeric order is restored:# Natural order (correct) — after prepare() Under 6, 6–10, 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 76+
handle_missing_values(df) replaces 75+ sentinel strings (
"Unknown","N/A","–","Missing", …) withpd.NAso they are counted as missing rather than treated as valid categories.
Contents