pandas-cat

pandas-cat (PANDAS-CATegorical profiling) is a library for profiling categorical datasets and preparing them for analysis. It generates HTML reports with category distributions, correlations, and missing-value summaries, and automatically reorders numeric-like categories into their natural order.

Pass any DataFrame and get a self-contained HTML report in one call:

import pandas_cat
pandas_cat.profile(df, dataset_name="Road accidents")

The report gives you:

  • Bar charts — frequency counts and percentages for every categorical column.

  • Histograms — distribution for every numeric column.

  • Correlations — between all variables and between categorical values.

  • Missing-value summary — sentinel detection and gap counts per column.

  • Memory breakdown — usage by column.

Two preparation helpers keep the data clean before profiling:

  • prepare(df) detects numeric-like categories and converts them to ordered CategoricalDtype so charts and correlations respect natural order.

    Without prepare(), pandas sorts categories alphabetically — a common trap:

    # Alphabetical (wrong) — pandas default
    16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 6–10, 76+, Under 6
    

    After prepare(), the natural numeric order is restored:

    # Natural order (correct) — after prepare()
    Under 6, 6–10, 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 76+
    
  • handle_missing_values(df) replaces 75+ sentinel strings ("Unknown", "N/A", "–", "Missing", …) with pd.NA so they are counted as missing rather than treated as valid categories.