API Reference

All public functions are available directly on the pandas_cat module.

pandas_cat.profile(df=None, dataset_name=None, template=None, out_html='report.html', opts=None, verbose=True)[source]

Profile a dataset and write an HTML report.

The report is written to <cwd>/report/<out_html>. The directory is created automatically if it does not exist.

Categorical columns produce frequency bar charts and crosstab heatmaps. Numeric (continuous) columns produce histograms with mean/median overlays and descriptive statistics.

Parameters:
  • df (DataFrame | None) – DataFrame to profile.

  • dataset_name (str | None) – Title shown in the report header.

  • template (str | None) – Built-in name (None/'default', 'modern', 'interactive') or a file-system path to a custom .html.j2 template. Custom templates declare their rendering mode with {# pandas-cat: mode=interactive #}; anything without that tag renders as static (SVG charts).

  • out_html (str) – Output filename (basename only).

  • opts (dict | None) –

    Optional settings dict:

    • auto_prepare (bool, default True)

    • cat_limit (int, default 20)

    • na_values (list)

    • na_ignore (list)

    • keep_default_na (bool, default True)

Return type:

None

Returns:

None.

pandas_cat.prepare(df=None, opts=None, auto_data_prep='default', verbose=True)[source]

Prepare a categorical dataset by converting numeric-like columns to ordered pandas.Categorical.

Parameters:
  • df (DataFrame | None) – DataFrame to prepare.

  • opts (dict | None) – Options forwarded to the underlying engine.

  • auto_data_prep (str) – 'CLM' to use CleverMiner; any other value (default 'default') uses the built-in conversion, which respects cat_limit and leaves high-cardinality numeric columns as continuous.

Return type:

DataFrame

Returns:

New DataFrame with eligible columns as ordered CategoricalDtype.

pandas_cat.handle_missing_values(df, na_values=None, na_ignore=None, keep_default_na=True)[source]

Replace sentinel string values with pd.NA.

Returns a new DataFrame — the input is never modified.

Parameters:
  • df (DataFrame) – DataFrame to process.

  • na_values (list | None) – Additional strings to treat as missing.

  • na_ignore (list | None) – Built-in sentinel strings to exclude.

  • keep_default_na (bool) – When False, only na_values are used.

Returns:

(df, detected, counts) tuple.


Supplementary notes

Output location

profile() always writes its output to <current working directory>/report/<out_html>. The report/ directory is created if it does not exist. There is no return value; the file is the side effect.

Column type detection

pandas-cat automatically detects column types and adapts the report:

Column type

Detected when

Continuous

dtype is numeric (int8 through float64) and not pandas.CategoricalDtype.

Categorical

All other columns: object, str, bool, pandas.CategoricalDtype.

Continuous columns are profiled with a histogram and descriptive statistics. Categorical columns are profiled with a frequency bar chart and frequency table. The cat_limit option applies only to categorical columns — continuous columns are never excluded on category-count grounds.

Effect of automatic data preparation on numeric columns

Because pandas-cat is a categorical profiling package, the default preparation engine converts plain numeric columns with cat_limit or fewer distinct values to ordered CategoricalDtype before profiling — so a 0/1 flag or a 1–5 rating scale produces a frequency bar chart rather than a histogram. Numeric columns with more distinct values are left unchanged and profiled as continuous.

No information is lost: a bar chart shows every value and its exact frequency. A KDE histogram for low-cardinality discrete data is less informative and visually misleading. To keep a low-cardinality numeric column as continuous, either pass auto_prepare=False or cast the column to float before calling profile().

Templates

template=

Description

None / 'default'

Static HTML. Includes a memory-usage bar chart, per-attribute profiles (frequency tables + bar charts for categorical; histograms + descriptive statistics for continuous), a Cramér’s V heatmap (categorical columns), a Pearson correlation heatmap (continuous columns, when >= 2 exist), and individual column-pair crosstab heatmaps (categorical only).

'modern'

Same content as the default template, rendered with an alternative visual style (modern.html.j2). Uses the same context variables and renderer as 'default'.

'interactive'

Data-driven HTML. Three correlation metrics per column pair (Cramér’s V, Spearman Rank, Theil’s U), per-category crosstabs, explicit missing-value report, and a list of excluded attributes. Progress is printed to the console as six numbered steps.

Options (opts dict)

Key

Default

Description

auto_prepare

True

Call prepare() before profiling. Converts numeric-like string columns to ordered pandas.Categorical so charts and tables show categories in natural numeric order.

cat_limit

20

Maximum number of distinct categories a categorical column may have to be included in the report. Columns over the limit are excluded with a warning. Does not apply to continuous (numeric) columns.

na_values

None

Extra strings to treat as missing (in addition to the built-in list).

na_ignore

None

Strings from the built-in missing-value list to not replace with pd.NA.

keep_default_na

True

When False, ignore the built-in sentinel list and use only na_values.

Correlation metrics

Cramér’s V (default and interactive templates — categorical columns)

Symmetric measure based on the chi-squared statistic, using the Bergsma-Wicher correction for small samples. Range: 0 (no association) – 1 (perfect association).

Pearson (default and modern templates — continuous columns)

Linear correlation coefficient. Range: −1 (perfect negative) – 1 (perfect positive), 0 = no linear relation. Shown when the DataFrame contains at least two continuous columns.

Spearman Rank (interactive template)

Rank correlation after converting categories to integer codes. Range: −1 – 1. Most meaningful for ordered categoricals.

Theil’s U (interactive template)

Asymmetric uncertainty coefficient. U(X→Y) measures how much knowing X reduces uncertainty about Y. Range: 0 – 1.

prepare() engine

The auto_data_prep parameter of prepare() selects the backend:

  • 'default' (default) — uses the built-in conversion: detects numeric-like string columns and converts them to ordered CategoricalDtype. Plain numeric columns with more unique values than cat_limit are left as continuous.

  • 'CLM' — delegates to CleverMiner (>= 1.0.7). If an older version is installed the DataFrame is returned unchanged. Note: CleverMiner drops high-cardinality numeric columns entirely rather than leaving them as continuous.

Note

prepare() converts all columns that can be given a meaningful order. String-encoded categories (e.g. "0-10", "Over 75") are sorted by their leading numeric token. Plain numeric columns (float64, int64) with cat_limit or fewer distinct values are also converted to ordered CategoricalDtype — they appear as frequency bar charts rather than histograms. Columns that cannot be ordered are sorted alphabetically.