API Reference

All public functions are available directly on the pandas_cat module.

pandas_cat.profile(df=None, dataset_name=None, template=None, out_html='report.html', opts=None, verbose=True)[source]

Profile a dataset and write an HTML report.

The report is written to <cwd>/report/<out_html>. The directory is created automatically if it does not exist.

Categorical columns produce frequency bar charts and crosstab heatmaps. Numeric (continuous) columns produce histograms with mean/median overlays and descriptive statistics.

Parameters:

df (DataFrame | None) – DataFrame to profile.
dataset_name (str | None) – Title shown in the report header.
template (str | None) – Built-in name (None/'default', 'modern', 'interactive') or a file-system path to a custom .html.j2 template. Custom templates declare their rendering mode with {# pandas-cat: mode=interactive #}; anything without that tag renders as static (SVG charts).
out_html (str) – Output filename (basename only).
opts (dict | None) –
Optional settings dict:
- auto_prepare (bool, default True)
- cat_limit (int, default 20)
- na_values (list)
- na_ignore (list)
- keep_default_na (bool, default True)

Return type:

None

Returns:

None.

pandas_cat.prepare(df=None, opts=None, auto_data_prep='default', verbose=True)[source]

Prepare a categorical dataset by converting numeric-like columns to ordered pandas.Categorical.

Parameters:

df (DataFrame | None) – DataFrame to prepare.
opts (dict | None) – Options forwarded to the underlying engine.
auto_data_prep (str) – 'CLM' to use CleverMiner; any other value (default 'default') uses the built-in conversion, which respects cat_limit and leaves high-cardinality numeric columns as continuous.

Return type:

DataFrame

Returns:

New DataFrame with eligible columns as ordered CategoricalDtype.

pandas_cat.handle_missing_values(df, na_values=None, na_ignore=None, keep_default_na=True)[source]

Replace sentinel string values with pd.NA.

Returns a new DataFrame — the input is never modified.

Parameters:

df (DataFrame) – DataFrame to process.
na_values (list | None) – Additional strings to treat as missing.
na_ignore (list | None) – Built-in sentinel strings to exclude.
keep_default_na (bool) – When False, only na_values are used.

Returns:

(df, detected, counts) tuple.

Supplementary notes

Output location

profile() always writes its output to <current working directory>/report/<out_html>. The report/ directory is created if it does not exist. There is no return value; the file is the side effect.

Column type detection

pandas-cat automatically detects column types and adapts the report:

Column type	Detected when
Continuous	`dtype` is numeric (`int8` through `float64`) and not `pandas.CategoricalDtype`.
Categorical	All other columns: `object`, `str`, `bool`, `pandas.CategoricalDtype`.

Continuous columns are profiled with a histogram and descriptive statistics. Categorical columns are profiled with a frequency bar chart and frequency table. The cat_limit option applies only to categorical columns — continuous columns are never excluded on category-count grounds.

Effect of automatic data preparation on numeric columns

Because pandas-cat is a categorical profiling package, the default preparation engine converts plain numeric columns with cat_limit or fewer distinct values to ordered CategoricalDtype before profiling — so a 0/1 flag or a 1–5 rating scale produces a frequency bar chart rather than a histogram. Numeric columns with more distinct values are left unchanged and profiled as continuous.

No information is lost: a bar chart shows every value and its exact frequency. A KDE histogram for low-cardinality discrete data is less informative and visually misleading. To keep a low-cardinality numeric column as continuous, either pass auto_prepare=False or cast the column to float before calling profile().

Templates

`template=`	Description
`None` / `'default'`	Static HTML. Includes a memory-usage bar chart, per-attribute profiles (frequency tables + bar charts for categorical; histograms + descriptive statistics for continuous), a Cramér’s V heatmap (categorical columns), a Pearson correlation heatmap (continuous columns, when >= 2 exist), and individual column-pair crosstab heatmaps (categorical only).
`'modern'`	Same content as the default template, rendered with an alternative visual style (`modern.html.j2`). Uses the same context variables and renderer as `'default'`.
`'interactive'`	Data-driven HTML. Three correlation metrics per column pair (Cramér’s V, Spearman Rank, Theil’s U), per-category crosstabs, explicit missing-value report, and a list of excluded attributes. Progress is printed to the console as six numbered steps.

Options (`opts` dict)

Key	Default	Description
`auto_prepare`	`True`	Call `prepare()` before profiling. Converts numeric-like string columns to ordered `pandas.Categorical` so charts and tables show categories in natural numeric order.
`cat_limit`	`20`	Maximum number of distinct categories a categorical column may have to be included in the report. Columns over the limit are excluded with a warning. Does not apply to continuous (numeric) columns.
`na_values`	`None`	Extra strings to treat as missing (in addition to the built-in list).
`na_ignore`	`None`	Strings from the built-in missing-value list to not replace with `pd.NA`.
`keep_default_na`	`True`	When `False`, ignore the built-in sentinel list and use only `na_values`.

Correlation metrics

Cramér’s V (default and interactive templates — categorical columns): Symmetric measure based on the chi-squared statistic, using the Bergsma-Wicher correction for small samples. Range: 0 (no association) – 1 (perfect association).
Pearson (default and modern templates — continuous columns): Linear correlation coefficient. Range: −1 (perfect negative) – 1 (perfect positive), 0 = no linear relation. Shown when the DataFrame contains at least two continuous columns.
Spearman Rank (interactive template): Rank correlation after converting categories to integer codes. Range: −1 – 1. Most meaningful for ordered categoricals.
Theil’s U (interactive template): Asymmetric uncertainty coefficient. U(X→Y) measures how much knowing X reduces uncertainty about Y. Range: 0 – 1.

`prepare()` engine

The auto_data_prep parameter of prepare() selects the backend:

'default' (default) — uses the built-in conversion: detects numeric-like string columns and converts them to ordered CategoricalDtype. Plain numeric columns with more unique values than cat_limit are left as continuous.
'CLM' — delegates to CleverMiner (>= 1.0.7). If an older version is installed the DataFrame is returned unchanged. Note: CleverMiner drops high-cardinality numeric columns entirely rather than leaving them as continuous.

Note

prepare() converts all columns that can be given a meaningful order. String-encoded categories (e.g. "0-10", "Over 75") are sorted by their leading numeric token. Plain numeric columns (float64, int64) with cat_limit or fewer distinct values are also converted to ordered CategoricalDtype — they appear as frequency bar charts rather than histograms. Columns that cannot be ordered are sorted alphabetically.

API Reference

Supplementary notes

Output location

Column type detection

Templates

Options (opts dict)

Correlation metrics

prepare() engine

Options (`opts` dict)

`prepare()` engine