API Reference
All public functions are available directly on the pandas_cat module.
- pandas_cat.profile(df=None, dataset_name=None, template=None, out_html='report.html', opts=None, verbose=True)[source]
Profile a dataset and write an HTML report.
The report is written to
<cwd>/report/<out_html>. The directory is created automatically if it does not exist.Categorical columns produce frequency bar charts and crosstab heatmaps. Numeric (continuous) columns produce histograms with mean/median overlays and descriptive statistics.
- Parameters:
df (
DataFrame|None) – DataFrame to profile.dataset_name (
str|None) – Title shown in the report header.template (
str|None) – Built-in name (None/'default','modern','interactive') or a file-system path to a custom.html.j2template. Custom templates declare their rendering mode with{# pandas-cat: mode=interactive #}; anything without that tag renders as static (SVG charts).out_html (
str) – Output filename (basename only).opts (
dict|None) –Optional settings dict:
auto_prepare (bool, default
True)cat_limit (int, default
20)na_values (list)
na_ignore (list)
keep_default_na (bool, default
True)
- Return type:
None- Returns:
None.
- pandas_cat.prepare(df=None, opts=None, auto_data_prep='default', verbose=True)[source]
Prepare a categorical dataset by converting numeric-like columns to ordered
pandas.Categorical.- Parameters:
df (
DataFrame|None) – DataFrame to prepare.opts (
dict|None) – Options forwarded to the underlying engine.auto_data_prep (
str) –'CLM'to use CleverMiner; any other value (default'default') uses the built-in conversion, which respectscat_limitand leaves high-cardinality numeric columns as continuous.
- Return type:
DataFrame- Returns:
New DataFrame with eligible columns as ordered
CategoricalDtype.
- pandas_cat.handle_missing_values(df, na_values=None, na_ignore=None, keep_default_na=True)[source]
Replace sentinel string values with
pd.NA.Returns a new DataFrame — the input is never modified.
- Parameters:
df (
DataFrame) – DataFrame to process.na_values (
list|None) – Additional strings to treat as missing.na_ignore (
list|None) – Built-in sentinel strings to exclude.keep_default_na (
bool) – WhenFalse, onlyna_valuesare used.
- Returns:
(df, detected, counts)tuple.
Supplementary notes
Output location
profile() always writes its output to
<current working directory>/report/<out_html>. The report/ directory
is created if it does not exist. There is no return value; the file is the
side effect.
Column type detection
pandas-cat automatically detects column types and adapts the report:
Column type |
Detected when |
|---|---|
Continuous |
|
Categorical |
All other columns: |
Continuous columns are profiled with a histogram and descriptive statistics.
Categorical columns are profiled with a frequency bar chart and frequency table.
The cat_limit option applies only to categorical columns — continuous
columns are never excluded on category-count grounds.
Effect of automatic data preparation on numeric columns
Because pandas-cat is a categorical profiling package, the default preparation
engine converts plain numeric columns with cat_limit or fewer distinct
values to ordered CategoricalDtype before profiling — so a 0/1 flag or a
1–5 rating scale produces a frequency bar chart rather than a histogram.
Numeric columns with more distinct values are left unchanged and profiled as
continuous.
No information is lost: a bar chart shows every value and its exact frequency.
A KDE histogram for low-cardinality discrete data is less informative and
visually misleading. To keep a low-cardinality numeric column as continuous,
either pass auto_prepare=False or cast the column to float before
calling profile().
Templates
|
Description |
|---|---|
|
Static HTML. Includes a memory-usage bar chart, per-attribute profiles (frequency tables + bar charts for categorical; histograms + descriptive statistics for continuous), a Cramér’s V heatmap (categorical columns), a Pearson correlation heatmap (continuous columns, when >= 2 exist), and individual column-pair crosstab heatmaps (categorical only). |
|
Same content as the default template, rendered with an alternative visual
style ( |
|
Data-driven HTML. Three correlation metrics per column pair (Cramér’s V, Spearman Rank, Theil’s U), per-category crosstabs, explicit missing-value report, and a list of excluded attributes. Progress is printed to the console as six numbered steps. |
Options (opts dict)
Key |
Default |
Description |
|---|---|---|
|
|
Call |
|
|
Maximum number of distinct categories a categorical column may have to be included in the report. Columns over the limit are excluded with a warning. Does not apply to continuous (numeric) columns. |
|
|
Extra strings to treat as missing (in addition to the built-in list). |
|
|
Strings from the built-in missing-value list to not replace with |
|
|
When |
Correlation metrics
- Cramér’s V (default and interactive templates — categorical columns)
Symmetric measure based on the chi-squared statistic, using the Bergsma-Wicher correction for small samples. Range: 0 (no association) – 1 (perfect association).
- Pearson (default and modern templates — continuous columns)
Linear correlation coefficient. Range: −1 (perfect negative) – 1 (perfect positive), 0 = no linear relation. Shown when the DataFrame contains at least two continuous columns.
- Spearman Rank (interactive template)
Rank correlation after converting categories to integer codes. Range: −1 – 1. Most meaningful for ordered categoricals.
- Theil’s U (interactive template)
Asymmetric uncertainty coefficient.
U(X→Y)measures how much knowing X reduces uncertainty about Y. Range: 0 – 1.
prepare() engine
The auto_data_prep parameter of prepare()
selects the backend:
'default'(default) — uses the built-in conversion: detects numeric-like string columns and converts them to orderedCategoricalDtype. Plain numeric columns with more unique values thancat_limitare left as continuous.'CLM'— delegates to CleverMiner (>= 1.0.7). If an older version is installed the DataFrame is returned unchanged. Note: CleverMiner drops high-cardinality numeric columns entirely rather than leaving them as continuous.
Note
prepare() converts all columns that can be given a meaningful order.
String-encoded categories (e.g. "0-10", "Over 75") are sorted by
their leading numeric token. Plain numeric columns (float64, int64)
with cat_limit or fewer distinct values are also converted to ordered
CategoricalDtype — they appear as frequency bar charts rather than
histograms. Columns that cannot be ordered are sorted alphabetically.