Skip to content

Soft Data

The intake desk. Load a dataset, understand it, and get it clean and shaped before it goes to the models. Layout is cribbed from a data-wrangler: a columns list on the left, the grid in the middle, a column summary on the right, and a scoped chatbot across the bottom.

Soft Data — the grid, per-column headers, and the column inspector

Loading data

Action What it does
import csv / parquet Load your own file from disk
load sample Load a built-in messy sample dataset to explore the tools
clear Drop the current dataset

The sample is a realistic mess on purpose: currency strings, %-suffixed numbers, sentinel ages (-999 / 9999), mixed Y/N booleans, mixed date formats, case-only duplicates, mojibake, BOM/NBSP characters, missing markers, and duplicate rows — so every cleaning tool has something to do.

The grid

Each column header shows:

  • A type badgeabc (text), 123 (numeric), 📅 (date).
  • A mini distribution (histogram or category bar).
  • rows · cols · % missing for the dataset.

Click a column to select it; its full summary appears on the right (type, missing, unique, top values, five-number summary, histogram).

Cleaning

When Scelo detects issues, a cleaning banner appears above the grid. It lists each suggested operation with a count of affected cells and a safe flag. Tick the ones you want and Apply.

The full op set:

Op What it fixes
trim whitespace leading/trailing spaces
collapse internal whitespace runs of spaces/tabs/newlines
fix encoding artefacts mojibake, BOM, NBSP, zero-width chars
normalise missing markers N/A, ?, -, TBD, … → null
parse numeric strings $1,234 / (1,234) / 85% → numbers
parse date strings date-shaped text → ISO YYYY-MM-DD
standardise booleans mixed yes/no/Y/N → true/false
replace sentinel numerics repeated -999 / 9999 codes → null
merge case-only duplicates WEST/west/West → one bucket
rename to snake_case headers with spaces/dots/mixed case
drop near-empty columns columns >95% missing
drop constant columns columns with a single value
drop duplicate rows exact-match duplicates

Or just ask

Type clean my data in the soft-data chat and Scelo runs the recommended set for you — no backend needed, fully local.

Date formatting

Scelo reads date columns intelligently, including day-first (European) formats that the naive date parser would reject. Two ways to reformat:

  • Click: on a date column, the type badge becomes a 📅 ▾ dropdown — pick American (MM/DD/YYYY), European (DD/MM/YYYY), or ISO 8601.
  • Chat: make the dates american format, format the dataset european, or hover a single column's chat and say make this ISO.

It infers each column's source convention (so 29-01-2025 is read as day-first) and reports if any cells weren't recognisable dates.

Per-column actions (chat)

Hover any column header to open its scoped chat:

  • make this american — reformat just this column's dates.
  • remove all non-dates — null every cell in a date column that isn't a date.
  • clean this column — trim, fix encoding, collapse whitespace, and null missing-markers for that column only.

Data augmentation

Generate synthetic rows from the soft-data chat:

add 1000 more rows through augmentation

Scelo bootstrap-resamples real rows (preserving correlations) and adds light Gaussian jitter to numeric columns. Categoricals, dates, and identifier columns are preserved. Use it for stress-testing intake; for correlation-preserving synthesis (SMOTE, copulas, CTGAN), move to the modeling stage.

Derived columns and filters

  • + ƒ derived — add a column from a formula (df.eval-style expressions).
  • Click a column's distribution to add a filter; active filters show as chips above the grid and can be cleared individually or all at once.

Simulating and exporting

  • ▷ simulate — generate a synthetic dataset by simulating a population's response to a scenario (via the swarm). See The swarm.
  • export ▾ — export the cleaned dataset (CSV / Parquet).
  • export · code — export everything you did as a runnable Python / R / C++ script. See Exporting.

When you're ready: next: tools →.