Build a CSV Sales Toolkit : Day 1 - CSV Sales Cleaner

Today we build a Python program that loads a messy CSV, see it in a table, click a button, and watch the data transform — bad values fixed, missing cells filled.

Jun 02, 2026

Projects in this week’s series:

This week, we build a CSV Sales Toolkit that cleans messy data, generates insightful reports, and processes whole folders of files at once — the exact workflow of real data work.

Why build this? Because real-world data is always messy. Inconsistent capitalization, stray whitespace, dollar signs in number columns, missing values — every dataset has them. The ability to reliably clean data is one of the most practical skills in all of Python.

What you’ll learn: This series teaches you data cleaning, type conversion, missing-value handling, aggregation and reporting, and batch file processing — the core skills behind data analysis and automation.

Why this matters: By Day 3, you’ll have a tool that points at a folder of monthly sales files, cleans every one, and produces a single consolidated business report. That’s a genuinely useful piece of software.

Day 1: Visual CSV Cleaner (Today)
Day 2: Sales Analyzer & Report
Day 3: Batch CSV Processor

View All Projects This Week

Today’s Project

We start at the foundation of all data work: cleaning. But instead of a script that runs and prints text, we’re building a visual desktop app. You load a messy CSV, see it in a table, click a button, and watch the data transform — bad values fixed, missing cells filled, all in front of you.

The input file looks fine at a glance. But it has stray whitespace, three different spellings of “Electronics,” dollar signs stuck in the price column, and missing values. Today you build a tool that fixes all of it — and shows you the difference.

Project Task

Build a visual CSV cleaner with Tkinter that:

Loads a raw orders CSV and displays it in a table (Treeview)
Shows a “before” view of the messy data
Cleans the data on a button click:
- Strips whitespace from text columns
- Standardizes category names to Title Case
- Standardizes order status to lowercase
- Removes $ signs and converts prices to numbers
- Converts quantity to whole numbers
- Fills missing values sensibly
- Adds a calculated total_price column
Displays the cleaned “after” data in the same table
Shows a status log of exactly what was cleaned
Saves the cleaned data to a new CSV via a Save button

This project gives you hands-on practice with pandas data cleaning, Tkinter GUIs, the Treeview table widget, and connecting a data pipeline to a visual interface — a practical mix of data and app-building skills.

Expected Output

Running the app:

python csv_cleaner.py

Application Window:

Running the code will produce this desktop app. Here we have already loaded the orders_raw.csv file which you can download here:

After clicking “Clean Data” we see this:

Watch the table change in place: electronics becomes Electronics, $12.99 becomes 12.99, the empty quantity in row 1004 fills with 1, and a new total_price column appears.

Setup Instructions

Install pandas:

pip install pandas

Tkinter ships with Python — nothing else to install.

Get the data file:

Download orders_raw.csv (provided with this project) and keep it handy — you’ll load it from inside the app.

Download CSV

Run the .py program:

python csv_cleaner.py

Understanding the Messy Data

Before building anything, look at orders_raw.csv. These are the six problems hiding in it:

Whitespace — " Alice Johnson" and "Bob Smith " have stray spaces.
Inconsistent categories — Electronics, electronics, ELECTRONICS are the same category spelled three ways.
Inconsistent status — completed, Completed, COMPLETED — same problem.
Currency symbols — some prices are $12.99 (text!) instead of 12.99 (a number).
Missing quantities — a few quantity cells are empty.
Missing prices — a few unit_price cells are empty.

Each one breaks analysis later. Sum revenue and $12.99 crashes the math. Group by category and you get six groups instead of three. Cleaning is what makes analysis possible.

Understanding the App Structure

We organize the whole app as a class. The class holds the DataFrame as self.df, so every method — load, clean, save — works on the same shared data:

import tkinter as tk
from tkinter import ttk, filedialog
import pandas as pd

class CSVCleanerApp:
    def __init__(self, root):
        self.root = root
        self.df = None          # the DataFrame, shared across all methods
        self.build_ui()

    def build_ui(self):
        # buttons, table, status log go here
        ...

self.df starts as None (nothing loaded yet). load_csv() fills it, clean_data() transforms it, save_csv() writes it out. Storing the data on self is what lets separate button clicks act on one shared dataset.

Understanding the Treeview Table

Tkinter’s ttk.Treeview is the widget for showing tabular data — rows and columns, like a spreadsheet:

# Create the table
self.tree = ttk.Treeview(self.root, show="headings")

# Define which columns exist
self.tree["columns"] = ("order_id", "customer_name", "product")

# Set the header text for each column
for col in self.tree["columns"]:
    self.tree.heading(col, text=col)
    self.tree.column(col, width=120)

# Add a row of data
self.tree.insert("", "end", values=(1001, "Alice Johnson", "Wireless Mouse"))

Three ideas:

show="headings" hides Treeview’s default empty first column — you almost always want this.
.heading() sets the visible column title; .column() sets its width.
.insert("", "end", values=(...)) adds one row; the tuple of values fills the columns left to right.

Understanding Displaying a DataFrame

The table widget doesn’t know about pandas — so we need a function that takes any DataFrame and renders it into the Treeview. The pattern is: clear the table, set up the columns, then add every row.

def show_dataframe(self, df):
    # 1. Clear any existing rows
    self.tree.delete(*self.tree.get_children())

    # 2. Set the columns to match the DataFrame
    self.tree["columns"] = list(df.columns)
    for col in df.columns:
        self.tree.heading(col, text=col)
        self.tree.column(col, width=110)

    # 3. Add every row from the DataFrame
    for _, row in df.iterrows():
        self.tree.insert("", "end", values=list(row))

This one function is reused for both the before and after views. Load the messy data — call show_dataframe. Clean it — call show_dataframe again. The table just re-renders whatever DataFrame you hand it.

Understanding Loading a File

The “Load CSV” button opens a file picker with filedialog, then reads the chosen file:

from tkinter import filedialog

def load_csv(self):
    path = filedialog.askopenfilename(
        title="Select a CSV file",
        filetypes=[("CSV files", "*.csv")]
    )
    if not path:          # user clicked Cancel
        return

    self.df = pd.read_csv(path)
    self.show_dataframe(self.df)
    self.log(f"Loaded {path} — {len(self.df)} rows, {len(self.df.columns)} columns")

filedialog.askopenfilename returns the selected path as a string — or an empty string if the user cancels, which is why we check if not path. Once loaded, we show the data and log what happened.

Understanding the Cleaning Pipeline

The cleaning logic is pure pandas — the same regardless of GUI or CLI. We break it into one method per problem, all called by clean_data():

def clean_data(self):
    if self.df is None:
        self.log("Load a CSV first!")
        return

    df = self.df

    # --- Strip whitespace from text columns ---
    text_cols = ["customer_name", "product", "category", "status"]
    for col in text_cols:
        df[col] = df[col].astype(str).str.strip()

    # --- Standardize capitalization ---
    df["category"] = df["category"].str.title()   # electronics -> Electronics
    df["status"] = df["status"].str.lower()        # COMPLETED -> completed

    # --- Fix the price column ---
    df["unit_price"] = df["unit_price"].astype(str).str.replace("$", "", regex=False)
    df["unit_price"] = pd.to_numeric(df["unit_price"], errors="coerce")

    # ... (quantity, missing values, total_price) ...

    self.df = df
    self.show_dataframe(self.df)

The key tools:

.str.strip() / .str.title() / .str.lower() — string operations applied to a whole column via the .str accessor.
.str.replace("$", "", regex=False) — strips the dollar sign so the text can become a number.
pd.to_numeric(..., errors="coerce") — converts text to numbers; anything unconvertible becomes NaN (missing) instead of crashing.

Understanding Missing Value Handling

After converting types, some cells are missing. For sales data we fill rather than drop — dropping rows loses real orders.

Missing quantity → fill with 1 (an order existed, so at least one item):

df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
df["quantity"] = df["quantity"].fillna(1).astype(int)

Missing unit_price → fill with that category’s average price (a flat number would distort things — a missing webcam price shouldn’t become $5):

category_avg = df.groupby("category")["unit_price"].transform("mean")
df["unit_price"] = df["unit_price"].fillna(category_avg).round(2)

groupby(...).transform("mean") computes each category’s average and aligns it to every row, so each gap is filled with its own category’s average. “Handle missing values” always means think about what the right value is.

Understanding the Calculated Column

Clean, properly-typed data lets you derive new information. Total price is just quantity × unit price — one line, now that both columns are real numbers:

df["total_price"] = (df["quantity"] * df["unit_price"]).round(2)

This is the payoff of cleaning. If unit_price were still text with dollar signs, this multiplication would fail outright.

Understanding the Status Log

A good tool tells the user what it did. We add a scrolling text box at the bottom and append a line for each action:

from tkinter import scrolledtext

# In build_ui:
self.status_log = scrolledtext.ScrolledText(self.root, height=6)

def log(self, message):
    self.status_log.insert(tk.END, message + "\n")
    self.status_log.see(tk.END)   # auto-scroll to the newest line

Each cleaning step calls self.log(...). The user gets a running history — “stripped whitespace,” “filled 3 missing quantities” — instead of changes happening invisibly.

Understanding Saving the Result

The “Save Clean CSV” button uses a save dialog and writes the DataFrame out:

def save_csv(self):
    if self.df is None:
        self.log("Nothing to save — load and clean a file first.")
        return

    path = filedialog.asksaveasfilename(
        defaultextension=".csv",
        filetypes=[("CSV files", "*.csv")]
    )
    if not path:
        return

    self.df.to_csv(path, index=False)
    self.log(f"Saved clean data to {path}")

asksaveasfilename lets the user choose where to save. index=False in to_csv is important — without it, pandas writes its internal row numbers as a junk extra column.

Practical Use Cases

1. Cleaning exported reports:

E-commerce, accounting, and CRM exports are notoriously messy — clean them on arrival.

2. A reusable cleaning tool:

Build it once, then clean any sales CSV by loading it into the app.

3. Visual data inspection:

Seeing the table before and after catches problems a script would hide.

4. Non-technical users:

A GUI means colleagues can clean data without touching code.

5. Foundation for Day 2:

Tomorrow's analyzer needs clean, typed data — this app produces exactly that.

Coming Tomorrow

Tomorrow we put the clean data to work. The Sales Analyzer takes orders_clean.csv and computes real business insights — revenue by category, best-selling products, daily trends, average order value — and generates a formatted summary report.

Skeleton and Solution

Below you will find both a downloadable skeleton.py file to help you code the project with comment guides and the downloadable solution.py file containing the correct solution.

Get the code skeleton here:

View Code Skeleton

Get the code solution here:

View Code Solution

Discussion about this post

Ready for more?