Data Cleaning Techniques for Accurate ML Models in 2025

Telegram Group Join Now
WhatsApp Group Join Now

In machine learning, data cleaning is not just a first step. It is the backbone of model accuracy. Without clean data, even the most sophisticated algorithms can produce unreliable results. Data scientists spend 60-70% of their time cleaning and organizing data, research shows. Why? Because garbage in equals garbage out.

This guide will share key data-cleaning techniques. I’ll use real examples to help you turn messy datasets into high-quality inputs for ML models. These tips will help you predict house prices and keep customers. They’ll improve your predictions and insights.


Data Project: Predicting Price

This project is a guide to data cleaning for a machine learning task. We’ll explore the whole process to prepare a dataset. It will predict housing prices using various features. The goal is to show how each data-cleaning step helps the model. It should improve performance and ensure reliable predictions.

Dataset Overview

The dataset contains information about housing properties, with the following columns:

  • square_footage: The total square footage of the house.
  • price: The sale price of the house (target variable).
  • location: Categorical data specifying the city or neighborhood.
  • number_of_bedrooms: The number of bedrooms in the house.
  • year_built: The year the house was constructed.
  • garage_space: The number of cars that can fit in the garage.
  • sale_date: The date the house was sold.

Common Issues in the Dataset

  1. Missing Values:
    • Around 15% of square_footage values are missing.
    • 5% of price values are missing.
    • garage_space has many NaN values for houses without a garage.
  2. Data Type Inconsistencies:
    • year_built is stored as a string.
    • price contains dollar signs and commas, making it a string instead of a float.
  3. Outliers:
    • price has extreme values for luxury homes, which skew the distribution.
  4. Feature Scaling Issues:
    • square_footage ranges from 500 to 10,000, while number_of_bedrooms ranges from 1 to 7.
  5. Categorical Encoding Needed:
    • location is categorical and needs to be encoded for model training.

Project Goal

Build a regression model to predict house prices (price) based on the given features. This involves:

  • Cleaning the dataset to ensure reliability and consistency.
  • Applying feature engineering techniques for better model performance.
  • Training and evaluating the machine learning model.

Step-by-Step Process

1. Loading the Dataset

Use Python’s Pandas library to load the dataset:

import pandas as pd

# Load dataset
df = pd.read_csv('housing_data.csv')

# Preview data
print(df.head())

2. Handling Missing Values

Example from the Dataset:

  • square_footage has 15% missing values, which could significantly affect predictions.

Techniques Used:

  1. For square_footage, use mean imputation:
    df['square_footage'].fillna(df['square_footage'].mean(), inplace=True)
    
  2. For garage_space, fill NaN values with 0 (no garage):
    df['garage_space'].fillna(0, inplace=True)
    
  3. Drop rows where price is missing, as it’s the target variable:
    df.dropna(subset=['price'], inplace=True)
    

3. Data Type Conversion

Example from the Dataset:

  • price contains strings like "$500,000". Convert these to floats.

Techniques Used:

  1. Remove unwanted characters and convert to numeric:
    df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)
    
  2. Convert year_built to integer:
    df['year_built'] = df['year_built'].astype(int)
    

4. Identifying and Removing Outliers

Example from the Dataset:

  • Extreme luxury properties in price (e.g., $20,000,000) create a long tail.

Techniques Used:

  1. Visualize outliers using boxplots:
    import seaborn as sns
    sns.boxplot(x=df['price'])
    
  2. Remove outliers using the interquartile range (IQR):
    Q1 = df['price'].quantile(0.25)
    Q3 = df['price'].quantile(0.75)
    IQR = Q3 - Q1
    
    # Filter out outliers
    df = df[(df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR))]
    

5. Encoding Categorical Variables

Example from the Dataset:

  • location needs to be converted into numerical values.

Techniques Used:

  1. Apply one-hot encoding:
    df = pd.get_dummies(df, columns=['location'], drop_first=True)
    

6. Scaling Features

Example from the Dataset:

  • square_footage and number_of_bedrooms are on different scales. This affects algorithms like SVM or k-NN.

Techniques Used:

  1. Standardize numerical features:
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    df[['square_footage', 'number_of_bedrooms']] = scaler.fit_transform(df[['square_footage', 'number_of_bedrooms']])
    

7. Feature Selection

Example from the Dataset:

  • zipcode and sale_date might not add much value to price prediction.

Techniques Used:

  1. Use correlation analysis to identify redundant features:
    import seaborn as sns
    sns.heatmap(df.corr(), annot=True)
    
  2. Apply Recursive Feature Elimination (RFE) with a regression model:
    from sklearn.feature_selection import RFE
    from sklearn.linear_model import LinearRegression
    
    model = LinearRegression()
    rfe = RFE(estimator=model, n_features_to_select=5)
    rfe.fit(df.drop('price', axis=1), df['price'])
    
    selected_features = df.columns[rfe.support_]
    print("Selected Features:", selected_features)
    

8. Splitting Data for Model Training

Split the dataset into training and test sets for validation:

from sklearn.model_selection import train_test_split

X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

9. Training the Model

Use a regression algorithm to train the model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
print("R^2 Score:", model.score(X_test, y_test))

Handling Missing Data

Example from the Dataset

In our dataset, 15% of square_footage values and 5% of price values are missing. Missing values can arise from human error, data corruption, or incomplete data collection. If left untreated, they can:

  • Cause algorithms to crash.
  • Skew model predictions.

Alternative Methods to Handle Missing Data

  1. Mean, Median, or Mode Imputation:
    • For numerical data, use the mean for missing values. For numerical data with outliers, use the median. For categorical data, use the mode.

    • Example in Python:
      df['square_footage'].fillna(df['square_footage'].mean(), inplace=True)
      
  2. Dropping Rows or Columns:
    • If a column or row has too many missing values, it might be better to remove it.
    • Example:
      df.dropna(axis=0, inplace=True)  # Drops rows with any missing values
      
  3. Advanced Imputation (k-NN):
    • Use the k-Nearest Neighbors algorithm to predict missing values. It should use similar data points.

    • Example: from sklearn.impute import KNNImputer
  4. Predictive Imputation:
    • Use machine learning models to predict missing values based on other features.
    • Ideal for datasets where relationships between variables are strong.

Data Type Conversion

Example from the Dataset

Inconsistent data types can lead to errors during analysis. For example:

  • year_built is stored as a string instead of an integer.
  • price is accidentally stored as text ("$500,000" instead of 500000).

Alternative Methods for Data Type Conversion

  1. Fixing Numeric Columns:
    • Convert strings to integers or floats using astype().
    • Example:
      df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)
      
  2. Categorical to Numeric Conversion:
    • Convert categories like location into numerical labels using one-hot encoding or label encoding.
    • Example (One-Hot Encoding):
      df = pd.get_dummies(df, columns=['location'])
      
  3. Datetime Parsing:
    • Convert columns like sale_date into datetime objects.
    • Example:
      df['sale_date'] = pd.to_datetime(df['sale_date'])
      

Outlier Removal

Example from the Dataset

The price column has extreme outliers, like ultra-luxury properties. They skew the distribution and make the mean misleading. These outliers can have a significant impact on models such as linear regression.

Alternative Methods for Handling Outliers

  1. Z-Score Method:
    • Identify and remove data points where the z-score exceeds a threshold (e.g., 3).
    • Example:
      from scipy.stats import zscore
      df['zscore'] = zscore(df['price'])
      df = df[df['zscore'] < 3]
      
  2. Interquartile Range (IQR):
    • Remove points outside 1.5 times the IQR.
    • Example:
      Q1 = df['price'].quantile(0.25)
      Q3 = df['price'].quantile(0.75)
      IQR = Q3 - Q1
      df = df[(df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR))]
      
  3. Winsorizing:
    • Cap outliers at a predefined percentile.
  4. Transformation:
    • Apply log transformations to reduce the effect of outliers.
    • Example:
      df['price_log'] = np.log(df['price'])
      

Standardization

Example from the Dataset

Square footage ranges from hundreds to thousands. Bedrooms range from 1 to 10. Models like SVM and k-NN are sensitive to such scale differences.

Alternative Methods for Scaling

  1. Min-Max Scaling:
    • Scale values to a range of 0-1.
    • Example:
      from sklearn.preprocessing import MinMaxScaler
      scaler = MinMaxScaler()
      df[['square_footage']] = scaler.fit_transform(df[['square_footage']])
      
  2. Z-Score Standardization:
    • Scale data to have a mean of 0 and a standard deviation of 1.
  3. Robust Scaling:
    • Use median and IQR for scaling, resistant to outliers.

Feature Selection

Example from the Dataset

Not all features are equally important. For example:

  • zipcode may have little impact on price prediction when location is already included.
  • Irrelevant features add noise and slow down model training.

Alternative Methods for Feature Selection

  1. Correlation Analysis:
    • Use a heatmap to visualize and remove highly correlated features.
    • Example:
      import seaborn as sns
      sns.heatmap(df.corr(), annot=True)
      
  2. Recursive Feature Elimination (RFE):
    • Iteratively remove the least important features.
    • Example:
      from sklearn.feature_selection import RFE
      from sklearn.linear_model import LinearRegression
      rfe = RFE(estimator=LinearRegression(), n_features_to_select=5)
      rfe.fit(X, y)
      
  3. Tree-Based Models:
    • Use Random Forest or XGBoost to determine feature importance.
  4. Lasso Regression:
    • Penalize irrelevant features, forcing their coefficients to zero.

Conclusion

Clean data is the foundation of accurate machine-learning models. You can create a dataset that helps your model perform at its best. Do this by handling missing data, removing outliers, and converting data types. Select only relevant features, too. Remember, great models start with great data.

Ready to try these techniques? Download a messy dataset, clean it, and share your results. Let’s keep learning and improving together.

Read Also:

Azure SQL Database vs Azure SQL Data Warehouse (2025)

ETL Developer Roadmap: A Comprehensive Guide for 2025

Leave a comment