In machine learning, data cleaning is not just a first step. It is the backbone of model accuracy. Without clean data, even the most sophisticated algorithms can produce unreliable results. Data scientists spend 60-70% of their time cleaning and organizing data, research shows. Why? Because garbage in equals garbage out.
This guide will share key data-cleaning techniques. I’ll use real examples to help you turn messy datasets into high-quality inputs for ML models. These tips will help you predict house prices and keep customers. They’ll improve your predictions and insights.
Data Project: Predicting Price
This project is a guide to data cleaning for a machine learning task. We’ll explore the whole process to prepare a dataset. It will predict housing prices using various features. The goal is to show how each data-cleaning step helps the model. It should improve performance and ensure reliable predictions.
Dataset Overview
The dataset contains information about housing properties, with the following columns:
square_footage
: The total square footage of the house.price
: The sale price of the house (target variable).location
: Categorical data specifying the city or neighborhood.number_of_bedrooms
: The number of bedrooms in the house.year_built
: The year the house was constructed.garage_space
: The number of cars that can fit in the garage.sale_date
: The date the house was sold.
Common Issues in the Dataset
- Missing Values:
- Around 15% of
square_footage
values are missing. - 5% of
price
values are missing. garage_space
has manyNaN
values for houses without a garage.
- Around 15% of
- Data Type Inconsistencies:
year_built
is stored as a string.price
contains dollar signs and commas, making it a string instead of a float.
- Outliers:
price
has extreme values for luxury homes, which skew the distribution.
- Feature Scaling Issues:
square_footage
ranges from 500 to 10,000, whilenumber_of_bedrooms
ranges from 1 to 7.
- Categorical Encoding Needed:
location
is categorical and needs to be encoded for model training.
Project Goal
Build a regression model to predict house prices (price
) based on the given features. This involves:
- Cleaning the dataset to ensure reliability and consistency.
- Applying feature engineering techniques for better model performance.
- Training and evaluating the machine learning model.
Step-by-Step Process
1. Loading the Dataset
Use Python’s Pandas library to load the dataset:
import pandas as pd
# Load dataset
df = pd.read_csv('housing_data.csv')
# Preview data
print(df.head())
2. Handling Missing Values
Example from the Dataset:
square_footage
has 15% missing values, which could significantly affect predictions.
Techniques Used:
- For
square_footage
, use mean imputation:df['square_footage'].fillna(df['square_footage'].mean(), inplace=True)
- For
garage_space
, fillNaN
values with 0 (no garage):df['garage_space'].fillna(0, inplace=True)
- Drop rows where
price
is missing, as it’s the target variable:df.dropna(subset=['price'], inplace=True)
3. Data Type Conversion
Example from the Dataset:
price
contains strings like"$500,000"
. Convert these to floats.
Techniques Used:
- Remove unwanted characters and convert to numeric:
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)
- Convert
year_built
to integer:df['year_built'] = df['year_built'].astype(int)
4. Identifying and Removing Outliers
Example from the Dataset:
- Extreme luxury properties in
price
(e.g., $20,000,000) create a long tail.
Techniques Used:
- Visualize outliers using boxplots:
import seaborn as sns sns.boxplot(x=df['price'])
- Remove outliers using the interquartile range (IQR):
Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 # Filter out outliers df = df[(df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR))]
5. Encoding Categorical Variables
Example from the Dataset:
location
needs to be converted into numerical values.
Techniques Used:
- Apply one-hot encoding:
df = pd.get_dummies(df, columns=['location'], drop_first=True)
6. Scaling Features
Example from the Dataset:
square_footage
andnumber_of_bedrooms
are on different scales. This affects algorithms like SVM or k-NN.
Techniques Used:
- Standardize numerical features:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df[['square_footage', 'number_of_bedrooms']] = scaler.fit_transform(df[['square_footage', 'number_of_bedrooms']])
7. Feature Selection
Example from the Dataset:
zipcode
andsale_date
might not add much value to price prediction.
Techniques Used:
- Use correlation analysis to identify redundant features:
import seaborn as sns sns.heatmap(df.corr(), annot=True)
- Apply Recursive Feature Elimination (RFE) with a regression model:
from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression model = LinearRegression() rfe = RFE(estimator=model, n_features_to_select=5) rfe.fit(df.drop('price', axis=1), df['price']) selected_features = df.columns[rfe.support_] print("Selected Features:", selected_features)
8. Splitting Data for Model Training
Split the dataset into training and test sets for validation:
from sklearn.model_selection import train_test_split
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9. Training the Model
Use a regression algorithm to train the model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate the model
print("R^2 Score:", model.score(X_test, y_test))
Handling Missing Data
Example from the Dataset
In our dataset, 15% of square_footage
values and 5% of price
values are missing. Missing values can arise from human error, data corruption, or incomplete data collection. If left untreated, they can:
- Cause algorithms to crash.
- Skew model predictions.
Alternative Methods to Handle Missing Data
- Mean, Median, or Mode Imputation:
-
For numerical data, use the mean for missing values. For numerical data with outliers, use the median. For categorical data, use the mode.
- Example in Python:
df['square_footage'].fillna(df['square_footage'].mean(), inplace=True)
-
- Dropping Rows or Columns:
- If a column or row has too many missing values, it might be better to remove it.
- Example:
df.dropna(axis=0, inplace=True) # Drops rows with any missing values
- Advanced Imputation (k-NN):
-
Use the k-Nearest Neighbors algorithm to predict missing values. It should use similar data points.
- Example:
from sklearn.impute import KNNImputer
-
- Predictive Imputation:
- Use machine learning models to predict missing values based on other features.
- Ideal for datasets where relationships between variables are strong.
Data Type Conversion
Example from the Dataset
Inconsistent data types can lead to errors during analysis. For example:
year_built
is stored as a string instead of an integer.price
is accidentally stored as text ("$500,000"
instead of500000
).
Alternative Methods for Data Type Conversion
- Fixing Numeric Columns:
- Convert strings to integers or floats using
astype()
. - Example:
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)
- Convert strings to integers or floats using
- Categorical to Numeric Conversion:
- Convert categories like
location
into numerical labels using one-hot encoding or label encoding. - Example (One-Hot Encoding):
df = pd.get_dummies(df, columns=['location'])
- Convert categories like
- Datetime Parsing:
- Convert columns like
sale_date
into datetime objects. - Example:
df['sale_date'] = pd.to_datetime(df['sale_date'])
- Convert columns like
Outlier Removal
Example from the Dataset
The price column has extreme outliers, like ultra-luxury properties. They skew the distribution and make the mean misleading. These outliers can have a significant impact on models such as linear regression.
Alternative Methods for Handling Outliers
- Z-Score Method:
- Identify and remove data points where the z-score exceeds a threshold (e.g., 3).
- Example:
from scipy.stats import zscore df['zscore'] = zscore(df['price']) df = df[df['zscore'] < 3]
- Interquartile Range (IQR):
- Remove points outside 1.5 times the IQR.
- Example:
Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 df = df[(df['price'] >= (Q1 - 1.5 * IQR)) & (df['price'] <= (Q3 + 1.5 * IQR))]
- Winsorizing:
- Cap outliers at a predefined percentile.
- Transformation:
- Apply log transformations to reduce the effect of outliers.
- Example:
df['price_log'] = np.log(df['price'])
Standardization
Example from the Dataset
Square footage ranges from hundreds to thousands. Bedrooms range from 1 to 10. Models like SVM and k-NN are sensitive to such scale differences.
Alternative Methods for Scaling
- Min-Max Scaling:
- Scale values to a range of 0-1.
- Example:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() df[['square_footage']] = scaler.fit_transform(df[['square_footage']])
- Z-Score Standardization:
- Scale data to have a mean of 0 and a standard deviation of 1.
- Robust Scaling:
- Use median and IQR for scaling, resistant to outliers.
Feature Selection
Example from the Dataset
Not all features are equally important. For example:
zipcode
may have little impact on price prediction whenlocation
is already included.- Irrelevant features add noise and slow down model training.
Alternative Methods for Feature Selection
- Correlation Analysis:
- Use a heatmap to visualize and remove highly correlated features.
- Example:
import seaborn as sns sns.heatmap(df.corr(), annot=True)
- Recursive Feature Elimination (RFE):
- Iteratively remove the least important features.
- Example:
from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression rfe = RFE(estimator=LinearRegression(), n_features_to_select=5) rfe.fit(X, y)
- Tree-Based Models:
- Use Random Forest or XGBoost to determine feature importance.
- Lasso Regression:
- Penalize irrelevant features, forcing their coefficients to zero.
Conclusion
Clean data is the foundation of accurate machine-learning models. You can create a dataset that helps your model perform at its best. Do this by handling missing data, removing outliers, and converting data types. Select only relevant features, too. Remember, great models start with great data.
Ready to try these techniques? Download a messy dataset, clean it, and share your results. Let’s keep learning and improving together.
Read Also: