Data Cleaning


Core Idea: Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It is a crucial first step in any data science project, as the quality of the data directly impacts the quality of the model.

1. Handling Missing Values

We’ve touched on this in the Intermediate ML notes, but it’s worth reiterating. The strategy for handling missing values depends on the nature of the data and the reason for the missingness.

2. Correcting Data Types

Ensure that each column has the correct data type (e.g., numeric, datetime, categorical). Incorrect data types can lead to errors and unexpected behavior in your analysis and modeling.

# Convert 'date' column to datetime objects
df['date'] = pd.to_datetime(df['date'])

# Convert 'zip_code' to a categorical type
df['zip_code'] = df['zip_code'].astype('category')

3. Dealing with Inconsistent Data

Inconsistent data can arise from typos, different formatting, or different units of measurement. Regular expressions and string manipulation are powerful tools for cleaning up inconsistent text data.

# Standardize street names
df['street'] = df['street'].str.replace('St.', 'Street').str.replace('Rd.', 'Road')

4. Identifying and Handling Outliers

Outliers are data points that are significantly different from other observations. They can be caused by measurement errors or they can be legitimate but extreme values.

5. Removing Duplicates

Duplicate rows can skew your analysis and should generally be removed.

# Remove duplicate rows
df = df.drop_duplicates()

Key Takeaways:

© 2018 JIAWEI    粤ICP备18035774号