Skip to content

Simulated data with intentional errors. Then extra processes were added to correct the erroneous data.

Notifications You must be signed in to change notification settings

kw3ku/Data-Cleaning-Preprocessing-with_R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

R-Data-Proj

Data Cleaning and Preprocessing Project

This project demonstrates data cleaning and preprocessing techniques using R. It involves generating synthetic data with intentional inaccuracies and applying various data cleaning methods to prepare the data for analysis.

Project Overview

Synthetic Data Generation

  • Objective: Create a dataset that mimics real-world messy data.
  • Process:
    • Generate a dataset with 200 columns and rows of random data.
    • Introduce issues such as missing values, duplicates, outliers, and inconsistent formatting.

Data Cleaning Techniques

  • Handling Outliers: Replace values exceeding the 99th percentile with the threshold value.
  • Handling Missing Values: Impute missing numeric values with the mean of the respective column.
  • Removing Duplicates: Remove duplicate rows after rounding numeric columns.
  • Correcting Inconsistent Formatting: Convert character columns to numeric.
  • Correcting Incorrect Values: Replace negative values with their absolute values.
  • Standardizing Column Names: Convert column names to lowercase.
  • Normalizing Data: Scale numeric columns.

Visualization

  • Create a correlation matrix plot to visualize the relationships between numeric columns.

Repository Structure

The repository includes the following files and directories:

About

Simulated data with intentional errors. Then extra processes were added to correct the erroneous data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published