This GitHub repository houses a comprehensive data cleaning and exploratory data analysis (EDA) project, showcasing the process of preparing and analyzing a dataset using Python. The project leverages popular libraries such as Pandas, NumPy, Seaborn, and Matplotlib, and it encompasses various essential steps in the data science pipeline.
- Libraries Used: Pandas, NumPy, Seaborn, Matplotlib.
- Data Source: A CSV file ("online.csv").
- Initial Exploration: Displaying the top rows of the dataset and dropping unnecessary columns ("Unnamed: 0", "Unnamed: 13", '9', '#@%').
- Visualization Tools: Seaborn, Matplotlib.
- Insights Gained:
- Analyzing "Marital Status" and "Gender" distributions using count plots.
- Employing boxplots to understand the relationship between "Gender," "Marital Status," and "Age."
- Exploring demographic trends and outliers.
- Handling Missing Values:
- Imputing missing values in the "Age" column using mean, median, and mode methodologies.
- Outlier Detection and Treatment:
- Identifying outliers in the "Age" column using Z-Score, IQR, and boxplots.
- Treating outliers through trimming and capping methodologies.
- Handling Categorical Values:
- Converting nominal and ordinal categorical variables into numerical format through one-hot encoding and label encoding.
- Data Export:
- Saving the cleaned dataset to a new CSV file ("onlineclean.csv") for potential machine learning applications.
- Statistical Analysis:
- Utilizing descriptive statistics to understand the distribution of variables.
- Visualizing Outliers:
- Comparing boxplots before and after outlier treatment for the "Age" column.
- Clone the repository to your local machine.
- Open the Jupyter Notebook or Python script in your preferred development environment.
- Run the script sequentially to execute the data cleaning and EDA steps.
- Explore the generated visualizations and cleaned dataset.