This project demonstrates how to read multiple Excel files from a specific directory, concatenate the data into a single DataFrame, and then plot the distribution of file sizes.
├── README.md│
├── requirements.txt
├── src
│ │
│ ├── cached_df
│ │ └── .gitkeep
│ ├── data_visualization
│ │ └── .gitkeep
│ │
│ ├── files_here
│ │ └── .gitkeep
│ ├── get_data.py
│ └── plot_graph.py
└── .gitignore
-
get_data.py
: Python script to read Excel files from thesrc/files_here
directory, concatenate them into a single DataFrame, and cache the result. -
plot_graph.py
: Python script to plot the distribution of file sizes from the concatenated DataFrame and save the plot as a PNG file. -
src/files_here/
: Directory containing Excel files to read -
cached_df.pkl
: Pickle file storing the cached DataFrame after concatenation. -
src/cached_df/
: Directory containing pickled DataFrame after concatenation. -
src/data_visualization/
: Directory for generated graphs
-
Clone the repository:
git clone https://github.com/egekaplan/concat-excel-data.git
-
Navigate to the project directory:
cd concat-excel-data
-
Install the required dependencies:
pip install -r requirements.txt
-
Ensure your Excel files are placed in the
src/files_here
directory. -
Run
src/get_data.py
to read and concatenate the Excel files:python3 get_data.py
This will generate a cached DataFrame
cached_df.pkl
insrc/cached_df/cached_df.pkl
. -
Run
src/plot_graph.py
to plot the size distribution of the files:python3 plot_graph.py
The resulting histogram chart will be saved as
file_size_histogram.png
andextension_frequency_histogram
.
-
Ensure that you have Python installed on your system.
-
Additional libraries such as
pandas
,matplotlib
, andseaborn
are required. These dependencies are listed inrequirements.txt
. -
Modify the
sheet_name
andheader_row
variables inget_data.py
according to your Excel file structure.