Joel post 2 (#594)

* fix: Switch to English (US) * feat: Add blog post 2 --------- Co-authored-by: Carlos Maltzahn <carlosm@ucsc.edu>
ucsc-ospo · Aug 6, 2024 · da4fa73 · da4fa73
1 parent 4246faf
commit da4fa73
Show file tree

Hide file tree

Showing 2 changed files with 84 additions and 7 deletions.
diff --git a/content/report/osre24/lbl/drishti/20240614-jaytau/index.md b/content/report/osre24/lbl/drishti/20240614-jaytau/index.md
@@ -5,6 +5,7 @@ summary:
 authors: 
 - jaytau
 - jeanlucabez
+- "Suren Byna"
 author_notes: ["CS Undergrad, BITS Pilani"]
 tags: ["osre24", "uc", "LBNL", "data science", "visualization", "performance analysis", "I/O", "HPC", "AI"]
 categories: ["GSoC'24"]
@@ -24,24 +25,24 @@ image:
 
 Namaste everyone! 🙏🏻
 
-I'm {{% mention jaytau %}}, a third-year Computer Science student at BITS Pilani, Goa, India. I'm truly honoured to be part of this year's Google Summer of Code program, working with the UC OSPO organisation on a project that genuinely excites me. I'm particularly grateful to be working under the mentorship of Dr. {{% mention jeanlucabez %}}, a Research Scientist at Lawrence Berkeley National Laboratory, and [Dr. Suren Byna](https://sbyna.github.io), a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.
+I'm {{% mention jaytau %}}, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I'm truly honored to be part of this year's Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I'm particularly grateful to be working under the mentorship of Dr. {{% mention jeanlucabez %}}, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. [Suren Byna](https://sbyna.github.io), a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.
 
 My project, "[Drishti: Visualization and Analysis of AI-based Applications](/project/osre24/lbl/drishti)", aims to extend the [Drishti](https://github.com/hpc-io/drishti) framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer's memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.
 
-Drishti is an innovative, interactive web-based framework that helps users understand the I/O behaviour of scientific applications by visualising I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualisations, making performance issues more apparent. Now, I'm working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.
+Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I'm working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.
 
 Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I've gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn't just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.
 
 As outlined in my [proposal](https://docs.google.com/document/d/1zfQclXYWFswUbHuuwEU7bjjTvzS3gRCyNci08lTR3Rg/edit?usp=sharing), my tasks are threefold:
 
 1. **Modularize Drishti's codebase**: Currently, it's a single 1700-line file that handles multiple functionalities. I'll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
-2. **Enable multi-trace handling**: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I'll build a layer to aggregate these, providing a comprehensive view of the application's I/O behaviour.
+2. **Enable multi-trace handling**: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I'll build a layer to aggregate these, providing a comprehensive view of the application's I/O behavior.
 3. **Craft AI/ML-specific recommendations**: Current suggestions often involve MPI-IO or HDF5, which aren't typical in ML frameworks like PyTorch or TensorFlow. I'll create targeted recommendations that align with these frameworks' data pipelines.
 
-This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimise it for the unique I/O challenges that AI/ML applications face. Whether it's dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.
+This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it's dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.
 
-From classroom theories to hands-on projects, from understanding file systems to optimising AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.
+From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.
 
-In today's AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimised I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behaviour in AI/ML applications more interpretable through Drishti, we're not just tweaking code. We're providing developers with the insights they need to optimise their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.
+In today's AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we're not just tweaking code. We're providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.
 
-I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimising I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I'm deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavour and to the UC OSPO and GSoC communities for this incredible opportunity.
+I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I'm deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.
diff --git a/content/report/osre24/lbl/drishti/20240714-jaytau/index.md b/content/report/osre24/lbl/drishti/20240714-jaytau/index.md
@@ -0,0 +1,76 @@
+---
+title: Midway Through GSoC
+subtitle: My Journey with Drishti
+summary:
+authors: 
+- jaytau
+- jeanlucabez
+author_notes: ["CS Undergrad, BITS Pilani"]
+tags: ["osre24", "uc", "LBNL", "data science", "visualization", "performance analysis", "I/O", "HPC", "AI"]
+categories: ["GSoC'24"]
+date: 2024-07-31
+lastmod: 2024-08-06
+featured: false
+draft: false
+
+# Featured image
+# To use, add an image named `featured.jpg/png` to your page's folder.
+# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
+image:
+  caption: ""
+  focal_point: ""
+  preview_only: false
+---
+
+Hello everyone! I'm {{% mention jaytau %}}, and I'm excited to share my progress update on the [Drishti](https://github.com/hpc-io/drishti) project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I've been diving deep into the world of I/O visualization for scientific applications, and I'm thrilled to tell you about the strides we've made.
+
+## What is Drishti?
+
+For those unfamiliar with Drishti, it's an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using [Darshan](https://wordpress.cels.anl.gov/darshan/), a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application's I/O patterns, making it easier to identify bottlenecks and optimize performance.
+
+## Progress and Challenges
+
+### Export Directory Feature
+
+One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn't select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.
+
+### CI Improvements and Cross-Project Dependencies
+
+While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don't adequately test the interactions between different branches of these interconnected tools. This is an area we've identified for future improvement to ensure smoother integration and fewer conflicts between projects.
+
+### Refactoring for Multi-File Support
+
+The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti's insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.
+
+The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:
+
+1. Better separation of computation and condition checking
+2. Easier parallelization of processing multiple traces
+3. Finer-grained profiling of performance bottlenecks
+4. More flexibility in data manipulation and memory management
+
+## Learnings and Skills Gained
+
+Through this process, I've gained valuable insights into:
+
+1. Refactoring large codebases
+2. Understanding and improving cross-project dependencies
+3. Implementing data classes in Python for better code organization
+4. Balancing performance with code readability and maintainability
+
+## Next Steps
+
+As I move forward with the project, my focus will be on:
+
+1. Adding unit tests for individual methods to ensure functionality
+2. Exploring alternative data frame implementations like Polars for better performance
+3. Developing aggregation methods for different types of data across multiple Darshan files
+4. Optimizing memory usage and computational efficiency for large datasets
+
+## Conclusion
+
+Working on Drishti has been an incredible learning experience. I've had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I'm excited about the potential impact of these improvements on the scientific community's ability to optimize their applications' I/O performance.
+
+I'm grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!
+
+If you have any questions or would like to learn more about the project, feel free to [reach out to me](https://www.jaytau.com/#contact?ref=uc-ospo). Let's keep pushing the boundaries of scientific computing together!