This repository contains my Bachelor's CS degree project as well as it's timeline and incremental progress.
- ☑️ Define a specific set of use cases detailed in Images/Specifications.png file as a result of the discussion with Flavian, Cosmin R. and Dan T.
- ☑️ Research on the appropriate technologies (Apache Flume, Hadoop, Solr)
- ☑️ Create the project pre-architecture and establish it
- ☑️ Research on Apache Flume to see if it supports metadata extraction and different log formats
- ☑️ Create a prototype Solr project using Docker and SolrJ
- ☑️ Request access to AWS infrastructure
- ☑️ Discuss with Ciprian D. (coordinator professor) to get approval on project architecture and features (last week progress)
- ☑️ Test Solr in manual configuration to understand the flow (results presented here)
- ☑️ Research on different types of logs and think about a unique, structured data transfer object
- ☑️ Research on Solr Analyzers
- ☑️ Get access to AWS and Splunk
- ☑️ Research on Morphlines and MapReduceIndexerTool
- ☑️ Discuss with Andrei F. to get access to the AWS infrastructure
- ☑️ Bootstrap the infrastructure and get access to Cloudera dashboard
- ☑️ Analyze the current infrastructure
- ☑️ Install Flume through Cloudera (need to make this persistent in the future)
- ☑️ Analyze all keystone logs to see their format
- ☑️ Create a simulator that takes a large log file, splits it into multiple files and archives those files
- ☑️ Create a parser that decompresses each archive and converts each log event into a structured JSON model
- ☑️ Test the parser along with the simulator
- ☑️ Modify the parser to support high speed archive ingestion
- ☑️ Add support for multiline logs (ex. stacktraces)
- ☑️ Add parser functions that support only a subset of log formats (ongoing work)
- ☑️ Research on Grok parser as a final parsing solution
- ☑️ Created a test project to illustrate the functionality of Grok
- ☑️ Started to analyze Flume configuration
- ☑️ Modified Flume config file to send data to a specific HDFS directory
- ☑️ Changed Flume config to send an entire blob (file) to HDFS, with an established max size
- ☑️ Tested the LogGenerator along with the LogParser and with Flume
- ☑️ Modified both projects accordingly to pass the functionality test
- ☑️ Research on the MapReduceIndexerTool job
- ☑️ Research on the Morphline concept along with the MapReduceIndexerTool
- ☑️ Created a Morphline config file that matches the project needs
- ☑️ Created a script that starts the indexing job
- ☑️ Debugged a strange error on MapReduceIndexerTool using StackOverflow
- ☑️ Managed to run the indexing job in --dry-run mode (without Solr loading)
- ☑️ Modified the Morphline config file to load the index into Solr
- ☑️ Created a script that generates a Solr Config file
- ☑️ Created a script that creates a Solr Core based on a generated configuration
- ☑️ Adapted the default Solr schema to match the serialized JSON model of a log event
- ☑️ Ran the entire flow and checked the index correctness on small files with a few models
- ☑️ Implemented the Grok engine in the project parser
- ☑️ Created a mockup client project
- ☑️ Implemented TarGz decompressor
- ☑️ Implemented Zip decompressor
- ☑️ Created unit tests (using JUnit) for all decompressors
- ☑️ Presented the current progress to the coordinator professor
- ☑️ Discussed with Cosmin R. about triggering the index job using SQS
- ☑️ Created a set of config scripts that will prepare the newly created infrastructure
- ☑️ Created the daemon that will run on HDFS machine and trigger the index job (IndexTrigger)
- ☑️ Started building of a Desktop app UI using Java Swing
- ☑️ Replaced Swing UI with a Java FX one (because it's more flexible)
- ☑️ Tried to fix the QE cluster with Dragos C. and Vlad C.
- ☑️ Built the presentation for Scientific Communication Session 2018
- ☑️ Built a document with the initial work on this project
- ☑️ Added CommonsCLI support in the Client project
- ☑️ Added a custom control in the desktop UI for the search fields
- ☑️ Developed the SolrAPI class on the client
- ☑️ Created custom classes for each command
- ☑️ Created the SpringBoot REST API on HadoopDriver project (index trigger)
- ☑️ Fixed some packet collisions that were leading to a corrupt fat JAR
- ☑️ Developed a HadoopRestAPI Client on the client project
- ☑️ Tested the API to make sure that index-now and index-interval commands work as expected
- ☑️ Created a command executor model on the client
- ☑️ Developed the export command in both CLI and GUI
- ☑️ Implemented a merge logic between the time-interval and date-interval commands
- ☑️ Started to work on the official documentation
- ☑️ Added S3 download functionality on parser
- ☑️ Added SQS receive logic on parser
- ☑️ Implemented the processing logic for each archive using an ExecutorService
- ☑️ Created a test S3 bucket and tested the developed workflow
- ☑️ Created a DataGenerator multithreaded project that generates archives and loads them in S3
- ☑️ Finalized the client side and tested it manually
- ☑️ Implemented the Job Scheduler on the Hadoop Driver (using a timer controllable by the client)
- ☑️ Developed a logic to detect when the indexing job is finished
- ☑️ Diagnosed a log4j deadlock (Call Appenders) caused by a missing log4j.properties file
- ☑️ Accelerated the work on the documentation
- ☑️ Worked on the presentation for the KeyStone team
- ☑️ Presented the project to the keystone team
- ☑️ Worked on the documentation
- ☑️ Loaded 34.7 GB of data into the system and tested the entire data flow
- ☑️ Worked on the Faculty formalities regarding the diploma exam
- ☑️ Finalized and delivered the documentation
- ☑️ Loaded 120 GB of data and tested the entire workflow
- ☑️ Created the official presentation for the next week