We developed a robust and scalable system that proactively identifies new generally available (GA) software products and checks their availability on the G2 software marketplace. The goal is to compile a list of products that are not yet listed on G2, simplifying the process of onboarding them onto the platform.
- Identify New GA Products: The system should periodically gather information about newly released software products.
- Check Availability on G2: Utilize the G2 API to verify if identified products are listed on the platform.
- Compile a List: Maintain a record of products that are not currently listed on G2.
- Streamline Onboarding: Simplify the process of listing new products on G2.
Our Solution - Discovery Dino
Welcome to Discovery Dino, your friendly assistant designed to simplify the process of discovering and listing the latest Generally Available (GA) software products on G2. We've developed a cloud-native solution deployed on AWS, ensuring scalability, robustness, and efficiency.
Our solution comprises three key sections, seamlessly integrated to streamline the product discovery and onboarding process.
We've implemented a robust scraping mechanism to gather unstructured data from reliable sources. This data is streamed into Kafka and subsequently stored in our data lake on AWS S3, ensuring real-time ingestion and scalability. [More Information]
Once the data is stored in our data lake, our sophisticated data processing pipeline comes into play. Leveraging Large Language Models (LLMs) including OpenAI's GPT-3 and LLama2, hosted on our own instances, we process the data. This involves extracting features, categorizing products, identifying business types, and enhancing product descriptions. The processed data is then ingested into our MongoDB database, enabling seamless access and search functionalities. [More Information]
To empower users in navigating and interacting with the collected information, we've developed a user-friendly web application. This application acts as a co-pilot, offering intuitive features such as filtering, search capabilities, and AI-driven insights. Users can easily explore and identify new GA products, making informed decisions effortlessly. [More Information]
- Real-time Data Acquisition: Continuous scraping and ingestion ensure up-to-date product information.
- AI-powered Processing: Utilizing LLMs for advanced data processing and enrichment.
- Efficient Storage and Retrieval: Data is stored in a scalable manner on AWS S3 and MongoDB.
- Intuitive User Interface: The web application provides a seamless experience for accessing and interacting with the data.
- Scalable and Cloud-native: Deployed on AWS, our solution can handle large volumes of data and user interactions without compromising performance.
To set up and run Discovery Dino on your local machine, follow these steps:
-
Clone the repository:
git clone <https://github.com/Sakthe-Balan/DiscoveryDino>
-
Download the environment file (
env
) from the provided link and place it in the project directory.-
.env.template
MONGO_URI= DB_NAME= AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_REGION= OPENAI_API_KEY= SERPAPI_API_KEY= G2_API_KEY= LEPTON_API_KEY=
-
-
Backend Setup:
cd DiscoveryDino/server pip install -r requirements.txt python main.py
-
Frontend Setup:
cd DiscoveryDino/discoverydino npm install npm run build npm start
WARNING: If you want to scrape locally, make sure your Kafka server is set up and the URL is given accordingly(check out the kafka folder). Instead, use s3_spiders
which directly puts to S3.
Alternatively, you can use Docker to run Discovery Dino:
-
Build and Run the Frontend and Backend Images:
docker-compose up
By following these steps, you'll have Discovery Dino up and running locally, ready to explore and discover the latest Generally Available (GA) software products effortlessly.
Our technology stack for Discovery Dino is carefully selected to ensure scalability, performance, and ease of development. Each component plays a crucial role in delivering a robust and efficient solution.
- FastAPI: FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3. It's asynchronous and efficient, making it ideal for our backend API services.
- MongoDB: MongoDB is a NoSQL database that provides flexibility and scalability for handling unstructured data. It's well-suited for our use case of storing and retrieving product information efficiently.
- Kafka: Kafka is used for stream processing and real-time data ingestion. It enables scalable and reliable messaging between different components of our system.
- QdrantDB: QdrantDB is a high-performance vector database optimized for similarity search and recommendation systems. It enhances our data analysis capabilities for efficient product categorization and search.
- Next.js: Next.js is a React framework for building server-rendered applications. It offers benefits like improved SEO, faster page loading, and efficient routing, making it ideal for our web application.
- AWS (Amazon Web Services):
- Lambda: AWS Lambda allows us to run serverless functions, enabling event-driven architectures and reducing operational overhead.
- EC2 (Elastic Compute Cloud): EC2 provides scalable compute capacity in the cloud, allowing us to host and run our backend services.
- S3 (Simple Storage Service): AWS S3 is used as our data lake for storing scraped data and other assets securely and at scale.
- Performance: FastAPI, Kafka, and QdrantDB are chosen for their high performance, enabling real-time data processing and efficient querying.
- Scalability: AWS services (Lambda, EC2, S3) provide scalability and flexibility, allowing us to handle varying workloads and data volumes effectively.
- Ease of Development: Next.js simplifies frontend development with its built-in features like automatic code splitting, server-side rendering, and routing.
Our scraping and data acquisition process is designed to efficiently gather product information from various sources and store it in a structured manner for further processing.
- Kafka Server Setup:
- Start by setting up the Kafka server as described in the instructions provided in the
kafka
folder. Theconsumer.py
script handles the consumption of data streamed by crawlers to the producer, which then uploads it into the designated S3 bucket.
- Start by setting up the Kafka server as described in the instructions provided in the
- Crawlers:
- Inside the
dino
folder, we have two sets of crawlers:- s3_spiders: These crawlers directly push JSON data to the S3 bucket without using Kafka.
- spiders: These crawlers utilize Kafka architecture to stream each scraped JSON to Kafka before uploading to S3.
- Inside the
- Concurrent Spider Execution:
- The spiders are designed to run concurrently, with each spider executing as a separate background process. This parallel execution enables efficient scraping of data from multiple sources simultaneously.
- FastAPI Server Integration:
- All functionalities related to running the spiders, querying databases, and preprocessing scraped data are defined within the
main.py
of our FastAPI server. - The FastAPI server provides endpoints to start, stop, and monitor the status of individual spiders, as well as a mechanism to trigger concurrent scraping of all spiders.
- All functionalities related to running the spiders, querying databases, and preprocessing scraped data are defined within the
- Template for Creating Scrapers:
- We provide a
spider_template.py
that simplifies the process of creating new scrapers for additional websites. This template includes placeholders for inserting scraping selectors and generating JSON output.
- We provide a
- Environment Setup:
- Each crawler requires its own environment variables defined in an
.env
file (template provided). Ensure proper setup of these variables before running the crawlers.
- Each crawler requires its own environment variables defined in an
To recreate the scraping and data acquisition process locally, follow these steps:
-
Install required dependencies:
pip install -r requirements.txt
-
Setup Kafka server and
consumer.py
(instructions in thekafka
folder). Alternatively, uses3_spiders
if you prefer direct data upload to S3. -
Run the FastAPI server:
uvicorn main:app --reload
-
Trigger the scraping process:
- Invoke the
/scrape
endpoint of the FastAPI server (http://localhost:8000/scrape
) to start the scraping process.
- Invoke the
By following these steps, you can replicate our scraping and data acquisition pipeline locally and explore the functionalities of Discovery Dino's backend system.
After acquiring semi-structured data stored in S3, the data processing and ingestion phase involves transforming and enriching the data for meaningful analysis and storage.
- Dynamic Data Retrieval:
- Retrieve the latest files from the designated S3 bucket dynamically, downloading them onto serverless containers for temporary processing.
- Data Batch Processing:
- Batch process the downloaded data, focusing on understanding product details and identifying existing entries on G2 to minimize API calls.
- Utilizing Language Models (LLMs):
- For detailed product descriptions, leverage LLMs such as GPT-3 and LLama2 (hosted on our own cluster) to enhance and enrich product information.
- Feature Extraction and Categorization:
- Extract features like categories and identify business types (B2B or B2C) to enhance product understanding and categorization.
- Integration with External APIs:
- Integrate with external APIs (e.g., Google Search API) to supplement data and gather the latest information where feasible.
- Database Integration:
- Push processed and enriched data into MongoDB, ensuring a structured and organized repository of product information.
- MongoDB is structured to store detailed product entries, including descriptions, categories, business types, and metadata.
- The database is continuously updated with new product information sourced from the scraping and ingestion pipeline.
By following this data processing and ingestion workflow, Discovery Dino ensures that product information is systematically analyzed, enriched, and stored for efficient retrieval and analysis.
Our web application provides a user-friendly interface to explore and interact with the collected product data, offering intuitive features for enhanced discovery and decision-making.
- Dynamic Dashboard:
- Display all scraped products that are not yet listed on G2, providing visibility into potential additions to the platform.
- Filtering and Sorting:
- Implement filters based on star ratings, categories, and other criteria to refine product searches.
- Dynamic Search:
- Enable users to search the entire database for specific products or keywords, ensuring quick access to relevant information.
- Detailed Product Information:
- View detailed product descriptions, website links, ratings, and reviews with a single click for comprehensive insights.
- Integration with Database:
- Query product data directly from the MongoDB database, which is continuously updated with fresh information from the preprocessing pipeline.
Our web application acts as a powerful tool for software discovery, leveraging the processed data to facilitate informed decision-making and effortless navigation of the vast landscape of B2B software products. Experience the convenience of Discovery Dino's user-centric interface for exploring the latest Generally Available (GA) products!
In our Discovery Dino project, we have designed a MongoDB database to efficiently store and manage various types of product information, categorizing them based on their source, filtering status, and existing presence on the G2 platform.
-
scraped_products
: This collection contains all products that have been scraped, providing detailed information for search and analysis purposes.{ "\_id": "string", "productName": "string", "photoUrl": "string", "description": "string", "rating": 0, "similarProducts": ["list of URLs"], "contactMail": "string", "website": "string", "category": ["string"], "additionalInfo": "string", "scarpedLink": "string", "reviews": [{"objects"}] } ```
-
filtered_products
: This collection stores products that have been filtered as B2B and are not currently listed on G2.{ "\_id": "string", "productName": "string", "photoUrl": "string", "description": "string", "rating": 0, "similarProducts": ["list of URLs"], "contactMail": "string", "website": "string", "category": ["string"], "additionalInfo": "string", "scarpedLink": "string", "reviews": [{"objects"}] } ```
-
g2_products
: This collection serves as a cache for products already listed on G2, storing relevant metadata and associations.{ "associatedProductName": "product name for which we got this result", "id": "129d01fa-f6db-4477-a3f2-549cee2b6d54", "type": "products", "links": {}, "attributes": {}, "relationships": {} } ```
We have predefined categories to classify products based on their nature and functionality, enabling efficient organization and filtering.
categories = [
"Sales Tools",
"Marketing",
"Analytics Tools & Software",
"Artificial Intelligence",
"AR/VR",
"B2B Marketplaces",
"Business Services",
"CAD & PLM",
"Collaboration & Productivity",
"Commerce",
"Content Management",
"Converged Infrastructure",
"Customer Service",
"Data Privacy",
"Design",
"Development",
"Digital Advertising Tech",
"Ecosystem Service Providers",
"ERP",
"Governance, Risk & Compliance",
"Greentech",
"Hosting",
"HR",
"IoT Management",
"IT Infrastructure",
"IT Management",
"Marketing Services",
"Marketplace Apps",
"Office",
"Other Services",
"Professional Services",
"Routers",
"Security"
]
Our data lake setup in Amazon S3 organizes scraped data into a structured format for further processing and analysis.
{
"title": "Workspace ONE",
"description": "Workspace ONE is a user-friendly intelligence-driven digital workspace solution that enables users to securely manage and deliver any app anywhere and on any device. It provides one source of truth for end user access, provisioning, security, compliance, and management across all devices",
"price": "4.7",
"image_url": "data:image/svg+xml,%3csvg%20xmlns=%27http://www.w3.org/2000/svg%27%20version=%271.1%27%20width=%27104%27%20height=%27104%27/%3e",
"link": "<https://www.softwareadvice.com/help-desk/workspace-one-profile/>",
"additional_info": "Workspace ONE is a user-friendly intelligence-driven digital workspace solution that enables users to securely manage and deliver any app anywhere and on any device. It provides one source of truth for end user access, provisioning, security, compliance, and management across all devices",
"website": "<https://www.vmware.com/products/workspace-one.html>",
"reviews": [
{ "content": "Remote user access, application/desktop virtualization." },
{
"content": "The implementation took some time but all of our end-users are extremely happy to be off the legacy MDM platform."
},
{
"content": "IT implemented it and I was tasked with enforcing security controls and policies. Integrating it with our SIEM has been a huge pain so we ended up going with something else. Forcing HD encryption to all endpoints was nice."
}
]
}
title
→productName
description
→description
price
→rating
image_url
→photoUrl
link
→scarpedLink
additional_info
→additionalInfo
website
→website
reviews
→reviews
The data processing phase involves dynamically retrieving data from S3, processing it for detailed insights, and storing the enriched data in MongoDB for further analysis and application usage. This workflow ensures that product information is organized, categorized, and made accessible for effective decision-making.
By implementing this comprehensive database design and data processing workflow, Discovery Dino optimizes the handling and utilization of scraped product data, enabling efficient discovery and analysis of B2B software products.
This document provides detailed information on the API endpoints of the FastAPI application.
Each endpoint in this documentation includes details about its purpose, the parameters it requires, and the expected response format.
Endpoint: /
Method: GET
Parameters: None
Response:
- Type: JSON
- Content:
message
: Welcome message and service information.DB_Status
: Status of the MongoDB connection.
Description: Provides basic information about the service, including the status of the connection to MongoDB.
Endpoint: /api/data
Method: GET
Parameters:
- limit (int): The number of documents to retrieve.
Response:
- Type: JSON
- Content: An array of documents from the specified MongoDB collection.
Description: Retrieves a specified number of documents from a MongoDB collection.
Endpoint: /api/search
Method: GET
Parameters:
- collection (str): Name of the MongoDB collection to search in.
- searchString (str, optional): String to search for within the productName field.
- limit (int, optional): The number of documents to retrieve.
Response:
- Type: JSON
- Content:
collection
: The MongoDB collection searched.searchString
: The string searched for.results
: An array of search results.
Description: Searches for a specific string within a specified collection based on the productName field.
Endpoint: /api/filter
Method: GET
Parameters:
- collection (str): Name of the MongoDB collection to search in.
- rating (str, optional): The number of stars to filter from.
- category (str, optional): Category to filter from.
- limit (int, optional): The number of documents to retrieve.
Response:
- Type: JSON
- Content:
collection
: The MongoDB collection searched.results
: An array of documents that meet the filter criteria.
Description: Applies filters to the data based on rating and category within a specified collection.
Endpoint: /scrape
Method: GET
Parameters: None
Response:
- Type: JSON
- Content:
message
: Message indicating scraping completion.
Description: Initiates scraping by starting spiders in separate processes.
Endpoint: /stop_spider/{spider_name}
Method: POST
Parameters:
- spider_name (str): The name of the class of the spider to stop.
Response:
- Type: JSON
- Content:
message
: Message indicating whether the spider was stopped successfully.
Description: Stops a running spider by its class name.
Endpoint: /run_spider/{spider_name}
Method: GET
Parameters:
- spider_name (str): The name of the class of the spider to be run.
Response:
- Type: JSON
- Content:
message
: Message indicating whether the spider was started successfully.
Description: Runs a specific spider by name, initiating a new process.
We welcome contributions to Discovery Dino from the community. If you are interested in contributing to the project, please check out our GitHub repositories:
Feel free to submit pull requests or open issues to collaborate and improve the project.
For any project-related inquiries or issues, please contact:
- Email: sakthebalan2003@gmail.com
- Email: adithyaskolavi@gmail.com
Our team is available to address any questions, feedback, or concerns you may have regarding Discovery Dino.
Discovery Dino is a powerful tool designed to simplify the process of discovering and onboarding new Generally Available (GA) software products onto the G2 platform. With robust data scraping, processing, and a user-friendly web interface, Discovery Dino empowers users to explore and interact with a comprehensive database of B2B software products.
By leveraging advanced technologies and a scalable architecture, Discovery Dino aims to contribute to informed decision-making in software purchasing and promote the visibility of emerging software solutions.
Explore the world of B2B software discovery with Discovery Dino today!