Artifact Collector

This tool allows you to collect and consolidate data from various sources, including GitHub repositories and websites. It also provides functionality to consolidate the collected data into a specified token context window using Ollama and Llama models.

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Installation

Clone this repository:

git clone https://github.com/bacalhau-project/scraper.git
cd scraper

Install the required Python packages:

uv venv .venv --seed
source .venv/bin/activate
uv pip install -r requirements.txt

Install Ollama:
- For Linux:
```
curl https://ollama.ai/install.sh | sh
```
- For MacOS:
```
brew install ollama
```
- For Windows: Download the installer from Ollama's official website
Pull the Llama3.1 model using Ollama:
```
ollama pull llama3.1:8b
```

Usage

The script provides several options:

Download data:
```
python main.py --download
```
Consolidate data into multiple files (max 5MB each):
```
python main.py --consolidate
```
Consolidate data to a specific token context size using Ollama:
```
python main.py --context-consolidate 2048 --model llama3.1:8b
```
You can adjust the context size (e.g., 2048) and the model name as needed.

Perform all operations:

python main.py --download --consolidate --context-consolidate 2048

Configuration

Edit the config.json file to specify:

Output directory
GitHub repositories to clone/update
Websites to scrape
Maximum depth for web crawling
Maximum pages per site
Number of worker threads

Troubleshooting

If you encounter any issues with Ollama or the Llama model:

Ensure Ollama is running:
```
ollama serve
```
Check available models:
```
ollama list
```
If the Llama3.1 model is missing, pull it again:
```
ollama pull llama3.1:8b
```
For more detailed Ollama usage, refer to the Ollama documentation.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.cspell		.cspell
bacalhau_knowledge_base		bacalhau_knowledge_base
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artifact Collector

Prerequisites

Installation

Usage

Configuration

Troubleshooting

Contributing

License

About

Releases

Packages

Languages

bacalhau-project/scraper

Folders and files

Latest commit

History

Repository files navigation

Artifact Collector

Prerequisites

Installation

Usage

Configuration

Troubleshooting

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages