An API designed for code completion and fine-tuning of open-source large language models on internal codebases and documents.
- π Code Completion API: Seamlessly integrate advanced code suggestions into your development process.
- βοΈ Custom Fine-tuning: Personalize models to your company's codebase and internal knowledge, including support for documents and PDFs.
- π Fine-tuning Techniques: Supports Standard, LoRA, and QLoRA fine-tuning methods.
- π₯ Multi-user Support: Run multiple users with different models on a shared server.
- π§ Retrieval-Augmented Generation (RAG): Experimental feature enabling context-aware generation.
We provide instructions for running the API with and without Docker. Follow either the Without Docker or With Docker section, and then follow the instructions in the Testing the API section to get started.
-
Install dependencies: Ensure you have Python 3.9+ and pip installed. Then run:
pip install -r requirements.txt
-
Install CUDA (recommended): If you want to use GPU acceleration, make sure you have CUDA installed. The version of flash attention that we use is only compatible with CUDA 12.2 and 11.8. You can instead build it from source if you want to use a different version of CUDA, but this takes a lot longer and is more work.
-
Install Flash Attention (recommended): Installing flash attention will significantly improve performance and is highly recommended:
MAX_JOBS=4 pip install flash-attn==2.5.8 --no-build-isolation
-
Set OpenAI API Key (optional): If you want to use SFT with rejection sampling or RAG, you need to set the OPENAI_API_KEY environment variable:
export OPENAI_API_KEY=your_api_key_here
-
Start the API: Navigate to the
src
directory and run:python main.py
- This starts the server on
localhost:8000
. - Uses the
deepseek-ai/deepseek-coder-1.3b-base
model by default. - Add a
--config-name=cpu_config
flag to run on CPU instead of GPU (extremely slow).
- This starts the server on
You can also use the provided Dockerfile
to run the API:
-
Build the Docker image:
docker build -t intract_api .
-
Set OpenAI API Key (optional): If you want to use SFT with rejection sampling or RAG, you need to set the OPENAI_API_KEY environment variable:
export OPENAI_API_KEY=your_api_key_here
-
Start the Docker container:
docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY -it --rm --name intract_api intract_api --config-name=cpu_config
- Binds port
8000
on the host to port8000
on the container. - Removes the container when it stops.
- Uses the
deepseek-ai/deepseek-coder-1.3b-base
model by default.
- Binds port
-
Enable GPU Acceleration (recommended): To use GPU acceleration, add the
--gpus all
flag when starting the container, and remove the--config-name cpu_config
flag to revert to the default, gpu-compatible config:docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY --gpus all -it --rm --name intract_api intract_api
Once the server is running (either with or without Docker), you can test the API:
- Open a web browser and navigate to
http://localhost:8000/docs
to access the Swagger UI, where you can explore and interact with the available API endpoints. - Complete the following steps to set up and authorize your session. This is required for using the other endpoints:
- Navigate to
localhost:8000/register
to create an account (data will only be stored locally). - Return to the Swagger UI at
localhost:8000/docs
. - Click the "Authorize" button on the top right of the Swagger UI to authorize your session.
- Navigate to
- You can now test any of the endpoints through the Swagger UI, such as:
/generate
to get a code completion/finetune/project
to start a fine-tuning process
The model's behavior and training parameters can be customized by modifying the src/conf/config.yaml
file. Key configuration options include:
model_name
: Set the model to use (default: deepseek-ai/deepseek-coder-1.3b-base)context_length
: Set the context length for the model (default: 512)device
: Choose the device to run the model on (default: cuda)use_flash_attention
: Enable or disable flash attention (default: True)
You can switch between different fine-tuning methods by adjusting the following parameters:
Set model_type: standard
in the configuration.
Set model_type: lora
and adjust these parameters:
lora_r
: Rank of the LoRA update matrices (default: 64)lora_alpha
: LoRA scaling factor (default: 16)lora_dropout
: Dropout probability for LoRA layers (default: 0.01)
Set model_type: qlora
and adjust these parameters:
bits
: Quantization bits (default: 4)double_quant
: Enable double quantization (default: True)quant_type
: Quantization data type (default: nf4)optim
: Optimizer for QLoRA (default: paged_adamw_32bit)gradient_checkpointing
: Enable gradient checkpointing (default: True)
max_gen_length
: Maximum length of generated code (default: 128)max_revision_steps
: Maximum number of code revision steps (default: 2)use_ntp
anduse_fim
: Enable/disable specific training techniquestrain_on_code
,train_on_docs
, etc.: Configure what to train on
For a complete list of configurable parameters, refer to the src/conf/config.yaml
file in the project repository.
Explore the full API documentation by visiting http://localhost:8000/docs
after starting the server.
Our fine-tuning process is versatile and powerful, supporting multiple approaches:
- Self-supervised learning
- Next Token Prediction (NTP)
- Fill-in-the-Middle (FIM)
- Supervised fine-tuning (SFT) with rejection sampling
- Standard fine-tuning
- LoRA (Low-Rank Adaptation)
- QLoRA (Quantized LoRA)
Standard fine-tuning will provide the best results, but it is also the most expensive. LoRA and QLoRA use less memory, but may not be as accurate, and were slower in our experiments.
We employ two main techniques:
- Next Token Prediction (NTP): Trains the model to predict the next token in a sequence.
- Fill-in-the-Middle (FIM): Masks out a portion of the input and trains the model to reconstruct it.
These methods can be applied to various data sources:
- User's codebase
- Documentation text
- Code snippets extracted from documentation
- Auto-generated problems and solutions
- External documents (text files and PDFs)
The fine-tuning process is highly configurable through the config file:
- Choose data sources:
train_on_code
,train_on_docs
,train_on_doc_code
,train_on_practice_problems
,train_on_verified_solutions
,train_on_documents
- Select training methods:
use_ntp
,use_fim
This approach entails generating and solving synthetic problems to improve the model's performance. The full process entails the following steps:
- Automatically generate problem statements
- Model produces multiple solutions for each problem
- Solutions are executed and evaluated automatically
- This process is repeated for a number of iterations until a solution is found or the maximum number of revisions is reached.
- The model is trained on the solved problems and their solutions.
Key features:
- Allows iterative improvement without human intervention
- Automatically assesses solution correctness
- Creates a feedback loop for continual refinement
This method leverages the fact that judging solution correctness is often easier than generating correct solutions from scratch, enabling the model to enhance its problem-solving skills over multiple iterations.
main.py
- Entry point to running the server.modeling.py
- Handles the construction, loading, and management of language models and tokenizers. It includes:- A
ModelLoader
class for creating models with various configurations. - A
ModelProvider
singleton class that manages model instances for multiple users, allowing retrieval of user-specific models. - Utility functions and classes to support model operations and tokenization.
- A
config_handler.py
- Contains a singleton classConfigProvider
that manages the configuration for the server. It provides methods to initialize the configuration, retrieve the configuration instance, and access the configuration data.database.py
- Manages database operations and connections through aDatabaseProvider
singleton class, including table creation and connection handling.users.py
- Manages user sessions, authentication, and token handling through aSessionTracker
singleton and various utility functions. TheSessionTracker
maintains active user sessions, handles user eviction based on inactivity, and manages user-specific resources like models and vector stores.rag.py
- Implements the Retrieval-Augmented Generation (RAG) functionality through aVectorStoreProvider
singleton class. It manages vector stores for each user, handles document insertion, and provides methods for context retrieval during inference.documents.py
- Handles document processing and conversion, including PDF to text conversion using different libraries (Nougat and PyMuPDF). It also provides caching mechanisms for processed documents and utility functions for handling various document formats.routers/
- Contains the API endpoints for different functionalities:generator.py
- Handles text generation requests and responses.fine_tuner.py
- Manages the fine-tuning process for models based on user input.auth.py
- Handles user authentication, registration, and token management.
static/
- Contains static files for authentication and login that are no longer used.training/
- Contains files related to model training and fine-tuning:data_formatting.py
- Handles data preparation and formatting for training, including functions for tokenization and dataset creation.finetune.py
- Implements the fine-tuning process, including dataset processing, model configuration, and training loop management.trainer.py
- Extends the Hugging Face Trainer class to provide custom training functionality. It includes modifications for continual learning, custom evaluation, and memory optimizations.interactive/
- Contains files for multi-step SFT with rejection sampling. This folder includes implementations for generating and evaluating solutions to programming problems and handling multi-step training processes. It supports features like automated problem generation, solution verification, and iterative improvement of model responses.
crawler/
- Contains files for web scraping and document extraction. The crawler functionality uses libraries like Scrapy and BeautifulSoup to extract content from web pages and documentation sites, with explicit support for both GitHub repositories and web-based documentation. It includes utilities for finding documentation URLs and processing HTML content.
If you want to constribute, we assume that you have read the rest of this document.
We are not longer actively working on this project, and we don't plan to make any updates. If you still want to contribute knowing that, you are welcome to start by making an Issue to ask if it is something we would approve. If you do make a pull request, please do your best to follow the Google Python Style Guide.
This project is licensed under the MIT License. See the LICENSE file for more information.