Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds local LLM inference for categorization via Ollama #2

Merged
merged 1 commit into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions Ena.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,13 @@
@click.option("-d", "--directory", "statements_dir", type=click.Path(exists=True),
default=STATEMENTS_PATH, help="Directory where statements are. Defaults to Ena/statements")
@click.option("-v", "--verbose", is_flag=True, default=False,
help="If true, logs at INFO level. Else, defaults to WARNING level.")
def cli(statements_dir: str, verbose: bool):
help="If set, logs at INFO level. Else, defaults to WARNING level.")
@click.option("-m", "--manual-review", is_flag=True, default=False,
help="""
If set and using LLM to infer categories, any transactions that are categorized as
Expense (catch-all) will be presented for manual review. Defaults to False.
""")
def cli(statements_dir: str, verbose: bool, manual_review: bool):
"""
Parses FI Statements into CSVs to be used for book-keeping purposes. Officially
supported use-cases are Dime (iOS) and Google Sheets.
Expand All @@ -28,7 +33,7 @@ def cli(statements_dir: str, verbose: bool):
except FileNotFoundError:
write_preferences()

ena = Ena(statements_dir)
ena = Ena(statements_dir, manual_review)
ena.parse_statements()


Expand Down
3 changes: 2 additions & 1 deletion Preferences.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/usr/bin/env python3

import os
import click
import logging

import click

from typing import Dict
from configparser import ConfigParser

Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ Ena was built as a tool to better house-keep finances, rather than simply paying

The core feature of Ena is to take Credit Card bill statements from major Financial Institutes in Canada and turning them into CSV files to be used with Google Sheets or Dime (iOS) to better visualize your monthly spendings.

It's currently still a WIP, but hopefully in the soonish future, local LLM can be integrated into Ena to automatically categorize expenses into pre-set categories (defined in `src/model:Category`) so users have a better idea of what they're spending on.
The second core feature is Ena's ability to query a local custom LLM to categorize transactions into pre-set categories (defined in `src/model:Category`) so users have a better idea of what they're spending on. Currently, the local LLM model is based off Meta's Llama3, and its model is defined at `src/llm/Modelfile`. It has been instructed to only categorize a transaction if its at least 90% confident (0.9). While not perfect, it serves as a nice starting point for most users.

An option is also included, if preferred, to manually categorize transactions that have been categorized into the catch-all category of Expense. If this option is enabled, every time a transaction is categorized into the generic category, the console will prompt the user to type in a valid category before moving onto the next transaction. This gives some control back to the user. At this point, no training is done to the LLM as that is a bit out of my scope.

## Usage
Two scripts are provided, the first of which, `Ena.py`, is the one to use to process statement PDFs into CSVs. The second one, `Preferences.py`, is used to configure `preferences.ini` for individual users to determine Ena's overall behaviour.
Expand Down Expand Up @@ -68,7 +70,7 @@ This feature is represented by the `csv_order` flag, and there are three options
`DEFAULT` and `SIMPLE` are quite self-explanatory; `DIME` is the order in which the iOS app expects and prompts the user for the column names, and it's easier to simply click next 4 times in a row withing moving anything around, a real lazy approach to it if you will.

##### Categories
By default, Ena will only categorize transactions into two categories - expenses and income. This behaviour can be configured and will be expanded at a later date to integrate a local Ollama instance to use a custom LLM model to categorize transactions. The idea is to prompt an open-sourced model, like Llama3 or Phi3 with all possible categories along with some examples, so that it can categorize expenses to better help visualize their spendings into accurate categories without the user manually having to do so.
By default, Ena will only categorize transactions into two categories - expenses and income. This behaviour can be configured to instead use a local custom LLM via Ollama to categorize transactions. This will allow Ena to automatically categorize transactions without manual user input.

This feature is represented by the `use_llm` flag, and by default, is set to False (no).

Expand Down Expand Up @@ -132,9 +134,7 @@ The following Financial Insitutes are a WIP as I do not have access to them atm.
* BNS
* TD

Integrating Ollama and LLM in general is a continued WIP as I'm in the process of learning and creating a custom model image that is tuned to the categories I've set.

Another feature I have in mind has to do with reocurring expenses, such as Rent, Mortgage, Utilities, etc. In Canada, some of those things are not payable via Credit Card, and thus cannot be tracked by Ena. However, they should be around the same every month, and so, can be included via another option for `Ena.py`. Current approach is to include another file to be looked at by Ena, which include utilities.
Another feature I have in mind has to do with recurring expenses, such as Rent, Mortgage, Utilities, etc. In Canada, some of those things are not payable via Credit Card, and thus cannot be tracked by Ena. However, they should be around the same every month, and so, can be included via another option for `Ena.py`. Current approach is to include another file to be looked at by Ena, which include utilities.

As the basis of this fork comes from Teller (noted above), some of the code isn't exactly what I need to for my purposes. Thus, a refactor of the processor code is planned for the future.

Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
click>=8.1.2
ollama>=0.2.0
pdfplumber>=0.11.0
38 changes: 29 additions & 9 deletions src/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,20 @@
import re
import csv
import logging
import pdfplumber

from Preferences import ROOT_PATH, get_preferences
import pdfplumber

from typing import List
from datetime import datetime
from collections import defaultdict

from src.llm.api import LLM
from Preferences import ROOT_PATH, get_preferences
from src.model import Category, Orders, Transaction, FIFactory, CSV_ORDERS


class Ena:
def __init__(self, statements_dir: str):
def __init__(self, statements_dir: str, manual_review: bool):
"""
Does two things:
1. Globs available statements and maps FI Name to corresponding statements'
Expand All @@ -24,6 +26,8 @@ def __init__(self, statements_dir: str):
Args:
statements_dir (str): Directory where statements are stored.
"""
self.llm = LLM()
self.manual_review = manual_review
self.preferences = get_preferences()
self.statements = defaultdict(list)
for item in os.listdir(statements_dir):
Expand Down Expand Up @@ -51,11 +55,13 @@ def parse_statements(self):
csv_order = CSV_ORDERS[self.preferences.csv_order]
writer = csv.DictWriter(csv_file, csv_order)
writer.writeheader()
for txn in csv_data:
for transaction in csv_data:
if csv_order == Orders.SIMPLE:
writer.writerow(txn.simple_repr())
writer.writerow(transaction.simple_repr())
else:
writer.writerow(txn.row_repr())
writer.writerow(transaction.row_repr())

print(f"CSV written to {file_path}")

def _parse_statement(self, processor: FIFactory.type_FI, statement_path: str) -> List[Transaction]:
"""
Expand Down Expand Up @@ -94,14 +100,14 @@ def _parse_statement(self, processor: FIFactory.type_FI, statement_path: str) ->
opening_balance = processor.get_opening_balance(text)
closing_balance = processor.get_closing_balance(text)

print(text)
logging.info(text)

# debugging transaction mapping - all 3 regex in transaction have to find a result in order for it to be considered a "match"
year_end = False
transaction_regex = processor.get_transaction_regex()
for match in re.finditer(transaction_regex, text, re.MULTILINE):
match_dict = match.groupdict()
print(match_dict)
logging.info(match_dict)

date = match_dict["dates"].replace("/", " ") # change format to standard: 03/13 -> 03 13
date = date.split(" ")[0:2] # Aug. 10 Aug. 13 -> ["Aug.", "10"]
Expand Down Expand Up @@ -148,7 +154,21 @@ def _parse_statement(self, processor: FIFactory.type_FI, statement_path: str) ->
else:
if self.preferences.use_llm:
# Get category via inference
...
llm_category = self.llm.categorize_transaction(transaction)
if llm_category == Category.EXPENSE and self.manual_review:
# Get human category from input
all_categories = [c.value for c in Category]
print(f"Transaction: [{transaction}] needs a manual review. What category is it?")
print(f"List of possible categories are: {all_categories}")
human_category = input("Please type a new Category (must be exact match): ").strip()

# ensure its one of the options
while human_category not in all_categories:
human_category = input("Input categoy did not match possible categories. Please try again (must be exact match): ")

transaction.category = Category[human_category.upper()]
else:
transaction.category = llm_category
else:
transaction.category = Category.EXPENSE

Expand Down
28 changes: 28 additions & 0 deletions src/llm/Modelfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
FROM llama3

# set the system message
SYSTEM """
You are Ena, the super accountant. Your job is to categorize expenses into categories,
defined below.

The available categories are:
* Recurring
* Household
* Food
* Fashion
* Games
* Travel
* Expense

Recurring expenses are those associated with Rent, Mortgage, Utilities, etc.

Household expenses are any purchases that has to do with the household, including but not limited to groceries, cleaning supplies, furniture, etc.

Food expenses are explicitly going out to dine at a restaurant OR ordering food via delivery apps, like UberEats, DoorDash, etc.

Fashion expenses would be anything clothing related, including but not limited to shoes and sporting goods.

Expense is the catch-all category if you are not at least 90% confident with the categorization.

You should answer only in JSON, with keys being category and confidence, where confidence is represented in a number between 0 and 100. Please do not yap, thank you.
"""
54 changes: 54 additions & 0 deletions src/llm/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import json
import logging

import ollama

from json.decoder import JSONDecodeError

from src.model import Category, Transaction


class LLM:
def __init__(self):
"""
Ensures that the ryanliu6/ena model is always on disk
"""
ollama.pull("ryanliu6/ena")

def categorize_transaction(self, transaction: Transaction) -> Category:
"""
Categorizes a given transaction using LLM via local ollama.

If the LLM does not return a valid JSON object, then we return the catch-all
expense (Category.EXPENSE) category.

Similarly, if the LLM returns a category that is not specified, we will also return
the same catch-all expense (Category.EXPENSE) category.

Note: First usage of this function might take a long time, due to ollama needing to
load the model into memory. If possible, load the model beforehand. By default, ollama
keeps a model in memory for 5 minutes before unloading.

Args:
transaction (Transaction): Transaction to be categorized, assumed to be an expense.

Returns:
Category: Category of this Transaction, generated by an LLM
"""
# result of ollama.generate is a dictionary with a bunch of stuff, we're only interested in response
llm_result = ollama.generate(model="ryanliu6/ena", prompt=Transaction.note)["response"]

# LLM response should be in JSON, where keys are category and confidence
try:
json_result = json.loads(llm_result)
except JSONDecodeError:
return Category.EXPENSE

# Try to interpolate generated by the LLM
try:
llm_category = Category[json_result["category"]]
logging.info(f"Transaction [{transaction}] has been categorized as {llm_category} with a {json_result['confidence']}% confidence.")
except KeyError:
llm_category = Category.EXPENSE

return llm_category
10 changes: 5 additions & 5 deletions src/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,16 @@ class Preferences:

class Category(Enum):
RECURRING = "Recurring"
GROCERIES = "Groceries"
HOUSEHOLD = "Household"
FOOD = "Food"
FUN = "Fun"
FASHION = "Fashion"
GAMES = "Games"
TRAVEL = "Travel"
GIFTS = "Gifts"
# Serves as generic expense
EXPENSE = "Expense"
# Used to differentiate payments / refunds
# / cashback rewards from expenses
INCOME = "Income"
# Serves as generic expense
EXPENSE = "Expense"


@dataclass
Expand All @@ -57,6 +54,9 @@ def __eq__(self, other):
return isinstance(other, Transaction) and self.date == other.date and self.amount == other.amount \
and self.note == other.note and self.category == other.category

def __repr__(self):
return f"Date: {self.date} | Amount: {self.amount} | Note: {self.note} | Category: {self.category.value}"

def row_repr(self) -> Dict:
"""
Returns the Row Representation of a Transaction.
Expand Down