Skip to content

Based on paper Company Similarity using Large Language Models by D Vamvourellis et al.(2023) from BlackRock

Notifications You must be signed in to change notification settings

wuxxx949/stock_embedding_using_LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This project is based on paper Company Similarity using Large Language Models by D Vamvourellis et al.(2023) from BlackRock.

The original paper tested both open-source and proprietary LLMs. I only tested the open-source part of the paper due to cost. This project generates embeddings for stocks in the Russell 3000 universe using LLM fine tuned by SEC 10-K data.

If you find this repo useful, please star it to help others find it! It is also the only way I can tell how useful it is to the open source community.

Embeddings

For each company in the universe, is is assigned to a industry by GICS.The idea is to use these industry classifications as label and company description in 10-K Section 1 to fine tune a LLM like Bert or sentence Bert based on. The final embedding is generated by averaging over multiple embeddings of a company description due to input length limitation.

Infer GICS

By design, each company is assigned to only 1 sector and industry by its primary business. However, a company can potentially involve in other business other than the one specified in GICS. The goal is to infer other industries by assigning probabilities. To achieve the goal, we use the embeddings as feature and GICS industry / sector as label to train a multinomial logistic regression, the predicted probabilities are used as GICS percentage.

Application

A simple Dash dashboard is made to demonstrate multiple GICS industries or sectors inferred from the classifier tuned in embedding step. You can select a company by its ticker and view its description and inferred GICS industries / sectors. For example, Disney (DIS) is classified into Entertainment (71%), Diversified Consumer Services(5%), and Hotel, Restraurants & Leisure (2%) etc.

length2 I created a SQLite database, which stores company descriptions and embeddings. To run the app locally, follow these steps:

  • Clone the repo.
  • Manually download src/data/db/dash.db and place it under src/data/db/ as it uses git LFS. length1
  • Install packages in a virtual environment based on requirements-app.txt.
  • Activate venv.
  • cd to src/viz and run python app.py.
  • Open a browser and use http://127.0.0.1:8050/ to access the dashboard.

How to Reproduce the Results

  • add following environment variables
    • SEC_DIR: directory for storing company descriptions as text file.
    • BERT_MODEL_DIR: direcotry for storing fine-tuned Bert model.
    • BERT_MODEL_DIR: direcotry for storing fine-tuned Sentence Bert model.
  • Activate venv.
  • SEC data scraping: python src/data/sec_data/fetch_submission.py. If you encounter errro due to unstable network, simply rerun the script until all the companies are fetched.
  • Fine tune bert model python src/model/fine_tune/bert_fine_tune.py.
  • Creat local sqlite database python src/data/db/local_db.py.

About

Based on paper Company Similarity using Large Language Models by D Vamvourellis et al.(2023) from BlackRock

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages