Skip to content

Commit

Permalink
typo
Browse files Browse the repository at this point in the history
  • Loading branch information
adhocmaster committed Apr 5, 2024
1 parent 2ce0413 commit d5fe5a0
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Challenges
Co-occurance matrices are quite large. A 50k vocabulary will have 2.5B entries requiring 2.5GB. There is no point having a co-occurance matrix instead of a 1GB embedding.
Co-occurrence matrices are quite large. A 50k vocabulary will have 2.5B entries requiring 2.5GB. There is no point having a co-occurrence matrix instead of a 1GB embedding.

## Solution #1 - Session-based

1. Build session-based co-occurance matrix with available words. e.g.
1. Build session-based co-occurrence matrix with available words. e.g.
2. Create session based vocabulary with user's history and new contents. Assuming top 1000 words with new content words having higher weights.
3. build co-occurance matrix for the new vocabulary
3. build co-occurrence matrix for the new vocabulary
4. create user and content vectors. apply weights using p(w1, w2). Need to revise literature on this. Basically, if a word, w1, appears in a vector, we add count 1 to it and add p(w2|w1) to w2.
5. Use Jaccard Similarity. Need to revisit the similarity research on this.

10 changes: 5 additions & 5 deletions documentation/ranking.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
# Ranking

## Approach 1 - No embedding/co-occurance
## Approach 1 - No embedding/co-occurrence
Input: (new contents, user history)

1. Build a vocabulary with more weight on new content words. No stop-words, only roots
2. Build content representation with word occurance. The representation is a vector of word frequencies.
2. Build content representation with word occurrence. The representation is a vector of word frequencies.
3. Build user representation using the same approach.
4. Calculate similarity with TF-IDF weights and staking weights. Use Jaccard Similarity. Need to revisit the similarity research on this.
5. May be improved by introducing synonyms of important words.


## Approach 2 - session-based co-occurance
This is a semi-semantic approach. With co-occurance, we can partially find semantic overlap between contents and user history even when they do not share the same words.
## Approach 2 - session-based co-occurrence
This is a semi-semantic approach. With co-occurrence, we can partially find semantic overlap between contents and user history even when they do not share the same words.

Input: (new contents, user history)

[Details](./co-occurance.md)
[Details](./co-occurrence.md)

## Approach 3 - embedding
Embedding lookup is O(logn). We don't need to store embeddings in the database initially. Later we need to find a way to create and cache different user preference embeddings.
Expand Down

0 comments on commit d5fe5a0

Please sign in to comment.