-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF IDF with Morel #87
Comments
The first problem seems to get solved if I calculate val tfs =
from doc in docs, term in split doc.text
yield {doc.name, term}
group name, term compute tf = count;
val dfs =
from doc in docs, term in split doc.text
yield {doc.name, term}
group term compute df = count
val tfidfs =
from tf in tfs
join df in dfs on tf.term = df.term
yield {tf.term, tf.name, tfidf = tf.tf * (n/df.df + 1)}; |
Thanks for giving Morel a try. This is an interesting problem, because the different measures require you to group at different granularities. That is hard to do in relational algebra because - using my favorite analogy, the pasta machine - you need to run the pasta through the machine more than once. SQL gives us So I settled on an approach that uses correlated functions. Your approach produces collections that could be later joined on the common attributes, but it's basically similar. With some query optimization magic, perhaps both solutions could produce the same physical plan. Here's my solution:
You'll need to use my https://github.com/julianhyde/morel/tree/0088-math branch (work in progress for #88). It includes the There is potentially also a solution that uses clever aggregate functions over sets, but I didn't have time to write it. |
Oops, I missed the '+ 1'. Here is a revised solution that sorts by idf-tf:
|
Thank you, this code is super clean! It is interesting how the relational and functional worlds intersect 🤔 Is the goal for a program like this to execute on a distributed system like Spark or Flink using Calcite, or does Calcite translate the query plan back into Morel? I am also wondering, is there a chance Morel will support data streams? |
Yes, absolutely. The goal is to provide a language more powerful than SQL that can be executed as efficiently as SQL, on any system that can execute SQL. In my StrangeLoop talk, I talk about how Morel can implement WordCount. WordCount is the examplar of data-parallel systems like MapReduce and Spark. Those systems execute DAGs of extended relational algebra (extended with aggregate, table functions, and iteration). Via Calcite we can generate such DAGs. |
By the way, the program I have written is a query. The functions, when inlined, become correlated scalar sub-queries. The query is initially correlated and seems to make multiple passes over the data, but those problems can be solved via rewrites. I expect that the relational algebra plan would include Aggregate with multiple grouping sets, but that can be evaluated efficiently in a data parallel system. Someone could work backwards from that plan to a SQL query that contains |
Hi, I tried to implement TF-IDF (Term Frequency - Inverse Document Frequency) in Morel, and I almost managed to get it to work.
The formula that is used to compute tf-idf is defined as follows:
tf-idf(t, d) = tf(t, d) * idf(t)
t
is a termd
is a document in a document setidf(t) = log [n/df(t)] + 1
is the inverse document frequencyn
is the total number of documents in the document setdf(t)
is the document frequency oft
t
tf(t, d)
is the term frequency oft
ind
t
within the documentd
1
is added so that terms which occur in all documents will not beentirely ignored.
This is the code:
Is there a way to calculate tf-idf that I might have missed? It would be interesting to see if it was possible. I am unsure, but maybe a
unique
relational operator is needed to calculatedf
.The text was updated successfully, but these errors were encountered: