-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add offline inference support #103
Conversation
docs/source/arguments.rst
Outdated
A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. Default: False. | ||
|
||
- ``--bcast`` | ||
Whether to broadcast topK in a specific side. By default, an universal topK across all scores are returned. Users can specify ``head`` to broadcast at head that returns topK for each head; ``rel`` to broadcast at relation that returns topK for each relation; ``tail`` to broadcast at tail that returns topK for each tail. Default: None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't quite understand this one either. I suppose this is bool, so Default is False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string. if you have a list of h, r, t and want to calculate top5 most relevant links for each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but h_r_t doesn't broadcast head, relation and tail? i think you have an option to do pair-wise computation, right? so i suppose the default behavior is broadcast?
How many results are returned. Default:10. | ||
|
||
- ``--score_func SCORE_FUNC_NAME or None`` | ||
What kind of score is used in ranking. Currently, we support two functions: ``none`` (score = $x$) and ``logsigmoid`` ($score = log(sigmoid(x))$). Default: none. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does logsigmoid change ranking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just change the score distribution shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason we want this is that the returned results contain score? does anyone care about this transformation?
docs/source/arguments.rst
Outdated
@@ -304,3 +303,69 @@ dglke_partition | |||
- ``-k NUM_PARTS`` or ``--num-parts NUM_PARTS`` | |||
The number of partitions. | |||
|
|||
dglke_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we need to explain what this command does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another PR to refine our doc before 0.1.1 release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's better to call it dglke_predict
. the main function of this command is to predict the likely head/tail entities or relations given other elements.
docs/source/arguments.rst
Outdated
The numpy file containing the embeddings. | ||
|
||
- ``--format FORMAT`` | ||
The format of the input data, specified in ``e_e``. Ideally, user should provides two files, one for heads and one for tails. But we also allow users to use *** to represent *all* of the embeddings. For exmpale, ``e_*`` only requires users to provide a file containing heads and use the whole embedding set as tails; ``*_e`` only requires users to provide a file containing tails and use the whole embedding set as heads; even users can specify a single *** to treat the whole embedding set as both heads and tails. By default, the calculation will take an N_head x N_tail manner, but user can use ``e_e_pw`` to give two files with same length and the similarity is calcuated pair by pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users don't need to specify relations? why are the format e_e
instead of h_r_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For embedding similarity. only entity2entity or relation2relation.
It is meaningless to compare entity with relation.
docs/source/inference.rst
Outdated
|
||
The Linkage Score Ranking | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
The task of linkage score ranking is given a list of candidate (h, r, t) triplets, calculating the edge score of each triplet based on the trained model and the pretrained embeddings and then returning the topk most relevent triplets along with their scores. An example return value of top5 linkage score likes this:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm confused if the command returns score of all input edges or returns top K ranked ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By design it only returns top5.
Do we need an option for user to output all the scores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe --topk -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems the description is kind of redundant.
* ``--model_path``, The path containing the pretrained model, include the embedding files (.npy) and a config.json containing the configure information of the model. | ||
* ``--format``, The format of the input data, specified in ``h_r_t``. Ideally, user should provides three files, one for head entities, one for relations and one for tail entities. But we also allow users to use *** to represent *all* of the entities or relations. For example, ``h_r_*`` requires users to provide files containing head entities and relation entities and use the whole entity set as tail entities; ``*_*_t`` requires users to provide a single file containing tail entities and use the whole entity set as head entities and the whole relation set as relations. The supported formats include ``h_r_t``, ``h_r_*``, ``h_*_t``, ``*_r_t``, ``h_*_*``, ``*_r_*``, ``*_*_t``. By default, the calculation will take an N_h x N_r x N_t manner. | ||
* ``--data_files`` A list of data file names. This is used to provide necessary files containing the requried data according to the format, e.g., for ``h_r_t``, three files are required as h_data, r_data and t_data, while for ``h_*_t``, two files are required as h_data and t_data. | ||
* ``--raw_data``, A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the arguments are also listed in the argument list file? can we organize these arguments like this (required arguments, task related arguments, etc) in the other file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We may discuss about how to present our arguments with a more readable way.
docs/source/inference.rst
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
The task of embedding similarity ranking is given a list of entity (e1, e2) pairs or relation (r1, r2) pairs, calculating the similarity between their corresponding embeddings and returning the topk most similar pairs. An example of return value of top5 similar entities likes this:: | ||
|
||
head tail score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actual head entity or tail entity? why do we want to calculate the similarity of a head entity and tail enity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm..., the naming here is miss leading. It should be emb1 and emb2, not the head and tail in the triplets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i think we should call them entity1 and entity2.
python/dglke/infer_emb_sim.py
Outdated
head = load_entity_data(data_files[0]) | ||
tail = load_entity_data(data_files[1]) | ||
args.bcast = False | ||
pair_wise = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still get confused by broadcast and pair wise. shouldn't one be enough?
python/dglke/infer_emb_sim.py
Outdated
head = load_entity_data() | ||
tail = load_entity_data() | ||
|
||
model = EmbSimInfor(args.gpu, args.emb_file, args.sim_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suppose you mean EmbSimInfer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
for j in range((num_tail + batch_size - 1) // batch_size): | ||
st_emb = tail_emb[j * batch_size : (j + 1) * batch_size \ | ||
if (j + 1) * batch_size < num_tail \ | ||
else num_tail] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it always broadcast?
result.append((F.asnumpy(head[head_idx]), | ||
F.asnumpy(rel[rel_idx]), | ||
np.full((k,), F.asnumpy(tail[i])), | ||
F.asnumpy(score))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we share code in the branches? it seems there is a lot of duplications.
python/dglke/models/infer.py
Outdated
|
||
return result | ||
|
||
class EmbSimInfor(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infer instead of Infor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -120,6 +127,9 @@ def edge_func(self, edges): | |||
score = head + rel - tail | |||
return {'score': self.gamma - nd.norm(score, ord=1, axis=-1)} | |||
|
|||
def infer(self, head_emb, rel_emb, tail_emb): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why no infer?
@@ -60,6 +60,13 @@ def edge_func(self, edges): | |||
score = head + rel - tail | |||
return {'score': self.gamma - nd.norm(score, ord=self.dist_ord, axis=-1)} | |||
|
|||
def infer(self, head_emb, rel_emb, tail_emb): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use edge_func to implement it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some infer func is different from score_func for some algorithms.
docs/source/arguments.rst
Outdated
How many results are returned. Default:10. | ||
|
||
- ``--sim_func SIM_FUNC_NAME`` | ||
What kind of distance function is used in ranking and will be output. It support five functions: 1)cosine: use cosine distance; 2) l2: use l2 distance; 3) l1: use l1 distance; 4) dot: use dot product as distance; 5) ext_jaccard: use extended jaccard as distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a clear explanation of ext_jaccard here.
A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. Default: False. | ||
|
||
- ``--exec_mode`` | ||
How to calculate scores for triplets and calculate topK. Possible candidates include: ``triplet_wise``, ``all``, ``batch_head``, ``batch_rel``, ``batch_tail``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you going to explain each option means? btw, pairwise isn't good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is (h, r, t) triplets, pairwise is not correct.
docs/source/inference.rst
Outdated
|
||
Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks: | ||
|
||
* **Linkage score ranking** Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel the description of the task isn't very intuitive. How about this?
predicting missing entities/relations in a triplet: given entities and/or relations, predict which entities or relations are likely to connect with the existing entities for given relations. For example, given a head entity and a relation, predict which entities are likely to connect to the head entity via the given relation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is better.
docs/source/inference.rst
Outdated
Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks: | ||
|
||
* **Linkage score ranking** Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets. | ||
* **Embedding similarity ranking** Given a list of (e, e) enitity pairs or (r, r) relation pairs, calculate the similarity of for each pair, sort the resulting similarity score and output the topk most similar pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finding similar embeddings: given an entity/relation embedding, find the most similar entity/relation embeddings for some pre-defined similarity functions.
docs/source/inference.rst
Outdated
|
||
The Linkage Score Ranking | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
The task of linkage score ranking is given a list of candidate (h, r, t) triplets, calculating the edge score of each triplet based on the trained model and the pretrained embeddings and then returning the topk most relevent triplets along with their scores. An example return value of top5 linkage score likes this:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems the description is kind of redundant.
remb = rel_emb[j] | ||
temb = F.unsqueeze(tail_emb[k], dim=0) | ||
edge = FakeEdge(hemb, temb, remb) | ||
score = score_func.edge_func(edge)['score'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the difference between infer
and edge_func
is that infer
computes scores in a batch fashion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infer will do broadcasting as (N, 1) * (1, N).
edge_func only do N * N.
Add support for offline inference of two different tasks:
For linkage score ranking, we support TransE_l1, TransE_l1, DistMult, ComplEx, RESCAL, RotatE score functions. We provides two different score funcs:
For embedding similarity ranking, we support five different similarity funcs: