Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offline inference support #103

Merged
merged 28 commits into from
Jun 11, 2020
Merged

Conversation

classicsong
Copy link
Contributor

@classicsong classicsong commented May 26, 2020

Add support for offline inference of two different tasks:

  • The Linkage Score Ranking
  • The Embedding Similarity Ranking

For linkage score ranking, we support TransE_l1, TransE_l1, DistMult, ComplEx, RESCAL, RotatE score functions. We provides two different score funcs:

  • none: $score = x$
  • logsigmoid: $score = log(sigmoid(x))$

For embedding similarity ranking, we support five different similarity funcs:

  • cosine: score = $\frac{x \cdot y}{||x||_2||y||_2}$
  • l2: score = $||x - y||_2$
  • l1: score = $||x - y||_1$
  • dot: score = $x \cdot y$
  • extended jaccard: score = $\frac{x \cdot y}{||x||{2}^{2} + ||y||{2}^{2} - x \cdot y}$
  • linkage score ranking inference
  • embedding similarity ranking inference
  • docs
  • unitest
  • examples

A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. Default: False.

- ``--bcast``
Whether to broadcast topK in a specific side. By default, an universal topK across all scores are returned. Users can specify ``head`` to broadcast at head that returns topK for each head; ``rel`` to broadcast at relation that returns topK for each relation; ``tail`` to broadcast at tail that returns topK for each tail. Default: None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't quite understand this one either. I suppose this is bool, so Default is False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string. if you have a list of h, r, t and want to calculate top5 most relevant links for each $h_i$, you can specify --bcast 'head'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but h_r_t doesn't broadcast head, relation and tail? i think you have an option to do pair-wise computation, right? so i suppose the default behavior is broadcast?

How many results are returned. Default:10.

- ``--score_func SCORE_FUNC_NAME or None``
What kind of score is used in ranking. Currently, we support two functions: ``none`` (score = $x$) and ``logsigmoid`` ($score = log(sigmoid(x))$). Default: none.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does logsigmoid change ranking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just change the score distribution shape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason we want this is that the returned results contain score? does anyone care about this transformation?

@@ -304,3 +303,69 @@ dglke_partition
- ``-k NUM_PARTS`` or ``--num-parts NUM_PARTS``
The number of partitions.

dglke_score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need to explain what this command does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need another PR to refine our doc before 0.1.1 release

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's better to call it dglke_predict. the main function of this command is to predict the likely head/tail entities or relations given other elements.

The numpy file containing the embeddings.

- ``--format FORMAT``
The format of the input data, specified in ``e_e``. Ideally, user should provides two files, one for heads and one for tails. But we also allow users to use *** to represent *all* of the embeddings. For exmpale, ``e_*`` only requires users to provide a file containing heads and use the whole embedding set as tails; ``*_e`` only requires users to provide a file containing tails and use the whole embedding set as heads; even users can specify a single *** to treat the whole embedding set as both heads and tails. By default, the calculation will take an N_head x N_tail manner, but user can use ``e_e_pw`` to give two files with same length and the similarity is calcuated pair by pair.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users don't need to specify relations? why are the format e_e instead of h_r_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For embedding similarity. only entity2entity or relation2relation.
It is meaningless to compare entity with relation.


The Linkage Score Ranking
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The task of linkage score ranking is given a list of candidate (h, r, t) triplets, calculating the edge score of each triplet based on the trained model and the pretrained embeddings and then returning the topk most relevent triplets along with their scores. An example return value of top5 linkage score likes this::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm confused if the command returns score of all input edges or returns top K ranked ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design it only returns top5.
Do we need an option for user to output all the scores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe --topk -1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the description is kind of redundant.

* ``--model_path``, The path containing the pretrained model, include the embedding files (.npy) and a config.json containing the configure information of the model.
* ``--format``, The format of the input data, specified in ``h_r_t``. Ideally, user should provides three files, one for head entities, one for relations and one for tail entities. But we also allow users to use *** to represent *all* of the entities or relations. For example, ``h_r_*`` requires users to provide files containing head entities and relation entities and use the whole entity set as tail entities; ``*_*_t`` requires users to provide a single file containing tail entities and use the whole entity set as head entities and the whole relation set as relations. The supported formats include ``h_r_t``, ``h_r_*``, ``h_*_t``, ``*_r_t``, ``h_*_*``, ``*_r_*``, ``*_*_t``. By default, the calculation will take an N_h x N_r x N_t manner.
* ``--data_files`` A list of data file names. This is used to provide necessary files containing the requried data according to the format, e.g., for ``h_r_t``, three files are required as h_data, r_data and t_data, while for ``h_*_t``, two files are required as h_data and t_data.
* ``--raw_data``, A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the arguments are also listed in the argument list file? can we organize these arguments like this (required arguments, task related arguments, etc) in the other file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. We may discuss about how to present our arguments with a more readable way.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The task of embedding similarity ranking is given a list of entity (e1, e2) pairs or relation (r1, r2) pairs, calculating the similarity between their corresponding embeddings and returning the topk most similar pairs. An example of return value of top5 similar entities likes this::

head tail score
Copy link
Contributor

@zheng-da zheng-da Jun 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actual head entity or tail entity? why do we want to calculate the similarity of a head entity and tail enity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm..., the naming here is miss leading. It should be emb1 and emb2, not the head and tail in the triplets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think we should call them entity1 and entity2.

head = load_entity_data(data_files[0])
tail = load_entity_data(data_files[1])
args.bcast = False
pair_wise = True
Copy link
Contributor

@zheng-da zheng-da Jun 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still get confused by broadcast and pair wise. shouldn't one be enough?

head = load_entity_data()
tail = load_entity_data()

model = EmbSimInfor(args.gpu, args.emb_file, args.sim_func)
Copy link
Contributor

@zheng-da zheng-da Jun 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose you mean EmbSimInfer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

for j in range((num_tail + batch_size - 1) // batch_size):
st_emb = tail_emb[j * batch_size : (j + 1) * batch_size \
if (j + 1) * batch_size < num_tail \
else num_tail]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always broadcast?

result.append((F.asnumpy(head[head_idx]),
F.asnumpy(rel[rel_idx]),
np.full((k,), F.asnumpy(tail[i])),
F.asnumpy(score)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we share code in the branches? it seems there is a lot of duplications.


return result

class EmbSimInfor():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infer instead of Infor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -120,6 +127,9 @@ def edge_func(self, edges):
score = head + rel - tail
return {'score': self.gamma - nd.norm(score, ord=1, axis=-1)}

def infer(self, head_emb, rel_emb, tail_emb):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why no infer?

@@ -60,6 +60,13 @@ def edge_func(self, edges):
score = head + rel - tail
return {'score': self.gamma - nd.norm(score, ord=self.dist_ord, axis=-1)}

def infer(self, head_emb, rel_emb, tail_emb):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use edge_func to implement it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some infer func is different from score_func for some algorithms.

@classicsong classicsong requested a review from zheng-da June 10, 2020 00:01
How many results are returned. Default:10.

- ``--sim_func SIM_FUNC_NAME``
What kind of distance function is used in ranking and will be output. It support five functions: 1)cosine: use cosine distance; 2) l2: use l2 distance; 3) l1: use l1 distance; 4) dot: use dot product as distance; 5) ext_jaccard: use extended jaccard as distance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a clear explanation of ext_jaccard here.

A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. Default: False.

- ``--exec_mode``
How to calculate scores for triplets and calculate topK. Possible candidates include: ``triplet_wise``, ``all``, ``batch_head``, ``batch_rel``, ``batch_tail``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you going to explain each option means? btw, pairwise isn't good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is (h, r, t) triplets, pairwise is not correct.


Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks:

* **Linkage score ranking** Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel the description of the task isn't very intuitive. How about this?
predicting missing entities/relations in a triplet: given entities and/or relations, predict which entities or relations are likely to connect with the existing entities for given relations. For example, given a head entity and a relation, predict which entities are likely to connect to the head entity via the given relation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is better.

Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks:

* **Linkage score ranking** Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets.
* **Embedding similarity ranking** Given a list of (e, e) enitity pairs or (r, r) relation pairs, calculate the similarity of for each pair, sort the resulting similarity score and output the topk most similar pairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finding similar embeddings: given an entity/relation embedding, find the most similar entity/relation embeddings for some pre-defined similarity functions.


The Linkage Score Ranking
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The task of linkage score ranking is given a list of candidate (h, r, t) triplets, calculating the edge score of each triplet based on the trained model and the pretrained embeddings and then returning the topk most relevent triplets along with their scores. An example return value of top5 linkage score likes this::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the description is kind of redundant.

remb = rel_emb[j]
temb = F.unsqueeze(tail_emb[k], dim=0)
edge = FakeEdge(hemb, temb, remb)
score = score_func.edge_func(edge)['score']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the difference between infer and edge_func is that infer computes scores in a batch fashion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer will do broadcasting as (N, 1) * (1, N).
edge_func only do N * N.

@zheng-da zheng-da merged commit ff4da0b into awslabs:master Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants