Add offline inference support #103

classicsong · 2020-05-26T15:40:42Z

Add support for offline inference of two different tasks:

The Linkage Score Ranking
The Embedding Similarity Ranking

For linkage score ranking, we support TransE_l1, TransE_l1, DistMult, ComplEx, RESCAL, RotatE score functions. We provides two different score funcs:

none: $score = x$
logsigmoid: $score = log(sigmoid(x))$

For embedding similarity ranking, we support five different similarity funcs:

cosine: score = $\frac{x \cdot y}{||x||_2||y||_2}$
l2: score = $||x - y||_2$
l1: score = $||x - y||_1$
dot: score = $x \cdot y$
extended jaccard: score = $\frac{x \cdot y}{||x||{2}^{2} + ||y||{2}^{2} - x \cdot y}$

Fix test

docs/source/arguments.rst

zheng-da · 2020-06-06T23:20:19Z

docs/source/arguments.rst

+    A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. Default: False.
+
+  - ``--bcast``
+    Whether to broadcast topK in a specific side. By default, an universal topK across all scores are returned. Users can specify ``head`` to broadcast at head that returns topK for each head; ``rel`` to broadcast at relation that returns topK for each relation; ``tail`` to broadcast at tail that returns topK for each tail. Default: None.


don't quite understand this one either. I suppose this is bool, so Default is False?

A string. if you have a list of h, r, t and want to calculate top5 most relevant links for each $h_i$, you can specify --bcast 'head'

but h_r_t doesn't broadcast head, relation and tail? i think you have an option to do pair-wise computation, right? so i suppose the default behavior is broadcast?

zheng-da · 2020-06-06T23:22:32Z

docs/source/arguments.rst

+    How many results are returned. Default:10.
+
+  - ``--score_func SCORE_FUNC_NAME or None``
+    What kind of score is used in ranking. Currently, we support two functions: ``none`` (score = $x$) and ``logsigmoid`` ($score = log(sigmoid(x))$). Default: none.


does logsigmoid change ranking?

No, just change the score distribution shape.

the reason we want this is that the returned results contain score? does anyone care about this transformation?

docs/source/arguments.rst

zheng-da · 2020-06-06T23:25:55Z

docs/source/arguments.rst

@@ -304,3 +303,69 @@ dglke_partition
  - ``-k NUM_PARTS`` or ``--num-parts NUM_PARTS``
    The number of partitions.

+dglke_score


i think we need to explain what this command does.

I think we need another PR to refine our doc before 0.1.1 release

i think it's better to call it dglke_predict. the main function of this command is to predict the likely head/tail entities or relations given other elements.

docs/source/arguments.rst

zheng-da · 2020-06-06T23:29:08Z

docs/source/arguments.rst

+    The numpy file containing the embeddings.
+
+  - ``--format FORMAT``
+    The format of the input data, specified in ``e_e``. Ideally, user should provides two files, one for heads and one for tails. But we also allow users to use *** to represent *all* of the embeddings. For exmpale, ``e_*`` only requires users to provide a file containing heads and use the whole embedding set as tails; ``*_e`` only requires users to provide a file containing tails and use the whole embedding set as heads; even users can specify a single *** to treat the whole embedding set as both heads and tails. By default, the calculation will take an N_head x N_tail manner, but user can use ``e_e_pw`` to give two files with same length and the similarity is calcuated pair by pair.


users don't need to specify relations? why are the format e_e instead of h_r_t?

For embedding similarity. only entity2entity or relation2relation.
It is meaningless to compare entity with relation.

zheng-da · 2020-06-06T23:37:29Z

docs/source/inference.rst

+
+The Linkage Score Ranking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The task of linkage score ranking is given a list of candidate (h, r, t) triplets, calculating the edge score of each triplet based on the trained model and the pretrained embeddings and then returning the topk most relevent triplets along with their scores. An example return value of top5 linkage score likes this::


i'm confused if the command returns score of all input edges or returns top K ranked ones.

By design it only returns top5.
Do we need an option for user to output all the scores?

maybe --topk -1

it seems the description is kind of redundant.

zheng-da · 2020-06-06T23:42:41Z

docs/source/inference.rst

+  * ``--model_path``, The path containing the pretrained model, include the embedding files (.npy) and a config.json containing the configure information of the model.
+  * ``--format``, The format of the input data, specified in ``h_r_t``. Ideally, user should provides three files, one for head entities, one for relations and one for tail entities. But we also allow users to use *** to represent *all* of the entities or relations. For example, ``h_r_*`` requires users to provide files containing head entities and relation entities and use the whole entity set as tail entities; ``*_*_t`` requires users to provide a single file containing tail entities and use the whole entity set as head entities and the whole relation set as relations. The supported formats include ``h_r_t``, ``h_r_*``, ``h_*_t``, ``*_r_t``, ``h_*_*``, ``*_r_*``, ``*_*_t``. By default, the calculation will take an N_h x N_r x N_t manner.
+  * ``--data_files`` A list of data file names. This is used to provide necessary files containing the requried data according to the format, e.g., for ``h_r_t``, three files are required as h_data, r_data and t_data, while for ``h_*_t``, two files are required as h_data and t_data.
+  * ``--raw_data``, A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation.


It seems the arguments are also listed in the argument list file? can we organize these arguments like this (required arguments, task related arguments, etc) in the other file?

Sure. We may discuss about how to present our arguments with a more readable way.

zheng-da · 2020-06-06T23:44:18Z

docs/source/inference.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The task of embedding similarity ranking is given a list of entity (e1, e2) pairs or relation (r1, r2) pairs, calculating the similarity between their corresponding embeddings and returning the topk most similar pairs. An example of return value of top5 similar entities likes this::
+
+    head  tail  score


is this actual head entity or tail entity? why do we want to calculate the similarity of a head entity and tail enity?

Emmm..., the naming here is miss leading. It should be emb1 and emb2, not the head and tail in the triplets.

yes, i think we should call them entity1 and entity2.

zheng-da · 2020-06-07T00:03:05Z

python/dglke/infer_emb_sim.py

+            head = load_entity_data(data_files[0])
+            tail = load_entity_data(data_files[1])
+        args.bcast = False
+        pair_wise = True


i still get confused by broadcast and pair wise. shouldn't one be enough?

zheng-da · 2020-06-07T00:05:38Z

python/dglke/infer_emb_sim.py

+        head = load_entity_data()
+        tail = load_entity_data()
+
+    model = EmbSimInfor(args.gpu, args.emb_file, args.sim_func)


i suppose you mean EmbSimInfer?

zheng-da · 2020-06-07T00:22:58Z

python/dglke/models/general_models.py

+            for j in range((num_tail + batch_size - 1) // batch_size):
+                st_emb = tail_emb[j * batch_size : (j + 1) * batch_size \
+                                                   if (j + 1) * batch_size < num_tail \
+                                                   else num_tail]


is it always broadcast?

zheng-da · 2020-06-07T02:24:38Z

python/dglke/models/infer.py

+                result.append((F.asnumpy(head[head_idx]),
+                               F.asnumpy(rel[rel_idx]),
+                               np.full((k,), F.asnumpy(tail[i])),
+                               F.asnumpy(score)))


can we share code in the branches? it seems there is a lot of duplications.

zheng-da · 2020-06-07T02:25:04Z

python/dglke/models/infer.py

+
+        return result
+
+class EmbSimInfor():


Infer instead of Infor?

zheng-da · 2020-06-07T02:25:53Z

python/dglke/models/mxnet/score_fun.py

@@ -120,6 +127,9 @@ def edge_func(self, edges):
        score = head + rel - tail
        return {'score': self.gamma - nd.norm(score, ord=1, axis=-1)}

+    def infer(self, head_emb, rel_emb, tail_emb):


why no infer?

zheng-da · 2020-06-07T02:27:11Z

python/dglke/models/mxnet/score_fun.py

@@ -60,6 +60,13 @@ def edge_func(self, edges):
        score = head + rel - tail
        return {'score': self.gamma - nd.norm(score, ord=self.dist_ord, axis=-1)}

+    def infer(self, head_emb, rel_emb, tail_emb):


can we use edge_func to implement it?

some infer func is different from score_func for some algorithms.

zheng-da · 2020-06-11T05:09:36Z

docs/source/arguments.rst

+    How many results are returned. Default:10.
+
+  - ``--sim_func SIM_FUNC_NAME``
+    What kind of distance function is used in ranking and will be output. It support five functions: 1)cosine: use cosine distance; 2) l2: use l2 distance; 3) l1: use l1 distance; 4) dot: use dot product as distance; 5) ext_jaccard: use extended jaccard as distance.


we need a clear explanation of ext_jaccard here.

zheng-da · 2020-06-11T05:12:58Z

docs/source/arguments.rst

+    A flag tells whether the data profiled in data_files is in the raw object naming space or in mapped id space. If True, the data is in the original naming space and the inference program will do the id translation according to id mapping files. If False, the data is just intergers and it is assumed that user has already done the id translation. Default: False.
+
+  - ``--exec_mode``
+    How to calculate scores for triplets and calculate topK. Possible candidates include: ``triplet_wise``, ``all``, ``batch_head``, ``batch_rel``, ``batch_tail``.


are you going to explain each option means? btw, pairwise isn't good?

Actually it is (h, r, t) triplets, pairwise is not correct.

zheng-da · 2020-06-11T05:20:17Z

docs/source/inference.rst

+
+Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks:
+
+  * **Linkage score ranking** Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets.


i feel the description of the task isn't very intuitive. How about this?
predicting missing entities/relations in a triplet: given entities and/or relations, predict which entities or relations are likely to connect with the existing entities for given relations. For example, given a head entity and a relation, predict which entities are likely to connect to the head entity via the given relation.

This version is better.

zheng-da · 2020-06-11T05:26:05Z

docs/source/inference.rst

+Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks:
+
+  * **Linkage score ranking** Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets.
+  * **Embedding similarity ranking** Given a list of (e, e) enitity pairs or (r, r) relation pairs, calculate the similarity of for each pair, sort the resulting similarity score and output the topk most similar pairs.


finding similar embeddings: given an entity/relation embedding, find the most similar entity/relation embeddings for some pre-defined similarity functions.

zheng-da · 2020-06-11T05:27:31Z

docs/source/inference.rst

+
+The Linkage Score Ranking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The task of linkage score ranking is given a list of candidate (h, r, t) triplets, calculating the edge score of each triplet based on the trained model and the pretrained embeddings and then returning the topk most relevent triplets along with their scores. An example return value of top5 linkage score likes this::


it seems the description is kind of redundant.

zheng-da · 2020-06-11T05:39:37Z

python/dglke/tests/test_infer.py

+                remb = rel_emb[j]
+                temb = F.unsqueeze(tail_emb[k], dim=0)
+                edge = FakeEdge(hemb, temb, remb)
+                score = score_func.edge_func(edge)['score']


the difference between infer and edge_func is that infer computes scores in a batch fashion?

infer will do broadcasting as (N, 1) * (1, N).
edge_func only do N * N.

Ubuntu added 17 commits May 13, 2020 10:16

Dev infer for dgl-ke

4946ccd

Add infer

16dec7b

score func for pytorch OK

51072ee

Pass unitest

62cc655

upd

915fdf1

Merge branch 'upstream'

b20ca21

Add some doc

979b1bd

embed similarity

3cde283

code done

1c14343

Add examples

ac7931a

Add args

bc6f8b8

Fix test

test

398594e

reduce test time

18cdf01

simplify cpu test

2fe66fa

Fix test report

bdccfba

test

55e2b10

trigger

f48607d

zheng-da reviewed Jun 6, 2020

View reviewed changes

docs/source/arguments.rst Show resolved Hide resolved

zheng-da reviewed Jun 6, 2020

View reviewed changes

docs/source/arguments.rst Show resolved Hide resolved

zheng-da reviewed Jun 6, 2020

View reviewed changes

docs/source/arguments.rst Show resolved Hide resolved

zheng-da reviewed Jun 6, 2020

View reviewed changes

zheng-da reviewed Jun 7, 2020

View reviewed changes

Ubuntu added 8 commits June 9, 2020 08:49

Fix comments

207034c

Merge branch 'upstream'

95b3bd8

text

8b52a71

Fix

8a7a30a

ifx

e4a94a9

Fix test

fcb6690

Fix test

07126e1

Fix

d53635d

classicsong requested a review from zheng-da June 10, 2020 00:01

zheng-da reviewed Jun 11, 2020

View reviewed changes

Ubuntu added 3 commits June 11, 2020 07:36

upd

0fcf985

text

aab0d9e

reduce memory pressure during test

7502685

zheng-da merged commit ff4da0b into awslabs:master Jun 11, 2020


		Users can use DGL-KE to do inference tasks based on pretained embeddings (We recommand using DGL-KE to generate these embedding). Here we support two kinds of inference tasks:

		* Linkage score ranking Given a list of (h, r, t) triplets, calculate the linkage score using the predefined score function for each triplet, sort the resulting scores and output the topk most confident triplets.

Add offline inference support #103

Add offline inference support #103

Conversation

classicsong commented May 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da Jun 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da Jun 7, 2020 • edited Loading

Choose a reason for hiding this comment

zheng-da Jun 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

classicsong commented May 26, 2020 •

edited

Loading

zheng-da Jun 6, 2020 •

edited

Loading

zheng-da Jun 7, 2020 •

edited

Loading

zheng-da Jun 7, 2020 •

edited

Loading