Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the implementation of 'camera-aware positional encoding' part #45

Open
d1024choi opened this issue Mar 20, 2023 · 0 comments

Comments

@d1024choi
Copy link

d1024choi commented Mar 20, 2023

Thank you first for sharing your great work for the community :)

According to your published paper, camera location embeddings (tau_{k}) are subtracted from map-view positional encodings (c^{n}) to make map-view queries (c^{n} - tau_{k}).

However, I found from your code that camera location embeddings (tau_{k}) are also subtracted from camera positional embeddings (delta_{k,i}), which is different from equation 3. Please see the last two lines of the following code.

# -------------------------
# translation embedding, tau_{k}
# -------------------------
c = E_inv[..., -1:]                                                     # b n 4 1
c_flat = rearrange(c, 'b n ... -> (b n) ...')[..., None]                # (b n) 4 1 1
c_embed = self.cam_embed(c_flat)                                        # (b n) d 1 1

# -------------------------
# R_{k}^{-1} X K_{k}^{-1} X x_{i}^{(I)}
# -------------------------
pixel_flat = rearrange(pixel, '... h w -> ... (h w)')                   # 1 1 3 (h w)
cam = I_inv @ pixel_flat                                                # b n 3 (h w)
cam = F.pad(cam, (0, 0, 0, 1, 0, 0, 0, 0), value=1)                     # b n 4 (h w)
d = E_inv @ cam                                                         # b n 4 (h w)
d_flat = rearrange(d, 'b n d (h w) -> (b n) d h w', h=h, w=w)           # (b n) 4 h w
d_embed = self.img_embed(d_flat)   

# -------------------------
# Normalization for attention
# -------------------------
# TODO : why subtract c_embed?
img_embed = d_embed - c_embed                                           # (b n) d h w
img_embed = img_embed / (img_embed.norm(dim=1, keepdim=True) + 1e-7)    # (b n) d h w

Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant