Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoxiLB with multi-AZ HA support in AWS don't working #813

Open
agixio opened this issue Sep 27, 2024 · 10 comments
Open

LoxiLB with multi-AZ HA support in AWS don't working #813

agixio opened this issue Sep 27, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@agixio
Copy link

agixio commented Sep 27, 2024

Describe the bug
I followed all instructions in the docs to deploy loxilb in multi az ha with aws but my 2 loxilb instances timeout:
2024-09-27 15:18:29 XSync netRPC Connect - 192.168.228.58:22222 :Fail(dial tcp 192.168.228.58:22222: i/o timeout)
2024-09-27 15:18:24 XSync netRPC Connect - 192.168.218.57:22222 :Fail(dial tcp 192.168.218.57:22222: i/o timeout)

(more error at the google drive link)

To Reproduce
Follow the documentations here: https://github.com/loxilb-io/loxilbdocs/blob/main/docs/aws-multi-az.md

Screenshots
I have all screenshots at this link: https://docs.google.com/document/d/1DFvKcP8WCYQhhud0h9FWDaQoYm2iX417t7Uit1ObRcg/edit?usp=sharing

@agixio agixio added the bug Something isn't working label Sep 27, 2024
@TrekkieCoder
Copy link
Collaborator

Havent yet looked at the detailed logs but it seems like the loxilb instances cant communicate over grpc channel. This is usually due to various reasons:

  • Make sure AWS Security policy of the instances allow 22222 tcp port communication.
  • Inside the loxilb nodes, there might be some firewall/iptables which might be preventimg this. Kindly clear such entries and check it out.

@agixio
Copy link
Author

agixio commented Sep 27, 2024

Indeed I haven't done all that, I'll test it and come back to you afterwards. thanks !

@UltraInstinct14
Copy link
Contributor

The doc did not mention setting up of inbound security groups for loxilb instance. It has been updated here. You can check if things are working by trying :

## From loxilb1
$ nc <loxilb2-instance-IP> 11111 -v
Connection to <loxilb2-instance-IP> 11111 port [tcp/*] succeeded!

$ nc <loxilb2-instance-IP> 22222 -v
Connection to <loxilb2-instance-IP> 22222 port [tcp/*] succeeded!

## From loxilb2
$ nc <loxilb1-instance-IP> 11111 -v
Connection to <loxilb1-instance-IP> 11111 port [tcp/*] succeeded!

$ nc <loxilb1-instance-IP> 22222 -v
Connection to <loxilb1-instance-IP> 22222 port [tcp/*] succeeded!

@agixio
Copy link
Author

agixio commented Sep 29, 2024

@UltraInstinct14 @TrekkieCoder Thank you very much for your help. Everything works better with all the rules in place, and now they can communicate properly. The HA status is also fine. However, after completing all the steps, the Elastic IP is still not associated with any of my EC2 instances, which is strange. I followed the exact same network architecture and used the same CIDR as in the tutorial.

We have an error talking to the kernel
2024/09/29 23:00:32 Serving loxilb rest API at http://[::]:11111
2024-09-29 23:00:33 [API] HA POST API called. url : /netlox/v1/config/cistate
2024-09-29 23:00:33 [API] Instance default New HA State : BACKUP, VIP: 0.0.0.0
2024-09-29 23:00:33 [CLUSTER] Instance default Current State NOT_DEFINED Updated State: BACKUP VIP : 0.0.0.0
2024-09-29 23:00:33 failed to get ENI intf name ()
2024-09-29 23:00:33 [API] Load balancer POST API called. url : /netlox/v1/config/loadbalancer
2024-09-29 23:00:33 [API] lbRules : {{35.180.153.68 192.168.248.254 55002 tcp 0 0 false false 2 1800 true  0   0 0 default_nginx-lb1 0} [] [{192.168.36.180 31630 1  }]}
2024-09-29 23:00:33 ep-host added 192.168.36.180_tcp_31630:0
2024-09-29 23:00:33 fullnat:suitable source for 192.168.36.180: 192.168.248.254
2024-09-29 23:00:33 nat lb-rule added - 1:dst-35.180.153.68/32,proto-6,dport-55002,-do-fullnat:eip-192.168.36.180,ep-31630,w-1,alive|
2024-09-29 23:00:33 Added cluster-peer 192.168.228.173
2024-09-29 23:00:33 [DP] LB rule 192.168.248.254 add[OK]
2024-09-29 23:00:35 inactive ep - 192.168.36.180_tcp_31630:tcp:31630(next try after 60s)
2024-09-29 23:00:35 [NLP] Link msgs subscribed

...

2024-09-29 23:00:42 [API] HA POST API called. url : /netlox/v1/config/cistate
2024-09-29 23:00:42 [API] Instance default New HA State : MASTER, VIP: 0.0.0.0
2024-09-29 23:00:42 [CLUSTER] Instance default Current State BACKUP Updated State: MASTER VIP : 0.0.0.0
2024-09-29 23:00:42 no loxiType intf found
2024-09-29 23:00:42 Get xsync()
2024-09-29 23:00:42 XSync netRPC - 192.168.228.173:22222 :Connected
2024-09-29 23:00:42 RPC - CT Xsync Remote-1
2024-09-29 23:00:42 cidrBlock (192.168.248.0/24) associate failed in VPC vpc-09d80cc6acffaad98:operation error EC2: AssociateVpcCidrBlock, https response error StatusCode: 400, RequestID: 6d3a8c7f-ccd7-4cdc-82d3-750360f3c8e9, api error CidrConflict: CIDR range conflicts with 192.168.0.0/16 with association ID vpc-cidr-assoc-0acaa95895001e808
2024-09-29 23:00:44 XSync netRPC - 192.168.228.173:22222 :Reset
2024-09-29 23:00:44 XSync netRPC - 192.168.228.173:22222 :Connected
2024-09-29 23:00:44 RPC -  CT Get 1
2024-09-29 23:00:44 CT Bcast
2024-09-29 23:00:44 [CT]  CTBcast Complete
23:01:02 TRACE loxilb_libdp.c:2592: ct: #192.168.218.226:35631 -> 192.168.0.2:53 (17)# rid:0 est:0 nat:0 (Aged:202194512

@TrekkieCoder
Copy link
Collaborator

TrekkieCoder commented Sep 30, 2024

Sorry for the inconvenience. After trying the scenario again from the tutorial doc, it seems the document has not been updated for the latest versions. I will try to summarize what things need to be done differently.

  1. Run loxilb with a cloudcidrblock subnet which is not currently used in the VPC. For example use 124.124.124.0/24 CIDR :
sudo docker run -u root --cap-add SYS_ADMIN \
  --restart unless-stopped \
  --net=host \
  --privileged \
  -dit \
  -v /dev/log:/dev/log -e AWS_REGION=ap-northeast-2 \
  --name loxilb \
  ghcr.io/loxilb-io/loxilb:latest \
  --cloud=aws --cloudcidrblock=124.124.124.0/24

Please note that image to be used is ghcr.io/loxilb-io/loxilb:latest . "aws-support" labeled image has been discontinued.

  1. Change kube-loxilb.yaml to have a private CIDR VIP of the above CIDR subnet:
    - --externalCIDR=13.208.X.X/32
    - --privateCIDR==124.124.124.250/32
    - --setRoles=0.0.0.0
  1. Additionally, please edit security in EKS nodes to allow traffic from this CIDR subnet because loxilb will use this subnet to send traffic to eks nodes in "fullnat" mode.

TrekkieCoder added a commit to TrekkieCoder/loxilb that referenced this issue Sep 30, 2024
UltraInstinct14 added a commit that referenced this issue Sep 30, 2024
PR: gh-813 Fixed unnecessary error logs
@agixio
Copy link
Author

agixio commented Sep 30, 2024

@TrekkieCoder Thank you. No problem at all regarding the documentation being slightly out of date; I completely understand. I'm happy to help a bit to update it to make it accessible to everyone.
I followed your file configuration exactly as provided, but it didn’t work. I then tried various mixing your with the arguments from the documentation, but still unable to mount the Elastic IP. I don't know where i messed.

Screenshot 2024-09-30 at 23 32 20 Screenshot 2024-09-30 at 23 34 08 Screenshot 2024-09-30 at 23 38 46

Log trace link: https://docs.google.com/document/d/1ZFliJHdDZ3ruM30V6TxiICTCBxl1HsECS6B3hcVlZCg/edit?usp=sharing

@TrekkieCoder
Copy link
Collaborator

Hi @agixio ,

Your log trace file cant be opened since it asks for permissions.

https://docs.google.com/document/d/1ZFliJHdDZ3ruM30V6TxiICTCBxl1HsECS6B3hcVlZCg/edit?usp=sharing

@agixio
Copy link
Author

agixio commented Oct 2, 2024

@TrekkieCoder
Copy link
Collaborator

My bad: https://docs.google.com/document/d/18xf9R6WpDWyzmxnzM_sOny9gP2S-hwOpIkPgPahYkN4/edit?usp=sharing

Double checked the logs. But there are no logs related to service creation ?

@agixio
Copy link
Author

agixio commented Oct 2, 2024

@TrekkieCoder: My service don't work but sorry i sent u the old log with the deprecated docker image, i updated everything with all pictures :

https://docs.google.com/document/d/1MEkQzmVEvkiOgLSS6OnqyAVaZLnNgrl2-I8q6KDTipg/edit?usp=sharing

TrekkieCoder added a commit to TrekkieCoder/loxilb that referenced this issue Oct 3, 2024
UltraInstinct14 added a commit that referenced this issue Oct 3, 2024
PR : gh-813 Fixed logs level verbosity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants