-
Notifications
You must be signed in to change notification settings - Fork 222
KubeFATE FAQs
1. When launching a cluster using docker-compose, the Python container initialization restarts several times.
TBD
Login to the machine where the cluster is located, and run: TBD
We are planning to fix the issue by Kubefate v1.9.0, before that happens, if you are using docker compose to deploy a FATE cluster, please check the workaround here.
When fail to run run_toy_example, maybe get logs like this:
2019-11-14 07:27:48,165 - task_executor.py[line:127] - ERROR: <_Rendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "172.18.0.8:8011: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
-
First check if all containers are running normally, especially
egg
,roll
,meta-service
containers. -
It may be that computer resources are insufficient, such as memory, check the kernel log to verify if the OMM killer was invoked.
-
The
storage-service
ofegg
service requires CPU instruction set likeavx2
etc. Please make sure your CPU supports these instructions otherwise thestorage-service
will fail to start with the following error:Illegal instruction (core dumped)
use the following commands to check the log of "storage-service":
$ docker exec -it confs-xxxx_egg_1 bash $ cat storage-service-cxx/logs/error.log
-
Decrease the
processor_count
of the egg service, if the job is able to start but can not finished.For
Docker-Compose
deployment:- Check the number(s) of processor with
$ cat /proc/cpuinfo | grep processor | wc -l
- Log in to the egg container and update config
$ docker exec -it confs-xxx_egg_1 bash $ vi egg/conf/egg.properties set `eggroll.computing.processor.session.max.count` to the output of step 1
- Restart the egg container
$ docker restart confs-xxx_egg_1
For
k8s-deploy
, to change the egg setting (e.g. in namespace fate-9999) with,kubectl edit configmap egg-config -n fate-9999
by default, the
eggroll.computing.processor.session.max.count
is set to 16, change it to match your CPU processor, then save it and restart the egg pod. The egg pod with restart by itself.
Q: How does KubeFATE support PodSecurityPolicy (PSP) of k8s?
A: You can support PSP by configuring the following configuration of cluster.yaml
podSecurityPolicy:
enabled: true
When deploying, KubeFATE will create PodSecurityPolicy and the corresponding role, rolebinding and serviceaccount resources.
Q: How to break through the DockerHub's limit on the number of times to pull images.
A: You can configure your own imagePullSecrets:
image:
imagePullSecrets:
- name: myregistrykey
Then create your own imagePullSecrets under the corresponding namespace, refer to Use image pull secrets.
Q: How to change the resource requirements for computing engines?
With regard to "computing engines", we are talking about eggroll's nodemanger or spark's spark-worker.
For these two components, they could have higher requirements for the resources in the K8s cluster. In the helm chart, we have set a default request for these 2 components: at least 2 cores' CPU and 4GB memory. For more information about resource request, please check the offical doc of K8s. In our cluster.yaml file, you can set the customized resource configuration for each component of FATE, including node manager and spark-worker.
Take this cluster.yaml file as an example, we can add below lines to enlarge the resource request for spark worker:
spark:
worker:
replicas: 2
resources:
requests:
cpu: "4"
memory: "8Gi"
Welcome to KubeFATE's wiki page.