-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to introduce a monolithic deployment with Helm #4858
base: main
Are you sure you want to change the base?
Conversation
I know this PR is not ready, created it in scope of #4832 but I got stuck on getting things working. |
Nothing strikes me as the cause of the "connection refused" error. Can you provide more surrounding log lines? I am trying to understand what it is trying to connect to. Also can you confirm what this IP |
@dimitarvdimitrov It's the IP for the mimir-monolithic service. See screenshot from lens (IP changed because it is a new installation) |
I've had a bit of a play with this as our company is interested in deploying Mimir in Monolithic mode. I think at the moment there's a bit of a catch-22 situation where Mimir's ready check depends on the service but the service won't publish the IPs until Mimir is ready. The /ready endpoint is spitting out In microservices mode it looks like this is worked around by setting Line 18 in 548ca6a
I think if you were to set Proposed changes: monolithic-svc-headless.yaml{{- if eq .Values.deploymentMode "monolithic" }}
apiVersion: v1
kind: Service
metadata:
name: {{ include "mimir.resourceName" (dict "ctx" . "component" "monolithic") }}-headless
...
spec:
...
publishNotReadyAddresses: true
...
{{- end }} values.yamlmimir:
frontend_worker:
frontend_address: {{ include "mimir.resourceName" (dict "ctx" . "component" "monolithic") }}-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }} Apologies if I'm completely off the mark, I only started looking into Mimir as a whole last week. Thanks for the fantastic work on this PR so far, deploying this in Monolithic mode definitely simplifies things 👍 |
@kieranbrown Did you test it with the proposed configuration. I applied your proposed changes and it's still failing with the exact same error message. Any other suggestions @dimitarvdimitrov ? |
@rubenvw-ngdata apologies there was a typo in my recommended changes. The frontend_address change should be here https://github.com/rubenvw-ngdata/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L259, and it should be changed to frontend_address: {{ include "mimir.resourceName" (dict "ctx" . "component" "monolithic") }}-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }} But to answer your question yes I've been running your recommended configuration file, I'll attach my full values file below for reference. I've been running this configuration now for about a week with no issues. My full values filedeploymentMode: monolithic
fullnameOverride: mimir
image:
tag: 2.9.0-rc.1
ingester:
zoneAwareReplication:
enabled: false
mimir:
structuredConfig:
alertmanager_storage:
s3:
bucket_name: <redacted>-alertmanager
blocks_storage:
s3:
bucket_name: <redacted>-blocks
common:
storage:
backend: s3
s3:
endpoint: s3.eu-west-2.amazonaws.com
region: eu-west-2
frontend_worker: # this is temporarily set until it's fixed in the upstream chart
frontend_address: mimir-monolithic-headless.mimir:9095
ingester:
ring:
replication_factor: 3
limits:
compactor_blocks_retention_period: 2y
ingestion_burst_size: 100000
ingestion_rate: 50000
max_global_series_per_user: 1000000 # increased as all requests are under an 'anonymous' user
max_label_names_per_series: 50 # increased as we breached the default of '30'
out_of_order_time_window: 1h
ruler_storage:
s3:
bucket_name: <redacted>-ruler
metaMonitoring:
dashboards:
enabled: true
annotations:
k8s-sidecar-target-directory: Platforming - Mimir
prometheusRule:
mimirAlerts: true
mimirRules: true
serviceMonitor:
enabled: true
minio:
enabled: false
monolithic:
persistentVolume:
size: 50Gi
replicas: 3
resources:
limits:
cpu: '2'
memory: 8Gi
requests:
cpu: '1'
memory: 8Gi
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
service:
annotations:
service.kubernetes.io/topology-aware-hints: Auto
zoneAwareReplication:
enabled: false
nginx:
enabled: false
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<redacted>:role/ops-tooling/mimir
store_gateway:
zoneAwareReplication:
enabled: false
|
@kieranbrown Thanks for the input. That helped. I have it working now, and will continue testing. Still hoping for official support from the mimir team in the future. |
@kieranbrown Do you use mimir with prometheus? How have you configured the remoteWrite in prometheus. Still trying to get that one working at my side. |
We don't use Prometheus, we use a central opentelemetry collector to forward metrics to Mimir, although the config should be roughly the same. Your push endpoint should be something like This is just a complete guess based on a quick Google and I can't validate it because I don't have a prometheus instance, but perhaps you want something like this. remote_write:
url: http://mimir-monolithic.<namespace>.svc.cluster.local:8080/api/v1/push
headers:
X-Scope-OrgID: anonymous |
@rubenvw-ngdata I think the There's 2 approaches we could take, we build the target arg based on the helm values that already exist today. i.e if Or we just add Keen to hear your thoughts. |
The flexibility |
@kieranbrown @dimitarvdimitrov I don't think it's as simple as you propose it. I have only done a very quick test and this is my first opinion. |
Even though all data seems to get in correctly into mimir and I can query the data without issues, the mimir logs are still full of |
I believe this is related to periodically renewing gRPC connections. Currently they are renewed every 2 minutes (values.yaml), so this log line shouldn't be too frequent. This helps to refresh connection pools with new pods. It should be harmless because the connection will be reestablished after the failure. |
@dimitarvdimitrov Not sure about your statement. An example of a successful query action in the mimir logs:
I also cannot understand that I should ignore error level logs... |
sharing from a private message:
|
328fd50
to
bbe5a1a
Compare
@kieranbrown Just a note to make you aware that we have moved the repository to be an organisation repository. The fork should be available at https://github.com/NGDATA/mimir/ |
Nice work @rubenvw-ngdata. Your fork doesn't have an 'issues' section, so I thought I'd mention this here: this chart doesn't support turning off
and:
It looks like there are other sections in the
The _helpers.tpl has this:
I checked each component for zoneAwareReplication, and narrowed it down to 'alertmanager', 'store_gateway', 'ingester' and 'monolithic'. If I explicitly set |
@raffraffraff Thank you and thanks for reporting the issue. I agree that mimir should support this functionality. If more people start using this, they might consider it. So do not hesitate to spread the word. For the first issue, you are right. I have fixed some validations that I had forgotten. Since we are still working on the initial setup I have squashed the changes in the same commit. Can you validate if it is working correctly for you now? If you still encounter the second problem (I did not face it), can you share your configuration so I can help? I have opened up the issues section on the fork to allow logging issues. |
Thanks @rubenvw-ngdata - I'll test it with the latest version of your code and create an issue in your project if it isn't working for me. EDIT: I pulled the latest code, but don't see any significant differences that would resolve this. In the meantime I have created an issue on your project to track what I was experiencing. I'll update it as soon as I can test it (hopefully tomorrow) |
I wrote a completely new helm chart with
Adding
(mount
This is not production ready by any means! I can publish the chart if someone is interested in it. |
0318ff5
to
203d1ab
Compare
b068f97
to
8b07871
Compare
deployment. The same chart is used for monolithic, as was existing for the microservices deployment mode. To do this a new parameter 'deploymentMode' has been added to the helm chart. All charts are adapted to deploy services when applicable according to the deployment model. A new template is being added that installs the monolithic mimir container. All nginx endpoints are updated to direct to the monolithic service if this deployment mode has been chosen. The name of the chart has not been updated from 'mimir-distributed' to 'mimir' yet. This allows for an easier maintenance as long as this is not merged. Ideally the rename happens when merging to avoid further confusion. I attached a monolithic.yaml yaml file that I have used to do a monolithic setup. In monolithic mode, publish not ready addresses and use these for the frontend. The read-write mode could be added in the future. The alert manager is not included in the monolithic deployment. If you want to run it, you can run it in a seperate pod by enabling it. The zone aware reploication functionality is supported int he monolithic deployment with a similar confiuction as required for ingester or store gateway. I have also added the functionality to enable compression for the monolithic chart. Ideally this would be embedded into the complete chart instead. A common configuration is probably the easiest, if you want grpc message compression, I guess you want it everywhere. The common configuration setting can then also be used in the monolithic approach.
The CHANGELOG has just been cut to prepare for the next release. Please rebase |
Ideally the same chart can be used for monolithic, read-write and microservices deployment mode.
This allows an easier migration between modes when the load on a system changes.
To do this a new parameter 'deploymentMode' has been added to the helm chart. All charts are adapted to deploy services when applicable according to the
deployment model.
A new template is being added that installs the
monolithic mimir container. All nginx endpoints are updated to direct to the monolithic service if this deployment mode has been chosen.
I attached a monolithic.yaml yaml file that I have used to do a monolithic setup.
Didn't fully get it working yet, but sharing it as it could be a starting point for a future implementation.
Currently stuck on issues with my monolithic container that reports errors that I don't really understand:
rpc error: code = Unavailable desc = connection
error: desc = "transport: Error while dialing dial
tcp 10.98.18.233:9095: connect: connection refused
If anyone has a clue and willing to help me out, let me know.
What this PR does
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]