Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Allow instance number to be passed in as an environment variable #1242

Closed
SEJeff opened this issue Feb 25, 2015 · 65 comments
Closed

Allow instance number to be passed in as an environment variable #1242

SEJeff opened this issue Feb 25, 2015 · 65 comments

Comments

@SEJeff
Copy link

SEJeff commented Feb 25, 2015

Say I have a docker container ie: kafka:0.8.2.0 and want to run it under mesos. In marathon terminology, for each app, I want 10 instances. I need an integer that is unique amongst all instances in that app, but only for that app.

Currently I've got a start script in python which does terrible black magic along the lines of:

sha = hashlib.sha1(os.environ.get('HOSTNAME'), str(random.randint(1, 100000)))
default_broker_id = 10000 * int(sha.hexdigest()[:5], 16) / int('10000', 16)

This gives me a unique integer I can pass in as a kafka broker. However, I'm having marathon start up 10 instances of said brokers. It would be super nice if the instance number was passed from marathon to the container. Then that above code could be more like:

default_broker_id = os.environ.get('INSTANCE_NUMBER')

That way I get a per-app unique integer for each instance of said app. It seems like this wouldn't be super difficult to expose.

Thoughts?

@kolloch
Copy link
Contributor

kolloch commented Apr 2, 2015

Hi @SEJeff,

thanks for your idea. That seems to be a pretty specific use case and I am not sure that I understand it well. Please elaborate if you think I am missing something crucial.

We do export the Mesos Task ID as the environment variable MESOS_TASK_ID. Of course, that is unique cluster-wide and not an integer.

As of now, I would like to close the issue.

@kolloch kolloch closed this as completed Apr 2, 2015
@lusid
Copy link
Contributor

lusid commented May 31, 2015

I need something like this for creating volumes in a networked storage that are immediately available when a crashed instance comes back up on another server. Right now, if an instance comes up on a different server, the data isn't available. If I could identify individual instances of a single application, this would become insanely easy. At the moment, I have two options: no shared filesystem or using the same shared path for all instances which works for some applications but not most of them.

@kolloch
Copy link
Contributor

kolloch commented Jun 1, 2015

Hi @lusid, can you elaborate, please? What do you mean by identifying the instances?

MESOS_TASK_ID uniquely identifies a task.

https://mesosphere.github.io/marathon/docs/task-environment-vars.html

@lusid
Copy link
Contributor

lusid commented Jun 1, 2015

Let's say I have 10 physical servers. I run an app that creates a Docker container with 5 instances that are constrained uniquely by hostname. I want each instance to attach to a volume on the physical server that includes a number of the instance in the scaling group that it represents.

Instance 1 = ID 1
Instance 2 = ID 2
... and so on

Now, let's say Instance 3 dies and is recreated on a completely different server. I want that newly created instance to be able to take over the storage volume it created originally before it died. If I used the MESOS_TASK_ID, then I get a completely unique ID that is in no way related to the previous task that died.

Because I have a networked file system between all servers in my cluster, this would basically solve the problem of not being able to locate data when a crashed instance returns on a completely different server, especially when the data stored by each instance must be stored in a different location to avoid data corruption.

@kolloch
Copy link
Contributor

kolloch commented Jun 1, 2015

Let me summarize: You want all tasks to have some kind of sequentially assigned ID. If a task fails, you want its replacement task to get the same sequentially assigned ID. So that if you specify "instances": 10, you want to make sure that you always have tasks with the the IDs 1-10 running somewhere. You assign network volumes using these IDs. Thus you always have on task per network volume.

@mhausenblas / @air What's our current best practice for dealing with persistence in cases like this?

@lusid
Copy link
Contributor

lusid commented Jun 1, 2015

Correct. It would be nice if it worked automatically with scaling to different sizes as well, but I can see where difficulties would start appearing in those instances.

I've been thinking about this a lot, and having an ID like this is the only thing I've been able to come up with for this use case. I have no idea how anyone runs long running processes on Marathon in its current state when they require persistence and when they aren't scaling to the full capacity of the cluster. If I could find a reliable alternative that worked in most cases, I would be happy. I would prefer to not have to constrain an app to X machines by hostname, and never be able to scale them up further than that.

I'm sure I'm missing something, but it is driving me crazy. As soon as I need to store persistent data, all the awesomeness of Marathon starts to turn into crazy tedious Bash hacking tricks, or constraining myself to one machine which defeats the purpose altogether.

@kolloch
Copy link
Contributor

kolloch commented Jul 27, 2015

This has come up again a number of times. Maybe this idea has more applications than I originally thought.

@kolloch kolloch reopened this Jul 27, 2015
@air
Copy link
Contributor

air commented Aug 9, 2015

Another use case where having a strong 'I am instance N of M' identity is useful: Cassandra nodes. e.g. instances 1 and 2 know that they are the leaders (their instance numbers are lowest) and configure themselves as seeds.

I'm not convinced Marathon is the right level to provide this level of guaranteed identity. It seems like something a minority of apps would benefit from - the implementation weight would be wasted on other apps that scale horizontally with true independence.

Marathon's current guarantee is, 'I'll run N of these for you and uniquely identify them' - but they are cattle, and the sense of 'being instance 5' is not carried over if #5 dies and is replaced. That feels like more of a pet.

Technically do we see difficulties? I wonder if - in the event of network partitions, restarts - we might run into issues where e.g. there are two 'instance 5s'.

@aquamatthias aquamatthias added this to the Backlog milestone Aug 12, 2015
@cberry777
Copy link

I am +infinity for this feature. I can think of several places that having an instance number would make things much simpler.

Most prominent for me is with monitoring. Let’s say that I have 7 Foos. Typically I’d then want to see a graph with 7 Foo metrics (lines) that I can compare and contrast. The fact that they are ephemeral doesn’t really matter. Conceptually I have 7 Foos — that may move about. I don’t want to see disjointed (and likely different colored) lines and multiple instances on the legend of my graph. I want to see 7 lines. And if I spot an anomaly I want to be able to overlay “event bars” that show me when an instance moved. Something like; “Whoa, what happened to 7. Oh, it flipped onto that spotty server…”

And more important. A named instance (what we are asking for with “lasting instance numbers”) helps to keep the number of metric datasources from becoming ridiculously large. Rather than having a zillion instances in the history of a given metric, I can have 7. I.e. 7 datasources versus a brand new one for every time that docker instance is redeployed .

In fact, we have had to create exactly this capability (instance numbers) on top of mesos/marathon, which is a real PITA.
Make sense?

Honestly, I believe that most people think in terms of "instances of services”.
It gives us something to hang our hat on. We say "Node 7 is acting up — what’s up with that." We don’t say "Node bfa1c68ce497 is acting up” — particularly when that name changes every time we redeploy a new version!

Yeah sure, maybe you don’t name your cattle like that. But really, I think we all kinda do. (I’ve never raised cattle. But I have raised chickens, and while they certainly weren’t pets, I could tell them apart. And it was the same chicken whether it was in the yard or in the coop :~)

Most of us don’t run 1000s of Foos. We run 10s or 100s. And clustered solutions (e.g. elasticsearch, cassandra, a bank of proxy servers, …) often want us to conceptually identify Nodes, so we can do things like traffic shaping (e.g. hot spots are routed to specific Nodes, etc). I don’t particularly care where Node 7 lives, but it is servicing only XYZ or is operating on this set of Shards.

I like to think of these things as workers — not cattle vs. pets. (Personally, I think the whole cattle/pets analogy misses the mark somewhat.) My workers should be relatively interchangeable — think check-out people at the big-box-store. But they have do have names. And if Bill is working aisle 1 today and aisle 2 tomorrow, I don’t care. But I do care about Bill’ productivity, or whether he died last night. And it would be problematic if every time Bill worked at a different aisle, he had a different name…

Thanks,
— Chris

@jamesbouressa
Copy link

Instances may live like cattle, but we need to treat them a little bit like pets when they get ill. Even real cattle are numbered.

Even if this were only to function as a way to make it easier for humans to keep instances straight in their heads for a few minutes, it would be worth the effort (as the DNS was, and for much the same reason). Human-friendly naming imposes no burden upon automation, and it eases the cognitive load on the humans involved.

@air
Copy link
Contributor

air commented Sep 14, 2015

Hey @BenWhitehead have you thoughts on this? We were talking similar issues recently.

@eyalfink
Copy link

+1
While the 'put N identical replicas' to increase the load capacity of your service is common, there is a not less common pattern of 'shard your data to N pieces and put an instance per shard'.
In fact I'm quite sure that the later is more common with services which deals with a lot of data/computation which need to be serve in low latency (e.g. search engines of different sort)

Without this requested feature, is there a way to have an instance knows it 'shard id' and load its own data when going up?

@kolloch
Copy link
Contributor

kolloch commented Sep 16, 2015

@air, @mwasn This should be reasonably easy to implement. Since there is only one Marathon instance that is currently leader and starting tasks, there should be no problems with network partitions (except of course those unrelated to this feature, e.g. that we don't restart tasks in that case).

Implementation Proposal:

  • Save the instance number for every task as part of the MarathonTask protocol buffer structure
  • Assign the lowest available instance number for new tasks (start at 0)
  • Create the associated MARATHON_APP_INSTANCE_NUMBER environment variable in the TaskInfo for starting the task

@sielaq
Copy link
Contributor

sielaq commented Sep 16, 2015

@kolloch does it mean that restart (or killing) application, marathon will remember INSTANCE_NUMBER?
and in case of restart: for small amount of time both (new and old instance) gonna have same INSTANCE_NUMBER ?

@BenWhitehead
Copy link
Contributor

Conceptually it's not difficult for marathon to pass a value as an environment variable. The complicated part is what that value should be and what is done with it during failure scenarios.

For example, what should the instance number be for new instances of an app that are being started for an updated app? Once the apps are healthy the old instances will be torn down, should those numbers be re-used, or is it safe to abandon them? If numbers are supposed to be re-used what are the semantics around re-use?

Here is a more concrete example that exposes some questions:
Imagine you're trying to run 5 kafka brokers vi marathon. Each broker needs a unique id. Once an id is defined for a broker that id has to stick around, since the data that has been written to disk is directly associated with that broker and it's id. This means marathon would have to keep track of this new metadata and corresponding association "Broker id 4 maps to mesos slave slave-a.mesos" (not something it currently does). Assuming it could, there are many more challenges that arise when dealing with failure cases. In addition to keeping track of the id for the slave, marathon now has to change its offer evaluation code to effectively constrain "restarting a lost task" to only restart on a slave that was previously running.

Managing state of distributed systems is a very challenging thing to do well. Marathon (currently) is first and foremost a system for running stateless applications. If you application has a lot of complex state that needs to be managed/coordinated it would be a good idea to look into what it would take to write a mesos framework where you will have full control over managing the specific considerations of your app. This is why there are frameworks specific to Kafka, Cassandra, HDFS and other stateful apps.

If you're asking for anything more than "In the history of my app, what task number am i?" I don't think it's a good idea for marathon to support it. Marathon already creates a task id that is available as an environment variable MESOS_TASK_ID that can be used to identify tasks. This task id is a UUID so that it specifically can be identified uniquely across the whole cluster and its lifetime.

To the point about pets vs. cattle vs. Bill; from the standpoint of mesos the thing running here is a task. Attempting to further map the analogy to Mesos, Bill is a worker (mesos-slave) that when his resources are available (he's at work) are used to perform a task (checking out customers). This task has the same shape of work day-to-day but it is not the exact same every day. It could also be argued that Bill is a pretty stateless task that could easily be taken over by someone else if Bill was no longer able to perform his task for the day (sickness, break etc).

@eyalfink
Copy link

If I understand correctly your concern I think that the problems you are raising can be overcome by leaving these things to the application level and not pull them into the framework (Marathon) level - Instead of supporting a "which task number am I" via the framework, just let the Job creation API specify small variations between the replicas agrs. For example:

{
    "id": "/product/service/myApp",
    "instances": 3,
    "cmd": "cp /path/to/remote/data/shard_$INSTANCE_NUMBER /local/data && run_my_service --data /local/data",
...

And let $INSTANCE_NUMBER be replaced with a running number for each instance.
Now we've defined 3 mesos tasks which are similar but not the same, so if one dies it's clear what need to be rerun.
It's also clear that it's the application responsibility to make sure it handles restarts or coexistence of a task replica due to lost of communication correctly.
In your Broker example I would expect the application to deal with the association of the data with ID 4, by writing it to a network location and fetching it from a new task with ID 4, or by being able to create it if needed.

@kolloch
Copy link
Contributor

kolloch commented Sep 21, 2015

@sielaq: What I specified would actually not reuse the MARATHON_APP_INSTANCE_NUMER of a running task. But that might be a problem. I guess after you have upgraded an app successfully, you would expect:

  • That all tasks of the app have an MARATHON_APP_INSTANCE_NUMBER assigned which is unique for the tasks with the same configuration. That allows clashes with tasks with an app version which has a different configuration (e.g. updated docker/image).
  • There exists an MARATHON_APP_INSTANCE_NUMBER for every value between [0;instances-1]. That means that there are no "holes" in the ID assignment.

This would be achievable by updating the rules I provided above with only considering tasks with the same configuration.

It would NOT ensure that a task with a certain MARATHON_APP_INSTANCE_NUMBER is respawned on the same slave on upgrade or failure.

There are plans in Marathon to use "dynamic reservations" to allow sticky tasks that are restarted in-place on failure or upgrade. It would definitely nice, if the MARATHON_APP_INSTANCE_NUMBER would be preserved in this case. But I would consider that a distinct issue.

@xargstop
Copy link

xargstop commented Nov 5, 2015

+1

@drewrobb
Copy link
Contributor

drewrobb commented Nov 5, 2015

+1, my use case is limited to monitoring as well. We just want a way of enumerating app tasks in a way that doesn't have duplicates but is otherwise as small as possible. I think it is also worth noting that when scaling down, this feature would mean that the highest number tasks would need to be terminated first. That might make satisfying placement constraints difficult. Also, when deploying if upgradeStrategy.maximumOverCapacity > 0 you have a problem. (I wouldn't actually care about these aspect of correctness, but others would I'd assume)

@cberry777
Copy link

I agree that “instance numbering” is not all that simple when one considers failure scenarios, but I don't think that that necessarily means it isn’t worth doing. In fact, I think that a “best effort” solution is completely adequate. I have enumerated some scenarios below.

For those apps that don’t care about instance-naming, they can simply ignore it altogether.

Also I believe that “host affinity” is a separate concern (although related by the common underlying use case), although I do I think this is another valuable addition to the ecosystem.

AFAICT, The different scenarios for instance numbering are as follows;
Let’s assume the initial mapping: 1=>ab, 2=>bc, 3=>cd

A) we scale down: bc is destroyed. (2 is now free)
So 1=>ab, 3=>cd

B) we scale back up, Add de and ef. We reuse the free slot (2) and add a new one.
So 1=>ab, 2=>de, 3=>cd, 4=>ef

C) de & ab die, and are replaced by aa, bb
So 1=>aa, 2=>bb, 3=>cd, 4=>ef

D) Version X --> Y
D1) Spin up Y:
1=>aa, 2=>bb, 3=>cd, 4=>ef, 5=>cc, 6=>dd, 7=>ee, 8=>ff
D2) Bring down X
5=>cc, 6=>dd, 7=>ee, 8=>ff
that would result in 2X the buckets...
next roll would reuse 1,2,3,4
but if you "roll in place" then it would stay 1,2,3,4 always

Optionally, if you have blue/green
Blue : 1=>aa, 2=>bb, 3=>cd, 4=>ef
Green: 5=>cc, 6=>dd, 7=>ee, 8=>ff
and we "flip" Blue to Green with a "live” alias (as is often done in Elasticsearch)

About Mesos frameworks (per Ben's comment above)— my problem with them is that they are often not layered on top of Docker. They use whatever OS, JVM, etc that is already host resident. Docker’s promise and raison d’être is to bring repeatability all the way down to the OS level. We have all been bitten by an OS that has a different set of patches, or has swap turned on, etc. IMHO, when we step away from that vision, it is a step backwards.

Cheers,
— Chris

@rasputnik
Copy link
Contributor

It sounds like people are discussing two different use cases here. I'd also dearly love a way to get metrics consolidated for an app rather than at either the task or slave level.

But the re-using instance numbers seems a bit wrong - taking the cattle/pet analogy, this is akin to renaming your new cat 'Mr Tiddles' because that was the old cats name.

Doesn't anyone else think it might be confusing to operators to notice Mr Tiddles suddenly grew his leg back and lost 10 pounds?

@memelet
Copy link

memelet commented Nov 12, 2015

I am looking for INSTANCE_NUMBER to be able to assign the correct Flocker volume. Maybe this will be handled some other way soon?

@bydga
Copy link

bydga commented Dec 9, 2015

Hi, I think this would be really nice feature and i have another use case:

we are logging app metrics (cpu, rss, event-loop hangs, etc...) into Graphite. Our app is usually a long running service with stable instance count between 2 - 8 instances. The metrics mentioned definitely needs to be logged per instance (=per task in Marathon terminology). And when one of the tasks fails/restarts/whatever, we want the line in graphite to continue. We definitely dont want to have hunderds of metrics in Graphite (its difficult to read them and it will take too much disk space).

So this feature would be really helpful - one sequential number that will get recycled (if its free) on a new task start.

@Radek44
Copy link

Radek44 commented Feb 20, 2016

Adding a scenario inline with this request.
We have tasks deployed with Marathon that we want to scale but with the caveat that when we scale a given process we need to make sure it talks to a given queue (without going into details, ordering of items in the queue is important so only 1 process at a time should be consuming the queue):
For example let’s say we have 2 queues:

  • queue.01
  • queue.02

We now deploy a task called queue.consumer

We want to scale queue.consumer using Marathon to 2 instances.
But now we would want to make sure that queue.consumer-Instance1 talks to queue.01 and queue.consumer-Instance2 talks to queue.02

It would be great if there was a way in Marathon to either:

  1. Get the information on the task itself (from an env variable for example) on which instance number it is (1 or 2)
  2. Pass a dynamic Env variable on scaling for example by setting a script that sets ENV_QUEUE_TO_LISTEN to queue.{%i} where {%i} is the number of the instance

@air
Copy link
Contributor

air commented Mar 19, 2016

Also see Cardinal Service idea in Kubernetes kubernetes/kubernetes#260 (comment)

...which on further reading became the PetSet proposal https://github.com/smarterclayton/kubernetes/blob/petset/docs/proposals/petset.md

@cherweg
Copy link

cherweg commented Apr 12, 2016

+1

3 similar comments
@Krassi10
Copy link

+1

@harpreet-sawhney
Copy link

+1

@samwiise
Copy link

+1

@krestjaninoff
Copy link

+1

@air
Copy link
Contributor

air commented Apr 29, 2016

Good news everyone! This is officially on the radar and we'll look at prioritizing it. Thank you for all the excellent use case examples. Internal tracker https://mesosphere.atlassian.net/browse/MARATHON-983

@jdef
Copy link
Contributor

jdef commented May 2, 2016

has anyone tried using https://github.com/spacejam/zk-glove to coordinate/track the number of instances? it may be trivial to hack up zk-glove to provide an INSTANCE_ID environment variable

@reachbach
Copy link

reachbach commented Jun 1, 2016

@air any update on this issue? This is quite an important requirement for our production use case.

@air
Copy link
Contributor

air commented Jun 22, 2016

Copying from marathon-framework post:

The next release is mostly planned at this point, so this feature would be a couple of months out realistically.
We'll be publishing a roadmap shortly to help make this transparent.

Two things in the interim,

  1. I would love to know if the workaround suggested here is workable, can you try it?
  2. We would love to see more community contributions! A PR with a design proposal for this would be welcome.

@SEJeff
Copy link
Author

SEJeff commented Jul 14, 2016

For what its worth, with PetSets, Kubernetes now supports this exact thing. Clearly there is a demand for a feature such as this, or there wouldn't be so many comments on this issue. It makes managing stateful services much nicer.

Pet Sets provides the following capabilities:

A stable hostname, available to others in DNS. Number is based off of the Pet Set name and starts at zero. For example cassandra-0.
An ordinal index of Pets. 0, 1, 2, 3, etc
... snip ...

@air
Copy link
Contributor

air commented Jul 15, 2016

Thanks @SEJeff - spotted that a few comments back. This feature is on the backlog and awaiting prioritization - it's officially a good idea!

@isavcic
Copy link

isavcic commented Aug 22, 2016

+1

1 similar comment
@psyhomb
Copy link

psyhomb commented Aug 22, 2016

+1

@yuefengz
Copy link

+infinity

The args section in container probably can help to work it around. But we may have create N very similar apps/container, just with different ids in their args section. I am wondering whether there is a more elegant way to handle this.

@raghu999
Copy link

+1

@isavcic
Copy link

isavcic commented Nov 2, 2016

"Good news everyone! This is officially on the radar and we'll look at prioritizing it." ...from April 2016.

Are there any updates on this?

@isavcic
Copy link

isavcic commented Jan 30, 2017

Can someone be assigned to this issue, pretty please? By the way, Kubernetes already improved upon the initial PetSet concept in form of StatefulSet.

@jmgpeeters
Copy link

+1. Would be a great feature.

@rtoma
Copy link

rtoma commented Feb 15, 2017

This ticket was created on Feb 25 2015, so its almost 2 years old.

Meanwhile Kubernetes came along and implemented this months ago. @air @jdef do you have any idea how we can give this issue a higher priority?

@fpapleux
Copy link

+1 - configuring app clusters across multiple docker containers for all kind of reasons would benefit from this feature.

@meichstedt
Copy link
Contributor

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-3602. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.

@d2iq-archive d2iq-archive locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests