Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #509 - Flag for reviveOffers and the duration for which to reject offers #510

Merged
merged 1 commit into from
Aug 11, 2015

Conversation

gkleiman
Copy link
Member

@gkleiman gkleiman commented Aug 6, 2015

@ConnorDoyle @elingg @brndnmtthws

NOTE: I plan to add some tests tomorrow, but I would really appreciate if I could get any feedback in the meantime. This has already been done to Marathon, please see mesosphere/marathon#1931.

Add the following command line flags:

--revive_offers_for_new_jobs if true, revive offers is called when a new job
is added to the TaskManager.

Also note, that Mesos only filters offers that are a strict sub set of a
rejected offer.

--decline_offer_duration allows configuring the duration for which unused
offers are declined.

--min_revive_offers_interval if --revive_offers_for_new_jobs is specified,
do not call reviveOffers more often than this interval. It defaults to 5
seconds.

@ConnorDoyle
Copy link
Contributor

One important thing to add is to reviveOffers on registration and re-registration. See: mesosphere/marathon@d3560b4...marathon-0.9.1-backports#diff-27dd4ebda88590d649eae0d05e3a6b47R34

@gkleiman
Copy link
Member Author

gkleiman commented Aug 6, 2015

@ConnorDoyle
Copy link
Contributor

Oops, I missed it. Great!

@gkleiman gkleiman force-pushed the gk/fix_#509_add_declineoffer_filters branch from 4cab5cb to db8e1ae Compare August 7, 2015 13:12
@gkleiman
Copy link
Member Author

gkleiman commented Aug 7, 2015

Added some unit tests.

@gkleiman gkleiman force-pushed the gk/fix_#509_add_declineoffer_filters branch 4 times, most recently from c555b2b to 52a53ae Compare August 7, 2015 14:06
@@ -161,6 +164,10 @@ class TaskManager @Inject()(val listeningExecutor: ListeningScheduledExecutorSer
val job = jobOption.get
jobsObserver.apply(JobQueued(job, taskId, attempt))
}

if (config.reviveOffersForNewJobs()) {
mesosOfferReviver.reviveOffers()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a counter here? From the metrics registry.

@brndnmtthws
Copy link
Member

Code looks good. Can you comment on how this has been tested? Aside from the unit tests (thanks for those, btw).

…ct offers

`--revive_offers_for_new_jobs` if true, revive offers is called when a new job
is added to the `TaskManager`.

Also note, that Mesos only filters offers that are a strict sub set of a
rejected offer.

`--decline_offer_duration` allows configuring the duration for which unused
offers are declined.

`--min_revive_offers_interval` if `--revive_offers_for_new_jobs` is specified,
do not call reviveOffers more often than this interval. It defaults to 5
seconds.
@gkleiman gkleiman force-pushed the gk/fix_#509_add_declineoffer_filters branch from 52a53ae to 71a4b2b Compare August 10, 2015 16:26
@gkleiman
Copy link
Member Author

@brndnmtthws

Thanks for having taken the time to review this.

The new actor debounces reviveOffers calls, so I added two counters: reviveOffersRequest and reviveOffers. The first one counts the number of times a reviveOffers was requested and the latter the number of times reviveOffers was actually called on the Mesos driver.

I added a couple more unit tests and also did some manual testing:

Reproducing the starvation

  • Start Mesos locally: MESOS_RESOURCES="cpus(*):1;mem(*):512;disk(*):100" mesos local --ip=127.0.0.1
  • Start 10 Chronos frameworks without the new flags:
#!/bin/sh

i=0;

while [[ $i -le 9 ]]; do
  bin/start-chronos.bash --master 127.0.0.1:5050 \
    --graphite_host_port 192.168.99.100:2003 \
    --graphite_reporting_interval 5 \
    --graphite_group_prefix "chronos_${i}" \
    --http_port "808${i}" \
    --zk_path "/chronos${i}/state" &>"/tmp/chronos_logs/${i}.log"&

  i=$((i+1));
done;
  • Add the following jobs to one of the just-started Chronos frameworks:
{
   "name": "test",
   "command": "sleep 500000",
   "schedule": "R//PT10M",
   "runAsUser": "gaston",
   "disk": "35",
   "mem": "1"
}
{
   "name": "starved",
   "command": "sleep 20",
   "schedule": "R//PT1M",
   "runAsUser": "gaston",
   "disk": "2",
   "mem": "1"
}
  • Watch in the logs how that framework stops getting offers after having scheduled test and is unable to start starved.

Trying out the new flags

  • Stop all the processes and clean up the ZK state.
  • Start Mesos as before.
  • Start the 10 Chronos frameworks, this time with the following extra parameters: --revive_offers_for_new_jobs --decline_offer_duration 3000000.
  • Add the same jobs as before.
  • Verify that the framework is able to start both the test and the starved job.

@gkleiman gkleiman assigned brndnmtthws and unassigned elingg Aug 10, 2015
@brndnmtthws
Copy link
Member

Looks great to me, nice work. 👍

Merge away.

gkleiman added a commit that referenced this pull request Aug 11, 2015
Fixes #509 - Flag for reviveOffers and the duration for which to reject offers
@gkleiman gkleiman merged commit 5c61adf into master Aug 11, 2015
@gkleiman gkleiman deleted the gk/fix_#509_add_declineoffer_filters branch August 11, 2015 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants