Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.8 muqss #336

Closed
wants to merge 43 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
36acb02
MuQSS version 0.104
ckolivas Oct 9, 2016
24ba405
MuQSS version 0.105
ckolivas Oct 9, 2016
560a7c9
MuQSS version 0.106
ckolivas Oct 9, 2016
386dd43
MuQSS version 0.108
ckolivas Oct 9, 2016
659a4a8
muqss108-001-check_affinity_switch
ckolivas Oct 9, 2016
1641360
muqss108-002-bias_idle_on_wake
ckolivas Oct 9, 2016
225a663
Any time we have two runqueues locked we use that as an opportunity to
ckolivas Oct 9, 2016
3201c4e
Don't reinitialise deadline on wake up new task in case runqueue has …
ckolivas Oct 10, 2016
8d5ded8
sched_info_de/queued only on de/activate.
ckolivas Oct 10, 2016
31a3fa8
Lock pi_lock as well when migrating a task in finish_lock_switch.
ckolivas Oct 10, 2016
b1068ce
Drop task waking which is not meaningfully used.
ckolivas Oct 10, 2016
b9889eb
Do not update task_thread_info cpu on a running task until it is off …
ckolivas Oct 10, 2016
4c1b1f5
Consolidate when and where we update_clocks and account for when niff…
ckolivas Oct 11, 2016
3688005
Update sched_info only on first enqueue/dequeue and alter en/dequeue_…
ckolivas Oct 11, 2016
f12d07e
Wrong again
ckolivas Oct 11, 2016
0f6b426
Should update sched info data when moving CPUs to get correct rq clock.
ckolivas Oct 11, 2016
182a0ce
Rework SCHED_ISO to work per-runqueue with numerous updates, overhead
ckolivas Oct 11, 2016
b8e3f69
Build fixes for various different stripped configurations.
ckolivas Oct 11, 2016
7d2ed0b
Revert unnecessary moving of preempt_enable.
ckolivas Oct 11, 2016
73e3f98
Bump MuQSS version to 0.110
ckolivas Oct 11, 2016
753f3af
Select a valid CPU from the online masks only
ckolivas Oct 13, 2016
a25a303
Fix suspend and resume.
ckolivas Oct 13, 2016
2c80363
Bump MuQSS version to 0.111
ckolivas Oct 13, 2016
fdd879d
Clean up bind_zero to not try and change affinity or reschedule the s…
ckolivas Oct 13, 2016
74ccb93
Remove rq_policy which is only used locally
ckolivas Oct 13, 2016
121b4e9
Double lock and differentiate rq from new_rq in wake_up_new_task, alt…
ckolivas Oct 13, 2016
fb66efa
Remove use of rq_time_slice as time_slice is now only checked locally…
ckolivas Oct 13, 2016
e34f461
Remove use of rq_last_ran as it is now only checked locally or under …
ckolivas Oct 13, 2016
83d4c55
Update sched info when we change CPUs and not on en/dequeuing to be run
ckolivas Oct 14, 2016
73a10ed
Remove unused code.
ckolivas Oct 14, 2016
1a62697
Avoid spurious missed preemption due to lockless resched_curr.
ckolivas Oct 14, 2016
7e45f4a
Time slice expire should be set to correct rq
ckolivas Oct 15, 2016
ab70a89
Remove redundant reassignment
ckolivas Oct 15, 2016
9989220
Use rq dither as an offset itself, avoiding one conditional
ckolivas Oct 15, 2016
0709a96
Dequeue sched info on task deactivation
ckolivas Oct 15, 2016
402bc62
Do irq_enter on scheduler_ipi called when idle to update xtime.
ckolivas Oct 15, 2016
755fcbd
Don't re-set values that haven't changed.
ckolivas Oct 16, 2016
2ce037a
Yielding repeatedly without resetting timeslice and deadline can lead…
ckolivas Oct 16, 2016
3e4e282
Remove dup'd vtime_task_switch
ckolivas Oct 17, 2016
046c8df
Add more documentation to sched-MuQSS.txt.
ckolivas Oct 17, 2016
a0a5f7e
Choose deadline task by skip list key instead of deadline to ensure w…
ckolivas Oct 17, 2016
2ddc12e
Bump MuQSS version to 0.112
ckolivas Oct 17, 2016
75f10f0
Typo
ckolivas Oct 17, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
361 changes: 361 additions & 0 deletions Documentation/scheduler/sched-BFS.txt

Large diffs are not rendered by default.

78 changes: 78 additions & 0 deletions Documentation/scheduler/sched-MuQSS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
MuQSS - The Multiple Queue Skiplist Scheduler by Con Kolivas.

See sched-BFS.txt for basic design; MuQSS is a per-cpu runqueue variant with
one 8 level skiplist per runqueue, and fine grained locking for much more
scalability.

Goals.

The goal of the Multiple Queue Skiplist Scheduler, referred to as MuQSS from
here on (pronounced mux) is to completely do away with the complex designs of
the past for the cpu process scheduler and instead implement one that is very
simple in basic design. The main focus of MuQSS is to achieve excellent desktop
interactivity and responsiveness without heuristics and tuning knobs that are
difficult to understand, impossible to model and predict the effect of, and when
tuned to one workload cause massive detriment to another, while still being
scalable to many CPUs and processes.


Design summary.

MuQSS is best described as per-cpu multiple runqueue, O(log n) insertion, O(1)
lookup, earliest effective virtual deadline first design, loosely based on EEVDF
(earliest eligible virtual deadline first) and my previous Staircase Deadline
scheduler, and evolved from the single runqueue O(n) BFS scheduler. Each
component shall be described in order to understand the significance of, and
reasoning for it.


Design reasoning.

In BFS, the use of a single runqueue across all CPUs meant that each CPU would
need to scan the entire runqueue looking for the process with the earliest
deadline and schedule that next, regardless of which CPU it originally came
from. This made BFS deterministic with respect to latency and provided
guaranteed latencies dependent on number of processes and CPUs. The single
runqueue, however, meant that all CPUs would compete for the single lock
protecting it, which would lead to increasing lock contention as the number of
CPUs rose and appeared to limit scalability of common workloads beyond 16
logical CPUs. Additionally, the O(n) lookup of the runqueue list obviously
increased overhead proportionate to the number of queued proecesses and led to
cache thrashing while iterating over the linked list.

MuQSS is an evolution of BFS, designed to maintain the same scheduling
decision mechanism and be virtually deterministic without relying on the
constrained design of the single runqueue by splitting out the single runqueue
to be per-CPU and use skiplists instead of linked lists.

The original reason for going back to a single runqueue design for BFS was that
once multiple runqueues are introduced, per-CPU or otherwise, there will be
complex interactions as each runqueue will be responsible for the scheduling
latency and fairness of the tasks only on its own runqueue, and to achieve
fairness and low latency across multiple CPUs, any advantage in throughput of
having CPU local tasks causes other disadvantages. This is due to requiring a
very complex balancing system to at best achieve some semblance of fairness
across CPUs and can only maintain relatively low latency for tasks bound to the
same CPUs, not across them. To increase said fairness and latency across CPUs,
the advantage of local runqueue locking, which makes for better scalability, is
lost due to having to grab multiple locks.

MuQSS works around the problems inherent in multiple runqueue designs by
making its skip lists priority ordered and through novel use of lockless
examination of each other runqueue it can decide if it should take the earliest
deadline task from another runqueue for latency reasons, or for CPU balancing
reasons. It still does not have a balancing system, choosing to allow the
next task scheduling decision and task wakeup CPU choice to allow balancing to
happen by virtue of its choices.


Design:

MuQSS is an 8 level skip list per runqueue variant of BFS.

See sched-BFS.txt for some of the shared design details.

Documentation yet to be completed.


Con Kolivas <kernel@kolivas.org> Sun, 2nd October 2016
26 changes: 26 additions & 0 deletions Documentation/sysctl/kernel.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ show up in /proc/sys/kernel:
- hung_task_timeout_secs
- hung_task_warnings
- kexec_load_disabled
- iso_cpu
- kptr_restrict
- kstack_depth_to_print [ X86 only ]
- l2cr [ PPC only ]
Expand Down Expand Up @@ -73,6 +74,7 @@ show up in /proc/sys/kernel:
- randomize_va_space
- real-root-dev ==> Documentation/initrd.txt
- reboot-cmd [ SPARC only ]
- rr_interval
- rtsig-max
- rtsig-nr
- sem
Expand Down Expand Up @@ -402,6 +404,16 @@ kernel stack.

==============================================================

iso_cpu: (MuQSS CPU scheduler only).

This sets the percentage cpu that the unprivileged SCHED_ISO tasks can
run effectively at realtime priority, averaged over a rolling five
seconds over the -whole- system, meaning all cpus.

Set to 70 (percent) by default.

==============================================================

l2cr: (PPC only)

This flag controls the L2 cache of G3 processor boards. If
Expand Down Expand Up @@ -818,6 +830,20 @@ rebooting. ???

==============================================================

rr_interval: (MuQSS CPU scheduler only)

This is the smallest duration that any cpu process scheduling unit
will run for. Increasing this value can increase throughput of cpu
bound tasks substantially but at the expense of increased latencies
overall. Conversely decreasing it will decrease average and maximum
latencies but at the expense of throughput. This value is in
milliseconds and the default value chosen depends on the number of
cpus available at scheduler initialisation with a minimum of 6.

Valid values are from 1-1000.

==============================================================

rtsig-max & rtsig-nr:

The file rtsig-max can be used to tune the maximum number
Expand Down
5 changes: 0 additions & 5 deletions arch/powerpc/platforms/cell/spufs/sched.c
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,6 @@ static struct task_struct *spusched_task;
static struct timer_list spusched_timer;
static struct timer_list spuloadavg_timer;

/*
* Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
*/
#define NORMAL_PRIO 120

/*
* Frequency of the spu scheduler tick. By default we do one SPU scheduler
* tick for every 10 CPU scheduler ticks.
Expand Down
22 changes: 19 additions & 3 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -914,10 +914,26 @@ config SCHED_SMT
depends on SMP
---help---
SMT scheduler support improves the CPU scheduler's decision making
when dealing with Intel Pentium 4 chips with HyperThreading at a
when dealing with Intel P4/Core 2 chips with HyperThreading at a
cost of slightly increased overhead in some places. If unsure say
N here.

config SMT_NICE
bool "SMT (Hyperthreading) aware nice priority and policy support"
depends on SCHED_MUQSS && SCHED_SMT
default y
---help---
Enabling Hyperthreading on Intel CPUs decreases the effectiveness
of the use of 'nice' levels and different scheduling policies
(e.g. realtime) due to sharing of CPU power between hyperthreads.
SMT nice support makes each logical CPU aware of what is running on
its hyperthread siblings, maintaining appropriate distribution of
CPU according to nice levels and scheduling policies at the expense
of slightly increased overhead.

If unsure say Y here.


config SCHED_MC
def_bool y
prompt "Multi-core scheduler support"
Expand Down Expand Up @@ -2036,7 +2052,7 @@ config HOTPLUG_CPU
config BOOTPARAM_HOTPLUG_CPU0
bool "Set default setting of cpu0_hotpluggable"
default n
depends on HOTPLUG_CPU
depends on HOTPLUG_CPU && !SCHED_MUQSS
---help---
Set whether default state of cpu0_hotpluggable is on or off.

Expand Down Expand Up @@ -2065,7 +2081,7 @@ config BOOTPARAM_HOTPLUG_CPU0
config DEBUG_HOTPLUG_CPU0
def_bool n
prompt "Debug CPU0 hotplug"
depends on HOTPLUG_CPU
depends on HOTPLUG_CPU && !SCHED_MUQSS
---help---
Enabling this option offlines CPU0 (if CPU0 can be offlined) as
soon as possible and boots up userspace with CPU0 offlined. User
Expand Down
4 changes: 2 additions & 2 deletions drivers/cpufreq/cpufreq_conservative.c
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ struct cs_dbs_tuners {
};

/* Conservative governor macros */
#define DEF_FREQUENCY_UP_THRESHOLD (80)
#define DEF_FREQUENCY_DOWN_THRESHOLD (20)
#define DEF_FREQUENCY_UP_THRESHOLD (63)
#define DEF_FREQUENCY_DOWN_THRESHOLD (26)
#define DEF_FREQUENCY_STEP (5)
#define DEF_SAMPLING_DOWN_FACTOR (1)
#define MAX_SAMPLING_DOWN_FACTOR (10)
Expand Down
4 changes: 2 additions & 2 deletions drivers/cpufreq/cpufreq_ondemand.c
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
#include "cpufreq_ondemand.h"

/* On-demand governor macros */
#define DEF_FREQUENCY_UP_THRESHOLD (80)
#define DEF_FREQUENCY_UP_THRESHOLD (63)
#define DEF_SAMPLING_DOWN_FACTOR (1)
#define MAX_SAMPLING_DOWN_FACTOR (100000)
#define MICRO_FREQUENCY_UP_THRESHOLD (95)
Expand Down Expand Up @@ -129,7 +129,7 @@ static void dbs_freq_increase(struct cpufreq_policy *policy, unsigned int freq)
}

/*
* Every sampling_rate, we check, if current idle time is less than 20%
* Every sampling_rate, we check, if current idle time is less than 37%
* (default), then we try to increase frequency. Else, we adjust the frequency
* proportional to load.
*/
Expand Down
2 changes: 1 addition & 1 deletion fs/proc/base.c
Original file line number Diff line number Diff line change
Expand Up @@ -505,7 +505,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
seq_printf(m, "0 0 0\n");
else
seq_printf(m, "%llu %llu %lu\n",
(unsigned long long)task->se.sum_exec_runtime,
(unsigned long long)tsk_seruntime(task),
(unsigned long long)task->sched_info.run_delay,
task->sched_info.pcount);

Expand Down
75 changes: 72 additions & 3 deletions include/linux/init_task.h
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,6 @@ extern struct task_group root_task_group;
# define INIT_VTIME(tsk)
#endif

#define INIT_TASK_COMM "swapper"

#ifdef CONFIG_RT_MUTEXES
# define INIT_RT_MUTEXES(tsk) \
.pi_waiters = RB_ROOT, \
Expand Down Expand Up @@ -187,6 +185,77 @@ extern struct task_group root_task_group;
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
*/
#ifdef CONFIG_SCHED_MUQSS
#define INIT_TASK_COMM "MuQSS"
#define INIT_TASK(tsk) \
{ \
.state = 0, \
.stack = &init_thread_info, \
.usage = ATOMIC_INIT(2), \
.flags = PF_KTHREAD, \
.prio = NORMAL_PRIO, \
.static_prio = MAX_PRIO-20, \
.normal_prio = NORMAL_PRIO, \
.deadline = 0, \
.policy = SCHED_NORMAL, \
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
.active_mm = &init_mm, \
.restart_block = { \
.fn = do_no_restart_syscall, \
}, \
.time_slice = 1000000, \
.tasks = LIST_HEAD_INIT(tsk.tasks), \
INIT_PUSHABLE_TASKS(tsk) \
.ptraced = LIST_HEAD_INIT(tsk.ptraced), \
.ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
.real_parent = &tsk, \
.parent = &tsk, \
.children = LIST_HEAD_INIT(tsk.children), \
.sibling = LIST_HEAD_INIT(tsk.sibling), \
.group_leader = &tsk, \
RCU_POINTER_INITIALIZER(real_cred, &init_cred), \
RCU_POINTER_INITIALIZER(cred, &init_cred), \
.comm = INIT_TASK_COMM, \
.thread = INIT_THREAD, \
.fs = &init_fs, \
.files = &init_files, \
.signal = &init_signals, \
.sighand = &init_sighand, \
.nsproxy = &init_nsproxy, \
.pending = { \
.list = LIST_HEAD_INIT(tsk.pending.list), \
.signal = {{0}}}, \
.blocked = {{0}}, \
.alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \
.timer_slack_ns = 50000, /* 50 usec default slack */ \
.pids = { \
[PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \
[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \
[PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \
}, \
.thread_group = LIST_HEAD_INIT(tsk.thread_group), \
.thread_node = LIST_HEAD_INIT(init_signals.thread_head), \
INIT_IDS \
INIT_PERF_EVENTS(tsk) \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_TASK_RCU_TASKS(tsk) \
INIT_CPUSET_SEQ(tsk) \
INIT_RT_MUTEXES(tsk) \
INIT_PREV_CPUTIME(tsk) \
INIT_VTIME(tsk) \
INIT_NUMA_BALANCING(tsk) \
INIT_KASAN(tsk) \
}
#else /* CONFIG_SCHED_MUQSS */
#define INIT_TASK_COMM "swapper"
#define INIT_TASK(tsk) \
{ \
.state = 0, \
Expand Down Expand Up @@ -261,7 +330,7 @@ extern struct task_group root_task_group;
INIT_NUMA_BALANCING(tsk) \
INIT_KASAN(tsk) \
}

#endif /* CONFIG_SCHED_MUQSS */

#define INIT_CPU_TIMERS(cpu_timers) \
{ \
Expand Down
2 changes: 2 additions & 0 deletions include/linux/ioprio.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ enum {
*/
static inline int task_nice_ioprio(struct task_struct *task)
{
if (iso_task(task))
return 0;
return (task_nice(task) + 20) / 5;
}

Expand Down
2 changes: 1 addition & 1 deletion include/linux/jiffies.h
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ static inline u64 get_jiffies_64(void)
* Have the 32 bit jiffies value wrap 5 minutes after boot
* so jiffies wrap bugs show up earlier.
*/
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ))

/*
* Change timeval to jiffies, trying to avoid the
Expand Down
Loading