Skip to content

Commit

Permalink
workqueue: fix subtle pool management issue which can stall whole wor…
Browse files Browse the repository at this point in the history
…ker_pool

commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream.

A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value.  This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function.  need_to_create_worker() tests the following
conditions.

	pending work items && !nr_running && !nr_idle

The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items.  If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers.  Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream.

cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.

try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
this case, __cancel_work_timer() currently invokes flush_work().  The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping

Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority.  Let's say task A just got woken
up from flush_work() by the completion of the target work item.  If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing.  This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.

task A			task B			worker

						executing work
__cancel_work_timer()
  try_to_grab_pending()
  set work CANCELING
  flush_work()
    block for work completion
						completion, wakes up A
			__cancel_work_timer()
			while (forever) {
			  try_to_grab_pending()
			    -ENOENT as work is being canceled
			  flush_work()
			    false as work is no longer executing
			}

This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.

Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com

v3: bit_waitqueue() can't be used for work items defined in vmalloc
    area.  Switched to custom wake function which matches the target
    work item and exclusive wait and wakeup.

v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
    the target bit waitqueue has wait_bit_queue's on it.  Use
    DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
    Vizoso.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rabin Vincent <rabin.vincent@axis.com>
Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
Tested-by: Jesper Nilsson <jesper.nilsson@axis.com>
Tested-by: Rabin Vincent <rabin.vincent@axis.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

workqueue: make sure delayed work run in local cpu

commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.

My system keeps crashing with below message. vmstat_update() schedules a delayed
work in current cpu and expects the work runs in the cpu.
schedule_delayed_work() is expected to make delayed work run in local cpu. The
problem is timer can be migrated with NO_HZ. __queue_work() queues work in
timer handler, which could run in a different cpu other than where the delayed
work is scheduled. The end result is the delayed work runs in different cpu.
The patch makes __queue_delayed_work records local cpu earlier. Where the timer
runs doesn't change where the work runs with the change.

[   28.010131] ------------[ cut here ]------------
[   28.010609] kernel BUG at ../mm/vmstat.c:1392!
[   28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[   28.011860] Modules linked in:
[   28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G        W4.3.0-rc3+ #634
[   28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[   28.014160] Workqueue: events vmstat_update
[   28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
[   28.015445] RIP: 0010:[<ffffffff8115f921>]  [<ffffffff8115f921>]vmstat_update+0x31/0x80
[   28.016282] RSP: 0018:ffff8800ba42fd80  EFLAGS: 00010297
[   28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
[   28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
[   28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
[   28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
[   28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
[   28.020071] FS:  0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
[   28.020071] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
[   28.020071] Stack:
[   28.020071]  ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
[   28.020071]  ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
[   28.020071]  ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
[   28.020071] Call Trace:
[   28.020071]  [<ffffffff8106bd88>] process_one_work+0x1c8/0x540
[   28.020071]  [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540
[   28.020071]  [<ffffffff8106c214>] worker_thread+0x114/0x460
[   28.020071]  [<ffffffff8106c100>] ? process_one_work+0x540/0x540
[   28.020071]  [<ffffffff81071bf8>] kthread+0xf8/0x110
[   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
[   28.020071]  [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70
[   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

workqueue: clear POOL_DISASSOCIATED in rebind_workers()

a9ab775 ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") moved pool locking into rebind_workers() but left
"pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().

There is nothing necessarily wrong with it, but there is no benefit
either.  Let's move it into rebind_workers() and achieve the following
benefits:

  1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
     as expected.

  2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
     running workers of the pool are on the local CPU (pool->cpu).

tj: Minor description update.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

workqueue: Fix workqueue stall issue after cpu down failure

When the hotplug notifier call chain with CPU_DOWN_PREPARE
is broken before reaching workqueue_cpu_down_callback(),
rebind_workers() adds WORKER_REBOUND flag for running workers.
Hence, the nr_running of the pool is not increased when scheduler
wakes up the worker. The fix is skipping adding WORKER_REBOUND
flag when the worker doesn't have WORKER_UNBOUND flag in
CPU_DOWN_FAILED path.

Change-Id: I2528e9154f4913d9ec14b63adbcbcd1eaa8a8452
Signed-off-by: Se Wang (Patrick) Oh <sewango@codeaurora.org>
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>

workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues

Workqueues can be performance or power-oriented. Currently, most workqueues are
bound to the CPU they were created on. This gives good performance (due to cache
effects) at the cost of potentially waking up otherwise idle cores (Idle from
scheduler's perspective. Which may or may not be physically idle) just to
process some work. To save power, we can allow the work to be rescheduled on a
core that is already awake.

Workqueues created with the WQ_UNBOUND flag will allow some power savings.
However, we don't change the default behaviour of the system.  To enable
power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
be turned on. This option can also be overridden by the
workqueue.power_efficient boot parameter.

tj: Updated config description and comments.  Renamed
    CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

workqueue: Add system wide power_efficient workqueues

This patch adds system wide workqueues aligned towards power saving. This is
done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
'true'.

tj: updated comments a bit.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

firmware: use power efficient workqueue for unloading and aborting fw load

Allow the scheduler to select the most appropriate CPU for running the
firmware load timeout routine and delayed routine for firmware unload.
This extends idle residency times and conserves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: Ming Lei <ming.lei@canonical.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel, added commit message.
Fixed code alignment.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

net: wireless: move regulatory timeout work to power efficient workqueue

For better use of CPU idle time, allow the scheduler to select the CPU
on which the timeout work of regulatory settings would be executed.
This extends CPU idle residency time and saves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel. Added commit message.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

rcu: Move SRCU grace period work to power efficient workqueue

For better use of CPU idle time, allow the scheduler to select the CPU
on which the SRCU grace period work would be scheduled. This improves
idle residency time and conserves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit
message. Fixed code alignment.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

net/ipv4: queue work on power efficient wq

Workqueue used in ipv4 layer have no real dependency of scheduling these on the
cpu which scheduled them.

On a idle system, it is observed that an idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions. This
doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
enabled.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

block: queue work on power efficient wq

Block layer uses workqueues for multiple purposes. There is no real dependency
of scheduling these on the cpu which scheduled them.

On a idle system, it is observed that and idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

PHYLIB: queue work on system_power_efficient_wq

Phylib uses workqueues for multiple purposes. There is no real dependency of
scheduling these on the cpu which scheduled them.

On a idle system, it is observed that and idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces system_wq with system_power_efficient_wq for PHYLIB.

Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

ASoC: pcm: Use the power efficient workqueue for delayed powerdown

There is no need to use a normal per-CPU workqueue for delayed power downs
as they're not timing or performance critical and waking up a core for them
would defeat some of the point.

Signed-off-by: Mark Brown <broonie@linaro.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

ASoC: compress: Use power efficient workqueue

There is no need for the power down work to be done on a per CPU workqueue
especially considering the fairly long delay before powerdown.

Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

ASoC: jack: Use power efficient workqueue

The accessory detect debounce work is not performance sensitive so let
the scheduler run it wherever is most efficient rather than in a per CPU
workqueue by using the system power efficient workqueue.

Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

net/neighbour: queue work on power efficient wq

Workqueue used in neighbour layer have no real dependency of scheduling these on
the cpu which scheduled them.

On a idle system, it is observed that an idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions. This
doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
enabled.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

timekeeping: Move clock sync work to power efficient workqueue

For better use of CPU idle time, allow the scheduler to select the CPU
on which the CMOS clock sync work would be scheduled. This improves
idle residency time and conserver power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Added commit message. Aligned code.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1391195904-12497-1-git-send-email-zoran.markovic@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

net: rfkill: move poll work to power efficient workqueue

This patch moves the rfkill poll_work to the power efficient workqueue.
This work does not have to be bound to the CPU that scheduled it, hence
the selection of CPU that executes it would be left to the scheduler.
Net result is that CPU idle times would be extended, resulting in power
savings.

This behaviour is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel, added commit message.
Fixed workqueue selection after suspend/resume cycle.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

usb: move hub init and LED blink work to power efficient workqueue

Allow the scheduler to select the best CPU to handle hub initalization
and LED blinking work. This extends idle residency times on idle CPUs
and conserves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

[zoran.markovic@linaro.org: Rebased to latest kernel. Added commit message.
Changed reference from system to power efficient workqueue for LEDs in
check_highspeed() and hub_port_connect_change().]

Acked-by: Alan Stern <stern@rowland.harvard.edu>
Cc: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Cc: Xenia Ragiadakou <burzalodowa@gmail.com>
Cc: Julius Werner <jwerner@chromium.org>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Cc: Matthias Beyer <mail@beyermatthias.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Thomas Pugliese <thomas.pugliese@gmail.com>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

block: remove WQ_POWER_EFFICIENT from kblockd

blk-mq issues async requests through kblockd. To issue a work request on
a specific CPU, kblockd_schedule_delayed_work_on is used. However, the
specific CPU choice may not be honored, if the power_efficient option
for workqueues is set. blk-mq requires that we have strict per-cpu
scheduling, so it wont work properly if kblockd is marked
POWER_EFFICIENT and power_efficient is set.

Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior.
This essentially reverts part of commit 695588f9454b, which added
the WQ_POWER_EFFICIENT marker to kblockd.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

regulator: core: Use the power efficient workqueue for delayed powerdown

There is no need to use a normal per-CPU workqueue for delayed power downs
as they're not timing or performance critical and waking up a core for them
would defeat some of the point.

Signed-off-by: Mark Brown <broonie@linaro.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Liam Girdwood <liam.r.girdwood@intel.com>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>

power: smb135x: queue work on system_power_efficient_wq

There doesn't seem to be any real dependency of scheduling these on
the cpu which scheduled them, so moving every *_delayed to the
power efficient wq save potential needlessly idle cpu wake ups
leaving the scheduler to decide the most appropriate cpus to wake up.

Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
  • Loading branch information
htejun authored and franciscofranco committed Oct 6, 2016
1 parent dc3cedb commit d1590c8
Show file tree
Hide file tree
Showing 20 changed files with 280 additions and 109 deletions.
15 changes: 15 additions & 0 deletions Documentation/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3349,6 +3349,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that this also can be controlled per-workqueue for
workqueues visible under /sys/bus/workqueue/.

workqueue.power_efficient
Per-cpu workqueues are generally preferred because
they show better performance thanks to cache
locality; unfortunately, per-cpu workqueues tend to
be more power hungry than unbound workqueues.

Enabling this makes the per-cpu workqueues which
were observed to contribute significantly to power
consumption unbound, leading to measurably lower
power usage at the cost of small performance
overhead.

The default value of this parameter is determined by
the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.

x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
Expand Down
3 changes: 2 additions & 1 deletion block/blk-ioc.c
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,8 @@ void put_io_context(struct io_context *ioc)
if (atomic_long_dec_and_test(&ioc->refcount)) {
spin_lock_irqsave(&ioc->lock, flags);
if (!hlist_empty(&ioc->icq_list))
schedule_work(&ioc->release_work);
queue_work(system_power_efficient_wq,
&ioc->release_work);
else
free_ioc = true;
spin_unlock_irqrestore(&ioc->lock, flags);
Expand Down
12 changes: 8 additions & 4 deletions block/genhd.c
Original file line number Diff line number Diff line change
Expand Up @@ -1506,9 +1506,11 @@ static void __disk_unblock_events(struct gendisk *disk, bool check_now)
intv = disk_events_poll_jiffies(disk);
set_timer_slack(&ev->dwork.timer, intv / 4);
if (check_now)
queue_delayed_work(system_freezable_wq, &ev->dwork, 0);
queue_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, 0);
else if (intv)
queue_delayed_work(system_freezable_wq, &ev->dwork, intv);
queue_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, intv);
out_unlock:
spin_unlock_irqrestore(&ev->lock, flags);
}
Expand Down Expand Up @@ -1551,7 +1553,8 @@ void disk_flush_events(struct gendisk *disk, unsigned int mask)
spin_lock_irq(&ev->lock);
ev->clearing |= mask;
if (!ev->block)
mod_delayed_work(system_freezable_wq, &ev->dwork, 0);
mod_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, 0);
spin_unlock_irq(&ev->lock);
}

Expand Down Expand Up @@ -1644,7 +1647,8 @@ static void disk_check_events(struct disk_events *ev,

intv = disk_events_poll_jiffies(disk);
if (!ev->block && intv)
queue_delayed_work(system_freezable_wq, &ev->dwork, intv);
queue_delayed_work(system_freezable_power_efficient_wq,
&ev->dwork, intv);

spin_unlock_irq(&ev->lock);

Expand Down
7 changes: 4 additions & 3 deletions drivers/base/firmware_class.c
Original file line number Diff line number Diff line change
Expand Up @@ -1006,7 +1006,8 @@ static int _request_firmware_load(struct firmware_priv *fw_priv, bool uevent,
dev_set_uevent_suppress(f_dev, false);
dev_dbg(f_dev, "firmware: requesting %s\n", buf->fw_id);
if (timeout != MAX_SCHEDULE_TIMEOUT)
schedule_delayed_work(&fw_priv->timeout_work, timeout);
queue_delayed_work(system_power_efficient_wq,
&fw_priv->timeout_work, timeout);

kobject_uevent(&fw_priv->dev.kobj, KOBJ_ADD);
}
Expand Down Expand Up @@ -1699,8 +1700,8 @@ static void device_uncache_fw_images_work(struct work_struct *work)
*/
static void device_uncache_fw_images_delay(unsigned long delay)
{
schedule_delayed_work(&fw_cache.work,
msecs_to_jiffies(delay));
queue_delayed_work(system_power_efficient_wq, &fw_cache.work,
msecs_to_jiffies(delay));
}

static int fw_pm_notify(struct notifier_block *notify_block,
Expand Down
9 changes: 5 additions & 4 deletions drivers/net/phy/phy.c
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ void phy_start_machine(struct phy_device *phydev,
{
phydev->adjust_state = handler;

schedule_delayed_work(&phydev->state_queue, HZ);
queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, HZ);
}

/**
Expand Down Expand Up @@ -500,7 +500,7 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
disable_irq_nosync(irq);
atomic_inc(&phydev->irq_disable);

schedule_work(&phydev->phy_queue);
queue_work(system_power_efficient_wq, &phydev->phy_queue);

return IRQ_HANDLED;
}
Expand Down Expand Up @@ -655,7 +655,7 @@ static void phy_change(struct work_struct *work)

/* reschedule state queue work to run as soon as possible */
cancel_delayed_work_sync(&phydev->state_queue);
schedule_delayed_work(&phydev->state_queue, 0);
queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, 0);

return;

Expand Down Expand Up @@ -918,7 +918,8 @@ void phy_state_machine(struct work_struct *work)
if (err < 0)
phy_error(phydev);

schedule_delayed_work(&phydev->state_queue, PHY_STATE_TIME * HZ);
queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
PHY_STATE_TIME * HZ);
}

static inline void mmd_phy_indirect(struct mii_bus *bus, int prtad, int devad,
Expand Down
50 changes: 31 additions & 19 deletions drivers/power/smb135x-charger.c
Original file line number Diff line number Diff line change
Expand Up @@ -1911,7 +1911,8 @@ static int smb135x_battery_set_property(struct power_supply *psy,
smb_stay_awake(&chip->smb_wake_source);
chip->bms_check = 1;
cancel_delayed_work(&chip->heartbeat_work);
schedule_delayed_work(&chip->heartbeat_work,
queue_delayed_work(system_power_efficient_wq,
&chip->heartbeat_work,
msecs_to_jiffies(0));
break;
case POWER_SUPPLY_PROP_HEALTH:
Expand All @@ -1921,7 +1922,8 @@ static int smb135x_battery_set_property(struct power_supply *psy,
smb135x_set_chrg_path_temp(chip);
chip->temp_check = 1;
cancel_delayed_work(&chip->heartbeat_work);
schedule_delayed_work(&chip->heartbeat_work,
queue_delayed_work(system_power_efficient_wq,
&chip->heartbeat_work,
msecs_to_jiffies(0));
break;
/* Block from Fuel Gauge */
Expand Down Expand Up @@ -2598,7 +2600,8 @@ static void aicl_check_work(struct work_struct *work)
chip->aicl_weak_detect = true;

cancel_delayed_work(&chip->src_removal_work);
schedule_delayed_work(&chip->src_removal_work,
queue_delayed_work(system_power_efficient_wq,
&chip->src_removal_work,
msecs_to_jiffies(3000));
if (!rc) {
dev_dbg(chip->dev, "Reached Bottom IC!\n");
Expand Down Expand Up @@ -2678,8 +2681,9 @@ static void rate_check_work(struct work_struct *work)

chip->rate_check_count++;
if (chip->rate_check_count < 6)
schedule_delayed_work(&chip->rate_check_work,
msecs_to_jiffies(500));
queue_delayed_work(system_power_efficient_wq,
&chip->rate_check_work,
msecs_to_jiffies(500));
}

static void usb_insertion_work(struct work_struct *work)
Expand Down Expand Up @@ -2734,8 +2738,9 @@ static void heartbeat_work(struct work_struct *work)
smb135x_get_prop_batt_health(chip, &batt_health)) {
dev_warn(chip->dev, "HB Failed to run resume = %d!\n",
(int)chip->resume_completed);
schedule_delayed_work(&chip->heartbeat_work,
msecs_to_jiffies(1000));
queue_delayed_work(system_power_efficient_wq,
&chip->heartbeat_work,
msecs_to_jiffies(1000));
return;
}

Expand Down Expand Up @@ -2833,8 +2838,9 @@ static void heartbeat_work(struct work_struct *work)

power_supply_changed(&chip->batt_psy);

schedule_delayed_work(&chip->heartbeat_work,
msecs_to_jiffies(60000));
queue_delayed_work(system_power_efficient_wq,
&chip->heartbeat_work,
msecs_to_jiffies(60000));
chip->hb_running = false;
if (!usb_present && !dc_present)
smb_relax(&chip->smb_wake_source);
Expand Down Expand Up @@ -2945,7 +2951,8 @@ static int otg_oc_handler(struct smb135x_chg *chip, u8 rt_stat)
return 0;
}

schedule_delayed_work(&chip->ocp_clear_work,
queue_delayed_work(system_power_efficient_wq,
&chip->ocp_clear_work,
msecs_to_jiffies(0));

pr_err("rt_stat = 0x%02x\n", rt_stat);
Expand All @@ -2970,7 +2977,8 @@ static int handle_dc_removal(struct smb135x_chg *chip)
static int handle_dc_insertion(struct smb135x_chg *chip)
{
if (chip->dc_psy_type == POWER_SUPPLY_TYPE_WIRELESS)
schedule_delayed_work(&chip->wireless_insertion_work,
queue_delayed_work(system_power_efficient_wq,
&chip->wireless_insertion_work,
msecs_to_jiffies(DCIN_UNSUSPEND_DELAY_MS));
if (chip->dc_psy_type != -EINVAL)
power_supply_set_online(&chip->dc_psy,
Expand Down Expand Up @@ -3121,8 +3129,9 @@ static int handle_usb_insertion(struct smb135x_chg *chip)
smb_stay_awake(&chip->smb_wake_source);
chip->apsd_rerun_cnt++;
chip->usb_present = 0;
schedule_delayed_work(&chip->usb_insertion_work,
msecs_to_jiffies(1000));
queue_delayed_work(system_power_efficient_wq,
&chip->usb_insertion_work,
msecs_to_jiffies(1000));
return 0;
}

Expand Down Expand Up @@ -3162,8 +3171,9 @@ static int handle_usb_insertion(struct smb135x_chg *chip)
chip->charger_rate = POWER_SUPPLY_CHARGE_RATE_NORMAL;
chip->rate_check_count = 0;
cancel_delayed_work(&chip->rate_check_work);
schedule_delayed_work(&chip->rate_check_work,
msecs_to_jiffies(500));
queue_delayed_work(system_power_efficient_wq,
&chip->rate_check_work,
msecs_to_jiffies(500));
return 0;
}

Expand Down Expand Up @@ -3204,8 +3214,9 @@ static int usbin_uv_handler(struct smb135x_chg *chip, u8 rt_stat)
if (rc < 0)
pr_err("Failed to Disable USBIN UV IRQ\n");
cancel_delayed_work(&chip->aicl_check_work);
schedule_delayed_work(&chip->aicl_check_work,
msecs_to_jiffies(0));
queue_delayed_work(system_power_efficient_wq,
&chip->aicl_check_work,
msecs_to_jiffies(0));
}
return 0;
}
Expand Down Expand Up @@ -5435,8 +5446,9 @@ static int smb135x_charger_probe(struct i2c_client *client,
if (rc < 0)
pr_err("failed to set up voltage notifications: %d\n", rc);

schedule_delayed_work(&chip->heartbeat_work,
msecs_to_jiffies(60000));
queue_delayed_work(system_power_efficient_wq,
&chip->heartbeat_work,
msecs_to_jiffies(60000));

dev_info(chip->dev, "SMB135X version = %s revision = %s successfully probed batt=%d dc = %d usb = %d\n",
version_str[chip->version],
Expand Down
5 changes: 3 additions & 2 deletions drivers/regulator/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -1925,8 +1925,9 @@ int regulator_disable_deferred(struct regulator *regulator, int ms)
rdev->deferred_disables++;
mutex_unlock(&rdev->mutex);

ret = schedule_delayed_work(&rdev->disable_work,
msecs_to_jiffies(ms));
ret = queue_delayed_work(system_power_efficient_wq,
&rdev->disable_work,
msecs_to_jiffies(ms));
if (ret < 0)
return ret;
else
Expand Down
19 changes: 13 additions & 6 deletions drivers/usb/core/hub.c
Original file line number Diff line number Diff line change
Expand Up @@ -506,7 +506,8 @@ static void led_work (struct work_struct *work)
changed++;
}
if (changed)
schedule_delayed_work(&hub->leds, LED_CYCLE_PERIOD);
queue_delayed_work(system_power_efficient_wq,
&hub->leds, LED_CYCLE_PERIOD);
}

/* use a short timeout for hub/port status fetches */
Expand Down Expand Up @@ -1057,7 +1058,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
goto init2;
#endif
PREPARE_DELAYED_WORK(&hub->init_work, hub_init_func2);
schedule_delayed_work(&hub->init_work,
queue_delayed_work(system_power_efficient_wq,
&hub->init_work,
msecs_to_jiffies(delay));

/* Suppress autosuspend until init is done */
Expand Down Expand Up @@ -1218,7 +1220,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
/* Don't do a long sleep inside a workqueue routine */
if (type == HUB_INIT2) {
PREPARE_DELAYED_WORK(&hub->init_work, hub_init_func3);
schedule_delayed_work(&hub->init_work,
queue_delayed_work(system_power_efficient_wq,
&hub->init_work,
msecs_to_jiffies(delay));
device_unlock(hub->intfdev);
return; /* Continues at init3: below */
Expand All @@ -1233,7 +1236,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
if (status < 0)
dev_err(hub->intfdev, "activate --> %d\n", status);
if (hub->has_indicators && blinkenlights)
schedule_delayed_work(&hub->leds, LED_CYCLE_PERIOD);
queue_delayed_work(system_power_efficient_wq,
&hub->leds, LED_CYCLE_PERIOD);

/* Scan all ports that need attention */
kick_khubd(hub);
Expand Down Expand Up @@ -4396,7 +4400,8 @@ check_highspeed (struct usb_hub *hub, struct usb_device *udev, int port1)
/* hub LEDs are probably harder to miss than syslog */
if (hub->has_indicators) {
hub->indicator[port1-1] = INDICATOR_GREEN_BLINK;
schedule_delayed_work (&hub->leds, 0);
queue_delayed_work(system_power_efficient_wq,
&hub->leds, 0);
}
}
kfree(qual);
Expand Down Expand Up @@ -4626,7 +4631,9 @@ static void hub_port_connect_change(struct usb_hub *hub, int port1,
if (hub->has_indicators) {
hub->indicator[port1-1] =
INDICATOR_AMBER_BLINK;
schedule_delayed_work (&hub->leds, 0);
queue_delayed_work(
system_power_efficient_wq,
&hub->leds, 0);
}
status = -ENOTCONN; /* Don't retry */
goto loop_disable;
Expand Down
38 changes: 37 additions & 1 deletion include/linux/workqueue.h
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,8 @@ enum {
/* data contains off-queue information when !WORK_STRUCT_PWQ */
WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT,

WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE),
__WORK_OFFQ_CANCELING = WORK_OFFQ_FLAG_BASE,
WORK_OFFQ_CANCELING = (1 << __WORK_OFFQ_CANCELING),

/*
* When a work item is off queue, its high bits point to the last
Expand Down Expand Up @@ -303,6 +304,33 @@ enum {
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */
WQ_SYSFS = 1 << 6, /* visible in sysfs, see wq_sysfs_register() */

/*
* Per-cpu workqueues are generally preferred because they tend to
* show better performance thanks to cache locality. Per-cpu
* workqueues exclude the scheduler from choosing the CPU to
* execute the worker threads, which has an unfortunate side effect
* of increasing power consumption.
*
* The scheduler considers a CPU idle if it doesn't have any task
* to execute and tries to keep idle cores idle to conserve power;
* however, for example, a per-cpu work item scheduled from an
* interrupt handler on an idle CPU will force the scheduler to
* excute the work item on that CPU breaking the idleness, which in
* turn may lead to more scheduling choices which are sub-optimal
* in terms of power consumption.
*
* Workqueues marked with WQ_POWER_EFFICIENT are per-cpu by default
* but become unbound if workqueue.power_efficient kernel param is
* specified. Per-cpu workqueues which are identified to
* contribute significantly to power-consumption are identified and
* marked with this flag and enabling the power_efficient mode
* leads to noticeable power saving at the cost of small
* performance disadvantage.
*
* http://thread.gmane.org/gmane.linux.kernel/1480396
*/
WQ_POWER_EFFICIENT = 1 << 7,

__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */

Expand Down Expand Up @@ -333,11 +361,19 @@ enum {
*
* system_freezable_wq is equivalent to system_wq except that it's
* freezable.
*
* *_power_efficient_wq are inclined towards saving power and converted
* into WQ_UNBOUND variants if 'wq_power_efficient' is enabled; otherwise,
* they are same as their non-power-efficient counterparts - e.g.
* system_power_efficient_wq is identical to system_wq if
* 'wq_power_efficient' is disabled. See WQ_POWER_EFFICIENT for more info.
*/
extern struct workqueue_struct *system_wq;
extern struct workqueue_struct *system_long_wq;
extern struct workqueue_struct *system_unbound_wq;
extern struct workqueue_struct *system_freezable_wq;
extern struct workqueue_struct *system_power_efficient_wq;
extern struct workqueue_struct *system_freezable_power_efficient_wq;

static inline struct workqueue_struct * __deprecated __system_nrt_wq(void)
{
Expand Down
Loading

0 comments on commit d1590c8

Please sign in to comment.