We're getting this in production. We're using docker 1.3.0
on Kernel 3.16.0
and 3.16.4
, and seeing this crash:
[2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
[2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210
Someone has had this issue before (https://lkml.org/lkml/2014/2/15/217):
Since it's pretty inlined, the code points to:
check_preempt_wakeup()
find_matching_se()
find_matching_se()
check_preempt_wakeup()
static inline struct cfs_rq *
is_same_group(struct sched_entity *se, struct sched_entity *pse)
{
if (se->cfs_rq == pse->cfs_rq) <=== HERE
return se->cfs_rq;
return NULL;
}
@graemej and I verified that 0x150
of sched_endity
is cfs_rq
. This means that either se
or pse
is NULL
. The responses to the post indicate that this is because se_depth
is wrong, and there's a patch to fix it.
https://lkml.org/lkml/2014/2/17/66:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 235cfa7..4445e56 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7317,7 +7317,11 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
*/
static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
- if (!p->se.on_rq)
+ struct sched_entity *se = &p->se;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ se->depth = se->parent ? se->parent->depth + 1 : 0;
+#endif
+ if (!se->on_rq)
return;
/*
Problem solved, right? Sadly, this patch is present in 3.15+
, and we've seen this issue on 3.16
, so it's time to do some more digging.
The error was in is_same_group
:
static inline struct cfs_rq *
is_same_group(struct sched_entity *se, struct sched_entity *pse)
{
if (se->cfs_rq == pse->cfs_rq) <=== HERE
return se->cfs_rq;
return NULL;
}
Up one level, we get find_matching_se
:
static void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
int se_depth, pse_depth;
/*
* preemption test can be made between sibling entities who are in the
* same cfs_rq i.e who have a common parent. Walk up the hierarchy of
* both tasks until we find their ancestors who are siblings of common
* parent.
*/
/* First walk up until both entities are at same depth */
se_depth = (*se)->depth;
pse_depth = (*pse)->depth;
while (se_depth > pse_depth) {
se_depth--;
*se = parent_entity(*se);
}
while (pse_depth > se_depth) {
pse_depth--;
*pse = parent_entity(*pse);
}
while (!is_same_group(*se, *pse)) {
*se = parent_entity(*se);
*pse = parent_entity(*pse);
}
}
Clearly, neither *se
nor *pe
as inputs to find_matching_se
is NULL
, since that would cause a panic on the first calls to ->depth
in this function. Rather, one of the calls to parent_entity
is returning NULL
, which we are passing in to is_same_group
.
Here's the contents of parent_entity
and an excerpt from the definition of sched_entity
:
static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
return se->parent;
}
struct sched_entity {
int depth;
struct sched_entity *parent;
struct cfs_rq *cfs_rq;
};
It seems clear that when depth=0
, parent=NULL
: This is modeled around CPU namespaces (cgroups
), where the root namespace has depth zero and no parent. So find_matching_se
is traversing all the way up to the root without finding what it's looking for, and attempting to continue, but crashing.
Let's take another look:
while (se_depth > pse_depth) {
se_depth--;
*se = parent_entity(*se);
}
while (pse_depth > se_depth) {
pse_depth--;
*pse = parent_entity(*pse);
}
while (!is_same_group(*se, *pse)) { // this is called with NULL at some point.
*se = parent_entity(*se);
*pse = parent_entity(*pse);
}
It's clear now that this guy was right:
Hrm.. that means we got se->depth wrong. I'll have a poke tomorrow. (https://lkml.org/lkml/2014/2/16/83)
If you really think through that algorithm, you'll be able to convince yourself that it can't possibly call is_same_group
with NULL
unless the parent+depth
information in one or more of the sched_entities
is incorrect.
Now, the funny thing is that the parent
or depth
is wrong even with this patch. Let's trace through what's happening here:
Here's the "fix". We're setting the sched_entity
's depth to its parent's depth, plus one.
/*
* We switched to the sched_fair class.
*/
static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* Since the real-depth could have been changed (only FAIR
* class maintain depth value), reset depth properly.
*/
se->depth = se->parent ? se->parent->depth + 1 : 0;
#endif
if (!se->on_rq)
return;
/*
* We were most likely switched from sched_rt, so
* kick off the schedule if running, otherwise just see
* if we can still preempt the current task.
*/
if (rq->curr == p)
resched_task(rq->curr);
else
check_preempt_curr(rq, p, 0);
}
So above, we've clobbered the task_struct
's sched_entity
's depth to the correct value. Now we call check_preempt_curr
with that task_struct
:
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {
for_each_class(class) {
if (class == rq->curr->sched_class)
break;
if (class == p->sched_class) {
resched_task(rq->curr);
break;
}
}
}
/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
rq->skip_clock_update = 1;
}
This calls the check_preempt_curr
member of the sched_class
, which...
const struct sched_class fair_sched_class = {
.check_preempt_curr = check_preempt_wakeup,
}
And the definition for that is...
/*
* Preempt the current task with a newly woken task if needed:
*/
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
int next_buddy_marked = 0;
if (unlikely(se == pse))
return;
/*
* This is possible from callers such as move_task(), in which we
* unconditionally check_prempt_curr() after an enqueue (which may have
* lead to a throttle). This both saves work and prevents false
* next-buddy nomination below.
*/
if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
return;
if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);
next_buddy_marked = 1;
}
/*
* We can come here with TIF_NEED_RESCHED already set from new task
* wake up path.
*
* Note: this also catches the edge-case of curr being in a throttled
* group (e.g. via set_curr_task), since update_curr() (in the
* enqueue of curr) will have resulted in resched being set. This
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
if (test_tsk_need_resched(curr))
return;
/* Idle tasks are by definition preempted by non-idle tasks. */
if (unlikely(curr->policy == SCHED_IDLE) &&
likely(p->policy != SCHED_IDLE))
goto preempt;
/*
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;
find_matching_se(&se, &pse);
update_curr(cfs_rq_of(se));
BUG_ON(!pse);
if (wakeup_preempt_entity(se, pse) == 1) {
/*
* Bias pick_next to pick the sched entity that is
* triggering this preemption.
*/
if (!next_buddy_marked)
set_next_buddy(pse);
goto preempt;
}
return;
preempt:
resched_task(curr);
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
* with schedule on the ->pre_schedule() or idle_balance()
* point, either of which can * drop the rq lock.
*
* Also, during early boot the idle thread is in the fair class,
* for obvious reasons its a bad idea to schedule back to it.
*/
if (unlikely(!se->on_rq || curr == rq->idle))
return;
if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
set_last_buddy(se);
}
Now, you'll see we call find_matching_se
about 2/3 of the way down that function with se
and pse
.
pse
is our task_struct *p
passed in, which we have clobbered into correctness.
se
, on the other hand, comes from rq->curr
, which we have made no recent clobbers to.
We can therefore assume that the currently-scheduled task has incorrect parent+depth
information.
Yes indeed. My first idea yesterday was to put it in
set_task_rq()
to be absolutely sure we catch all; but if this is sufficient its better. (https://lkml.org/lkml/2014/2/17/221)
Well, it would seem it wasn't sufficient after all! As Peter alluded to, this patch should be sufficient to fix:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..bcffba2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -844,6 +844,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = tg->cfs_rq[cpu];
p->se.parent = tg->se[cpu];
+ p->se.depth = p->se.parent ? p->se.parent->depth + 1 : 0;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
This seems completely reasonable: Since depth
is essentially a cache of the number of parent
s above the sched_entity
in question, we should probably be setting depth
when we set parent
.
Good analysis @burke. I think we are hitting the same or very similar thing on 3.16.0-34 kernel from Ubuntu. I am going to reproduce and try to confirm when I get a chance. For now I reverted to 3.13 and that dodges the problem.
Have you raised the issue with the kernel team?