burke/debug.md Secret

## debug.md

      
    Raw
  

              debug.md
            
          
    We're getting this in production. We're using docker 1.3.0 on Kernel 3.16.0 and 3.16.4, and seeing this crash:
[2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
[2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210

Someone has had this issue before (https://lkml.org/lkml/2014/2/15/217):
Since it's pretty inlined, the code points to:

	check_preempt_wakeup()
		find_matching_se()
			find_matching_se()
				check_preempt_wakeup()


	static inline struct cfs_rq *
	is_same_group(struct sched_entity *se, struct sched_entity *pse)
	{
	        if (se->cfs_rq == pse->cfs_rq)	<=== HERE
	                return se->cfs_rq;
	        return NULL;
	}

@graemej and I verified that 0x150 of sched_endity is cfs_rq. This means that either se or pse is NULL. The responses to the post indicate that this is because se_depth is wrong, and there's a patch to fix it.
https://lkml.org/lkml/2014/2/17/66:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 235cfa7..4445e56 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7317,7 +7317,11 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
-	if (!p->se.on_rq)
+	struct sched_entity *se = &p->se;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	se->depth = se->parent ? se->parent->depth + 1 : 0;
+#endif
+	if (!se->on_rq)
 		return;
 
 	/*
Problem solved, right? Sadly, this patch is present in 3.15+, and we've seen this issue on 3.16, so it's time to do some more digging.
The error was in is_same_group:
static inline struct cfs_rq *
is_same_group(struct sched_entity *se, struct sched_entity *pse)
{
        if (se->cfs_rq == pse->cfs_rq)	<=== HERE
                return se->cfs_rq;
        return NULL;
}
Up one level, we get find_matching_se:
static void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
	int se_depth, pse_depth;

	/*
	 * preemption test can be made between sibling entities who are in the
	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
	 * both tasks until we find their ancestors who are siblings of common
	 * parent.
	 */

	/* First walk up until both entities are at same depth */
	se_depth = (*se)->depth;
	pse_depth = (*pse)->depth;

	while (se_depth > pse_depth) {
		se_depth--;
		*se = parent_entity(*se);
	}

	while (pse_depth > se_depth) {
		pse_depth--;
		*pse = parent_entity(*pse);
	}

	while (!is_same_group(*se, *pse)) {
		*se = parent_entity(*se);
		*pse = parent_entity(*pse);
	}
}
Clearly, neither *se nor *pe as inputs to find_matching_se is NULL, since that would cause a panic on the first calls to ->depth in this function. Rather, one of the calls to parent_entity is returning NULL, which we are passing in to is_same_group.
Here's the contents of parent_entity and an excerpt from the definition of sched_entity:
static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
	return se->parent;
}

struct sched_entity {
	int			depth;
	struct sched_entity	*parent;
	struct cfs_rq		*cfs_rq;
};
It seems clear that when depth=0, parent=NULL: This is modeled around CPU namespaces (cgroups), where the root namespace has depth zero and no parent. So find_matching_se is traversing all the way up to the root without finding what it's looking for, and attempting to continue, but crashing.
Let's take another look:
while (se_depth > pse_depth) {
	se_depth--;
	*se = parent_entity(*se);
}

while (pse_depth > se_depth) {
	pse_depth--;
	*pse = parent_entity(*pse);
}

while (!is_same_group(*se, *pse)) { // this is called with NULL at some point.
	*se = parent_entity(*se);
	*pse = parent_entity(*pse);
}
It's clear now that this guy was right:

Hrm.. that means we got se->depth wrong. I'll have a poke tomorrow.
(https://lkml.org/lkml/2014/2/16/83)

If you really think through that algorithm, you'll be able to convince yourself that it can't possibly call is_same_group with NULL unless the parent+depth information in one or more of the sched_entities is incorrect.
Now, the funny thing is that the parent or depth is wrong even with this patch. Let's trace through what's happening here:
Here's the "fix". We're setting the sched_entity's depth to its parent's depth, plus one.
/*
 * We switched to the sched_fair class.
 */
static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
	struct sched_entity *se = &p->se;
#ifdef CONFIG_FAIR_GROUP_SCHED
	/*
	 * Since the real-depth could have been changed (only FAIR
	 * class maintain depth value), reset depth properly.
	 */
	se->depth = se->parent ? se->parent->depth + 1 : 0;
#endif
	if (!se->on_rq)
		return;

	/*
	 * We were most likely switched from sched_rt, so
	 * kick off the schedule if running, otherwise just see
	 * if we can still preempt the current task.
	 */
	if (rq->curr == p)
		resched_task(rq->curr);
	else
		check_preempt_curr(rq, p, 0);
}
So above, we've clobbered the task_struct's sched_entity's depth to the correct value. Now we call check_preempt_curr with that task_struct:
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
	const struct sched_class *class;

	if (p->sched_class == rq->curr->sched_class) {
		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
	} else {
		for_each_class(class) {
			if (class == rq->curr->sched_class)
				break;
			if (class == p->sched_class) {
				resched_task(rq->curr);
				break;
			}
		}
	}

	/*
	 * A queue event has occurred, and we're going to schedule.  In
	 * this case, we can save a useless back to back clock update.
	 */
	if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
		rq->skip_clock_update = 1;
}
This calls the check_preempt_curr member of the sched_class, which...
const struct sched_class fair_sched_class = {
	.check_preempt_curr	= check_preempt_wakeup,
}
And the definition for that is...
/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	struct task_struct *curr = rq->curr;
	struct sched_entity *se = &curr->se, *pse = &p->se;
	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
	int scale = cfs_rq->nr_running >= sched_nr_latency;
	int next_buddy_marked = 0;

	if (unlikely(se == pse))
		return;

	/*
	 * This is possible from callers such as move_task(), in which we
	 * unconditionally check_prempt_curr() after an enqueue (which may have
	 * lead to a throttle).  This both saves work and prevents false
	 * next-buddy nomination below.
	 */
	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
		return;

	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
		set_next_buddy(pse);
		next_buddy_marked = 1;
	}

	/*
	 * We can come here with TIF_NEED_RESCHED already set from new task
	 * wake up path.
	 *
	 * Note: this also catches the edge-case of curr being in a throttled
	 * group (e.g. via set_curr_task), since update_curr() (in the
	 * enqueue of curr) will have resulted in resched being set.  This
	 * prevents us from potentially nominating it as a false LAST_BUDDY
	 * below.
	 */
	if (test_tsk_need_resched(curr))
		return;

	/* Idle tasks are by definition preempted by non-idle tasks. */
	if (unlikely(curr->policy == SCHED_IDLE) &&
	    likely(p->policy != SCHED_IDLE))
		goto preempt;

	/*
	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
	 * is driven by the tick):
	 */
	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
		return;

	find_matching_se(&se, &pse);
	update_curr(cfs_rq_of(se));
	BUG_ON(!pse);
	if (wakeup_preempt_entity(se, pse) == 1) {
		/*
		 * Bias pick_next to pick the sched entity that is
		 * triggering this preemption.
		 */
		if (!next_buddy_marked)
			set_next_buddy(pse);
		goto preempt;
	}

	return;

preempt:
	resched_task(curr);
	/*
	 * Only set the backward buddy when the current task is still
	 * on the rq. This can happen when a wakeup gets interleaved
	 * with schedule on the ->pre_schedule() or idle_balance()
	 * point, either of which can * drop the rq lock.
	 *
	 * Also, during early boot the idle thread is in the fair class,
	 * for obvious reasons its a bad idea to schedule back to it.
	 */
	if (unlikely(!se->on_rq || curr == rq->idle))
		return;

	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
		set_last_buddy(se);
}
Now, you'll see we call find_matching_se about 2/3 of the way down that function with se and pse.
pse is our task_struct *p passed in, which we have clobbered into correctness.
se, on the other hand, comes from rq->curr, which we have made no recent clobbers to.
We can therefore assume that the currently-scheduled task has incorrect parent+depth information.

Yes indeed. My first idea yesterday was to put it in set_task_rq() to be
absolutely sure we catch all; but if this is sufficient its better. (https://lkml.org/lkml/2014/2/17/221)

Well, it would seem it wasn't sufficient after all! As Peter alluded to, this patch should be sufficient to fix:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..bcffba2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -844,6 +844,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #ifdef CONFIG_FAIR_GROUP_SCHED
        p->se.cfs_rq = tg->cfs_rq[cpu];
        p->se.parent = tg->se[cpu];
+       p->se.depth = p->se.parent ? p->se.parent->depth + 1 : 0;
 #endif

 #ifdef CONFIG_RT_GROUP_SCHED
This seems completely reasonable: Since depth is essentially a cache of the number of parents above the sched_entity in question, we should probably be setting depth when we set parent.