Skip to content

Instantly share code, notes, and snippets.

@burke

burke/debug.md Secret

Last active October 4, 2020 02:08
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save burke/c60dc5b8f0ba9bfd9275 to your computer and use it in GitHub Desktop.
Save burke/c60dc5b8f0ba9bfd9275 to your computer and use it in GitHub Desktop.

We're getting this in production. We're using docker 1.3.0 on Kernel 3.16.0 and 3.16.4, and seeing this crash:

[2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
[2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210

Someone has had this issue before (https://lkml.org/lkml/2014/2/15/217):

Since it's pretty inlined, the code points to:

	check_preempt_wakeup()
		find_matching_se()
			find_matching_se()
				check_preempt_wakeup()


	static inline struct cfs_rq *
	is_same_group(struct sched_entity *se, struct sched_entity *pse)
	{
	        if (se->cfs_rq == pse->cfs_rq)	<=== HERE
	                return se->cfs_rq;
	        return NULL;
	}

@graemej and I verified that 0x150 of sched_endity is cfs_rq. This means that either se or pse is NULL. The responses to the post indicate that this is because se_depth is wrong, and there's a patch to fix it.

https://lkml.org/lkml/2014/2/17/66:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 235cfa7..4445e56 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7317,7 +7317,11 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
-	if (!p->se.on_rq)
+	struct sched_entity *se = &p->se;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	se->depth = se->parent ? se->parent->depth + 1 : 0;
+#endif
+	if (!se->on_rq)
 		return;
 
 	/*

Problem solved, right? Sadly, this patch is present in 3.15+, and we've seen this issue on 3.16, so it's time to do some more digging.

The error was in is_same_group:

static inline struct cfs_rq *
is_same_group(struct sched_entity *se, struct sched_entity *pse)
{
        if (se->cfs_rq == pse->cfs_rq)	<=== HERE
                return se->cfs_rq;
        return NULL;
}

Up one level, we get find_matching_se:

static void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
	int se_depth, pse_depth;

	/*
	 * preemption test can be made between sibling entities who are in the
	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
	 * both tasks until we find their ancestors who are siblings of common
	 * parent.
	 */

	/* First walk up until both entities are at same depth */
	se_depth = (*se)->depth;
	pse_depth = (*pse)->depth;

	while (se_depth > pse_depth) {
		se_depth--;
		*se = parent_entity(*se);
	}

	while (pse_depth > se_depth) {
		pse_depth--;
		*pse = parent_entity(*pse);
	}

	while (!is_same_group(*se, *pse)) {
		*se = parent_entity(*se);
		*pse = parent_entity(*pse);
	}
}

Clearly, neither *se nor *pe as inputs to find_matching_se is NULL, since that would cause a panic on the first calls to ->depth in this function. Rather, one of the calls to parent_entity is returning NULL, which we are passing in to is_same_group.

Here's the contents of parent_entity and an excerpt from the definition of sched_entity:

static inline struct sched_entity *parent_entity(struct sched_entity *se)
{
	return se->parent;
}

struct sched_entity {
	int			depth;
	struct sched_entity	*parent;
	struct cfs_rq		*cfs_rq;
};

It seems clear that when depth=0, parent=NULL: This is modeled around CPU namespaces (cgroups), where the root namespace has depth zero and no parent. So find_matching_se is traversing all the way up to the root without finding what it's looking for, and attempting to continue, but crashing.

Let's take another look:

while (se_depth > pse_depth) {
	se_depth--;
	*se = parent_entity(*se);
}

while (pse_depth > se_depth) {
	pse_depth--;
	*pse = parent_entity(*pse);
}

while (!is_same_group(*se, *pse)) { // this is called with NULL at some point.
	*se = parent_entity(*se);
	*pse = parent_entity(*pse);
}

It's clear now that this guy was right:

Hrm.. that means we got se->depth wrong. I'll have a poke tomorrow. (https://lkml.org/lkml/2014/2/16/83)

If you really think through that algorithm, you'll be able to convince yourself that it can't possibly call is_same_group with NULL unless the parent+depth information in one or more of the sched_entities is incorrect.

Now, the funny thing is that the parent or depth is wrong even with this patch. Let's trace through what's happening here:

Here's the "fix". We're setting the sched_entity's depth to its parent's depth, plus one.

/*
 * We switched to the sched_fair class.
 */
static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
	struct sched_entity *se = &p->se;
#ifdef CONFIG_FAIR_GROUP_SCHED
	/*
	 * Since the real-depth could have been changed (only FAIR
	 * class maintain depth value), reset depth properly.
	 */
	se->depth = se->parent ? se->parent->depth + 1 : 0;
#endif
	if (!se->on_rq)
		return;

	/*
	 * We were most likely switched from sched_rt, so
	 * kick off the schedule if running, otherwise just see
	 * if we can still preempt the current task.
	 */
	if (rq->curr == p)
		resched_task(rq->curr);
	else
		check_preempt_curr(rq, p, 0);
}

So above, we've clobbered the task_struct's sched_entity's depth to the correct value. Now we call check_preempt_curr with that task_struct:

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
	const struct sched_class *class;

	if (p->sched_class == rq->curr->sched_class) {
		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
	} else {
		for_each_class(class) {
			if (class == rq->curr->sched_class)
				break;
			if (class == p->sched_class) {
				resched_task(rq->curr);
				break;
			}
		}
	}

	/*
	 * A queue event has occurred, and we're going to schedule.  In
	 * this case, we can save a useless back to back clock update.
	 */
	if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
		rq->skip_clock_update = 1;
}

This calls the check_preempt_curr member of the sched_class, which...

const struct sched_class fair_sched_class = {
	.check_preempt_curr	= check_preempt_wakeup,
}

And the definition for that is...

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	struct task_struct *curr = rq->curr;
	struct sched_entity *se = &curr->se, *pse = &p->se;
	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
	int scale = cfs_rq->nr_running >= sched_nr_latency;
	int next_buddy_marked = 0;

	if (unlikely(se == pse))
		return;

	/*
	 * This is possible from callers such as move_task(), in which we
	 * unconditionally check_prempt_curr() after an enqueue (which may have
	 * lead to a throttle).  This both saves work and prevents false
	 * next-buddy nomination below.
	 */
	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
		return;

	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
		set_next_buddy(pse);
		next_buddy_marked = 1;
	}

	/*
	 * We can come here with TIF_NEED_RESCHED already set from new task
	 * wake up path.
	 *
	 * Note: this also catches the edge-case of curr being in a throttled
	 * group (e.g. via set_curr_task), since update_curr() (in the
	 * enqueue of curr) will have resulted in resched being set.  This
	 * prevents us from potentially nominating it as a false LAST_BUDDY
	 * below.
	 */
	if (test_tsk_need_resched(curr))
		return;

	/* Idle tasks are by definition preempted by non-idle tasks. */
	if (unlikely(curr->policy == SCHED_IDLE) &&
	    likely(p->policy != SCHED_IDLE))
		goto preempt;

	/*
	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
	 * is driven by the tick):
	 */
	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
		return;

	find_matching_se(&se, &pse);
	update_curr(cfs_rq_of(se));
	BUG_ON(!pse);
	if (wakeup_preempt_entity(se, pse) == 1) {
		/*
		 * Bias pick_next to pick the sched entity that is
		 * triggering this preemption.
		 */
		if (!next_buddy_marked)
			set_next_buddy(pse);
		goto preempt;
	}

	return;

preempt:
	resched_task(curr);
	/*
	 * Only set the backward buddy when the current task is still
	 * on the rq. This can happen when a wakeup gets interleaved
	 * with schedule on the ->pre_schedule() or idle_balance()
	 * point, either of which can * drop the rq lock.
	 *
	 * Also, during early boot the idle thread is in the fair class,
	 * for obvious reasons its a bad idea to schedule back to it.
	 */
	if (unlikely(!se->on_rq || curr == rq->idle))
		return;

	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
		set_last_buddy(se);
}

Now, you'll see we call find_matching_se about 2/3 of the way down that function with se and pse.

pse is our task_struct *p passed in, which we have clobbered into correctness.

se, on the other hand, comes from rq->curr, which we have made no recent clobbers to.

We can therefore assume that the currently-scheduled task has incorrect parent+depth information.

Yes indeed. My first idea yesterday was to put it in set_task_rq() to be absolutely sure we catch all; but if this is sufficient its better. (https://lkml.org/lkml/2014/2/17/221)

Well, it would seem it wasn't sufficient after all! As Peter alluded to, this patch should be sufficient to fix:

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..bcffba2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -844,6 +844,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #ifdef CONFIG_FAIR_GROUP_SCHED
        p->se.cfs_rq = tg->cfs_rq[cpu];
        p->se.parent = tg->se[cpu];
+       p->se.depth = p->se.parent ? p->se.parent->depth + 1 : 0;
 #endif

 #ifdef CONFIG_RT_GROUP_SCHED

This seems completely reasonable: Since depth is essentially a cache of the number of parents above the sched_entity in question, we should probably be setting depth when we set parent.

@warnerpr
Copy link

Good analysis @burke. I think we are hitting the same or very similar thing on 3.16.0-34 kernel from Ubuntu. I am going to reproduce and try to confirm when I get a chance. For now I reverted to 3.13 and that dodges the problem.

Have you raised the issue with the kernel team?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment