kerneltoast/0001-runtime-fix-print-races-in-IRQ-context-and-during-pr.patch Secret

## 0001-runtime-fix-print-races-in-IRQ-context-and-during-pr.patch
From a420207d144d9d39515136fa4902ee33f125b53e Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan@openresty.com>
Date: Tue, 1 Dec 2020 11:43:07 -0800
Subject: [PATCH] runtime: fix print races in IRQ context and during print
 cleanup

Prints can race when there's a print called from IRQ context or a print
called while print cleanup takes place, which can lead to garbled print
messages, out-of-bounds memory accesses, and memory use-after-free. This
is one example of racy modification of the print buffer len in IRQ
context which caused a panic due to an out-of-bounds memory access:

BUG: unable to handle kernel paging request at ffffe8ffff621000
IP: [<ffffffffc05da0f3>] _stp_vsprint_memory+0x83/0x950 [stap_2c44636dfda18135ca3012a752599da6_13_533]
PGD 174b90067 PUD 174b8f067 PMD 174b93067 PTE 0
Oops: 0002 [#1] SMP
CPU: 12 PID: 3468 Comm: cat Kdump: loaded Tainted: G           OE  ------------   3.10.0-1127.19.1.el7.x86_64.debug #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
task: ffff88001f4f0000 ti: ffff88004ea5c000 task.ti: ffff88004ea5c000
RIP: 0010:[<ffffffffc05da0f3>]  [<ffffffffc05da0f3>] _stp_vsprint_memory+0x83/0x950 [stap_2c44636dfda18135ca3012a752599da6_13_533]
RSP: 0018:ffff88004ea5f9a8  EFLAGS: 00010082
RAX: ffffe8ffff621001 RBX: ffffe8ffff620ff2 RCX: fffffffffffffffe
RDX: 000000000000006e RSI: ffffffffffffffff RDI: ffffc90002c23730
RBP: ffff88004ea5fa28 R08: 00000000ffffffff R09: 0000000000000073
R10: ffffc90002c243d7 R11: 0000000000000001 R12: ffffc90002c2373f
R13: ffffe8ffff621004 R14: 0000000000000012 R15: 00000000fffffffe
FS:  00007f8a9b1d4740(0000) GS:ffff880179e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffe8ffff621000 CR3: 00000000b3e3c000 CR4: 0000000000360fe0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff8103eb89>] ? sched_clock+0x9/0x10
 [<ffffffff8114036f>] ? lock_release_holdtime.part.30+0xf/0x1a0
 [<ffffffffc05dcb80>] function___global_trace__overload_0+0x5b0/0x1220 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffffc05d8993>] ? stp_lock_probe+0x53/0xe0 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffff8188d879>] ? kretprobe_trampoline_holder+0x9/0x9
 [<ffffffffc05e0662>] probe_7118+0x82/0xe0 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffffc05de866>] enter_kretprobe_common+0x256/0x490 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffff813489f1>] ? proc_sys_open+0x51/0x60
 [<ffffffffc05dead0>] enter_kretprobe_probe+0x10/0x20 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffff8188e1d8>] trampoline_handler+0x148/0x220
 [<ffffffff813489f1>] ? proc_sys_open+0x51/0x60
 [<ffffffff8188d89e>] kretprobe_trampoline+0x25/0x57
 [<ffffffff813489f1>] ? proc_sys_open+0x51/0x60
 [<ffffffff8188d879>] kretprobe_trampoline_holder+0x9/0x9
 [<ffffffff81384702>] ? security_inode_permission+0x22/0x30
 [<ffffffff813489a0>] ? sysctl_head_finish+0x50/0x50
 [<ffffffff812ac11d>] vfs_open+0x5d/0xb0
 [<ffffffff812bb74a>] ? may_open+0x5a/0x120
 [<ffffffff812c0af5>] do_last+0x285/0x15b0
 [<ffffffff812bf18e>] ? link_path_walk+0x27e/0x8c0
 [<ffffffff812c1ef0>] path_openat+0xd0/0x5d0
 [<ffffffff8107a7f3>] ? kvm_clock_read+0x33/0x40
 [<ffffffff812c38ad>] do_filp_open+0x4d/0xb0
 [<ffffffff81889497>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffff812d5a9b>] ? __alloc_fd+0xfb/0x270
 [<ffffffff812ad784>] do_sys_open+0x124/0x220
 [<ffffffff812ad89e>] SyS_open+0x1e/0x20
 [<ffffffff8188d879>] kretprobe_trampoline_holder+0x9/0x9

This patch resolves the IRQ print races by disabling IRQs on the local
CPU when accessing said CPU's print buffer, and resolves the cleanup
races with a lock. We also protect against data corruption and panics
from prints inside NMIs now by checking if the current CPU was accessing
the log buffer when an NMI fired; in this case, the NMI's prints will be
dropped, as there is no way to safely service them without creating a
dedicated log buffer for them. This is achieved by forbidding reentrancy
with respect to _stp_print_trylock_irqsave() when the runtime context
isn't held. Reentrancy is otherwise allowed when the runtime context is
held because the runtime context provides reentrancy protection.
---
 runtime/linux/io.c              |  44 ++++--
 runtime/linux/print.c           | 268 ++++++++++++++++++++++++--------
 runtime/linux/runtime_context.h |  16 +-
 runtime/print.h                 |   3 +
 runtime/print_flush.c           |  56 +------
 runtime/stack.c                 |  34 ++--
 runtime/transport/transport.c   |   2 +-
 runtime/vsprintf.c              |  18 ++-
 translate.cxx                   |  20 ++-
 9 files changed, 310 insertions(+), 151 deletions(-)

diff --git a/runtime/linux/io.c b/runtime/linux/io.c
index 74a032c52..48adcecd7 100644
--- a/runtime/linux/io.c
+++ b/runtime/linux/io.c
@@ -20,9 +20,6 @@

 #define WARN_STRING "WARNING: "
 #define ERR_STRING "ERROR: "
-#if (STP_LOG_BUF_LEN < 10) /* sizeof(WARN_STRING) */
-#error "STP_LOG_BUF_LEN is too short"
-#endif

 enum code { INFO=0, WARN, ERROR, DBUG };

@@ -31,25 +28,37 @@ static void _stp_vlog (enum code type, const char *func, int line, const char *f

 static void _stp_vlog (enum code type, const char *func, int line, const char *fmt, va_list args)
 {
+	struct _stp_log *log;
+	unsigned long flags;
+	size_t bytes_avail;
 	int num;
-	char *buf = per_cpu_ptr(Stp_lbuf, get_cpu());
+	char *buf;
 	int start = 0;

+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	bytes_avail = STP_BUFFER_SIZE - log->len;
+	if (unlikely(!bytes_avail))
+		goto err_unlock;
+
+	buf = &log->buf[log->len];
 	if (type == DBUG) {
-		start = _stp_snprintf(buf, STP_LOG_BUF_LEN, "%s:%d: ", func, line);
+		start = _stp_snprintf(buf, bytes_avail, "%s:%d: ", func, line);
 	} else if (type == WARN) {
-		/* This strcpy() is OK, since we know STP_LOG_BUF_LEN
-		 * is > sizeof(WARN_STRING). */
-		strcpy (buf, WARN_STRING);
-		start = sizeof(WARN_STRING) - 1;
+		strncpy(buf, WARN_STRING, bytes_avail);
+		start = min(bytes_avail, sizeof(WARN_STRING) - 1);
 	} else if (type == ERROR) {
-		/* This strcpy() is OK, since we know STP_LOG_BUF_LEN
-		 * is > sizeof(ERR_STRING) (which is < sizeof(WARN_STRING). */
-		strcpy (buf, ERR_STRING);
-		start = sizeof(ERR_STRING) - 1;
+		strncpy(buf, ERR_STRING, bytes_avail);
+		start = min(bytes_avail, sizeof(ERR_STRING) - 1);
 	}

-	num = vscnprintf (buf + start, STP_LOG_BUF_LEN - start - 1, fmt, args);
+	bytes_avail -= start;
+	if (unlikely(!bytes_avail))
+		goto err_unlock;
+
+	num = vscnprintf(buf + start, bytes_avail - 1, fmt, args);
 	if (num + start) {
 		if (buf[num + start - 1] != '\n') {
 			buf[num + start] = '\n';
@@ -66,12 +75,13 @@ static void _stp_vlog (enum code type, const char *func, int line, const char *f
 		if (type != DBUG) {
 			_stp_ctl_send(STP_OOB_DATA, buf, start + num + 1);
 		} else {
-			_stp_print(buf);
-			_stp_print_flush();
+			log->len += start + num + 1;
+			__stp_print_flush(log);
 		}
 #endif
 	}
-	put_cpu();
+err_unlock:
+	_stp_print_unlock_irqrestore(&flags);
 }

 /** Prints warning.
diff --git a/runtime/linux/print.c b/runtime/linux/print.c
index 777bca8b0..b7039f257 100644
--- a/runtime/linux/print.c
+++ b/runtime/linux/print.c
@@ -35,84 +35,211 @@
  * @{
  */

-typedef struct __stp_pbuf {
-	uint32_t len;			/* bytes used in the buffer */
+struct _stp_log {
+	unsigned int len; /* Bytes used in the buffer */
 	char buf[STP_BUFFER_SIZE];
-} _stp_pbuf;
+	spinlock_t reentrancy_lock;
+};
+#include "print_flush.c"

-static void *Stp_pbuf = NULL;
+static struct _stp_log *_stp_log_pcpu;
+static bool _stp_log_stop;
+
+/*
+ * Can't use STP_DEFINE_RWLOCK() or this might be replaced with a spin lock. A
+ * read lock is used to synchronize every possible print buffer usage with the
+ * _stp_print_cleanup() function. The cleanup function sets _stp_log_stop to 1
+ * and then acquires a write lock to wait for everything using the print buffer
+ * to finish. Then, to stop any new instances of print buffer usage from
+ * occurring, _stp_log_stop is checked inside the read lock every time the read
+ * lock is acquired. This is all to prevent races from causing mayhem when it
+ * comes time to free the print buffers.
+ */
+static DEFINE_RWLOCK(_stp_print_lock);

-/** private buffer for _stp_vlog() */
-#ifndef STP_LOG_BUF_LEN
-#define STP_LOG_BUF_LEN 256
-#endif
+/*
+ * This disables IRQs to make per-CPU print buffer accesses atomic. There is a
+ * reentrancy protection mechanism specifically for NMIs, since they can violate
+ * our atomic guarantee. Reentrancy is otherwise allowed within code sections
+ * that have the runtime context held (via _stp_runtime_entryfn_get_context()).
+ */
+static bool _stp_print_trylock_irqsave(unsigned long *flags)
+	__acquires(&_stp_print_lock)
+{
+	struct _stp_log *log;
+
+	/*
+	 * This needs to be a trylock because we could be inside an NMI that
+	 * occurs while the write lock is held, resulting in a deadlock. The
+	 * only time there is contention is during cleanup anyway.
+	 *
+	 * IRQs don't need to be disabled before attempting to acquire this lock
+	 * because the write lock is never acquired inside an IRQ, so there's no
+	 * risk of a deadlock.
+	 */
+	if (!read_trylock(&_stp_print_lock))
+		return false;
+
+	/* Has to be checked with the lock held to avoid racing with cleanup */
+	if (_stp_log_stop)
+		goto print_unlock;
+
+	/*
+	 * Check the per-CPU reentrancy lock for contention, unless the runtime
+	 * context is already held, in which case we already have reentrancy
+	 * protection. Otherwise, if the reentrancy lock is contented, that
+	 * means we're either inside an NMI that fired while the current CPU was
+	 * accessing the log buffer, or something is trying to nest calls to
+	 * _stp_print_trylock_irqsave(). Our only choice is to reject the log
+	 * access attempt in this case because log buffer corruption and panics
+	 * could ensue if we're inside an NMI.
+	 */
+	if (read_trylock(&_stp_context_lock)) {
+		struct context *c = _stp_runtime_get_context();
+		bool context_held = c && atomic_read(&c->busy);
+		read_unlock(&_stp_context_lock);
+
+		/* We only need to disable IRQs if the context is held */
+		if (context_held) {
+			local_irq_save(*flags);
+			return true;
+		}
+	}
+
+	/* Fall back onto the reentrancy lock if the context isn't held */
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (!spin_trylock_irqsave(&log->reentrancy_lock, *flags))
+		goto print_unlock;

-typedef char _stp_lbuf[STP_LOG_BUF_LEN];
-static void *Stp_lbuf = NULL;
+	return true;
+
+print_unlock:
+	read_unlock(&_stp_print_lock);
+	return false;
+}
+
+static void _stp_print_unlock_irqrestore(unsigned long *flags)
+	__releases(&_stp_print_lock)
+{
+	bool context_held = false;
+	struct _stp_log *log;
+
+	/*
+	 * This read_trylock() is guaranteed to succeed if the context is held.
+	 * The guarantee for this is provided by _stp_runtime_contexts_free().
+	 * In other words, this won't fail when _stp_runtime_contexts_free() is
+	 * trying to start freeing the contexts.
+	 */
+	if (read_trylock(&_stp_context_lock)) {
+		struct context *c = _stp_runtime_get_context();
+		context_held = c && atomic_read(&c->busy);
+		read_unlock(&_stp_context_lock);
+	}
+
+	if (context_held) {
+		local_irq_restore(*flags);
+	} else {
+		log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+		spin_unlock_irqrestore(&log->reentrancy_lock, *flags);
+	}
+	read_unlock(&_stp_print_lock);
+}

 /* create percpu print and io buffers */
 static int _stp_print_init (void)
 {
-	Stp_pbuf = _stp_alloc_percpu(sizeof(_stp_pbuf));
-	if (unlikely(Stp_pbuf == 0))
-		return -1;
-
-	/* now initialize IO buffer used in io.c */
-	Stp_lbuf = _stp_alloc_percpu(sizeof(_stp_lbuf));
-	if (unlikely(Stp_lbuf == 0)) {
-		_stp_free_percpu(Stp_pbuf);
-		return -1;
+	unsigned int cpu;
+
+	_stp_log_pcpu = _stp_alloc_percpu(sizeof(*_stp_log_pcpu));
+	if (!_stp_log_pcpu)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct _stp_log *log = per_cpu_ptr(_stp_log_pcpu, cpu);
+
+		spin_lock_init(&log->reentrancy_lock);
 	}
 	return 0;
 }

 static void _stp_print_cleanup (void)
 {
-	if (Stp_pbuf)
-		_stp_free_percpu(Stp_pbuf);
-	if (Stp_lbuf)
-		_stp_free_percpu(Stp_lbuf);
-}
+	unsigned int cpu;
+
+	/* Wait for the loggers to finish modifying the print buffers */
+	write_lock(&_stp_print_lock);
+	_stp_log_stop = true;
+	write_unlock(&_stp_print_lock);
+
+	for_each_possible_cpu(cpu) {
+		struct _stp_log *log = per_cpu_ptr(_stp_log_pcpu, cpu);
+
+		/*
+		 * Flush anything that could be left in the print buffer. It is
+		 * safe to do this without any kind of synchronization mechanism
+		 * because nothing is using this print buffer anymore.
+		 */
+		__stp_print_flush(log);
+	}

-#include "print_flush.c"
+	_stp_free_percpu(_stp_log_pcpu);
+}

 static inline void _stp_print_flush(void)
 {
-	stp_print_flush(per_cpu_ptr(Stp_pbuf, smp_processor_id()));
+	struct _stp_log *log;
+	unsigned long flags;
+
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	__stp_print_flush(log);
+	_stp_print_unlock_irqrestore(&flags);
 }
+
 #ifndef STP_MAXBINARYARGS
 #define STP_MAXBINARYARGS 127
 #endif


-/** Reserves space in the output buffer for direct I/O.
+/** Reserves space in the output buffer for direct I/O. Must be called with
+ * _stp_print_lock held with IRQs disabled.
  */
 static void * _stp_reserve_bytes (int numbytes)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	int size = STP_BUFFER_SIZE - pb->len;
-	void * ret;
+	struct _stp_log *log;
+	char *ret;
+
+	BUG_ON(!irqs_disabled());
+	lockdep_assert_held(&_stp_print_lock);

 	if (unlikely(numbytes == 0 || numbytes > STP_BUFFER_SIZE))
 		return NULL;

-	if (unlikely(numbytes > size))
-		_stp_print_flush();
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (unlikely(numbytes > (STP_BUFFER_SIZE - log->len)))
+		__stp_print_flush(log);

-	ret = pb->buf + pb->len;
-	pb->len += numbytes;
+	ret = &log->buf[log->len];
+	log->len += numbytes;
 	return ret;
 }


 static void _stp_unreserve_bytes (int numbytes)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
+	struct _stp_log *log;
+
+	BUG_ON(!irqs_disabled());
+	lockdep_assert_held(&_stp_print_lock);

-	if (unlikely(numbytes == 0 || numbytes > pb->len))
+	if (unlikely(numbytes == 0))
 		return;

-	pb->len -= numbytes;
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (numbytes <= log->len)
+		log->len -= numbytes;
 }

 /** Write 64-bit args directly into the output stream.
@@ -123,22 +250,25 @@ static void _stp_unreserve_bytes (int numbytes)
  */
 static void _stp_print_binary (int num, ...)
 {
+	unsigned long flags;
 	va_list vargs;
 	int i;
 	int64_t *args;
-
+
 	if (unlikely(num > STP_MAXBINARYARGS))
 		num = STP_MAXBINARYARGS;

-	args = _stp_reserve_bytes(num * sizeof(int64_t));
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;

-	if (likely(args != NULL)) {
+	args = _stp_reserve_bytes(num * sizeof(int64_t));
+	if (args) {
 		va_start(vargs, num);
-		for (i = 0; i < num; i++) {
+		for (i = 0; i < num; i++)
 			args[i] = va_arg(vargs, int64_t);
-		}
 		va_end(vargs);
 	}
+	_stp_print_unlock_irqrestore(&flags);
 }

 /** Print into the print buffer.
@@ -149,6 +279,7 @@ static void _stp_print_binary (int num, ...)
 static void _stp_printf (const char *fmt, ...)
 {
 	va_list args;
+
 	va_start(args, fmt);
 	_stp_vsnprintf(NULL, 0, fmt, args);
 	va_end(args);
@@ -160,37 +291,38 @@ static void _stp_printf (const char *fmt, ...)

 static void _stp_print (const char *str)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	char *end = pb->buf + STP_BUFFER_SIZE;
-	char *ptr = pb->buf + pb->len;
-	char *instr = (char *)str;
-
-	while (ptr < end && *instr)
-		*ptr++ = *instr++;
-
-	/* Did loop terminate due to lack of buffer space? */
-	if (unlikely(*instr)) {
-		/* Don't break strings across subbufs. */
-		/* Restart after flushing. */
-		_stp_print_flush();
-		end = pb->buf + STP_BUFFER_SIZE;
-		ptr = pb->buf + pb->len;
-		instr = (char *)str;
-		while (ptr < end && *instr)
-			*ptr++ = *instr++;
+	struct _stp_log *log;
+	unsigned long flags;
+	unsigned int i;
+
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	while (1) {
+		for (i = log->len; i < STP_BUFFER_SIZE && *str; i++)
+			log->buf[i] = *str++;
+		log->len = i;
+		if (likely(!*str))
+			break;
+		__stp_print_flush(log);
 	}
-	pb->len = ptr - pb->buf;
+	_stp_print_unlock_irqrestore(&flags);
 }

 static void _stp_print_char (const char c)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	int size = STP_BUFFER_SIZE - pb->len;
-	if (unlikely(1 >= size))
-		_stp_print_flush();
-
-	pb->buf[pb->len] = c;
-	pb->len ++;
+	struct _stp_log *log;
+	unsigned long flags;
+
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (unlikely(log->len == STP_BUFFER_SIZE))
+		__stp_print_flush(log);
+	log->buf[log->len++] = c;
+	_stp_print_unlock_irqrestore(&flags);
 }

 static void _stp_print_kernel_info(char *sname, char *vstr, int ctx, int num_probes)
diff --git a/runtime/linux/runtime_context.h b/runtime/linux/runtime_context.h
index e716e6d39..2587956fa 100644
--- a/runtime/linux/runtime_context.h
+++ b/runtime/linux/runtime_context.h
@@ -42,8 +42,20 @@ static void _stp_runtime_contexts_free(void)
 {
 	unsigned int cpu;

-	/* Sync to make sure existing readers are done */
-	write_lock(&_stp_context_lock);
+	/*
+	 * Sync to make sure existing readers are done. We have to use
+	 * write_trylock() here because write_lock() prevents new readers from
+	 * acquiring the read lock. In runtime/linux/print.c we need to make
+	 * reentrant attempts to acquire the context read lock always work,
+	 * otherwise there will be an unbalanced spin unlock. This can happen if
+	 * _stp_print_trylock_irqsave() is called and skips acquiring its spin
+	 * lock since the context is held, and _stp_print_unlock_irqrestore() is
+	 * called but fails to check if the context has been held due to
+	 * read_trylock(&_stp_context_lock) failing because of a queued up write
+	 * lock, in which case it will erroneously do a spin unlock instead.
+	 */
+	while (!write_trylock(&_stp_context_lock))
+		cpu_relax();
 	_stp_context_stop = true;
 	write_unlock(&_stp_context_lock);

diff --git a/runtime/print.h b/runtime/print.h
index ede71f033..ffdea594d 100644
--- a/runtime/print.h
+++ b/runtime/print.h
@@ -10,6 +10,9 @@
 #ifndef _STP_PRINT_H_
 #define _STP_PRINT_H_

+/* The lock must be held with IRQs disabled to do any printing */
+static bool _stp_print_trylock_irqsave(unsigned long *flags);
+static void _stp_print_unlock_irqrestore(unsigned long *flags);
 static int _stp_print_init(void);
 static void _stp_print_cleanup(void);
 static void *_stp_reserve_bytes(int numbytes);
diff --git a/runtime/print_flush.c b/runtime/print_flush.c
index cf40a2645..acd6a32d9 100644
--- a/runtime/print_flush.c
+++ b/runtime/print_flush.c
@@ -13,40 +13,31 @@
  * is filled, or this is called. This MUST be called before returning
  * from a probe or accumulated output in the print buffer will be lost.
  *
- * @note Preemption must be disabled to use this.
+ * @note Interrupts must be disabled to use this.
  */

-static STP_DEFINE_SPINLOCK(_stp_print_lock);
-
-void stp_print_flush(_stp_pbuf *pb)
+static void __stp_print_flush(struct _stp_log *log)
 {
-	size_t len = pb->len;
+	size_t len = log->len;
 	void *entry = NULL;

 	/* check to see if there is anything in the buffer */
 	if (likely(len == 0))
 		return;

-	pb->len = 0;
-
-	if (unlikely(_stp_transport_get_state() != STP_TRANSPORT_RUNNING))
-		return;
+	log->len = 0;

 	dbug_trans(1, "len = %zu\n", len);

 #ifdef STP_BULKMODE
 #ifdef NO_PERCPU_HEADERS
 	{
-		struct context* __restrict__ c = NULL;
-		char *bufp = pb->buf;
+		char *bufp = log->buf;
 		int inode_locked;

-		c = _stp_runtime_entryfn_get_context();
-
 		if (!(inode_locked = _stp_transport_trylock_relay_inode())) {
 			atomic_inc (&_stp_transport_failures);
 #ifndef STP_TRANSPORT_RISKY
-			_stp_runtime_entryfn_put_context(c);
 			return;
 #endif
 		}
@@ -70,26 +61,20 @@ void stp_print_flush(_stp_pbuf *pb)

 		if (inode_locked)
 			_stp_transport_unlock_relay_inode();
-
-		_stp_runtime_entryfn_put_context(c);
 	}

 #else  /* !NO_PERCPU_HEADERS */

 	{
-		struct context* __restrict__ c = NULL;
-		char *bufp = pb->buf;
+		char *bufp = log->buf;
 		struct _stp_trace t = {	.sequence = _stp_seq_inc(),
 					.pdu_len = len};
 		size_t bytes_reserved;
 		int inode_locked;

-		c = _stp_runtime_entryfn_get_context();
-
 		if (!(inode_locked = _stp_transport_trylock_relay_inode())) {
 			atomic_inc (&_stp_transport_failures);
 #ifndef STP_TRANSPORT_RISKY
-			_stp_runtime_entryfn_put_context(c);
 			return;
 #endif
 		}
@@ -124,48 +109,24 @@ void stp_print_flush(_stp_pbuf *pb)

 		if (inode_locked)
 			_stp_transport_unlock_relay_inode();
-
-		_stp_runtime_entryfn_put_context(c);
 	}
 #endif /* !NO_PERCPU_HEADERS */

 #else  /* !STP_BULKMODE */

 	{
-		unsigned long flags;
-		struct context* __restrict__ c = NULL;
-		char *bufp = pb->buf;
+		char *bufp = log->buf;
 		int inode_locked;

-		/* Prevent probe reentrancy on _stp_print_lock.
-		 *
-		 * Since stp_print_flush may be called from probe context, we
-		 * have to make sure that its lock, _stp_print_lock, can't
-		 * possibly be held outside probe context too.  We ensure this
-		 * by grabbing the context here, so any probe triggered by this
-		 * region will appear reentrant and be skipped rather than
-		 * deadlock.  Failure to get_context just means we're already
-		 * in a probe, which is fine.
-		 *
-		 * (see also _stp_ctl_send for a similar situation)
-                 *
-                 * A better solution would be to replace this
-                 * concurrency-control-laden effort with a lockless
-                 * algorithm.
-		 */
-		c = _stp_runtime_entryfn_get_context();
-
 		if (!(inode_locked = _stp_transport_trylock_relay_inode())) {
 			atomic_inc (&_stp_transport_failures);
 #ifndef STP_TRANSPORT_RISKY
 			dbug_trans(0, "discarding %zu bytes of data\n", len);
-			_stp_runtime_entryfn_put_context(c);
 			return;
 #endif
 		}

 		dbug_trans(1, "calling _stp_data_write...\n");
-		stp_spin_lock_irqsave(&_stp_print_lock, flags);
 		while (len > 0) {
 			size_t bytes_reserved;

@@ -182,12 +143,9 @@ void stp_print_flush(_stp_pbuf *pb)
 			    break;
 			}
 		}
-		stp_spin_unlock_irqrestore(&_stp_print_lock, flags);

 		if (inode_locked)
 			_stp_transport_unlock_relay_inode();
-
-		_stp_runtime_entryfn_put_context(c);
 	}
 #endif /* !STP_BULKMODE */
 }
diff --git a/runtime/stack.c b/runtime/stack.c
index 241ccf793..da23d4395 100644
--- a/runtime/stack.c
+++ b/runtime/stack.c
@@ -690,13 +690,20 @@ static void _stp_stack_kernel_sprint(char *str, int size, struct context* c,
 	 * then call _stp_stack_print,
 	 * then copy the result into the output string
 	 * and clear the print buffer. */
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	_stp_print_flush();
+	struct _stp_log *log;
+	unsigned long flags;

-	_stp_stack_kernel_print(c, sym_flags);
+	if (!_stp_print_trylock_irqsave(&flags)) {
+		*str = '\0';
+		return;
+	}

-	strlcpy(str, pb->buf, size < (int)pb->len ? size : (int)pb->len);
-	pb->len = 0;
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	__stp_print_flush(log);
+	_stp_stack_kernel_print(c, sym_flags);
+	strlcpy(str, log->buf, min_t(int, size, log->len));
+	log->len = 0;
+	_stp_print_unlock_irqrestore(&flags);
 }

 static void _stp_stack_user_sprint(char *str, int size, struct context* c,
@@ -707,13 +714,20 @@ static void _stp_stack_user_sprint(char *str, int size, struct context* c,
 	 * then call _stp_stack_print,
 	 * then copy the result into the output string
 	 * and clear the print buffer. */
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	_stp_print_flush();
+	struct _stp_log *log;
+	unsigned long flags;

-	_stp_stack_user_print(c, sym_flags);
+	if (!_stp_print_trylock_irqsave(&flags)) {
+		*str = '\0';
+		return;
+	}

-	strlcpy(str, pb->buf, size < (int)pb->len ? size : (int)pb->len);
-	pb->len = 0;
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	__stp_print_flush(log);
+	_stp_stack_user_print(c, sym_flags);
+	strlcpy(str, log->buf, min_t(int, size, log->len));
+	log->len = 0;
+	_stp_print_unlock_irqrestore(&flags);
 }

 #endif /* _STACK_C_ */
diff --git a/runtime/transport/transport.c b/runtime/transport/transport.c
index 57955334b..44e69b68c 100644
--- a/runtime/transport/transport.c
+++ b/runtime/transport/transport.c
@@ -540,8 +540,8 @@ static void _stp_transport_close(void)
 		   current->pid);
 	_stp_cleanup_and_exit(0);
 	_stp_unregister_ctl_channel();
+	_stp_print_cleanup(); /* Requires the transport, so free this first */
 	_stp_transport_fs_close();
-	_stp_print_cleanup();	/* free print buffers */
 	_stp_mem_debug_done();

 	dbug_trans(1, "---- CLOSED ----\n");
diff --git a/runtime/vsprintf.c b/runtime/vsprintf.c
index 28fd18f16..417d9f7f3 100644
--- a/runtime/vsprintf.c
+++ b/runtime/vsprintf.c
@@ -542,6 +542,8 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
 				   number of chars for from string */
 	int qualifier;		/* 'h', 'l', or 'L' for integer fields */
 	int num_bytes = 0;
+	unsigned long irqflags = 0;
+	bool got_print_lock = false;

 	/* Reject out-of-range values early */
 	if (unlikely((int) size < 0))
@@ -724,11 +726,14 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
 	    num_bytes = STP_BUFFER_SIZE;
 	  }

+	  if (!_stp_print_trylock_irqsave(&irqflags))
+	    return 0;
 	  str = (char*)_stp_reserve_bytes(num_bytes);
 	  if (str == NULL) {
 	    _stp_error("Couldn't reserve any print buffer space\n");
-	    return 0;
+	    goto err_unlock;
 	  }
+	  got_print_lock = true;
 	  size = num_bytes;
 	  end = str + size - 1;

@@ -820,8 +825,10 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
 					field_width, precision,
 					*fmt, flags);
 			if (unlikely(str == NULL)) {
-				if (num_bytes > 0)
+				if (num_bytes > 0) {
 					_stp_unreserve_bytes(num_bytes);
+					goto err_unlock;
+				}
 				return 0;
 			}
 			continue;
@@ -923,7 +930,14 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
                   /* don't write out a null byte if the buf size is zero */
                   *end = '\0';
 	}
+
+	if (got_print_lock)
+		_stp_print_unlock_irqrestore(&irqflags);
 	return str-buf;
+
+err_unlock:
+	_stp_print_unlock_irqrestore(&irqflags);
+	return 0;
 }

 #endif /* _VSPRINTF_C_ */
diff --git a/translate.cxx b/translate.cxx
index 53f1d0725..f0195486c 100644
--- a/translate.cxx
+++ b/translate.cxx
@@ -1354,6 +1354,9 @@ c_unparser::emit_compiled_printfs ()
       o->newline() << "unsigned long ptr_value;";
       o->newline() << "int num_bytes;";

+      if (print_to_stream)
+	  o->newline() << "unsigned long irqflags;";
+
       o->newline() << "(void) width;";
       o->newline() << "(void) precision;";
       o->newline() << "(void) ptr_value;";
@@ -1452,7 +1455,9 @@ c_unparser::emit_compiled_printfs ()
 	    }

 	  o->newline() << "num_bytes = clamp(num_bytes, 0, STP_BUFFER_SIZE);";
-	  o->newline() << "str = (char*)_stp_reserve_bytes(num_bytes);";
+	  o->newline() << "if (!_stp_print_trylock_irqsave(&irqflags))";
+	  o->newline(1) << "return;";
+	  o->newline(-1) << "str = (char*)_stp_reserve_bytes(num_bytes);";
 	  o->newline() << "end = str ? str + num_bytes - 1 : 0;";
         }
       else // !print_to_stream
@@ -1547,8 +1552,14 @@ c_unparser::emit_compiled_printfs ()
 	      o->newline() << "if (unlikely(str == NULL)) {";
 	      o->indent(1);
 	      if (print_to_stream)
+                {
 		  o->newline() << "_stp_unreserve_bytes(num_bytes);";
-	      o->newline() << "return;";
+	          o->newline() << "goto err_unlock;";
+                }
+              else
+                {
+	          o->newline() << "return;";
+                }
 	      o->newline(-1) << "}";
 	      break;

@@ -1575,6 +1586,11 @@ c_unparser::emit_compiled_printfs ()

       o->newline(-1) << "}";

+      if (print_to_stream)
+        {
+          o->newline(-1) << "err_unlock:";
+          o->newline(1) << "_stp_print_unlock_irqrestore(&irqflags);";
+        }
       o->newline(-1) << "}";
     }
   o->newline() << "#endif // STP_LEGACY_PRINT";
--
2.29.2

## unrelated-deadlock-in-testsuite.txt
PID: 40     TASK: ffff880175344000  CPU: 6   COMMAND: "kworker/6:0"
    [exception RIP: native_halt+10]
    RIP: ffffffff81888a1a  RSP: ffff88017534ba00  RFLAGS: 00000046
    RAX: 0000000000000001  RBX: 0000000000000046  RCX: 0000000000000001
    RDX: ffff880175344000  RSI: 0000000000000001  RDI: ffffffff81079f25
    RBP: ffff88017534ba00   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000006  R12: ffff8801793dcc04
    R13: 0000000000000001  R14: 0000000000310000  R15: 0000000000000001
    CS: 0010  SS: 0018
 #0 [ffff88017534ba08] kvm_wait at ffffffff81079f5a
 #1 [ffff88017534ba30] __pv_queued_spin_lock_slowpath at ffffffff8114f496
 #2 [ffff88017534ba70] queued_spin_lock_slowpath at ffffffff810e842b
 #3 [ffff88017534ba80] mutex_trylock at ffffffff81882b1b
 #4 [ffff88017534bab8] _stp_transport_trylock_relay_inode at ffffffffc0c50c07 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
 #5 [ffff88017534bad8] __stp_print_flush at ffffffffc09b4733 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
 #6 [ffff88017534bb10] probe_8023 at ffffffffc0a73d85 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
 #7 [ffff88017534bb38] enter_real_tracepoint_probe_1687 at ffffffffc0c46947 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
 #8 [ffff88017534bb70] enter_tracepoint_probe_1687 at ffffffffc09b1e7e [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
 #9 [ffff88017534bb80] lock_acquire at ffffffff811460ba
#10 [ffff88017534bbe8] mutex_trylock at ffffffff81882a27
#11 [ffff88017534bc20] _stp_transport_trylock_relay_inode at ffffffffc0c50c07 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
#12 [ffff88017534bc40] __stp_print_flush at ffffffffc09b4733 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
#13 [ffff88017534bc78] _stp_vlog at ffffffffc09b7862 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
#14 [ffff88017534bcd8] _stp_dbug at ffffffffc09b800b [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
#15 [ffff88017534bd38] systemtap_module_refresh at ffffffffc09b80ed [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
#16 [ffff88017534bd50] module_refresher at ffffffffc09b810e [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
#17 [ffff88017534bd60] process_one_work at ffffffff810da9cc
#18 [ffff88017534bde8] worker_thread at ffffffff810dafe6
#19 [ffff88017534be48] kthread at ffffffff810e44cf
#20 [ffff88017534bf50] ret_from_fork_nospec_begin at ffffffff818958dd
	PID: 40 TASK: ffff880175344000 CPU: 6 COMMAND: "kworker/6:0"
	[exception RIP: native_halt+10]
	RIP: ffffffff81888a1a RSP: ffff88017534ba00 RFLAGS: 00000046
	RAX: 0000000000000001 RBX: 0000000000000046 RCX: 0000000000000001
	RDX: ffff880175344000 RSI: 0000000000000001 RDI: ffffffff81079f25
	RBP: ffff88017534ba00 R8: 0000000000000000 R9: 0000000000000000
	R10: 0000000000000001 R11: 0000000000000006 R12: ffff8801793dcc04
	R13: 0000000000000001 R14: 0000000000310000 R15: 0000000000000001
	CS: 0010 SS: 0018
	#0 [ffff88017534ba08] kvm_wait at ffffffff81079f5a
	#1 [ffff88017534ba30] __pv_queued_spin_lock_slowpath at ffffffff8114f496
	#2 [ffff88017534ba70] queued_spin_lock_slowpath at ffffffff810e842b
	#3 [ffff88017534ba80] mutex_trylock at ffffffff81882b1b
	#4 [ffff88017534bab8] _stp_transport_trylock_relay_inode at ffffffffc0c50c07 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#5 [ffff88017534bad8] __stp_print_flush at ffffffffc09b4733 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#6 [ffff88017534bb10] probe_8023 at ffffffffc0a73d85 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#7 [ffff88017534bb38] enter_real_tracepoint_probe_1687 at ffffffffc0c46947 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#8 [ffff88017534bb70] enter_tracepoint_probe_1687 at ffffffffc09b1e7e [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#9 [ffff88017534bb80] lock_acquire at ffffffff811460ba
	#10 [ffff88017534bbe8] mutex_trylock at ffffffff81882a27
	#11 [ffff88017534bc20] _stp_transport_trylock_relay_inode at ffffffffc0c50c07 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#12 [ffff88017534bc40] __stp_print_flush at ffffffffc09b4733 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#13 [ffff88017534bc78] _stp_vlog at ffffffffc09b7862 [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#14 [ffff88017534bcd8] _stp_dbug at ffffffffc09b800b [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#15 [ffff88017534bd38] systemtap_module_refresh at ffffffffc09b80ed [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#16 [ffff88017534bd50] module_refresher at ffffffffc09b810e [stap_116e6f88fb0960f76c434dbce65b51e3__12685]
	#17 [ffff88017534bd60] process_one_work at ffffffff810da9cc
	#18 [ffff88017534bde8] worker_thread at ffffffff810dafe6
	#19 [ffff88017534be48] kthread at ffffffff810e44cf
	#20 [ffff88017534bf50] ret_from_fork_nospec_begin at ffffffff818958dd