kerneltoast/0001-runtime-make-print-statements-safe-to-use-in-any-con.patch Secret

## 0001-runtime-make-print-statements-safe-to-use-in-any-con.patch
From 8349e89fc19ddfeae40e866471ac42ad03445f0d Mon Sep 17 00:00:00 2001
From: Sultan Alsawaf <sultan@openresty.com>
Date: Wed, 18 Nov 2020 22:34:29 -0800
Subject: [PATCH] runtime: make print statements safe to use in any context

Currently, a lockdep warning can be triggered by using a print statement
inside a probe that can be called from IRQ context, such as an hrtimer
probe. The warning is caused by stp_print_flush() acquiring a mutex lock
inside context not backed by any process, which makes the mutex owner
become nonsense. We cannot acquire mutex locks in IRQ context, so we
must do some juggling to make printing safe in all contexts and defer
the print flush to a safe context. This is achieved by having a worker
poll in the background for CPUs which need their print buffer flushed.
The nitty-gritty details to achieve this are documented in the comments.
Polling is unfortunately the only option here because every form of
asynchronous function execution can deadlock inside a tracepoint. This
patch also makes printing NMI-safe, though it's not guaranteed that
printing will always work in NMI context.

This is the lockdep warning that is resolved by this patch:

WARNING: CPU: 1 PID: 2809 at kernel/mutex.c:819 mutex_trylock+0x19c/0x1b0
DEBUG_LOCKS_WARN_ON(in_interrupt())
CPU: 1 PID: 2809 Comm: dmesg Kdump: loaded Tainted: G           OE  ------------   3.10.0-1127.19.1.el7.x86_64.debug #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
Call Trace:
 <IRQ>  [<ffffffff8187e46b>] dump_stack+0x19/0x1b
 [<ffffffff810af108>] __warn+0xd8/0x100
 [<ffffffff810af18f>] warn_slowpath_fmt+0x5f/0x80
 [<ffffffff8107a7f3>] ? kvm_clock_read+0x33/0x40
 [<ffffffff81882b0c>] mutex_trylock+0x19c/0x1b0
 [<ffffffffc08d52e0>] _stp_transport_trylock_relay_inode+0x3b/0x54 [stap_42f154dcc9a767977c2d6336d57ca6dd_1_1193]
 [<ffffffffc08cf14e>] stp_print_flush+0x4e/0x230 [stap_42f154dcc9a767977c2d6336d57ca6dd_1_1193]
 [<ffffffff81889bd7>] ? _raw_write_unlock+0x27/0x40
 [<ffffffffc08d46f1>] probe_6332+0x41/0xe0 [stap_42f154dcc9a767977c2d6336d57ca6dd_1_1193]
 [<ffffffff81339c61>] ? proc_i_callback+0x1/0x20
 [<ffffffffc08d2d01>] enter_kprobe_probe+0x231/0x460 [stap_42f154dcc9a767977c2d6336d57ca6dd_1_1193]
 [<ffffffff81890744>] aggr_pre_handler+0x54/0xb0
 [<ffffffff81339c61>] ? proc_i_callback+0x1/0x20
 [<ffffffff8188f41a>] kprobe_ftrace_handler+0xfa/0x170
 [<ffffffff81339c65>] ? proc_i_callback+0x5/0x20
 [<ffffffff81339c60>] ? proc_destroy_inode+0x20/0x20
 [<ffffffff811999f5>] ? rcu_process_callbacks+0x235/0x840
 [<ffffffff811a2b94>] ftrace_ops_list_func+0x144/0x150
 [<ffffffff81899954>] ftrace_regs_call+0x5/0x81
 [<ffffffff81339c60>] ? proc_destroy_inode+0x20/0x20
 [<ffffffff81339c60>] ? proc_destroy_inode+0x20/0x20
 [<ffffffff81339c65>] ? proc_i_callback+0x5/0x20
 [<ffffffff811999f5>] ? rcu_process_callbacks+0x235/0x840
 [<ffffffff810bab7c>] __do_softirq+0x10c/0x470
 [<ffffffff8189923c>] call_softirq+0x1c/0x30
 [<ffffffff8103811d>] do_softirq+0x8d/0xc0
 [<ffffffff810bb125>] irq_exit+0x125/0x140
 [<ffffffff81065249>] smp_call_function_single_interrupt+0x39/0x40
 [<ffffffff81897eef>] call_function_single_interrupt+0x16f/0x180
 <EOI>  [<ffffffff811405cb>] ? lock_is_held_type+0xcb/0xf0
 [<ffffffff81297e32>] __mem_cgroup_count_vm_event+0x182/0x2c0
 [<ffffffff81297cb5>] ? __mem_cgroup_count_vm_event+0x5/0x2c0
 [<ffffffff8120d8bf>] filemap_fault+0x12f/0x460
 [<ffffffff810eab85>] ? down_read_nested+0x95/0xc0
 [<ffffffffc02bf408>] ? xfs_ilock+0x1d8/0x300 [xfs]
 [<ffffffffc02ad9d7>] __xfs_filemap_fault+0x77/0x2a0 [xfs]
 [<ffffffffc02adcac>] xfs_filemap_fault+0x2c/0x30 [xfs]
 [<ffffffff812416fe>] __do_fault.isra.55+0x8e/0x110
 [<ffffffff81241ee0>] do_read_fault.isra.57+0x50/0x1c0
 [<ffffffff81246b95>] handle_pte_fault+0x2f5/0xd70
 [<ffffffff8124a344>] handle_mm_fault+0x3a4/0x9b0
 [<ffffffff8188fb9f>] __do_page_fault+0x20f/0x5a0
 [<ffffffff81890016>] trace_do_page_fault+0x56/0x2e0
 [<ffffffff8188f4f2>] do_async_page_fault+0x22/0xf0
 [<ffffffff8188b848>] async_page_fault+0x28/0x30

This patch also resolves the following crash caused by racy modification
of the print buffer len in IRQ context:

BUG: unable to handle kernel paging request at ffffe8ffff621000
IP: [<ffffffffc05da0f3>] _stp_vsprint_memory+0x83/0x950 [stap_2c44636dfda18135ca3012a752599da6_13_533]
PGD 174b90067 PUD 174b8f067 PMD 174b93067 PTE 0
Oops: 0002 [#1] SMP
CPU: 12 PID: 3468 Comm: cat Kdump: loaded Tainted: G           OE  ------------   3.10.0-1127.19.1.el7.x86_64.debug #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
task: ffff88001f4f0000 ti: ffff88004ea5c000 task.ti: ffff88004ea5c000
RIP: 0010:[<ffffffffc05da0f3>]  [<ffffffffc05da0f3>] _stp_vsprint_memory+0x83/0x950 [stap_2c44636dfda18135ca3012a752599da6_13_533]
RSP: 0018:ffff88004ea5f9a8  EFLAGS: 00010082
RAX: ffffe8ffff621001 RBX: ffffe8ffff620ff2 RCX: fffffffffffffffe
RDX: 000000000000006e RSI: ffffffffffffffff RDI: ffffc90002c23730
RBP: ffff88004ea5fa28 R08: 00000000ffffffff R09: 0000000000000073
R10: ffffc90002c243d7 R11: 0000000000000001 R12: ffffc90002c2373f
R13: ffffe8ffff621004 R14: 0000000000000012 R15: 00000000fffffffe
FS:  00007f8a9b1d4740(0000) GS:ffff880179e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffe8ffff621000 CR3: 00000000b3e3c000 CR4: 0000000000360fe0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff8103eb89>] ? sched_clock+0x9/0x10
 [<ffffffff8114036f>] ? lock_release_holdtime.part.30+0xf/0x1a0
 [<ffffffffc05dcb80>] function___global_trace__overload_0+0x5b0/0x1220 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffffc05d8993>] ? stp_lock_probe+0x53/0xe0 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffff8188d879>] ? kretprobe_trampoline_holder+0x9/0x9
 [<ffffffffc05e0662>] probe_7118+0x82/0xe0 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffffc05de866>] enter_kretprobe_common+0x256/0x490 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffff813489f1>] ? proc_sys_open+0x51/0x60
 [<ffffffffc05dead0>] enter_kretprobe_probe+0x10/0x20 [stap_2c44636dfda18135ca3012a752599da6_13_533]
 [<ffffffff8188e1d8>] trampoline_handler+0x148/0x220
 [<ffffffff813489f1>] ? proc_sys_open+0x51/0x60
 [<ffffffff8188d89e>] kretprobe_trampoline+0x25/0x57
 [<ffffffff813489f1>] ? proc_sys_open+0x51/0x60
 [<ffffffff8188d879>] kretprobe_trampoline_holder+0x9/0x9
 [<ffffffff81384702>] ? security_inode_permission+0x22/0x30
 [<ffffffff813489a0>] ? sysctl_head_finish+0x50/0x50
 [<ffffffff812ac11d>] vfs_open+0x5d/0xb0
 [<ffffffff812bb74a>] ? may_open+0x5a/0x120
 [<ffffffff812c0af5>] do_last+0x285/0x15b0
 [<ffffffff812bf18e>] ? link_path_walk+0x27e/0x8c0
 [<ffffffff812c1ef0>] path_openat+0xd0/0x5d0
 [<ffffffff8107a7f3>] ? kvm_clock_read+0x33/0x40
 [<ffffffff812c38ad>] do_filp_open+0x4d/0xb0
 [<ffffffff81889497>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffff812d5a9b>] ? __alloc_fd+0xfb/0x270
 [<ffffffff812ad784>] do_sys_open+0x124/0x220
 [<ffffffff812ad89e>] SyS_open+0x1e/0x20
 [<ffffffff8188d879>] kretprobe_trampoline_holder+0x9/0x9
---
 runtime/linux/io.c            |  47 +++--
 runtime/linux/print.c         | 328 +++++++++++++++++++++++++++-------
 runtime/print.h               |   3 +
 runtime/print_flush.c         |  57 +-----
 runtime/stack.c               |  62 ++++---
 runtime/transport/transport.c |   8 +-
 runtime/vsprintf.c            |  18 +-
 translate.cxx                 |  20 ++-
 8 files changed, 386 insertions(+), 157 deletions(-)

diff --git a/runtime/linux/io.c b/runtime/linux/io.c
index 74a032c52..70a4ff4d7 100644
--- a/runtime/linux/io.c
+++ b/runtime/linux/io.c
@@ -20,9 +20,6 @@

 #define WARN_STRING "WARNING: "
 #define ERR_STRING "ERROR: "
-#if (STP_LOG_BUF_LEN < 10) /* sizeof(WARN_STRING) */
-#error "STP_LOG_BUF_LEN is too short"
-#endif

 enum code { INFO=0, WARN, ERROR, DBUG };

@@ -31,25 +28,38 @@ static void _stp_vlog (enum code type, const char *func, int line, const char *f

 static void _stp_vlog (enum code type, const char *func, int line, const char *fmt, va_list args)
 {
+	struct _stp_log *log;
+	unsigned long flags;
+	size_t bytes_avail;
 	int num;
-	char *buf = per_cpu_ptr(Stp_lbuf, get_cpu());
+	char *buf;
 	int start = 0;
+	bool do_flush = false;
+
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	bytes_avail = STP_BUFFER_SIZE - log->len;
+	if (unlikely(!bytes_avail))
+		goto err_unlock;

+	buf = &log->buf[log->len];
 	if (type == DBUG) {
-		start = _stp_snprintf(buf, STP_LOG_BUF_LEN, "%s:%d: ", func, line);
+		start = _stp_snprintf(buf, bytes_avail, "%s:%d: ", func, line);
 	} else if (type == WARN) {
-		/* This strcpy() is OK, since we know STP_LOG_BUF_LEN
-		 * is > sizeof(WARN_STRING). */
-		strcpy (buf, WARN_STRING);
-		start = sizeof(WARN_STRING) - 1;
+		strncpy(buf, WARN_STRING, bytes_avail);
+		start = min(bytes_avail, sizeof(WARN_STRING) - 1);
 	} else if (type == ERROR) {
-		/* This strcpy() is OK, since we know STP_LOG_BUF_LEN
-		 * is > sizeof(ERR_STRING) (which is < sizeof(WARN_STRING). */
-		strcpy (buf, ERR_STRING);
-		start = sizeof(ERR_STRING) - 1;
+		strncpy(buf, ERR_STRING, bytes_avail);
+		start = min(bytes_avail, sizeof(ERR_STRING) - 1);
 	}

-	num = vscnprintf (buf + start, STP_LOG_BUF_LEN - start - 1, fmt, args);
+	bytes_avail -= start;
+	if (unlikely(!bytes_avail))
+		goto err_unlock;
+
+	num = vscnprintf(buf + start, bytes_avail, fmt, args);
 	if (num + start) {
 		if (buf[num + start - 1] != '\n') {
 			buf[num + start] = '\n';
@@ -67,11 +77,16 @@ static void _stp_vlog (enum code type, const char *func, int line, const char *f
 			_stp_ctl_send(STP_OOB_DATA, buf, start + num + 1);
 		} else {
 			_stp_print(buf);
-			_stp_print_flush();
+			do_flush = true;
 		}
 #endif
 	}
-	put_cpu();
+err_unlock:
+	_stp_print_unlock_irqrestore(&flags);
+
+	/* _stp_print_flush() can go faster if IRQs are enabled */
+	if (do_flush)
+		_stp_print_flush();
 }

 /** Prints warning.
diff --git a/runtime/linux/print.c b/runtime/linux/print.c
index 777bca8b0..9cbc4607d 100644
--- a/runtime/linux/print.c
+++ b/runtime/linux/print.c
@@ -16,6 +16,7 @@
 #include "print.h"
 #include "transport/transport.c"
 #include "vsprintf.c"
+#include <linux/interrupt.h>

 /** @file print.c
  * Printing Functions.
@@ -35,84 +36,256 @@
  * @{
  */

-typedef struct __stp_pbuf {
-	uint32_t len;			/* bytes used in the buffer */
+struct _stp_log {
+	unsigned int len; /* Bytes used in the buffer */
 	char buf[STP_BUFFER_SIZE];
-} _stp_pbuf;
+	struct work_struct flush_work;
+};
+
+static struct _stp_log *_stp_log_pcpu;
+static atomic_t _stp_log_stop = ATOMIC_INIT(0);
+static void _stp_flush_poll_worker(struct work_struct *work);
+static DECLARE_DELAYED_WORK(_stp_flush_poll_dwork, _stp_flush_poll_worker);
+static struct cpumask _stp_flush_mask = { CPU_BITS_NONE };
+static DEFINE_RWLOCK(_stp_flush_mask_lock);
+
+/*
+ * Can't use STP_DEFINE_RWLOCK() or this might be replaced with a spin lock. A
+ * read lock is used to synchronize every possible print buffer usage with the
+ * _stp_print_cleanup() function. The cleanup function sets _stp_log_stop to 1
+ * and then acquires a write lock to wait for everything using the print buffer
+ * to finish. Then, to stop any new instances of print buffer usage from
+ * occurring, _stp_log_stop is checked inside the read lock every time the read
+ * lock is acquired. This is all to prevent races from causing mayhem when it
+ * comes time to free the print buffers.
+ */
+static DEFINE_RWLOCK(_stp_print_lock);
+
+/*
+ * This doesn't disable IRQs to protect the lock; it disables them because the
+ * caller intends on accessing the log buffer, so IRQs don't need to be disabled
+ * before attempting to acquire the lock. This is because the write lock is
+ * never acquired inside an IRQ, so there's no risk of a deadlock. Returns true
+ * when the lock is successfully acquired.
+ */
+static bool _stp_print_trylock_irqsave(unsigned long *flags)
+	__acquires(&_stp_print_lock)
+{
+	/*
+	 * This needs to be a trylock because we could be inside an NMI that
+	 * occurs while the write lock is held, resulting in a deadlock. The
+	 * only time there is contention is during cleanup anyway.
+	 */
+	if (!read_trylock(&_stp_print_lock))
+		return false;
+
+	/* Has to be checked with the lock held to avoid racing with cleanup */
+	if (atomic_read(&_stp_log_stop)) {
+		read_unlock(&_stp_print_lock);
+		return false;
+	}

-static void *Stp_pbuf = NULL;
+	/* Disable IRQs if requested */
+	if (flags)
+		local_irq_save(*flags);
+	return true;
+}

-/** private buffer for _stp_vlog() */
-#ifndef STP_LOG_BUF_LEN
-#define STP_LOG_BUF_LEN 256
-#endif
+static void _stp_print_unlock_irqrestore(unsigned long *flags)
+	__releases(&_stp_print_lock)
+{
+	if (flags)
+		local_irq_restore(*flags);
+	read_unlock(&_stp_print_lock);
+}
+
+#include "print_flush.c"
+static void _stp_print_flush_work(struct work_struct *work)
+{
+	struct _stp_log *log = container_of(work, typeof(*log), flush_work);
+
+	local_irq_disable();
+	__stp_print_flush(log);
+	local_irq_enable();
+}
+
+static void _stp_flush_poll_worker(struct work_struct *work)
+{
+	/* This could be too big to fit on the stack, so it has to be static */
+	static struct cpumask mask;
+	unsigned long flags;
+	unsigned int cpu;
+
+	/* Copy out the flush mask and then clear it */
+	write_lock_irqsave(&_stp_flush_mask_lock, flags);
+	cpumask_copy(&mask, &_stp_flush_mask);
+	cpumask_clear(&_stp_flush_mask);
+	write_unlock_irqrestore(&_stp_flush_mask_lock, flags);
+
+	/* Flush each CPU's print buffer in a worker on that CPU */
+	for_each_cpu(cpu, &mask) {
+		struct _stp_log *log = per_cpu_ptr(_stp_log_pcpu, cpu);
+
+		schedule_work_on(cpu, &log->flush_work);
+	}

-typedef char _stp_lbuf[STP_LOG_BUF_LEN];
-static void *Stp_lbuf = NULL;
+	/* Poll again at the next jiffy */
+	schedule_delayed_work(&_stp_flush_poll_dwork, 1);
+}

 /* create percpu print and io buffers */
 static int _stp_print_init (void)
 {
-	Stp_pbuf = _stp_alloc_percpu(sizeof(_stp_pbuf));
-	if (unlikely(Stp_pbuf == 0))
-		return -1;
-
-	/* now initialize IO buffer used in io.c */
-	Stp_lbuf = _stp_alloc_percpu(sizeof(_stp_lbuf));
-	if (unlikely(Stp_lbuf == 0)) {
-		_stp_free_percpu(Stp_pbuf);
-		return -1;
+	unsigned int cpu;
+
+	_stp_log_pcpu = _stp_alloc_percpu(sizeof(*_stp_log_pcpu));
+	if (!_stp_log_pcpu)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct _stp_log *log = per_cpu_ptr(_stp_log_pcpu, cpu);
+
+		INIT_WORK(&log->flush_work, _stp_print_flush_work);
 	}
+
+	/* Fire up the poll worker */
+	schedule_delayed_work(&_stp_flush_poll_dwork, 0);
 	return 0;
 }

 static void _stp_print_cleanup (void)
 {
-	if (Stp_pbuf)
-		_stp_free_percpu(Stp_pbuf);
-	if (Stp_lbuf)
-		_stp_free_percpu(Stp_lbuf);
-}
+	unsigned long flags;
+	unsigned int cpu;
+
+	/*
+	 * Wait for the loggers to finish requesting print buffer flushes so we
+	 * can flush all of the workers and ensure that none of them will be
+	 * running or will start up again after this function returns.
+	 */
+	atomic_set(&_stp_log_stop, 1);
+	write_lock_irqsave(&_stp_print_lock, flags);
+	write_unlock_irqrestore(&_stp_print_lock, flags);
+
+	/* Stop the poll worker and don't let it rearm itself */
+	cancel_delayed_work_sync(&_stp_flush_poll_dwork);
+
+	for_each_possible_cpu(cpu) {
+		struct _stp_log *log = per_cpu_ptr(_stp_log_pcpu, cpu);
+
+		/* Wait for the flush worker to finish */
+		flush_work(&log->flush_work);
+
+		/*
+		 * Flush anything that could be left in the print buffer. It is
+		 * safe to do this without any kind of synchronization mechanism
+		 * because nothing is using this print buffer anymore.
+		 */
+		__stp_print_flush(log);
+	}

-#include "print_flush.c"
+	_stp_free_percpu(_stp_log_pcpu);
+}

 static inline void _stp_print_flush(void)
 {
-	stp_print_flush(per_cpu_ptr(Stp_pbuf, smp_processor_id()));
+	struct _stp_log *log;
+	unsigned int cpu;
+
+	/*
+	 * The write lock is never acquired inside an IRQ so we don't need to
+	 * disable IRQs while holding the read lock here, since we don't intend
+	 * on accessing the log buffer.
+	 */
+	if (!_stp_print_trylock_irqsave(NULL))
+		return;
+
+	cpu = raw_smp_processor_id();
+	log = per_cpu_ptr(_stp_log_pcpu, cpu);
+
+	/*
+	 * If IRQs are disabled, we're either inside an IRQ or a locked critical
+	 * section which disables IRQs. In the case of being in an IRQ, we
+	 * cannot flush the print buffer because a mutex lock is acquired in
+	 * __stp_print_flush(). Even though the mutex is locked atomically, we
+	 * cannot acquire a mutex from inside an IRQ because the mutex owner
+	 * will be nonsense. In the case of being inside a critical section with
+	 * IRQs disabled, we could be inside a tracepoint that makes the
+	 * following functions deadlock: schedule_work(), tasklet_schedule(),
+	 * add_timer(), hrtimer_start(), and wake_up(). This is because there
+	 * are tracepoints called under locks that these functions need. We
+	 * could also recurse and deadlock since some of these functions
+	 * themselves contain tracepoints which are called under locks. As such,
+	 * we can neither tell if it is safe to call __stp_print_flush() nor use
+	 * any of the listed methods for asynchronous execution when IRQs are
+	 * disabled, so the only solution available is to have a worker polling
+	 * for print flushes in the background. There is just no way to safely
+	 * request a function to be executed asynchronously from this context.
+	 * The polling is achieved by having a delayed worker schedule itself
+	 * every jiffy to check the cpumask containing CPUs which need their
+	 * print buffer flushed.
+	 */
+	if (irqs_disabled()) {
+		/*
+		 * Setting CPU bits is atomic but copying out the mask is not,
+		 * so we need the read lock here to synchronize with the poll
+		 * worker. It must be trylock'd in case we're in an NMI, so that
+		 * we don't deadlock with the write lock in the poll worker.
+		 */
+		if (read_trylock(&_stp_flush_mask_lock)) {
+			cpumask_set_cpu(cpu, &_stp_flush_mask);
+			read_unlock(&_stp_flush_mask_lock);
+		}
+	} else {
+		local_irq_disable();
+		__stp_print_flush(log);
+		local_irq_enable();
+	}
+	_stp_print_unlock_irqrestore(NULL);
 }
 #ifndef STP_MAXBINARYARGS
 #define STP_MAXBINARYARGS 127
 #endif


-/** Reserves space in the output buffer for direct I/O.
+/** Reserves space in the output buffer for direct I/O. Must be called with
+ * _stp_print_lock held with IRQs disabled.
  */
 static void * _stp_reserve_bytes (int numbytes)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	int size = STP_BUFFER_SIZE - pb->len;
-	void * ret;
+	struct _stp_log *log;
+	char *ret = NULL;
+
+	BUG_ON(!irqs_disabled());
+	lockdep_assert_held(&_stp_print_lock);

 	if (unlikely(numbytes == 0 || numbytes > STP_BUFFER_SIZE))
 		return NULL;

-	if (unlikely(numbytes > size))
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (unlikely(numbytes > (STP_BUFFER_SIZE - log->len))) {
 		_stp_print_flush();
-
-	ret = pb->buf + pb->len;
-	pb->len += numbytes;
+	} else {
+		ret = &log->buf[log->len];
+		log->len += numbytes;
+	}
 	return ret;
 }


 static void _stp_unreserve_bytes (int numbytes)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
+	struct _stp_log *log;
+
+	BUG_ON(!irqs_disabled());
+	lockdep_assert_held(&_stp_print_lock);

-	if (unlikely(numbytes == 0 || numbytes > pb->len))
+	if (unlikely(numbytes == 0))
 		return;

-	pb->len -= numbytes;
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (numbytes <= log->len)
+		log->len -= numbytes;
 }

 /** Write 64-bit args directly into the output stream.
@@ -123,22 +296,28 @@ static void _stp_unreserve_bytes (int numbytes)
  */
 static void _stp_print_binary (int num, ...)
 {
+	unsigned long flags;
 	va_list vargs;
 	int i;
 	int64_t *args;
-
+
+	if (atomic_read(&_stp_log_stop))
+		return;
+
 	if (unlikely(num > STP_MAXBINARYARGS))
 		num = STP_MAXBINARYARGS;

-	args = _stp_reserve_bytes(num * sizeof(int64_t));
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;

-	if (likely(args != NULL)) {
+	args = _stp_reserve_bytes(num * sizeof(int64_t));
+	if (args) {
 		va_start(vargs, num);
-		for (i = 0; i < num; i++) {
+		for (i = 0; i < num; i++)
 			args[i] = va_arg(vargs, int64_t);
-		}
 		va_end(vargs);
 	}
+	_stp_print_unlock_irqrestore(&flags);
 }

 /** Print into the print buffer.
@@ -149,6 +328,7 @@ static void _stp_print_binary (int num, ...)
 static void _stp_printf (const char *fmt, ...)
 {
 	va_list args;
+
 	va_start(args, fmt);
 	_stp_vsnprintf(NULL, 0, fmt, args);
 	va_end(args);
@@ -160,37 +340,51 @@ static void _stp_printf (const char *fmt, ...)

 static void _stp_print (const char *str)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	char *end = pb->buf + STP_BUFFER_SIZE;
-	char *ptr = pb->buf + pb->len;
-	char *instr = (char *)str;
-
-	while (ptr < end && *instr)
-		*ptr++ = *instr++;
-
-	/* Did loop terminate due to lack of buffer space? */
-	if (unlikely(*instr)) {
-		/* Don't break strings across subbufs. */
-		/* Restart after flushing. */
+	struct _stp_log *log;
+	unsigned long flags;
+	unsigned int i;
+	bool do_flush;
+
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	for (i = log->len; i < STP_BUFFER_SIZE && *str; i++)
+		log->buf[i] = *str++;
+	log->len = i;
+	do_flush = *str;
+	_stp_print_unlock_irqrestore(&flags);
+
+	/*
+	 * Flush if the log buffer ran out of space. This is done outside the
+	 * lock because _stp_print_flush() can go faster if IRQs are enabled.
+	 */
+	if (do_flush)
 		_stp_print_flush();
-		end = pb->buf + STP_BUFFER_SIZE;
-		ptr = pb->buf + pb->len;
-		instr = (char *)str;
-		while (ptr < end && *instr)
-			*ptr++ = *instr++;
-	}
-	pb->len = ptr - pb->buf;
 }

 static void _stp_print_char (const char c)
 {
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	int size = STP_BUFFER_SIZE - pb->len;
-	if (unlikely(1 >= size))
+	struct _stp_log *log;
+	unsigned long flags;
+	bool do_flush = false;
+
+	if (!_stp_print_trylock_irqsave(&flags))
+		return;
+
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	if (log->len < STP_BUFFER_SIZE)
+		log->buf[log->len++] = c;
+	else
+		do_flush = true;
+	_stp_print_unlock_irqrestore(&flags);
+
+	/*
+	 * Flush if the log buffer didn't have space. This is done outside the
+	 * lock because _stp_print_flush() can go faster if IRQs are enabled.
+	 */
+	if (do_flush)
 		_stp_print_flush();
-
-	pb->buf[pb->len] = c;
-	pb->len ++;
 }

 static void _stp_print_kernel_info(char *sname, char *vstr, int ctx, int num_probes)
diff --git a/runtime/print.h b/runtime/print.h
index ede71f033..ffdea594d 100644
--- a/runtime/print.h
+++ b/runtime/print.h
@@ -10,6 +10,9 @@
 #ifndef _STP_PRINT_H_
 #define _STP_PRINT_H_

+/* The lock must be held with IRQs disabled to do any printing */
+static bool _stp_print_trylock_irqsave(unsigned long *flags);
+static void _stp_print_unlock_irqrestore(unsigned long *flags);
 static int _stp_print_init(void);
 static void _stp_print_cleanup(void);
 static void *_stp_reserve_bytes(int numbytes);
diff --git a/runtime/print_flush.c b/runtime/print_flush.c
index cf40a2645..0215fba74 100644
--- a/runtime/print_flush.c
+++ b/runtime/print_flush.c
@@ -13,40 +13,32 @@
  * is filled, or this is called. This MUST be called before returning
  * from a probe or accumulated output in the print buffer will be lost.
  *
- * @note Preemption must be disabled to use this.
+ * @note Interrupts must be disabled to use this, and this cannot be used from
+ * interrupt context. Use _stp_print_flush() instead.
  */

-static STP_DEFINE_SPINLOCK(_stp_print_lock);
-
-void stp_print_flush(_stp_pbuf *pb)
+void __stp_print_flush(struct _stp_log *log)
 {
-	size_t len = pb->len;
+	size_t len = log->len;
 	void *entry = NULL;

 	/* check to see if there is anything in the buffer */
 	if (likely(len == 0))
 		return;

-	pb->len = 0;
-
-	if (unlikely(_stp_transport_get_state() != STP_TRANSPORT_RUNNING))
-		return;
+	log->len = 0;

 	dbug_trans(1, "len = %zu\n", len);

 #ifdef STP_BULKMODE
 #ifdef NO_PERCPU_HEADERS
 	{
-		struct context* __restrict__ c = NULL;
-		char *bufp = pb->buf;
+		char *bufp = log->buf;
 		int inode_locked;

-		c = _stp_runtime_entryfn_get_context();
-
 		if (!(inode_locked = _stp_transport_trylock_relay_inode())) {
 			atomic_inc (&_stp_transport_failures);
 #ifndef STP_TRANSPORT_RISKY
-			_stp_runtime_entryfn_put_context(c);
 			return;
 #endif
 		}
@@ -70,26 +62,20 @@ void stp_print_flush(_stp_pbuf *pb)

 		if (inode_locked)
 			_stp_transport_unlock_relay_inode();
-
-		_stp_runtime_entryfn_put_context(c);
 	}

 #else  /* !NO_PERCPU_HEADERS */

 	{
-		struct context* __restrict__ c = NULL;
-		char *bufp = pb->buf;
+		char *bufp = log->buf;
 		struct _stp_trace t = {	.sequence = _stp_seq_inc(),
 					.pdu_len = len};
 		size_t bytes_reserved;
 		int inode_locked;

-		c = _stp_runtime_entryfn_get_context();
-
 		if (!(inode_locked = _stp_transport_trylock_relay_inode())) {
 			atomic_inc (&_stp_transport_failures);
 #ifndef STP_TRANSPORT_RISKY
-			_stp_runtime_entryfn_put_context(c);
 			return;
 #endif
 		}
@@ -124,48 +110,24 @@ void stp_print_flush(_stp_pbuf *pb)

 		if (inode_locked)
 			_stp_transport_unlock_relay_inode();
-
-		_stp_runtime_entryfn_put_context(c);
 	}
 #endif /* !NO_PERCPU_HEADERS */

 #else  /* !STP_BULKMODE */

 	{
-		unsigned long flags;
-		struct context* __restrict__ c = NULL;
-		char *bufp = pb->buf;
+		char *bufp = log->buf;
 		int inode_locked;

-		/* Prevent probe reentrancy on _stp_print_lock.
-		 *
-		 * Since stp_print_flush may be called from probe context, we
-		 * have to make sure that its lock, _stp_print_lock, can't
-		 * possibly be held outside probe context too.  We ensure this
-		 * by grabbing the context here, so any probe triggered by this
-		 * region will appear reentrant and be skipped rather than
-		 * deadlock.  Failure to get_context just means we're already
-		 * in a probe, which is fine.
-		 *
-		 * (see also _stp_ctl_send for a similar situation)
-                 *
-                 * A better solution would be to replace this
-                 * concurrency-control-laden effort with a lockless
-                 * algorithm.
-		 */
-		c = _stp_runtime_entryfn_get_context();
-
 		if (!(inode_locked = _stp_transport_trylock_relay_inode())) {
 			atomic_inc (&_stp_transport_failures);
 #ifndef STP_TRANSPORT_RISKY
 			dbug_trans(0, "discarding %zu bytes of data\n", len);
-			_stp_runtime_entryfn_put_context(c);
 			return;
 #endif
 		}

 		dbug_trans(1, "calling _stp_data_write...\n");
-		stp_spin_lock_irqsave(&_stp_print_lock, flags);
 		while (len > 0) {
 			size_t bytes_reserved;

@@ -182,12 +144,9 @@ void stp_print_flush(_stp_pbuf *pb)
 			    break;
 			}
 		}
-		stp_spin_unlock_irqrestore(&_stp_print_lock, flags);

 		if (inode_locked)
 			_stp_transport_unlock_relay_inode();
-
-		_stp_runtime_entryfn_put_context(c);
 	}
 #endif /* !STP_BULKMODE */
 }
diff --git a/runtime/stack.c b/runtime/stack.c
index 241ccf793..c81c8474a 100644
--- a/runtime/stack.c
+++ b/runtime/stack.c
@@ -685,35 +685,57 @@ static void _stp_stack_user_print(struct context *c, int sym_flags)
 static void _stp_stack_kernel_sprint(char *str, int size, struct context* c,
 				     int sym_flags)
 {
-	/* To get an hex string, we use a simple trick.
-	 * First flush the print buffer,
-	 * then call _stp_stack_print,
-	 * then copy the result into the output string
-	 * and clear the print buffer. */
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	_stp_print_flush();
+	/*
+	 * We use a simple trick to print out hex. We call _stp_stack_*_print(),
+	 * copy the result into the output string, and then reset the print
+	 * buffer to its prior state.
+	 */
+	struct _stp_log *log;
+	unsigned long flags;
+	unsigned int old_len;
+
+	if (!size)
+		return;

-	_stp_stack_kernel_print(c, sym_flags);
+	if (!_stp_print_trylock_irqsave(&flags)) {
+		*str = '\0';
+		return;
+	}

-	strlcpy(str, pb->buf, size < (int)pb->len ? size : (int)pb->len);
-	pb->len = 0;
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	old_len = log->len;
+	_stp_stack_kernel_print(c, sym_flags);
+	strlcpy(str, &log->buf[old_len], min_t(unsigned int, log->len - old_len, size));
+	log->len = old_len;
+	_stp_print_unlock_irqrestore(&flags);
 }

 static void _stp_stack_user_sprint(char *str, int size, struct context* c,
 				   int sym_flags)
 {
-	/* To get an hex string, we use a simple trick.
-	 * First flush the print buffer,
-	 * then call _stp_stack_print,
-	 * then copy the result into the output string
-	 * and clear the print buffer. */
-	_stp_pbuf *pb = per_cpu_ptr(Stp_pbuf, smp_processor_id());
-	_stp_print_flush();
+	/*
+	 * We use a simple trick to print out hex. We call _stp_stack_*_print(),
+	 * copy the result into the output string, and then reset the print
+	 * buffer to its prior state.
+	 */
+	struct _stp_log *log;
+	unsigned long flags;
+	unsigned int old_len;
+
+	if (!size)
+		return;

-	_stp_stack_user_print(c, sym_flags);
+	if (!_stp_print_trylock_irqsave(&flags)) {
+		*str = '\0';
+		return;
+	}

-	strlcpy(str, pb->buf, size < (int)pb->len ? size : (int)pb->len);
-	pb->len = 0;
+	log = per_cpu_ptr(_stp_log_pcpu, raw_smp_processor_id());
+	old_len = log->len;
+	_stp_stack_user_print(c, sym_flags);
+	strlcpy(str, &log->buf[old_len], min_t(unsigned int, log->len - old_len, size));
+	log->len = old_len;
+	_stp_print_unlock_irqrestore(&flags);
 }

 #endif /* _STACK_C_ */
diff --git a/runtime/transport/transport.c b/runtime/transport/transport.c
index bb4a98bd3..aebd7e914 100644
--- a/runtime/transport/transport.c
+++ b/runtime/transport/transport.c
@@ -534,8 +534,14 @@ static void _stp_transport_close(void)
 		   current->pid);
 	_stp_cleanup_and_exit(0);
 	_stp_unregister_ctl_channel();
+
+	/*
+	 * Print cleanup needs to be done before the transport is closed because
+	 * the print driver uses a worker to asynchronously send data via the
+	 * transport.
+	 */
+	_stp_print_cleanup();
 	_stp_transport_fs_close();
-	_stp_print_cleanup();	/* free print buffers */
 	_stp_mem_debug_done();

 	dbug_trans(1, "---- CLOSED ----\n");
diff --git a/runtime/vsprintf.c b/runtime/vsprintf.c
index 28fd18f16..417d9f7f3 100644
--- a/runtime/vsprintf.c
+++ b/runtime/vsprintf.c
@@ -542,6 +542,8 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
 				   number of chars for from string */
 	int qualifier;		/* 'h', 'l', or 'L' for integer fields */
 	int num_bytes = 0;
+	unsigned long irqflags = 0;
+	bool got_print_lock = false;

 	/* Reject out-of-range values early */
 	if (unlikely((int) size < 0))
@@ -724,11 +726,14 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
 	    num_bytes = STP_BUFFER_SIZE;
 	  }

+	  if (!_stp_print_trylock_irqsave(&irqflags))
+	    return 0;
 	  str = (char*)_stp_reserve_bytes(num_bytes);
 	  if (str == NULL) {
 	    _stp_error("Couldn't reserve any print buffer space\n");
-	    return 0;
+	    goto err_unlock;
 	  }
+	  got_print_lock = true;
 	  size = num_bytes;
 	  end = str + size - 1;

@@ -820,8 +825,10 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
 					field_width, precision,
 					*fmt, flags);
 			if (unlikely(str == NULL)) {
-				if (num_bytes > 0)
+				if (num_bytes > 0) {
 					_stp_unreserve_bytes(num_bytes);
+					goto err_unlock;
+				}
 				return 0;
 			}
 			continue;
@@ -923,7 +930,14 @@ _stp_vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
                   /* don't write out a null byte if the buf size is zero */
                   *end = '\0';
 	}
+
+	if (got_print_lock)
+		_stp_print_unlock_irqrestore(&irqflags);
 	return str-buf;
+
+err_unlock:
+	_stp_print_unlock_irqrestore(&irqflags);
+	return 0;
 }

 #endif /* _VSPRINTF_C_ */
diff --git a/translate.cxx b/translate.cxx
index 53f1d0725..f0195486c 100644
--- a/translate.cxx
+++ b/translate.cxx
@@ -1354,6 +1354,9 @@ c_unparser::emit_compiled_printfs ()
       o->newline() << "unsigned long ptr_value;";
       o->newline() << "int num_bytes;";

+      if (print_to_stream)
+	  o->newline() << "unsigned long irqflags;";
+
       o->newline() << "(void) width;";
       o->newline() << "(void) precision;";
       o->newline() << "(void) ptr_value;";
@@ -1452,7 +1455,9 @@ c_unparser::emit_compiled_printfs ()
 	    }

 	  o->newline() << "num_bytes = clamp(num_bytes, 0, STP_BUFFER_SIZE);";
-	  o->newline() << "str = (char*)_stp_reserve_bytes(num_bytes);";
+	  o->newline() << "if (!_stp_print_trylock_irqsave(&irqflags))";
+	  o->newline(1) << "return;";
+	  o->newline(-1) << "str = (char*)_stp_reserve_bytes(num_bytes);";
 	  o->newline() << "end = str ? str + num_bytes - 1 : 0;";
         }
       else // !print_to_stream
@@ -1547,8 +1552,14 @@ c_unparser::emit_compiled_printfs ()
 	      o->newline() << "if (unlikely(str == NULL)) {";
 	      o->indent(1);
 	      if (print_to_stream)
+                {
 		  o->newline() << "_stp_unreserve_bytes(num_bytes);";
-	      o->newline() << "return;";
+	          o->newline() << "goto err_unlock;";
+                }
+              else
+                {
+	          o->newline() << "return;";
+                }
 	      o->newline(-1) << "}";
 	      break;

@@ -1575,6 +1586,11 @@ c_unparser::emit_compiled_printfs ()

       o->newline(-1) << "}";

+      if (print_to_stream)
+        {
+          o->newline(-1) << "err_unlock:";
+          o->newline(1) << "_stp_print_unlock_irqrestore(&irqflags);";
+        }
       o->newline(-1) << "}";
     }
   o->newline() << "#endif // STP_LEGACY_PRINT";
--
2.29.2