Skip to content

Instantly share code, notes, and snippets.

@jamesls
Last active December 5, 2018 18:26
Show Gist options
  • Save jamesls/3457056845a9ccddc8bbc1665e5b93ce to your computer and use it in GitHub Desktop.
Save jamesls/3457056845a9ccddc8bbc1665e5b93ce to your computer and use it in GitHub Desktop.

We last left off confirming that the rename() syscall is the culprit of a slow mv:

$ sudo strace -t mv 5GB.copy 5GB
   ....
    05:05:10 stat("5GB", {st_mode=S_IFREG|0664, st_size=5120000000, ...}) = 0
    05:05:10 lstat("5GB.copy", {st_mode=S_IFREG|0664, st_size=5120000000, ...}) = 0
    05:05:10 lstat("5GB", {st_mode=S_IFREG|0664, st_size=5120000000, ...}) = 0
    05:05:10 geteuid()                      = 0
*** 05:05:10 rename("5GB.copy", "5GB")      = 0
*** 05:05:18 lseek(0, 0, SEEK_CUR)          = -1 ESPIPE (Illegal seek)
    05:05:18 close(0)                       = 0
    05:05:18 close(1)                       = 0
    05:05:18 close(2)                       = 0
    05:05:18 exit_group(0)                  = ?
    05:05:18 +++ exited with 0 +++

So now we have to start tracing in the kernel: Tracing what's happening in the rename via ftrace:

#!/bin/bash
cp 5GB 5GB.copy
echo '*' >/debug/tracing/set_ftrace_filter
echo function_graph >/debug/tracing/current_tracer
echo 1 >/debug/tracing/tracing_on

mv 5GB.copy 5GB

echo 0 >/debug/tracing/tracing_on
cat /debug/tracing/trace > /tmp/tracing.out$$

Looking for the slow calls we see:

 0) $ 7670875 us  |                      } /* truncate_inode_pages_range */
 0) $ 7670876 us  |                    } /* truncate_inode_pages_final */
 0) $ 7674280 us  |                  } /* ext4_evict_inode */
 0) $ 7674291 us  |                } /* evict */
 0) $ 7674292 us  |              } /* iput */
 0) $ 7674295 us  |            } /* dentry_unlink_inode */
 0) $ 7674299 us  |          } /* __dentry_kill */
 0) $ 7674302 us  |        } /* dput.part.23 */
 0) $ 7674303 us  |      } /* dput */
 0) $ 17314240 us |    } /* SyS_rename */

Looking at what trunate_inode_pages_range is doing (omitting boilerplate code):

void truncate_inode_pages_range(struct address_space *mapping,
				loff_t lstart, loff_t lend)
{
	pgoff_t		start;		/* inclusive */
	pgoff_t		end;		/* exclusive */
	unsigned int	partial_start;	/* inclusive */
	unsigned int	partial_end;	/* exclusive */
	struct pagevec	pvec;
	pgoff_t		indices[PAGEVEC_SIZE];
	pgoff_t		index;
	int		i;

	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
		goto out;

	/* Offsets within partial pages */
	partial_start = lstart & (PAGE_SIZE - 1);
	partial_end = (lend + 1) & (PAGE_SIZE - 1);

	start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
	if (lend == -1)
		end = -1;
	else
		end = (lend + 1) >> PAGE_SHIFT;

	pagevec_init(&pvec);
	index = start;
	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
			min(end - index, (pgoff_t)PAGEVEC_SIZE),
			indices)) {
                /* ... omitted, populate pvec ... */
	}
	if (partial_start) {
		struct page *page = find_lock_page(mapping, start - 1);
		if (page) {
			unsigned int top = PAGE_SIZE;
			if (start > end) {
				/* Truncation within a single page */
				top = partial_end;
				partial_end = 0;
			}
			wait_on_page_writeback(page);
			zero_user_segment(page, partial_start, top);
			cleancache_invalidate_page(mapping, page);
			if (page_has_private(page))
				do_invalidatepage(page, partial_start,
						  top - partial_start);
			unlock_page(page);
			put_page(page);
		}
	}
	if (partial_end) {
                /* omitted, similar to partial_start case */
	}
	if (start >= end)
		goto out;

	index = start;

        /* omitted, additional booking */

out:
	cleancache_invalidate_inode(mapping);
}
EXPORT_SYMBOL(truncate_inode_pages_range);

So the wait_on_page_writeback suggests that it is writing out pages to disk in the rename() code path, which confirms the suspicion I had for why rename() was slow (triggering disk IO). The question is, why is it doing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment