We last left off confirming that the rename() syscall is
the culprit of a slow mv
:
$ sudo strace -t mv 5GB.copy 5GB
....
05:05:10 stat("5GB", {st_mode=S_IFREG|0664, st_size=5120000000, ...}) = 0
05:05:10 lstat("5GB.copy", {st_mode=S_IFREG|0664, st_size=5120000000, ...}) = 0
05:05:10 lstat("5GB", {st_mode=S_IFREG|0664, st_size=5120000000, ...}) = 0
05:05:10 geteuid() = 0
*** 05:05:10 rename("5GB.copy", "5GB") = 0
*** 05:05:18 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
05:05:18 close(0) = 0
05:05:18 close(1) = 0
05:05:18 close(2) = 0
05:05:18 exit_group(0) = ?
05:05:18 +++ exited with 0 +++
So now we have to start tracing in the kernel: Tracing what's happening in the rename via ftrace:
#!/bin/bash
cp 5GB 5GB.copy
echo '*' >/debug/tracing/set_ftrace_filter
echo function_graph >/debug/tracing/current_tracer
echo 1 >/debug/tracing/tracing_on
mv 5GB.copy 5GB
echo 0 >/debug/tracing/tracing_on
cat /debug/tracing/trace > /tmp/tracing.out$$
Looking for the slow calls we see:
0) $ 7670875 us | } /* truncate_inode_pages_range */
0) $ 7670876 us | } /* truncate_inode_pages_final */
0) $ 7674280 us | } /* ext4_evict_inode */
0) $ 7674291 us | } /* evict */
0) $ 7674292 us | } /* iput */
0) $ 7674295 us | } /* dentry_unlink_inode */
0) $ 7674299 us | } /* __dentry_kill */
0) $ 7674302 us | } /* dput.part.23 */
0) $ 7674303 us | } /* dput */
0) $ 17314240 us | } /* SyS_rename */
Looking at what trunate_inode_pages_range
is doing (omitting boilerplate code):
void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
unsigned int partial_start; /* inclusive */
unsigned int partial_end; /* exclusive */
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;
if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
goto out;
/* Offsets within partial pages */
partial_start = lstart & (PAGE_SIZE - 1);
partial_end = (lend + 1) & (PAGE_SIZE - 1);
start = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (lend == -1)
end = -1;
else
end = (lend + 1) >> PAGE_SHIFT;
pagevec_init(&pvec);
index = start;
while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
indices)) {
/* ... omitted, populate pvec ... */
}
if (partial_start) {
struct page *page = find_lock_page(mapping, start - 1);
if (page) {
unsigned int top = PAGE_SIZE;
if (start > end) {
/* Truncation within a single page */
top = partial_end;
partial_end = 0;
}
wait_on_page_writeback(page);
zero_user_segment(page, partial_start, top);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
do_invalidatepage(page, partial_start,
top - partial_start);
unlock_page(page);
put_page(page);
}
}
if (partial_end) {
/* omitted, similar to partial_start case */
}
if (start >= end)
goto out;
index = start;
/* omitted, additional booking */
out:
cleancache_invalidate_inode(mapping);
}
EXPORT_SYMBOL(truncate_inode_pages_range);
So the wait_on_page_writeback
suggests that it is writing out pages to disk in
the rename()
code path, which confirms the suspicion I had for why rename()
was slow (triggering disk IO). The question is, why is it doing this?