Created
December 13, 2014 16:04
-
-
Save pfactum/945af86d86c9e0af4013 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
diff --git a/Documentation/scheduler/sched-BFS.txt b/Documentation/scheduler/sched-BFS.txt | |
new file mode 100644 | |
index 0000000..c10d956 | |
--- /dev/null | |
+++ b/Documentation/scheduler/sched-BFS.txt | |
@@ -0,0 +1,347 @@ | |
+BFS - The Brain Fuck Scheduler by Con Kolivas. | |
+ | |
+Goals. | |
+ | |
+The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to | |
+completely do away with the complex designs of the past for the cpu process | |
+scheduler and instead implement one that is very simple in basic design. | |
+The main focus of BFS is to achieve excellent desktop interactivity and | |
+responsiveness without heuristics and tuning knobs that are difficult to | |
+understand, impossible to model and predict the effect of, and when tuned to | |
+one workload cause massive detriment to another. | |
+ | |
+ | |
+Design summary. | |
+ | |
+BFS is best described as a single runqueue, O(n) lookup, earliest effective | |
+virtual deadline first design, loosely based on EEVDF (earliest eligible virtual | |
+deadline first) and my previous Staircase Deadline scheduler. Each component | |
+shall be described in order to understand the significance of, and reasoning for | |
+it. The codebase when the first stable version was released was approximately | |
+9000 lines less code than the existing mainline linux kernel scheduler (in | |
+2.6.31). This does not even take into account the removal of documentation and | |
+the cgroups code that is not used. | |
+ | |
+Design reasoning. | |
+ | |
+The single runqueue refers to the queued but not running processes for the | |
+entire system, regardless of the number of CPUs. The reason for going back to | |
+a single runqueue design is that once multiple runqueues are introduced, | |
+per-CPU or otherwise, there will be complex interactions as each runqueue will | |
+be responsible for the scheduling latency and fairness of the tasks only on its | |
+own runqueue, and to achieve fairness and low latency across multiple CPUs, any | |
+advantage in throughput of having CPU local tasks causes other disadvantages. | |
+This is due to requiring a very complex balancing system to at best achieve some | |
+semblance of fairness across CPUs and can only maintain relatively low latency | |
+for tasks bound to the same CPUs, not across them. To increase said fairness | |
+and latency across CPUs, the advantage of local runqueue locking, which makes | |
+for better scalability, is lost due to having to grab multiple locks. | |
+ | |
+A significant feature of BFS is that all accounting is done purely based on CPU | |
+used and nowhere is sleep time used in any way to determine entitlement or | |
+interactivity. Interactivity "estimators" that use some kind of sleep/run | |
+algorithm are doomed to fail to detect all interactive tasks, and to falsely tag | |
+tasks that aren't interactive as being so. The reason for this is that it is | |
+close to impossible to determine that when a task is sleeping, whether it is | |
+doing it voluntarily, as in a userspace application waiting for input in the | |
+form of a mouse click or otherwise, or involuntarily, because it is waiting for | |
+another thread, process, I/O, kernel activity or whatever. Thus, such an | |
+estimator will introduce corner cases, and more heuristics will be required to | |
+cope with those corner cases, introducing more corner cases and failed | |
+interactivity detection and so on. Interactivity in BFS is built into the design | |
+by virtue of the fact that tasks that are waking up have not used up their quota | |
+of CPU time, and have earlier effective deadlines, thereby making it very likely | |
+they will preempt any CPU bound task of equivalent nice level. See below for | |
+more information on the virtual deadline mechanism. Even if they do not preempt | |
+a running task, because the rr interval is guaranteed to have a bound upper | |
+limit on how long a task will wait for, it will be scheduled within a timeframe | |
+that will not cause visible interface jitter. | |
+ | |
+ | |
+Design details. | |
+ | |
+Task insertion. | |
+ | |
+BFS inserts tasks into each relevant queue as an O(1) insertion into a double | |
+linked list. On insertion, *every* running queue is checked to see if the newly | |
+queued task can run on any idle queue, or preempt the lowest running task on the | |
+system. This is how the cross-CPU scheduling of BFS achieves significantly lower | |
+latency per extra CPU the system has. In this case the lookup is, in the worst | |
+case scenario, O(n) where n is the number of CPUs on the system. | |
+ | |
+Data protection. | |
+ | |
+BFS has one single lock protecting the process local data of every task in the | |
+global queue. Thus every insertion, removal and modification of task data in the | |
+global runqueue needs to grab the global lock. However, once a task is taken by | |
+a CPU, the CPU has its own local data copy of the running process' accounting | |
+information which only that CPU accesses and modifies (such as during a | |
+timer tick) thus allowing the accounting data to be updated lockless. Once a | |
+CPU has taken a task to run, it removes it from the global queue. Thus the | |
+global queue only ever has, at most, | |
+ | |
+ (number of tasks requesting cpu time) - (number of logical CPUs) + 1 | |
+ | |
+tasks in the global queue. This value is relevant for the time taken to look up | |
+tasks during scheduling. This will increase if many tasks with CPU affinity set | |
+in their policy to limit which CPUs they're allowed to run on if they outnumber | |
+the number of CPUs. The +1 is because when rescheduling a task, the CPU's | |
+currently running task is put back on the queue. Lookup will be described after | |
+the virtual deadline mechanism is explained. | |
+ | |
+Virtual deadline. | |
+ | |
+The key to achieving low latency, scheduling fairness, and "nice level" | |
+distribution in BFS is entirely in the virtual deadline mechanism. The one | |
+tunable in BFS is the rr_interval, or "round robin interval". This is the | |
+maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy) | |
+tasks of the same nice level will be running for, or looking at it the other | |
+way around, the longest duration two tasks of the same nice level will be | |
+delayed for. When a task requests cpu time, it is given a quota (time_slice) | |
+equal to the rr_interval and a virtual deadline. The virtual deadline is | |
+offset from the current time in jiffies by this equation: | |
+ | |
+ jiffies + (prio_ratio * rr_interval) | |
+ | |
+The prio_ratio is determined as a ratio compared to the baseline of nice -20 | |
+and increases by 10% per nice level. The deadline is a virtual one only in that | |
+no guarantee is placed that a task will actually be scheduled by this time, but | |
+it is used to compare which task should go next. There are three components to | |
+how a task is next chosen. First is time_slice expiration. If a task runs out | |
+of its time_slice, it is descheduled, the time_slice is refilled, and the | |
+deadline reset to that formula above. Second is sleep, where a task no longer | |
+is requesting CPU for whatever reason. The time_slice and deadline are _not_ | |
+adjusted in this case and are just carried over for when the task is next | |
+scheduled. Third is preemption, and that is when a newly waking task is deemed | |
+higher priority than a currently running task on any cpu by virtue of the fact | |
+that it has an earlier virtual deadline than the currently running task. The | |
+earlier deadline is the key to which task is next chosen for the first and | |
+second cases. Once a task is descheduled, it is put back on the queue, and an | |
+O(n) lookup of all queued-but-not-running tasks is done to determine which has | |
+the earliest deadline and that task is chosen to receive CPU next. | |
+ | |
+The CPU proportion of different nice tasks works out to be approximately the | |
+ | |
+ (prio_ratio difference)^2 | |
+ | |
+The reason it is squared is that a task's deadline does not change while it is | |
+running unless it runs out of time_slice. Thus, even if the time actually | |
+passes the deadline of another task that is queued, it will not get CPU time | |
+unless the current running task deschedules, and the time "base" (jiffies) is | |
+constantly moving. | |
+ | |
+Task lookup. | |
+ | |
+BFS has 103 priority queues. 100 of these are dedicated to the static priority | |
+of realtime tasks, and the remaining 3 are, in order of best to worst priority, | |
+SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority | |
+scheduling). When a task of these priorities is queued, a bitmap of running | |
+priorities is set showing which of these priorities has tasks waiting for CPU | |
+time. When a CPU is made to reschedule, the lookup for the next task to get | |
+CPU time is performed in the following way: | |
+ | |
+First the bitmap is checked to see what static priority tasks are queued. If | |
+any realtime priorities are found, the corresponding queue is checked and the | |
+first task listed there is taken (provided CPU affinity is suitable) and lookup | |
+is complete. If the priority corresponds to a SCHED_ISO task, they are also | |
+taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds | |
+to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this | |
+stage, every task in the runlist that corresponds to that priority is checked | |
+to see which has the earliest set deadline, and (provided it has suitable CPU | |
+affinity) it is taken off the runqueue and given the CPU. If a task has an | |
+expired deadline, it is taken and the rest of the lookup aborted (as they are | |
+chosen in FIFO order). | |
+ | |
+Thus, the lookup is O(n) in the worst case only, where n is as described | |
+earlier, as tasks may be chosen before the whole task list is looked over. | |
+ | |
+ | |
+Scalability. | |
+ | |
+The major limitations of BFS will be that of scalability, as the separate | |
+runqueue designs will have less lock contention as the number of CPUs rises. | |
+However they do not scale linearly even with separate runqueues as multiple | |
+runqueues will need to be locked concurrently on such designs to be able to | |
+achieve fair CPU balancing, to try and achieve some sort of nice-level fairness | |
+across CPUs, and to achieve low enough latency for tasks on a busy CPU when | |
+other CPUs would be more suited. BFS has the advantage that it requires no | |
+balancing algorithm whatsoever, as balancing occurs by proxy simply because | |
+all CPUs draw off the global runqueue, in priority and deadline order. Despite | |
+the fact that scalability is _not_ the prime concern of BFS, it both shows very | |
+good scalability to smaller numbers of CPUs and is likely a more scalable design | |
+at these numbers of CPUs. | |
+ | |
+It also has some very low overhead scalability features built into the design | |
+when it has been deemed their overhead is so marginal that they're worth adding. | |
+The first is the local copy of the running process' data to the CPU it's running | |
+on to allow that data to be updated lockless where possible. Then there is | |
+deference paid to the last CPU a task was running on, by trying that CPU first | |
+when looking for an idle CPU to use the next time it's scheduled. Finally there | |
+is the notion of "sticky" tasks that are flagged when they are involuntarily | |
+descheduled, meaning they still want further CPU time. This sticky flag is | |
+used to bias heavily against those tasks being scheduled on a different CPU | |
+unless that CPU would be otherwise idle. When a cpu frequency governor is used | |
+that scales with CPU load, such as ondemand, sticky tasks are not scheduled | |
+on a different CPU at all, preferring instead to go idle. This means the CPU | |
+they were bound to is more likely to increase its speed while the other CPU | |
+will go idle, thus speeding up total task execution time and likely decreasing | |
+power usage. This is the only scenario where BFS will allow a CPU to go idle | |
+in preference to scheduling a task on the earliest available spare CPU. | |
+ | |
+The real cost of migrating a task from one CPU to another is entirely dependant | |
+on the cache footprint of the task, how cache intensive the task is, how long | |
+it's been running on that CPU to take up the bulk of its cache, how big the CPU | |
+cache is, how fast and how layered the CPU cache is, how fast a context switch | |
+is... and so on. In other words, it's close to random in the real world where we | |
+do more than just one sole workload. The only thing we can be sure of is that | |
+it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and | |
+utilising idle CPUs is more important than cache locality, and cache locality | |
+only plays a part after that. | |
+ | |
+When choosing an idle CPU for a waking task, the cache locality is determined | |
+according to where the task last ran and then idle CPUs are ranked from best | |
+to worst to choose the most suitable idle CPU based on cache locality, NUMA | |
+node locality and hyperthread sibling business. They are chosen in the | |
+following preference (if idle): | |
+ | |
+* Same core, idle or busy cache, idle threads | |
+* Other core, same cache, idle or busy cache, idle threads. | |
+* Same node, other CPU, idle cache, idle threads. | |
+* Same node, other CPU, busy cache, idle threads. | |
+* Same core, busy threads. | |
+* Other core, same cache, busy threads. | |
+* Same node, other CPU, busy threads. | |
+* Other node, other CPU, idle cache, idle threads. | |
+* Other node, other CPU, busy cache, idle threads. | |
+* Other node, other CPU, busy threads. | |
+ | |
+This shows the SMT or "hyperthread" awareness in the design as well which will | |
+choose a real idle core first before a logical SMT sibling which already has | |
+tasks on the physical CPU. | |
+ | |
+Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark. | |
+However this benchmarking was performed on an earlier design that was far less | |
+scalable than the current one so it's hard to know how scalable it is in terms | |
+of both CPUs (due to the global runqueue) and heavily loaded machines (due to | |
+O(n) lookup) at this stage. Note that in terms of scalability, the number of | |
+_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x) | |
+quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark | |
+results are very promising indeed, without needing to tweak any knobs, features | |
+or options. Benchmark contributions are most welcome. | |
+ | |
+ | |
+Features | |
+ | |
+As the initial prime target audience for BFS was the average desktop user, it | |
+was designed to not need tweaking, tuning or have features set to obtain benefit | |
+from it. Thus the number of knobs and features has been kept to an absolute | |
+minimum and should not require extra user input for the vast majority of cases. | |
+There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval | |
+and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition | |
+to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is | |
+support for CGROUPS. The average user should neither need to know what these | |
+are, nor should they need to be using them to have good desktop behaviour. | |
+ | |
+rr_interval | |
+ | |
+There is only one "scheduler" tunable, the round robin interval. This can be | |
+accessed in | |
+ | |
+ /proc/sys/kernel/rr_interval | |
+ | |
+The value is in milliseconds, and the default value is set to 6ms. Valid values | |
+are from 1 to 1000. Decreasing the value will decrease latencies at the cost of | |
+decreasing throughput, while increasing it will improve throughput, but at the | |
+cost of worsening latencies. The accuracy of the rr interval is limited by HZ | |
+resolution of the kernel configuration. Thus, the worst case latencies are | |
+usually slightly higher than this actual value. BFS uses "dithering" to try and | |
+minimise the effect the Hz limitation has. The default value of 6 is not an | |
+arbitrary one. It is based on the fact that humans can detect jitter at | |
+approximately 7ms, so aiming for much lower latencies is pointless under most | |
+circumstances. It is worth noting this fact when comparing the latency | |
+performance of BFS to other schedulers. Worst case latencies being higher than | |
+7ms are far worse than average latencies not being in the microsecond range. | |
+Experimentation has shown that rr intervals being increased up to 300 can | |
+improve throughput but beyond that, scheduling noise from elsewhere prevents | |
+further demonstrable throughput. | |
+ | |
+Isochronous scheduling. | |
+ | |
+Isochronous scheduling is a unique scheduling policy designed to provide | |
+near-real-time performance to unprivileged (ie non-root) users without the | |
+ability to starve the machine indefinitely. Isochronous tasks (which means | |
+"same time") are set using, for example, the schedtool application like so: | |
+ | |
+ schedtool -I -e amarok | |
+ | |
+This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works | |
+is that it has a priority level between true realtime tasks and SCHED_NORMAL | |
+which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie, | |
+if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval | |
+rate). However if ISO tasks run for more than a tunable finite amount of time, | |
+they are then demoted back to SCHED_NORMAL scheduling. This finite amount of | |
+time is the percentage of _total CPU_ available across the machine, configurable | |
+as a percentage in the following "resource handling" tunable (as opposed to a | |
+scheduler tunable): | |
+ | |
+ /proc/sys/kernel/iso_cpu | |
+ | |
+and is set to 70% by default. It is calculated over a rolling 5 second average | |
+Because it is the total CPU available, it means that on a multi CPU machine, it | |
+is possible to have an ISO task running as realtime scheduling indefinitely on | |
+just one CPU, as the other CPUs will be available. Setting this to 100 is the | |
+equivalent of giving all users SCHED_RR access and setting it to 0 removes the | |
+ability to run any pseudo-realtime tasks. | |
+ | |
+A feature of BFS is that it detects when an application tries to obtain a | |
+realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the | |
+appropriate privileges to use those policies. When it detects this, it will | |
+give the task SCHED_ISO policy instead. Thus it is transparent to the user. | |
+Because some applications constantly set their policy as well as their nice | |
+level, there is potential for them to undo the override specified by the user | |
+on the command line of setting the policy to SCHED_ISO. To counter this, once | |
+a task has been set to SCHED_ISO policy, it needs superuser privileges to set | |
+it back to SCHED_NORMAL. This will ensure the task remains ISO and all child | |
+processes and threads will also inherit the ISO policy. | |
+ | |
+Idleprio scheduling. | |
+ | |
+Idleprio scheduling is a scheduling policy designed to give out CPU to a task | |
+_only_ when the CPU would be otherwise idle. The idea behind this is to allow | |
+ultra low priority tasks to be run in the background that have virtually no | |
+effect on the foreground tasks. This is ideally suited to distributed computing | |
+clients (like setiathome, folding, mprime etc) but can also be used to start | |
+a video encode or so on without any slowdown of other tasks. To avoid this | |
+policy from grabbing shared resources and holding them indefinitely, if it | |
+detects a state where the task is waiting on I/O, the machine is about to | |
+suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As | |
+per the Isochronous task management, once a task has been scheduled as IDLEPRIO, | |
+it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can | |
+be set to start as SCHED_IDLEPRIO with the schedtool command like so: | |
+ | |
+ schedtool -D -e ./mprime | |
+ | |
+Subtick accounting. | |
+ | |
+It is surprisingly difficult to get accurate CPU accounting, and in many cases, | |
+the accounting is done by simply determining what is happening at the precise | |
+moment a timer tick fires off. This becomes increasingly inaccurate as the | |
+timer tick frequency (HZ) is lowered. It is possible to create an application | |
+which uses almost 100% CPU, yet by being descheduled at the right time, records | |
+zero CPU usage. While the main problem with this is that there are possible | |
+security implications, it is also difficult to determine how much CPU a task | |
+really does use. BFS tries to use the sub-tick accounting from the TSC clock, | |
+where possible, to determine real CPU usage. This is not entirely reliable, but | |
+is far more likely to produce accurate CPU usage data than the existing designs | |
+and will not show tasks as consuming no CPU usage when they actually are. Thus, | |
+the amount of CPU reported as being used by BFS will more accurately represent | |
+how much CPU the task itself is using (as is shown for example by the 'time' | |
+application), so the reported values may be quite different to other schedulers. | |
+Values reported as the 'load' are more prone to problems with this design, but | |
+per process values are closer to real usage. When comparing throughput of BFS | |
+to other designs, it is important to compare the actual completed work in terms | |
+of total wall clock time taken and total work done, rather than the reported | |
+"cpu usage". | |
+ | |
+ | |
+Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011 | |
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt | |
index 57baff5..800e894 100644 | |
--- a/Documentation/sysctl/kernel.txt | |
+++ b/Documentation/sysctl/kernel.txt | |
@@ -38,6 +38,7 @@ show up in /proc/sys/kernel: | |
- hung_task_timeout_secs | |
- hung_task_warnings | |
- kexec_load_disabled | |
+- iso_cpu | |
- kptr_restrict | |
- kstack_depth_to_print [ X86 only ] | |
- l2cr [ PPC only ] | |
@@ -65,6 +66,7 @@ show up in /proc/sys/kernel: | |
- randomize_va_space | |
- real-root-dev ==> Documentation/initrd.txt | |
- reboot-cmd [ SPARC only ] | |
+- rr_interval | |
- rtsig-max | |
- rtsig-nr | |
- sem | |
@@ -379,6 +381,16 @@ kernel stack. | |
============================================================== | |
+iso_cpu: (BFS CPU scheduler only). | |
+ | |
+This sets the percentage cpu that the unprivileged SCHED_ISO tasks can | |
+run effectively at realtime priority, averaged over a rolling five | |
+seconds over the -whole- system, meaning all cpus. | |
+ | |
+Set to 70 (percent) by default. | |
+ | |
+============================================================== | |
+ | |
l2cr: (PPC only) | |
This flag controls the L2 cache of G3 processor boards. If | |
@@ -700,6 +712,20 @@ rebooting. ??? | |
============================================================== | |
+rr_interval: (BFS CPU scheduler only) | |
+ | |
+This is the smallest duration that any cpu process scheduling unit | |
+will run for. Increasing this value can increase throughput of cpu | |
+bound tasks substantially but at the expense of increased latencies | |
+overall. Conversely decreasing it will decrease average and maximum | |
+latencies but at the expense of throughput. This value is in | |
+milliseconds and the default value chosen depends on the number of | |
+cpus available at scheduler initialisation with a minimum of 6. | |
+ | |
+Valid values are from 1-1000. | |
+ | |
+============================================================== | |
+ | |
rtsig-max & rtsig-nr: | |
The file rtsig-max can be used to tune the maximum number | |
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX | |
index 081c497..75bde3d 100644 | |
--- a/Documentation/vm/00-INDEX | |
+++ b/Documentation/vm/00-INDEX | |
@@ -16,6 +16,8 @@ hwpoison.txt | |
- explains what hwpoison is | |
ksm.txt | |
- how to use the Kernel Samepage Merging feature. | |
+uksm.txt | |
+ - Introduction to Ultra KSM | |
numa | |
- information about NUMA specific code in the Linux vm. | |
numa_memory_policy.txt | |
diff --git a/Documentation/vm/uksm.txt b/Documentation/vm/uksm.txt | |
new file mode 100644 | |
index 0000000..08bd645 | |
--- /dev/null | |
+++ b/Documentation/vm/uksm.txt | |
@@ -0,0 +1,58 @@ | |
+The Ultra Kernel Samepage Merging feature | |
+---------------------------------------------- | |
+/* | |
+ * Ultra KSM. Copyright (C) 2011-2012 Nai Xia | |
+ * | |
+ * This is an improvement upon KSM. Some basic data structures and routines | |
+ * are borrowed from ksm.c . | |
+ * | |
+ * Its new features: | |
+ * 1. Full system scan: | |
+ * It automatically scans all user processes' anonymous VMAs. Kernel-user | |
+ * interaction to submit a memory area to KSM is no longer needed. | |
+ * | |
+ * 2. Rich area detection: | |
+ * It automatically detects rich areas containing abundant duplicated | |
+ * pages based. Rich areas are given a full scan speed. Poor areas are | |
+ * sampled at a reasonable speed with very low CPU consumption. | |
+ * | |
+ * 3. Ultra Per-page scan speed improvement: | |
+ * A new hash algorithm is proposed. As a result, on a machine with | |
+ * Core(TM)2 Quad Q9300 CPU in 32-bit mode and 800MHZ DDR2 main memory, it | |
+ * can scan memory areas that does not contain duplicated pages at speed of | |
+ * 627MB/sec ~ 2445MB/sec and can merge duplicated areas at speed of | |
+ * 477MB/sec ~ 923MB/sec. | |
+ * | |
+ * 4. Thrashing area avoidance: | |
+ * Thrashing area(an VMA that has frequent Ksm page break-out) can be | |
+ * filtered out. My benchmark shows it's more efficient than KSM's per-page | |
+ * hash value based volatile page detection. | |
+ * | |
+ * | |
+ * 5. Misc changes upon KSM: | |
+ * * It has a fully x86-opitmized memcmp dedicated for 4-byte-aligned page | |
+ * comparison. It's much faster than default C version on x86. | |
+ * * rmap_item now has an struct *page member to loosely cache a | |
+ * address-->page mapping, which reduces too much time-costly | |
+ * follow_page(). | |
+ * * The VMA creation/exit procedures are hooked to let the Ultra KSM know. | |
+ * * try_to_merge_two_pages() now can revert a pte if it fails. No break_ | |
+ * ksm is needed for this case. | |
+ * | |
+ * 6. Full Zero Page consideration(contributed by Figo Zhang) | |
+ * Now uksmd consider full zero pages as special pages and merge them to an | |
+ * special unswappable uksm zero page. | |
+ */ | |
+ | |
+ChangeLog: | |
+ | |
+2012-05-05 The creation of this Doc | |
+2012-05-08 UKSM 0.1.1.1 libc crash bug fix, api clean up, doc clean up. | |
+2012-05-28 UKSM 0.1.1.2 bug fix release | |
+2012-06-26 UKSM 0.1.2-beta1 first beta release for 0.1.2 | |
+2012-07-2 UKSM 0.1.2-beta2 | |
+2012-07-10 UKSM 0.1.2-beta3 | |
+2012-07-26 UKSM 0.1.2 Fine grained speed control, more scan optimization. | |
+2012-10-13 UKSM 0.1.2.1 Bug fixes. | |
+2012-12-31 UKSM 0.1.2.2 Minor bug fixes | |
+2014-07-02 UKSM 0.1.2.3 Fix a " __this_cpu_read() in preemptible bug" | |
diff --git a/Makefile b/Makefile | |
index fd80c6e..0290fc0 100644 | |
--- a/Makefile | |
+++ b/Makefile | |
@@ -1,7 +1,7 @@ | |
VERSION = 3 | |
PATCHLEVEL = 18 | |
SUBLEVEL = 0 | |
-EXTRAVERSION = | |
+EXTRAVERSION = -pf0 | |
NAME = Diseased Newt | |
# *DOCUMENTATION* | |
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c | |
index 998f632..5c80a0f 100644 | |
--- a/arch/powerpc/platforms/cell/spufs/sched.c | |
+++ b/arch/powerpc/platforms/cell/spufs/sched.c | |
@@ -64,11 +64,6 @@ static struct timer_list spusched_timer; | |
static struct timer_list spuloadavg_timer; | |
/* | |
- * Priority of a normal, non-rt, non-niced'd process (aka nice level 0). | |
- */ | |
-#define NORMAL_PRIO 120 | |
- | |
-/* | |
* Frequency of the spu scheduler tick. By default we do one SPU scheduler | |
* tick for every 10 CPU scheduler ticks. | |
*/ | |
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig | |
index 41a503c..b17eeb9 100644 | |
--- a/arch/x86/Kconfig | |
+++ b/arch/x86/Kconfig | |
@@ -839,10 +839,26 @@ config SCHED_SMT | |
depends on X86_HT | |
---help--- | |
SMT scheduler support improves the CPU scheduler's decision making | |
- when dealing with Intel Pentium 4 chips with HyperThreading at a | |
+ when dealing with Intel P4/Core 2 chips with HyperThreading at a | |
cost of slightly increased overhead in some places. If unsure say | |
N here. | |
+config SMT_NICE | |
+ bool "SMT (Hyperthreading) aware nice priority and policy support" | |
+ depends on X86_HT && SCHED_BFS && SCHED_SMT | |
+ default y | |
+ ---help--- | |
+ Enabling Hyperthreading on Intel CPUs decreases the effectiveness | |
+ of the use of 'nice' levels and different scheduling policies | |
+ (e.g. realtime) due to sharing of CPU power between hyperthreads. | |
+ SMT nice support makes each logical CPU aware of what is running on | |
+ its hyperthread siblings, maintaining appropriate distribution of | |
+ CPU according to nice levels and scheduling policies at the expense | |
+ of slightly increased overhead. | |
+ | |
+ If unsure say Y here. | |
+ | |
+ | |
config SCHED_MC | |
def_bool y | |
prompt "Multi-core scheduler support" | |
@@ -1179,7 +1195,7 @@ config HIGHMEM64G | |
endchoice | |
choice | |
- prompt "Memory split" if EXPERT | |
+ prompt "Memory split" | |
default VMSPLIT_3G | |
depends on X86_32 | |
---help--- | |
@@ -1199,17 +1215,17 @@ choice | |
option alone! | |
config VMSPLIT_3G | |
- bool "3G/1G user/kernel split" | |
+ bool "Default 896MB lowmem (3G/1G user/kernel split)" | |
config VMSPLIT_3G_OPT | |
depends on !X86_PAE | |
- bool "3G/1G user/kernel split (for full 1G low memory)" | |
+ bool "1GB lowmem (3G/1G user/kernel split)" | |
config VMSPLIT_2G | |
- bool "2G/2G user/kernel split" | |
+ bool "2GB lowmem (2G/2G user/kernel split)" | |
config VMSPLIT_2G_OPT | |
depends on !X86_PAE | |
- bool "2G/2G user/kernel split (for full 2G low memory)" | |
+ bool "2GB lowmem (2G/2G user/kernel split)" | |
config VMSPLIT_1G | |
- bool "1G/3G user/kernel split" | |
+ bool "3GB lowmem (1G/3G user/kernel split)" | |
endchoice | |
config PAGE_OFFSET | |
@@ -1854,7 +1870,7 @@ config HOTPLUG_CPU | |
config BOOTPARAM_HOTPLUG_CPU0 | |
bool "Set default setting of cpu0_hotpluggable" | |
default n | |
- depends on HOTPLUG_CPU | |
+ depends on HOTPLUG_CPU && !SCHED_BFS | |
---help--- | |
Set whether default state of cpu0_hotpluggable is on or off. | |
@@ -1883,7 +1899,7 @@ config BOOTPARAM_HOTPLUG_CPU0 | |
config DEBUG_HOTPLUG_CPU0 | |
def_bool n | |
prompt "Debug CPU0 hotplug" | |
- depends on HOTPLUG_CPU | |
+ depends on HOTPLUG_CPU && !SCHED_BFS | |
---help--- | |
Enabling this option offlines CPU0 (if CPU0 can be offlined) as | |
soon as possible and boots up userspace with CPU0 offlined. User | |
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched | |
index 421bef9..0ee5f0f 100644 | |
--- a/block/Kconfig.iosched | |
+++ b/block/Kconfig.iosched | |
@@ -39,6 +39,27 @@ config CFQ_GROUP_IOSCHED | |
---help--- | |
Enable group IO scheduling in CFQ. | |
+config IOSCHED_BFQ | |
+ tristate "BFQ I/O scheduler" | |
+ default n | |
+ ---help--- | |
+ The BFQ I/O scheduler tries to distribute bandwidth among | |
+ all processes according to their weights. | |
+ It aims at distributing the bandwidth as desired, independently of | |
+ the disk parameters and with any workload. It also tries to | |
+ guarantee low latency to interactive and soft real-time | |
+ applications. If compiled built-in (saying Y here), BFQ can | |
+ be configured to support hierarchical scheduling. | |
+ | |
+config CGROUP_BFQIO | |
+ bool "BFQ hierarchical scheduling support" | |
+ depends on CGROUPS && IOSCHED_BFQ=y | |
+ default n | |
+ ---help--- | |
+ Enable hierarchical scheduling in BFQ, using the cgroups | |
+ filesystem interface. The name of the subsystem will be | |
+ bfqio. | |
+ | |
choice | |
prompt "Default I/O scheduler" | |
default DEFAULT_CFQ | |
@@ -52,6 +73,16 @@ choice | |
config DEFAULT_CFQ | |
bool "CFQ" if IOSCHED_CFQ=y | |
+ config DEFAULT_BFQ | |
+ bool "BFQ" if IOSCHED_BFQ=y | |
+ help | |
+ Selects BFQ as the default I/O scheduler which will be | |
+ used by default for all block devices. | |
+ The BFQ I/O scheduler aims at distributing the bandwidth | |
+ as desired, independently of the disk parameters and with | |
+ any workload. It also tries to guarantee low latency to | |
+ interactive and soft real-time applications. | |
+ | |
config DEFAULT_NOOP | |
bool "No-op" | |
@@ -61,6 +92,7 @@ config DEFAULT_IOSCHED | |
string | |
default "deadline" if DEFAULT_DEADLINE | |
default "cfq" if DEFAULT_CFQ | |
+ default "bfq" if DEFAULT_BFQ | |
default "noop" if DEFAULT_NOOP | |
endmenu | |
diff --git a/block/Makefile b/block/Makefile | |
index 00ecc97..1ed86d5 100644 | |
--- a/block/Makefile | |
+++ b/block/Makefile | |
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o | |
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o | |
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o | |
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o | |
+obj-$(CONFIG_IOSCHED_BFQ) += bfq-iosched.o | |
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o | |
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o | |
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c | |
new file mode 100644 | |
index 0000000..eb140eb | |
--- /dev/null | |
+++ b/block/bfq-cgroup.c | |
@@ -0,0 +1,930 @@ | |
+/* | |
+ * BFQ: CGROUPS support. | |
+ * | |
+ * Based on ideas and code from CFQ: | |
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> | |
+ * | |
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it> | |
+ * Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ | |
+ * file. | |
+ */ | |
+ | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ | |
+static DEFINE_MUTEX(bfqio_mutex); | |
+ | |
+static bool bfqio_is_removed(struct bfqio_cgroup *bgrp) | |
+{ | |
+ return bgrp ? !bgrp->online : false; | |
+} | |
+ | |
+static struct bfqio_cgroup bfqio_root_cgroup = { | |
+ .weight = BFQ_DEFAULT_GRP_WEIGHT, | |
+ .ioprio = BFQ_DEFAULT_GRP_IOPRIO, | |
+ .ioprio_class = BFQ_DEFAULT_GRP_CLASS, | |
+}; | |
+ | |
+static inline void bfq_init_entity(struct bfq_entity *entity, | |
+ struct bfq_group *bfqg) | |
+{ | |
+ entity->weight = entity->new_weight; | |
+ entity->orig_weight = entity->new_weight; | |
+ entity->ioprio = entity->new_ioprio; | |
+ entity->ioprio_class = entity->new_ioprio_class; | |
+ entity->parent = bfqg->my_entity; | |
+ entity->sched_data = &bfqg->sched_data; | |
+} | |
+ | |
+static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css) | |
+{ | |
+ return css ? container_of(css, struct bfqio_cgroup, css) : NULL; | |
+} | |
+ | |
+/* | |
+ * Search the bfq_group for bfqd into the hash table (by now only a list) | |
+ * of bgrp. Must be called under rcu_read_lock(). | |
+ */ | |
+static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp, | |
+ struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_group *bfqg; | |
+ void *key; | |
+ | |
+ hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) { | |
+ key = rcu_dereference(bfqg->bfqd); | |
+ if (key == bfqd) | |
+ return bfqg; | |
+ } | |
+ | |
+ return NULL; | |
+} | |
+ | |
+static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp, | |
+ struct bfq_group *bfqg) | |
+{ | |
+ struct bfq_entity *entity = &bfqg->entity; | |
+ | |
+ /* | |
+ * If the weight of the entity has never been set via the sysfs | |
+ * interface, then bgrp->weight == 0. In this case we initialize | |
+ * the weight from the current ioprio value. Otherwise, the group | |
+ * weight, if set, has priority over the ioprio value. | |
+ */ | |
+ if (bgrp->weight == 0) { | |
+ entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio); | |
+ entity->new_ioprio = bgrp->ioprio; | |
+ } else { | |
+ entity->new_weight = bgrp->weight; | |
+ entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight); | |
+ } | |
+ entity->orig_weight = entity->weight = entity->new_weight; | |
+ entity->ioprio = entity->new_ioprio; | |
+ entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class; | |
+ entity->my_sched_data = &bfqg->sched_data; | |
+ bfqg->active_entities = 0; | |
+} | |
+ | |
+static inline void bfq_group_set_parent(struct bfq_group *bfqg, | |
+ struct bfq_group *parent) | |
+{ | |
+ struct bfq_entity *entity; | |
+ | |
+ BUG_ON(parent == NULL); | |
+ BUG_ON(bfqg == NULL); | |
+ | |
+ entity = &bfqg->entity; | |
+ entity->parent = parent->my_entity; | |
+ entity->sched_data = &parent->sched_data; | |
+} | |
+ | |
+/** | |
+ * bfq_group_chain_alloc - allocate a chain of groups. | |
+ * @bfqd: queue descriptor. | |
+ * @css: the leaf cgroup_subsys_state this chain starts from. | |
+ * | |
+ * Allocate a chain of groups starting from the one belonging to | |
+ * @cgroup up to the root cgroup. Stop if a cgroup on the chain | |
+ * to the root has already an allocated group on @bfqd. | |
+ */ | |
+static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd, | |
+ struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfqio_cgroup *bgrp; | |
+ struct bfq_group *bfqg, *prev = NULL, *leaf = NULL; | |
+ | |
+ for (; css != NULL; css = css->parent) { | |
+ bgrp = css_to_bfqio(css); | |
+ | |
+ bfqg = bfqio_lookup_group(bgrp, bfqd); | |
+ if (bfqg != NULL) { | |
+ /* | |
+ * All the cgroups in the path from there to the | |
+ * root must have a bfq_group for bfqd, so we don't | |
+ * need any more allocations. | |
+ */ | |
+ break; | |
+ } | |
+ | |
+ bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC); | |
+ if (bfqg == NULL) | |
+ goto cleanup; | |
+ | |
+ bfq_group_init_entity(bgrp, bfqg); | |
+ bfqg->my_entity = &bfqg->entity; | |
+ | |
+ if (leaf == NULL) { | |
+ leaf = bfqg; | |
+ prev = leaf; | |
+ } else { | |
+ bfq_group_set_parent(prev, bfqg); | |
+ /* | |
+ * Build a list of allocated nodes using the bfqd | |
+ * filed, that is still unused and will be | |
+ * initialized only after the node will be | |
+ * connected. | |
+ */ | |
+ prev->bfqd = bfqg; | |
+ prev = bfqg; | |
+ } | |
+ } | |
+ | |
+ return leaf; | |
+ | |
+cleanup: | |
+ while (leaf != NULL) { | |
+ prev = leaf; | |
+ leaf = leaf->bfqd; | |
+ kfree(prev); | |
+ } | |
+ | |
+ return NULL; | |
+} | |
+ | |
+/** | |
+ * bfq_group_chain_link - link an allocated group chain to a cgroup | |
+ * hierarchy. | |
+ * @bfqd: the queue descriptor. | |
+ * @css: the leaf cgroup_subsys_state to start from. | |
+ * @leaf: the leaf group (to be associated to @cgroup). | |
+ * | |
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the | |
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the | |
+ * hierarchy that already as a group associated to @bfqd all the nodes | |
+ * in the path to the root cgroup have one too. | |
+ * | |
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy | |
+ * per device) while the bfqio_cgroup lock protects the list of groups | |
+ * belonging to the same cgroup. | |
+ */ | |
+static void bfq_group_chain_link(struct bfq_data *bfqd, | |
+ struct cgroup_subsys_state *css, | |
+ struct bfq_group *leaf) | |
+{ | |
+ struct bfqio_cgroup *bgrp; | |
+ struct bfq_group *bfqg, *next, *prev = NULL; | |
+ unsigned long flags; | |
+ | |
+ assert_spin_locked(bfqd->queue->queue_lock); | |
+ | |
+ for (; css != NULL && leaf != NULL; css = css->parent) { | |
+ bgrp = css_to_bfqio(css); | |
+ next = leaf->bfqd; | |
+ | |
+ bfqg = bfqio_lookup_group(bgrp, bfqd); | |
+ BUG_ON(bfqg != NULL); | |
+ | |
+ spin_lock_irqsave(&bgrp->lock, flags); | |
+ | |
+ rcu_assign_pointer(leaf->bfqd, bfqd); | |
+ hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data); | |
+ hlist_add_head(&leaf->bfqd_node, &bfqd->group_list); | |
+ | |
+ spin_unlock_irqrestore(&bgrp->lock, flags); | |
+ | |
+ prev = leaf; | |
+ leaf = next; | |
+ } | |
+ | |
+ BUG_ON(css == NULL && leaf != NULL); | |
+ if (css != NULL && prev != NULL) { | |
+ bgrp = css_to_bfqio(css); | |
+ bfqg = bfqio_lookup_group(bgrp, bfqd); | |
+ bfq_group_set_parent(prev, bfqg); | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup. | |
+ * @bfqd: queue descriptor. | |
+ * @cgroup: cgroup being searched for. | |
+ * | |
+ * Return a group associated to @bfqd in @cgroup, allocating one if | |
+ * necessary. When a group is returned all the cgroups in the path | |
+ * to the root have a group associated to @bfqd. | |
+ * | |
+ * If the allocation fails, return the root group: this breaks guarantees | |
+ * but is a safe fallback. If this loss becomes a problem it can be | |
+ * mitigated using the equivalent weight (given by the product of the | |
+ * weights of the groups in the path from @group to the root) in the | |
+ * root scheduler. | |
+ * | |
+ * We allocate all the missing nodes in the path from the leaf cgroup | |
+ * to the root and we connect the nodes only after all the allocations | |
+ * have been successful. | |
+ */ | |
+static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd, | |
+ struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfqio_cgroup *bgrp = css_to_bfqio(css); | |
+ struct bfq_group *bfqg; | |
+ | |
+ bfqg = bfqio_lookup_group(bgrp, bfqd); | |
+ if (bfqg != NULL) | |
+ return bfqg; | |
+ | |
+ bfqg = bfq_group_chain_alloc(bfqd, css); | |
+ if (bfqg != NULL) | |
+ bfq_group_chain_link(bfqd, css, bfqg); | |
+ else | |
+ bfqg = bfqd->root_group; | |
+ | |
+ return bfqg; | |
+} | |
+ | |
+/** | |
+ * bfq_bfqq_move - migrate @bfqq to @bfqg. | |
+ * @bfqd: queue descriptor. | |
+ * @bfqq: the queue to move. | |
+ * @entity: @bfqq's entity. | |
+ * @bfqg: the group to move to. | |
+ * | |
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating | |
+ * it on the new one. Avoid putting the entity on the old group idle tree. | |
+ * | |
+ * Must be called under the queue lock; the cgroup owning @bfqg must | |
+ * not disappear (by now this just means that we are called under | |
+ * rcu_read_lock()). | |
+ */ | |
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ struct bfq_entity *entity, struct bfq_group *bfqg) | |
+{ | |
+ int busy, resume; | |
+ | |
+ busy = bfq_bfqq_busy(bfqq); | |
+ resume = !RB_EMPTY_ROOT(&bfqq->sort_list); | |
+ | |
+ BUG_ON(resume && !entity->on_st); | |
+ BUG_ON(busy && !resume && entity->on_st && | |
+ bfqq != bfqd->in_service_queue); | |
+ | |
+ if (busy) { | |
+ BUG_ON(atomic_read(&bfqq->ref) < 2); | |
+ | |
+ if (!resume) | |
+ bfq_del_bfqq_busy(bfqd, bfqq, 0); | |
+ else | |
+ bfq_deactivate_bfqq(bfqd, bfqq, 0); | |
+ } else if (entity->on_st) | |
+ bfq_put_idle_entity(bfq_entity_service_tree(entity), entity); | |
+ | |
+ /* | |
+ * Here we use a reference to bfqg. We don't need a refcounter | |
+ * as the cgroup reference will not be dropped, so that its | |
+ * destroy() callback will not be invoked. | |
+ */ | |
+ entity->parent = bfqg->my_entity; | |
+ entity->sched_data = &bfqg->sched_data; | |
+ | |
+ if (busy && resume) | |
+ bfq_activate_bfqq(bfqd, bfqq); | |
+ | |
+ if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver) | |
+ bfq_schedule_dispatch(bfqd); | |
+} | |
+ | |
+/** | |
+ * __bfq_bic_change_cgroup - move @bic to @cgroup. | |
+ * @bfqd: the queue descriptor. | |
+ * @bic: the bic to move. | |
+ * @cgroup: the cgroup to move to. | |
+ * | |
+ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller | |
+ * has to make sure that the reference to cgroup is valid across the call. | |
+ * | |
+ * NOTE: an alternative approach might have been to store the current | |
+ * cgroup in bfqq and getting a reference to it, reducing the lookup | |
+ * time here, at the price of slightly more complex code. | |
+ */ | |
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, | |
+ struct bfq_io_cq *bic, | |
+ struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0); | |
+ struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1); | |
+ struct bfq_entity *entity; | |
+ struct bfq_group *bfqg; | |
+ struct bfqio_cgroup *bgrp; | |
+ | |
+ bgrp = css_to_bfqio(css); | |
+ | |
+ bfqg = bfq_find_alloc_group(bfqd, css); | |
+ if (async_bfqq != NULL) { | |
+ entity = &async_bfqq->entity; | |
+ | |
+ if (entity->sched_data != &bfqg->sched_data) { | |
+ bic_set_bfqq(bic, NULL, 0); | |
+ bfq_log_bfqq(bfqd, async_bfqq, | |
+ "bic_change_group: %p %d", | |
+ async_bfqq, atomic_read(&async_bfqq->ref)); | |
+ bfq_put_queue(async_bfqq); | |
+ } | |
+ } | |
+ | |
+ if (sync_bfqq != NULL) { | |
+ entity = &sync_bfqq->entity; | |
+ if (entity->sched_data != &bfqg->sched_data) | |
+ bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg); | |
+ } | |
+ | |
+ return bfqg; | |
+} | |
+ | |
+/** | |
+ * bfq_bic_change_cgroup - move @bic to @cgroup. | |
+ * @bic: the bic being migrated. | |
+ * @cgroup: the destination cgroup. | |
+ * | |
+ * When the task owning @bic is moved to @cgroup, @bic is immediately | |
+ * moved into its new parent group. | |
+ */ | |
+static void bfq_bic_change_cgroup(struct bfq_io_cq *bic, | |
+ struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfq_data *bfqd; | |
+ unsigned long uninitialized_var(flags); | |
+ | |
+ bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data), | |
+ &flags); | |
+ if (bfqd != NULL) { | |
+ __bfq_bic_change_cgroup(bfqd, bic, css); | |
+ bfq_put_bfqd_unlock(bfqd, &flags); | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_bic_update_cgroup - update the cgroup of @bic. | |
+ * @bic: the @bic to update. | |
+ * | |
+ * Make sure that @bic is enqueued in the cgroup of the current task. | |
+ * We need this in addition to moving bics during the cgroup attach | |
+ * phase because the task owning @bic could be at its first disk | |
+ * access or we may end up in the root cgroup as the result of a | |
+ * memory allocation failure and here we try to move to the right | |
+ * group. | |
+ * | |
+ * Must be called under the queue lock. It is safe to use the returned | |
+ * value even after the rcu_read_unlock() as the migration/destruction | |
+ * paths act under the queue lock too. IOW it is impossible to race with | |
+ * group migration/destruction and end up with an invalid group as: | |
+ * a) here cgroup has not yet been destroyed, nor its destroy callback | |
+ * has started execution, as current holds a reference to it, | |
+ * b) if it is destroyed after rcu_read_unlock() [after current is | |
+ * migrated to a different cgroup] its attach() callback will have | |
+ * taken care of remove all the references to the old cgroup data. | |
+ */ | |
+static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic) | |
+{ | |
+ struct bfq_data *bfqd = bic_to_bfqd(bic); | |
+ struct bfq_group *bfqg; | |
+ struct cgroup_subsys_state *css; | |
+ | |
+ BUG_ON(bfqd == NULL); | |
+ | |
+ rcu_read_lock(); | |
+ css = task_css(current, bfqio_cgrp_id); | |
+ bfqg = __bfq_bic_change_cgroup(bfqd, bic, css); | |
+ rcu_read_unlock(); | |
+ | |
+ return bfqg; | |
+} | |
+ | |
+/** | |
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st. | |
+ * @st: the service tree being flushed. | |
+ */ | |
+static inline void bfq_flush_idle_tree(struct bfq_service_tree *st) | |
+{ | |
+ struct bfq_entity *entity = st->first_idle; | |
+ | |
+ for (; entity != NULL; entity = st->first_idle) | |
+ __bfq_deactivate_entity(entity, 0); | |
+} | |
+ | |
+/** | |
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group. | |
+ * @bfqd: the device data structure with the root group. | |
+ * @entity: the entity to move. | |
+ */ | |
+static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ | |
+ BUG_ON(bfqq == NULL); | |
+ bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group); | |
+ return; | |
+} | |
+ | |
+/** | |
+ * bfq_reparent_active_entities - move to the root group all active | |
+ * entities. | |
+ * @bfqd: the device data structure with the root group. | |
+ * @bfqg: the group to move from. | |
+ * @st: the service tree with the entities. | |
+ * | |
+ * Needs queue_lock to be taken and reference to be valid over the call. | |
+ */ | |
+static inline void bfq_reparent_active_entities(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg, | |
+ struct bfq_service_tree *st) | |
+{ | |
+ struct rb_root *active = &st->active; | |
+ struct bfq_entity *entity = NULL; | |
+ | |
+ if (!RB_EMPTY_ROOT(&st->active)) | |
+ entity = bfq_entity_of(rb_first(active)); | |
+ | |
+ for (; entity != NULL; entity = bfq_entity_of(rb_first(active))) | |
+ bfq_reparent_leaf_entity(bfqd, entity); | |
+ | |
+ if (bfqg->sched_data.in_service_entity != NULL) | |
+ bfq_reparent_leaf_entity(bfqd, | |
+ bfqg->sched_data.in_service_entity); | |
+ | |
+ return; | |
+} | |
+ | |
+/** | |
+ * bfq_destroy_group - destroy @bfqg. | |
+ * @bgrp: the bfqio_cgroup containing @bfqg. | |
+ * @bfqg: the group being destroyed. | |
+ * | |
+ * Destroy @bfqg, making sure that it is not referenced from its parent. | |
+ */ | |
+static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg) | |
+{ | |
+ struct bfq_data *bfqd; | |
+ struct bfq_service_tree *st; | |
+ struct bfq_entity *entity = bfqg->my_entity; | |
+ unsigned long uninitialized_var(flags); | |
+ int i; | |
+ | |
+ hlist_del(&bfqg->group_node); | |
+ | |
+ /* | |
+ * Empty all service_trees belonging to this group before | |
+ * deactivating the group itself. | |
+ */ | |
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) { | |
+ st = bfqg->sched_data.service_tree + i; | |
+ | |
+ /* | |
+ * The idle tree may still contain bfq_queues belonging | |
+ * to exited task because they never migrated to a different | |
+ * cgroup from the one being destroyed now. No one else | |
+ * can access them so it's safe to act without any lock. | |
+ */ | |
+ bfq_flush_idle_tree(st); | |
+ | |
+ /* | |
+ * It may happen that some queues are still active | |
+ * (busy) upon group destruction (if the corresponding | |
+ * processes have been forced to terminate). We move | |
+ * all the leaf entities corresponding to these queues | |
+ * to the root_group. | |
+ * Also, it may happen that the group has an entity | |
+ * in service, which is disconnected from the active | |
+ * tree: it must be moved, too. | |
+ * There is no need to put the sync queues, as the | |
+ * scheduler has taken no reference. | |
+ */ | |
+ bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags); | |
+ if (bfqd != NULL) { | |
+ bfq_reparent_active_entities(bfqd, bfqg, st); | |
+ bfq_put_bfqd_unlock(bfqd, &flags); | |
+ } | |
+ BUG_ON(!RB_EMPTY_ROOT(&st->active)); | |
+ BUG_ON(!RB_EMPTY_ROOT(&st->idle)); | |
+ } | |
+ BUG_ON(bfqg->sched_data.next_in_service != NULL); | |
+ BUG_ON(bfqg->sched_data.in_service_entity != NULL); | |
+ | |
+ /* | |
+ * We may race with device destruction, take extra care when | |
+ * dereferencing bfqg->bfqd. | |
+ */ | |
+ bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags); | |
+ if (bfqd != NULL) { | |
+ hlist_del(&bfqg->bfqd_node); | |
+ __bfq_deactivate_entity(entity, 0); | |
+ bfq_put_async_queues(bfqd, bfqg); | |
+ bfq_put_bfqd_unlock(bfqd, &flags); | |
+ } | |
+ BUG_ON(entity->tree != NULL); | |
+ | |
+ /* | |
+ * No need to defer the kfree() to the end of the RCU grace | |
+ * period: we are called from the destroy() callback of our | |
+ * cgroup, so we can be sure that no one is a) still using | |
+ * this cgroup or b) doing lookups in it. | |
+ */ | |
+ kfree(bfqg); | |
+} | |
+ | |
+static void bfq_end_wr_async(struct bfq_data *bfqd) | |
+{ | |
+ struct hlist_node *tmp; | |
+ struct bfq_group *bfqg; | |
+ | |
+ hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) | |
+ bfq_end_wr_async_queues(bfqd, bfqg); | |
+ bfq_end_wr_async_queues(bfqd, bfqd->root_group); | |
+} | |
+ | |
+/** | |
+ * bfq_disconnect_groups - disconnect @bfqd from all its groups. | |
+ * @bfqd: the device descriptor being exited. | |
+ * | |
+ * When the device exits we just make sure that no lookup can return | |
+ * the now unused group structures. They will be deallocated on cgroup | |
+ * destruction. | |
+ */ | |
+static void bfq_disconnect_groups(struct bfq_data *bfqd) | |
+{ | |
+ struct hlist_node *tmp; | |
+ struct bfq_group *bfqg; | |
+ | |
+ bfq_log(bfqd, "disconnect_groups beginning"); | |
+ hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) { | |
+ hlist_del(&bfqg->bfqd_node); | |
+ | |
+ __bfq_deactivate_entity(bfqg->my_entity, 0); | |
+ | |
+ /* | |
+ * Don't remove from the group hash, just set an | |
+ * invalid key. No lookups can race with the | |
+ * assignment as bfqd is being destroyed; this | |
+ * implies also that new elements cannot be added | |
+ * to the list. | |
+ */ | |
+ rcu_assign_pointer(bfqg->bfqd, NULL); | |
+ | |
+ bfq_log(bfqd, "disconnect_groups: put async for group %p", | |
+ bfqg); | |
+ bfq_put_async_queues(bfqd, bfqg); | |
+ } | |
+} | |
+ | |
+static inline void bfq_free_root_group(struct bfq_data *bfqd) | |
+{ | |
+ struct bfqio_cgroup *bgrp = &bfqio_root_cgroup; | |
+ struct bfq_group *bfqg = bfqd->root_group; | |
+ | |
+ bfq_put_async_queues(bfqd, bfqg); | |
+ | |
+ spin_lock_irq(&bgrp->lock); | |
+ hlist_del_rcu(&bfqg->group_node); | |
+ spin_unlock_irq(&bgrp->lock); | |
+ | |
+ /* | |
+ * No need to synchronize_rcu() here: since the device is gone | |
+ * there cannot be any read-side access to its root_group. | |
+ */ | |
+ kfree(bfqg); | |
+} | |
+ | |
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node) | |
+{ | |
+ struct bfq_group *bfqg; | |
+ struct bfqio_cgroup *bgrp; | |
+ int i; | |
+ | |
+ bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node); | |
+ if (bfqg == NULL) | |
+ return NULL; | |
+ | |
+ bfqg->entity.parent = NULL; | |
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) | |
+ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT; | |
+ | |
+ bgrp = &bfqio_root_cgroup; | |
+ spin_lock_irq(&bgrp->lock); | |
+ rcu_assign_pointer(bfqg->bfqd, bfqd); | |
+ hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data); | |
+ spin_unlock_irq(&bgrp->lock); | |
+ | |
+ return bfqg; | |
+} | |
+ | |
+#define SHOW_FUNCTION(__VAR) \ | |
+static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \ | |
+ struct cftype *cftype) \ | |
+{ \ | |
+ struct bfqio_cgroup *bgrp = css_to_bfqio(css); \ | |
+ u64 ret = -ENODEV; \ | |
+ \ | |
+ mutex_lock(&bfqio_mutex); \ | |
+ if (bfqio_is_removed(bgrp)) \ | |
+ goto out_unlock; \ | |
+ \ | |
+ spin_lock_irq(&bgrp->lock); \ | |
+ ret = bgrp->__VAR; \ | |
+ spin_unlock_irq(&bgrp->lock); \ | |
+ \ | |
+out_unlock: \ | |
+ mutex_unlock(&bfqio_mutex); \ | |
+ return ret; \ | |
+} | |
+ | |
+SHOW_FUNCTION(weight); | |
+SHOW_FUNCTION(ioprio); | |
+SHOW_FUNCTION(ioprio_class); | |
+#undef SHOW_FUNCTION | |
+ | |
+#define STORE_FUNCTION(__VAR, __MIN, __MAX) \ | |
+static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\ | |
+ struct cftype *cftype, \ | |
+ u64 val) \ | |
+{ \ | |
+ struct bfqio_cgroup *bgrp = css_to_bfqio(css); \ | |
+ struct bfq_group *bfqg; \ | |
+ int ret = -EINVAL; \ | |
+ \ | |
+ if (val < (__MIN) || val > (__MAX)) \ | |
+ return ret; \ | |
+ \ | |
+ ret = -ENODEV; \ | |
+ mutex_lock(&bfqio_mutex); \ | |
+ if (bfqio_is_removed(bgrp)) \ | |
+ goto out_unlock; \ | |
+ ret = 0; \ | |
+ \ | |
+ spin_lock_irq(&bgrp->lock); \ | |
+ bgrp->__VAR = (unsigned short)val; \ | |
+ hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) { \ | |
+ /* \ | |
+ * Setting the ioprio_changed flag of the entity \ | |
+ * to 1 with new_##__VAR == ##__VAR would re-set \ | |
+ * the value of the weight to its ioprio mapping. \ | |
+ * Set the flag only if necessary. \ | |
+ */ \ | |
+ if ((unsigned short)val != bfqg->entity.new_##__VAR) { \ | |
+ bfqg->entity.new_##__VAR = (unsigned short)val; \ | |
+ /* \ | |
+ * Make sure that the above new value has been \ | |
+ * stored in bfqg->entity.new_##__VAR before \ | |
+ * setting the ioprio_changed flag. In fact, \ | |
+ * this flag may be read asynchronously (in \ | |
+ * critical sections protected by a different \ | |
+ * lock than that held here), and finding this \ | |
+ * flag set may cause the execution of the code \ | |
+ * for updating parameters whose value may \ | |
+ * depend also on bfqg->entity.new_##__VAR (in \ | |
+ * __bfq_entity_update_weight_prio). \ | |
+ * This barrier makes sure that the new value \ | |
+ * of bfqg->entity.new_##__VAR is correctly \ | |
+ * seen in that code. \ | |
+ */ \ | |
+ smp_wmb(); \ | |
+ bfqg->entity.ioprio_changed = 1; \ | |
+ } \ | |
+ } \ | |
+ spin_unlock_irq(&bgrp->lock); \ | |
+ \ | |
+out_unlock: \ | |
+ mutex_unlock(&bfqio_mutex); \ | |
+ return ret; \ | |
+} | |
+ | |
+STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT); | |
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1); | |
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE); | |
+#undef STORE_FUNCTION | |
+ | |
+static struct cftype bfqio_files[] = { | |
+ { | |
+ .name = "weight", | |
+ .read_u64 = bfqio_cgroup_weight_read, | |
+ .write_u64 = bfqio_cgroup_weight_write, | |
+ }, | |
+ { | |
+ .name = "ioprio", | |
+ .read_u64 = bfqio_cgroup_ioprio_read, | |
+ .write_u64 = bfqio_cgroup_ioprio_write, | |
+ }, | |
+ { | |
+ .name = "ioprio_class", | |
+ .read_u64 = bfqio_cgroup_ioprio_class_read, | |
+ .write_u64 = bfqio_cgroup_ioprio_class_write, | |
+ }, | |
+ { }, /* terminate */ | |
+}; | |
+ | |
+static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state | |
+ *parent_css) | |
+{ | |
+ struct bfqio_cgroup *bgrp; | |
+ | |
+ if (parent_css != NULL) { | |
+ bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL); | |
+ if (bgrp == NULL) | |
+ return ERR_PTR(-ENOMEM); | |
+ } else | |
+ bgrp = &bfqio_root_cgroup; | |
+ | |
+ spin_lock_init(&bgrp->lock); | |
+ INIT_HLIST_HEAD(&bgrp->group_data); | |
+ bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO; | |
+ bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS; | |
+ | |
+ return &bgrp->css; | |
+} | |
+ | |
+/* | |
+ * We cannot support shared io contexts, as we have no means to support | |
+ * two tasks with the same ioc in two different groups without major rework | |
+ * of the main bic/bfqq data structures. By now we allow a task to change | |
+ * its cgroup only if it's the only owner of its ioc; the drawback of this | |
+ * behavior is that a group containing a task that forked using CLONE_IO | |
+ * will not be destroyed until the tasks sharing the ioc die. | |
+ */ | |
+static int bfqio_can_attach(struct cgroup_subsys_state *css, | |
+ struct cgroup_taskset *tset) | |
+{ | |
+ struct task_struct *task; | |
+ struct io_context *ioc; | |
+ int ret = 0; | |
+ | |
+ cgroup_taskset_for_each(task, tset) { | |
+ /* | |
+ * task_lock() is needed to avoid races with | |
+ * exit_io_context() | |
+ */ | |
+ task_lock(task); | |
+ ioc = task->io_context; | |
+ if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1) | |
+ /* | |
+ * ioc == NULL means that the task is either too | |
+ * young or exiting: if it has still no ioc the | |
+ * ioc can't be shared, if the task is exiting the | |
+ * attach will fail anyway, no matter what we | |
+ * return here. | |
+ */ | |
+ ret = -EINVAL; | |
+ task_unlock(task); | |
+ if (ret) | |
+ break; | |
+ } | |
+ | |
+ return ret; | |
+} | |
+ | |
+static void bfqio_attach(struct cgroup_subsys_state *css, | |
+ struct cgroup_taskset *tset) | |
+{ | |
+ struct task_struct *task; | |
+ struct io_context *ioc; | |
+ struct io_cq *icq; | |
+ | |
+ /* | |
+ * IMPORTANT NOTE: The move of more than one process at a time to a | |
+ * new group has not yet been tested. | |
+ */ | |
+ cgroup_taskset_for_each(task, tset) { | |
+ ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE); | |
+ if (ioc) { | |
+ /* | |
+ * Handle cgroup change here. | |
+ */ | |
+ rcu_read_lock(); | |
+ hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node) | |
+ if (!strncmp( | |
+ icq->q->elevator->type->elevator_name, | |
+ "bfq", ELV_NAME_MAX)) | |
+ bfq_bic_change_cgroup(icq_to_bic(icq), | |
+ css); | |
+ rcu_read_unlock(); | |
+ put_io_context(ioc); | |
+ } | |
+ } | |
+} | |
+ | |
+static void bfqio_destroy(struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfqio_cgroup *bgrp = css_to_bfqio(css); | |
+ struct hlist_node *tmp; | |
+ struct bfq_group *bfqg; | |
+ | |
+ /* | |
+ * Since we are destroying the cgroup, there are no more tasks | |
+ * referencing it, and all the RCU grace periods that may have | |
+ * referenced it are ended (as the destruction of the parent | |
+ * cgroup is RCU-safe); bgrp->group_data will not be accessed by | |
+ * anything else and we don't need any synchronization. | |
+ */ | |
+ hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node) | |
+ bfq_destroy_group(bgrp, bfqg); | |
+ | |
+ BUG_ON(!hlist_empty(&bgrp->group_data)); | |
+ | |
+ kfree(bgrp); | |
+} | |
+ | |
+static int bfqio_css_online(struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfqio_cgroup *bgrp = css_to_bfqio(css); | |
+ | |
+ mutex_lock(&bfqio_mutex); | |
+ bgrp->online = true; | |
+ mutex_unlock(&bfqio_mutex); | |
+ | |
+ return 0; | |
+} | |
+ | |
+static void bfqio_css_offline(struct cgroup_subsys_state *css) | |
+{ | |
+ struct bfqio_cgroup *bgrp = css_to_bfqio(css); | |
+ | |
+ mutex_lock(&bfqio_mutex); | |
+ bgrp->online = false; | |
+ mutex_unlock(&bfqio_mutex); | |
+} | |
+ | |
+struct cgroup_subsys bfqio_cgrp_subsys = { | |
+ .css_alloc = bfqio_create, | |
+ .css_online = bfqio_css_online, | |
+ .css_offline = bfqio_css_offline, | |
+ .can_attach = bfqio_can_attach, | |
+ .attach = bfqio_attach, | |
+ .css_free = bfqio_destroy, | |
+ .legacy_cftypes = bfqio_files, | |
+}; | |
+#else | |
+static inline void bfq_init_entity(struct bfq_entity *entity, | |
+ struct bfq_group *bfqg) | |
+{ | |
+ entity->weight = entity->new_weight; | |
+ entity->orig_weight = entity->new_weight; | |
+ entity->ioprio = entity->new_ioprio; | |
+ entity->ioprio_class = entity->new_ioprio_class; | |
+ entity->sched_data = &bfqg->sched_data; | |
+} | |
+ | |
+static inline struct bfq_group * | |
+bfq_bic_update_cgroup(struct bfq_io_cq *bic) | |
+{ | |
+ struct bfq_data *bfqd = bic_to_bfqd(bic); | |
+ return bfqd->root_group; | |
+} | |
+ | |
+static inline void bfq_bfqq_move(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq, | |
+ struct bfq_entity *entity, | |
+ struct bfq_group *bfqg) | |
+{ | |
+} | |
+ | |
+static void bfq_end_wr_async(struct bfq_data *bfqd) | |
+{ | |
+ bfq_end_wr_async_queues(bfqd, bfqd->root_group); | |
+} | |
+ | |
+static inline void bfq_disconnect_groups(struct bfq_data *bfqd) | |
+{ | |
+ bfq_put_async_queues(bfqd, bfqd->root_group); | |
+} | |
+ | |
+static inline void bfq_free_root_group(struct bfq_data *bfqd) | |
+{ | |
+ kfree(bfqd->root_group); | |
+} | |
+ | |
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node) | |
+{ | |
+ struct bfq_group *bfqg; | |
+ int i; | |
+ | |
+ bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node); | |
+ if (bfqg == NULL) | |
+ return NULL; | |
+ | |
+ for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) | |
+ bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT; | |
+ | |
+ return bfqg; | |
+} | |
+#endif | |
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c | |
new file mode 100644 | |
index 0000000..7f6b000 | |
--- /dev/null | |
+++ b/block/bfq-ioc.c | |
@@ -0,0 +1,36 @@ | |
+/* | |
+ * BFQ: I/O context handling. | |
+ * | |
+ * Based on ideas and code from CFQ: | |
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> | |
+ * | |
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it> | |
+ * Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it> | |
+ */ | |
+ | |
+/** | |
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq. | |
+ * @icq: the iocontext queue. | |
+ */ | |
+static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq) | |
+{ | |
+ /* bic->icq is the first member, %NULL will convert to %NULL */ | |
+ return container_of(icq, struct bfq_io_cq, icq); | |
+} | |
+ | |
+/** | |
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd. | |
+ * @bfqd: the lookup key. | |
+ * @ioc: the io_context of the process doing I/O. | |
+ * | |
+ * Queue lock must be held. | |
+ */ | |
+static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd, | |
+ struct io_context *ioc) | |
+{ | |
+ if (ioc) | |
+ return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue)); | |
+ return NULL; | |
+} | |
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c | |
new file mode 100644 | |
index 0000000..bbfb4e1 | |
--- /dev/null | |
+++ b/block/bfq-iosched.c | |
@@ -0,0 +1,4200 @@ | |
+/* | |
+ * Budget Fair Queueing (BFQ) disk scheduler. | |
+ * | |
+ * Based on ideas and code from CFQ: | |
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> | |
+ * | |
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it> | |
+ * Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ | |
+ * file. | |
+ * | |
+ * BFQ is a proportional-share storage-I/O scheduling algorithm based on | |
+ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets, | |
+ * measured in number of sectors, to processes instead of time slices. The | |
+ * device is not granted to the in-service process for a given time slice, | |
+ * but until it has exhausted its assigned budget. This change from the time | |
+ * to the service domain allows BFQ to distribute the device throughput | |
+ * among processes as desired, without any distortion due to ZBR, workload | |
+ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler, | |
+ * called B-WF2Q+, to schedule processes according to their budgets. More | |
+ * precisely, BFQ schedules queues associated to processes. Thanks to the | |
+ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to | |
+ * I/O-bound processes issuing sequential requests (to boost the | |
+ * throughput), and yet guarantee a low latency to interactive and soft | |
+ * real-time applications. | |
+ * | |
+ * BFQ is described in [1], where also a reference to the initial, more | |
+ * theoretical paper on BFQ can be found. The interested reader can find | |
+ * in the latter paper full details on the main algorithm, as well as | |
+ * formulas of the guarantees and formal proofs of all the properties. | |
+ * With respect to the version of BFQ presented in these papers, this | |
+ * implementation adds a few more heuristics, such as the one that | |
+ * guarantees a low latency to soft real-time applications, and a | |
+ * hierarchical extension based on H-WF2Q+. | |
+ * | |
+ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with | |
+ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N) | |
+ * complexity derives from the one introduced with EEVDF in [3]. | |
+ * | |
+ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness | |
+ * with the BFQ Disk I/O Scheduler'', | |
+ * Proceedings of the 5th Annual International Systems and Storage | |
+ * Conference (SYSTOR '12), June 2012. | |
+ * | |
+ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf | |
+ * | |
+ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing | |
+ * Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689, | |
+ * Oct 1997. | |
+ * | |
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz | |
+ * | |
+ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline | |
+ * First: A Flexible and Accurate Mechanism for Proportional Share | |
+ * Resource Allocation,'' technical report. | |
+ * | |
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf | |
+ */ | |
+#include <linux/module.h> | |
+#include <linux/slab.h> | |
+#include <linux/blkdev.h> | |
+#include <linux/cgroup.h> | |
+#include <linux/elevator.h> | |
+#include <linux/jiffies.h> | |
+#include <linux/rbtree.h> | |
+#include <linux/ioprio.h> | |
+#include "bfq.h" | |
+#include "blk.h" | |
+ | |
+/* Max number of dispatches in one round of service. */ | |
+static const int bfq_quantum = 4; | |
+ | |
+/* Expiration time of sync (0) and async (1) requests, in jiffies. */ | |
+static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; | |
+ | |
+/* Maximum backwards seek, in KiB. */ | |
+static const int bfq_back_max = 16 * 1024; | |
+ | |
+/* Penalty of a backwards seek, in number of sectors. */ | |
+static const int bfq_back_penalty = 2; | |
+ | |
+/* Idling period duration, in jiffies. */ | |
+static int bfq_slice_idle = HZ / 125; | |
+ | |
+/* Default maximum budget values, in sectors and number of requests. */ | |
+static const int bfq_default_max_budget = 16 * 1024; | |
+static const int bfq_max_budget_async_rq = 4; | |
+ | |
+/* | |
+ * Async to sync throughput distribution is controlled as follows: | |
+ * when an async request is served, the entity is charged the number | |
+ * of sectors of the request, multiplied by the factor below | |
+ */ | |
+static const int bfq_async_charge_factor = 10; | |
+ | |
+/* Default timeout values, in jiffies, approximating CFQ defaults. */ | |
+static const int bfq_timeout_sync = HZ / 8; | |
+static int bfq_timeout_async = HZ / 25; | |
+ | |
+struct kmem_cache *bfq_pool; | |
+ | |
+/* Below this threshold (in ms), we consider thinktime immediate. */ | |
+#define BFQ_MIN_TT 2 | |
+ | |
+/* hw_tag detection: parallel requests threshold and min samples needed. */ | |
+#define BFQ_HW_QUEUE_THRESHOLD 4 | |
+#define BFQ_HW_QUEUE_SAMPLES 32 | |
+ | |
+#define BFQQ_SEEK_THR (sector_t)(8 * 1024) | |
+#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR) | |
+ | |
+/* Min samples used for peak rate estimation (for autotuning). */ | |
+#define BFQ_PEAK_RATE_SAMPLES 32 | |
+ | |
+/* Shift used for peak rate fixed precision calculations. */ | |
+#define BFQ_RATE_SHIFT 16 | |
+ | |
+/* | |
+ * By default, BFQ computes the duration of the weight raising for | |
+ * interactive applications automatically, using the following formula: | |
+ * duration = (R / r) * T, where r is the peak rate of the device, and | |
+ * R and T are two reference parameters. | |
+ * In particular, R is the peak rate of the reference device (see below), | |
+ * and T is a reference time: given the systems that are likely to be | |
+ * installed on the reference device according to its speed class, T is | |
+ * about the maximum time needed, under BFQ and while reading two files in | |
+ * parallel, to load typical large applications on these systems. | |
+ * In practice, the slower/faster the device at hand is, the more/less it | |
+ * takes to load applications with respect to the reference device. | |
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive | |
+ * applications. | |
+ * | |
+ * BFQ uses four different reference pairs (R, T), depending on: | |
+ * . whether the device is rotational or non-rotational; | |
+ * . whether the device is slow, such as old or portable HDDs, as well as | |
+ * SD cards, or fast, such as newer HDDs and SSDs. | |
+ * | |
+ * The device's speed class is dynamically (re)detected in | |
+ * bfq_update_peak_rate() every time the estimated peak rate is updated. | |
+ * | |
+ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0] | |
+ * are the reference values for a slow/fast rotational device, whereas | |
+ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for | |
+ * a slow/fast non-rotational device. Finally, device_speed_thresh are the | |
+ * thresholds used to switch between speed classes. | |
+ * Both the reference peak rates and the thresholds are measured in | |
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT. | |
+ */ | |
+static int R_slow[2] = {1536, 10752}; | |
+static int R_fast[2] = {17415, 34791}; | |
+/* | |
+ * To improve readability, a conversion function is used to initialize the | |
+ * following arrays, which entails that they can be initialized only in a | |
+ * function. | |
+ */ | |
+static int T_slow[2]; | |
+static int T_fast[2]; | |
+static int device_speed_thresh[2]; | |
+ | |
+#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \ | |
+ { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 }) | |
+ | |
+#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0]) | |
+#define RQ_BFQQ(rq) ((rq)->elv.priv[1]) | |
+ | |
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd); | |
+ | |
+#include "bfq-ioc.c" | |
+#include "bfq-sched.c" | |
+#include "bfq-cgroup.c" | |
+ | |
+#define bfq_class_idle(bfqq) ((bfqq)->entity.ioprio_class ==\ | |
+ IOPRIO_CLASS_IDLE) | |
+#define bfq_class_rt(bfqq) ((bfqq)->entity.ioprio_class ==\ | |
+ IOPRIO_CLASS_RT) | |
+ | |
+#define bfq_sample_valid(samples) ((samples) > 80) | |
+ | |
+/* | |
+ * We regard a request as SYNC, if either it's a read or has the SYNC bit | |
+ * set (in which case it could also be a direct WRITE). | |
+ */ | |
+static inline int bfq_bio_sync(struct bio *bio) | |
+{ | |
+ if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC)) | |
+ return 1; | |
+ | |
+ return 0; | |
+} | |
+ | |
+/* | |
+ * Scheduler run of queue, if there are requests pending and no one in the | |
+ * driver that will restart queueing. | |
+ */ | |
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd) | |
+{ | |
+ if (bfqd->queued != 0) { | |
+ bfq_log(bfqd, "schedule dispatch"); | |
+ kblockd_schedule_work(&bfqd->unplug_work); | |
+ } | |
+} | |
+ | |
+/* | |
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now. | |
+ * We choose the request that is closesr to the head right now. Distance | |
+ * behind the head is penalized and only allowed to a certain extent. | |
+ */ | |
+static struct request *bfq_choose_req(struct bfq_data *bfqd, | |
+ struct request *rq1, | |
+ struct request *rq2, | |
+ sector_t last) | |
+{ | |
+ sector_t s1, s2, d1 = 0, d2 = 0; | |
+ unsigned long back_max; | |
+#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */ | |
+#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */ | |
+ unsigned wrap = 0; /* bit mask: requests behind the disk head? */ | |
+ | |
+ if (rq1 == NULL || rq1 == rq2) | |
+ return rq2; | |
+ if (rq2 == NULL) | |
+ return rq1; | |
+ | |
+ if (rq_is_sync(rq1) && !rq_is_sync(rq2)) | |
+ return rq1; | |
+ else if (rq_is_sync(rq2) && !rq_is_sync(rq1)) | |
+ return rq2; | |
+ if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META)) | |
+ return rq1; | |
+ else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META)) | |
+ return rq2; | |
+ | |
+ s1 = blk_rq_pos(rq1); | |
+ s2 = blk_rq_pos(rq2); | |
+ | |
+ /* | |
+ * By definition, 1KiB is 2 sectors. | |
+ */ | |
+ back_max = bfqd->bfq_back_max * 2; | |
+ | |
+ /* | |
+ * Strict one way elevator _except_ in the case where we allow | |
+ * short backward seeks which are biased as twice the cost of a | |
+ * similar forward seek. | |
+ */ | |
+ if (s1 >= last) | |
+ d1 = s1 - last; | |
+ else if (s1 + back_max >= last) | |
+ d1 = (last - s1) * bfqd->bfq_back_penalty; | |
+ else | |
+ wrap |= BFQ_RQ1_WRAP; | |
+ | |
+ if (s2 >= last) | |
+ d2 = s2 - last; | |
+ else if (s2 + back_max >= last) | |
+ d2 = (last - s2) * bfqd->bfq_back_penalty; | |
+ else | |
+ wrap |= BFQ_RQ2_WRAP; | |
+ | |
+ /* Found required data */ | |
+ | |
+ /* | |
+ * By doing switch() on the bit mask "wrap" we avoid having to | |
+ * check two variables for all permutations: --> faster! | |
+ */ | |
+ switch (wrap) { | |
+ case 0: /* common case for CFQ: rq1 and rq2 not wrapped */ | |
+ if (d1 < d2) | |
+ return rq1; | |
+ else if (d2 < d1) | |
+ return rq2; | |
+ else { | |
+ if (s1 >= s2) | |
+ return rq1; | |
+ else | |
+ return rq2; | |
+ } | |
+ | |
+ case BFQ_RQ2_WRAP: | |
+ return rq1; | |
+ case BFQ_RQ1_WRAP: | |
+ return rq2; | |
+ case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */ | |
+ default: | |
+ /* | |
+ * Since both rqs are wrapped, | |
+ * start with the one that's further behind head | |
+ * (--> only *one* back seek required), | |
+ * since back seek takes more time than forward. | |
+ */ | |
+ if (s1 <= s2) | |
+ return rq1; | |
+ else | |
+ return rq2; | |
+ } | |
+} | |
+ | |
+static struct bfq_queue * | |
+bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root, | |
+ sector_t sector, struct rb_node **ret_parent, | |
+ struct rb_node ***rb_link) | |
+{ | |
+ struct rb_node **p, *parent; | |
+ struct bfq_queue *bfqq = NULL; | |
+ | |
+ parent = NULL; | |
+ p = &root->rb_node; | |
+ while (*p) { | |
+ struct rb_node **n; | |
+ | |
+ parent = *p; | |
+ bfqq = rb_entry(parent, struct bfq_queue, pos_node); | |
+ | |
+ /* | |
+ * Sort strictly based on sector. Smallest to the left, | |
+ * largest to the right. | |
+ */ | |
+ if (sector > blk_rq_pos(bfqq->next_rq)) | |
+ n = &(*p)->rb_right; | |
+ else if (sector < blk_rq_pos(bfqq->next_rq)) | |
+ n = &(*p)->rb_left; | |
+ else | |
+ break; | |
+ p = n; | |
+ bfqq = NULL; | |
+ } | |
+ | |
+ *ret_parent = parent; | |
+ if (rb_link) | |
+ *rb_link = p; | |
+ | |
+ bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d", | |
+ (long long unsigned)sector, | |
+ bfqq != NULL ? bfqq->pid : 0); | |
+ | |
+ return bfqq; | |
+} | |
+ | |
+static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ struct rb_node **p, *parent; | |
+ struct bfq_queue *__bfqq; | |
+ | |
+ if (bfqq->pos_root != NULL) { | |
+ rb_erase(&bfqq->pos_node, bfqq->pos_root); | |
+ bfqq->pos_root = NULL; | |
+ } | |
+ | |
+ if (bfq_class_idle(bfqq)) | |
+ return; | |
+ if (!bfqq->next_rq) | |
+ return; | |
+ | |
+ bfqq->pos_root = &bfqd->rq_pos_tree; | |
+ __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root, | |
+ blk_rq_pos(bfqq->next_rq), &parent, &p); | |
+ if (__bfqq == NULL) { | |
+ rb_link_node(&bfqq->pos_node, parent, p); | |
+ rb_insert_color(&bfqq->pos_node, bfqq->pos_root); | |
+ } else | |
+ bfqq->pos_root = NULL; | |
+} | |
+ | |
+/* | |
+ * Tell whether there are active queues or groups with differentiated weights. | |
+ */ | |
+static inline bool bfq_differentiated_weights(struct bfq_data *bfqd) | |
+{ | |
+ BUG_ON(!bfqd->hw_tag); | |
+ /* | |
+ * For weights to differ, at least one of the trees must contain | |
+ * at least two nodes. | |
+ */ | |
+ return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && | |
+ (bfqd->queue_weights_tree.rb_node->rb_left || | |
+ bfqd->queue_weights_tree.rb_node->rb_right) | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ ) || | |
+ (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) && | |
+ (bfqd->group_weights_tree.rb_node->rb_left || | |
+ bfqd->group_weights_tree.rb_node->rb_right) | |
+#endif | |
+ ); | |
+} | |
+ | |
+/* | |
+ * If the weight-counter tree passed as input contains no counter for | |
+ * the weight of the input entity, then add that counter; otherwise just | |
+ * increment the existing counter. | |
+ * | |
+ * Note that weight-counter trees contain few nodes in mostly symmetric | |
+ * scenarios. For example, if all queues have the same weight, then the | |
+ * weight-counter tree for the queues may contain at most one node. | |
+ * This holds even if low_latency is on, because weight-raised queues | |
+ * are not inserted in the tree. | |
+ * In most scenarios, the rate at which nodes are created/destroyed | |
+ * should be low too. | |
+ */ | |
+static void bfq_weights_tree_add(struct bfq_data *bfqd, | |
+ struct bfq_entity *entity, | |
+ struct rb_root *root) | |
+{ | |
+ struct rb_node **new = &(root->rb_node), *parent = NULL; | |
+ | |
+ /* | |
+ * Do not insert if: | |
+ * - the device does not support queueing; | |
+ * - the entity is already associated with a counter, which happens if: | |
+ * 1) the entity is associated with a queue, 2) a request arrival | |
+ * has caused the queue to become both non-weight-raised, and hence | |
+ * change its weight, and backlogged; in this respect, each | |
+ * of the two events causes an invocation of this function, | |
+ * 3) this is the invocation of this function caused by the second | |
+ * event. This second invocation is actually useless, and we handle | |
+ * this fact by exiting immediately. More efficient or clearer | |
+ * solutions might possibly be adopted. | |
+ */ | |
+ if (!bfqd->hw_tag || entity->weight_counter) | |
+ return; | |
+ | |
+ while (*new) { | |
+ struct bfq_weight_counter *__counter = container_of(*new, | |
+ struct bfq_weight_counter, | |
+ weights_node); | |
+ parent = *new; | |
+ | |
+ if (entity->weight == __counter->weight) { | |
+ entity->weight_counter = __counter; | |
+ goto inc_counter; | |
+ } | |
+ if (entity->weight < __counter->weight) | |
+ new = &((*new)->rb_left); | |
+ else | |
+ new = &((*new)->rb_right); | |
+ } | |
+ | |
+ entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter), | |
+ GFP_ATOMIC); | |
+ entity->weight_counter->weight = entity->weight; | |
+ rb_link_node(&entity->weight_counter->weights_node, parent, new); | |
+ rb_insert_color(&entity->weight_counter->weights_node, root); | |
+ | |
+inc_counter: | |
+ entity->weight_counter->num_active++; | |
+} | |
+ | |
+/* | |
+ * Decrement the weight counter associated with the entity, and, if the | |
+ * counter reaches 0, remove the counter from the tree. | |
+ * See the comments to the function bfq_weights_tree_add() for considerations | |
+ * about overhead. | |
+ */ | |
+static void bfq_weights_tree_remove(struct bfq_data *bfqd, | |
+ struct bfq_entity *entity, | |
+ struct rb_root *root) | |
+{ | |
+ /* | |
+ * Check whether the entity is actually associated with a counter. | |
+ * In fact, the device may not be considered NCQ-capable for a while, | |
+ * which implies that no insertion in the weight trees is performed, | |
+ * after which the device may start to be deemed NCQ-capable, and hence | |
+ * this function may start to be invoked. This may cause the function | |
+ * to be invoked for entities that are not associated with any counter. | |
+ */ | |
+ if (!entity->weight_counter) | |
+ return; | |
+ | |
+ BUG_ON(RB_EMPTY_ROOT(root)); | |
+ BUG_ON(entity->weight_counter->weight != entity->weight); | |
+ | |
+ BUG_ON(!entity->weight_counter->num_active); | |
+ entity->weight_counter->num_active--; | |
+ if (entity->weight_counter->num_active > 0) | |
+ goto reset_entity_pointer; | |
+ | |
+ rb_erase(&entity->weight_counter->weights_node, root); | |
+ kfree(entity->weight_counter); | |
+ | |
+reset_entity_pointer: | |
+ entity->weight_counter = NULL; | |
+} | |
+ | |
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq, | |
+ struct request *last) | |
+{ | |
+ struct rb_node *rbnext = rb_next(&last->rb_node); | |
+ struct rb_node *rbprev = rb_prev(&last->rb_node); | |
+ struct request *next = NULL, *prev = NULL; | |
+ | |
+ BUG_ON(RB_EMPTY_NODE(&last->rb_node)); | |
+ | |
+ if (rbprev != NULL) | |
+ prev = rb_entry_rq(rbprev); | |
+ | |
+ if (rbnext != NULL) | |
+ next = rb_entry_rq(rbnext); | |
+ else { | |
+ rbnext = rb_first(&bfqq->sort_list); | |
+ if (rbnext && rbnext != &last->rb_node) | |
+ next = rb_entry_rq(rbnext); | |
+ } | |
+ | |
+ return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last)); | |
+} | |
+ | |
+/* see the definition of bfq_async_charge_factor for details */ | |
+static inline unsigned long bfq_serv_to_charge(struct request *rq, | |
+ struct bfq_queue *bfqq) | |
+{ | |
+ return blk_rq_sectors(rq) * | |
+ (1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) * | |
+ bfq_async_charge_factor)); | |
+} | |
+ | |
+/** | |
+ * bfq_updated_next_req - update the queue after a new next_rq selection. | |
+ * @bfqd: the device data the queue belongs to. | |
+ * @bfqq: the queue to update. | |
+ * | |
+ * If the first request of a queue changes we make sure that the queue | |
+ * has enough budget to serve at least its first request (if the | |
+ * request has grown). We do this because if the queue has not enough | |
+ * budget for its first request, it has to go through two dispatch | |
+ * rounds to actually get it dispatched. | |
+ */ | |
+static void bfq_updated_next_req(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity); | |
+ struct request *next_rq = bfqq->next_rq; | |
+ unsigned long new_budget; | |
+ | |
+ if (next_rq == NULL) | |
+ return; | |
+ | |
+ if (bfqq == bfqd->in_service_queue) | |
+ /* | |
+ * In order not to break guarantees, budgets cannot be | |
+ * changed after an entity has been selected. | |
+ */ | |
+ return; | |
+ | |
+ BUG_ON(entity->tree != &st->active); | |
+ BUG_ON(entity == entity->sched_data->in_service_entity); | |
+ | |
+ new_budget = max_t(unsigned long, bfqq->max_budget, | |
+ bfq_serv_to_charge(next_rq, bfqq)); | |
+ if (entity->budget != new_budget) { | |
+ entity->budget = new_budget; | |
+ bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu", | |
+ new_budget); | |
+ bfq_activate_bfqq(bfqd, bfqq); | |
+ } | |
+} | |
+ | |
+static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd) | |
+{ | |
+ u64 dur; | |
+ | |
+ if (bfqd->bfq_wr_max_time > 0) | |
+ return bfqd->bfq_wr_max_time; | |
+ | |
+ dur = bfqd->RT_prod; | |
+ do_div(dur, bfqd->peak_rate); | |
+ | |
+ return dur; | |
+} | |
+ | |
+static inline unsigned | |
+bfq_bfqq_cooperations(struct bfq_queue *bfqq) | |
+{ | |
+ return bfqq->bic ? bfqq->bic->cooperations : 0; | |
+} | |
+ | |
+static inline void | |
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic) | |
+{ | |
+ if (bic->saved_idle_window) | |
+ bfq_mark_bfqq_idle_window(bfqq); | |
+ else | |
+ bfq_clear_bfqq_idle_window(bfqq); | |
+ if (bic->saved_IO_bound) | |
+ bfq_mark_bfqq_IO_bound(bfqq); | |
+ else | |
+ bfq_clear_bfqq_IO_bound(bfqq); | |
+ /* Assuming that the flag in_large_burst is already correctly set */ | |
+ if (bic->wr_time_left && bfqq->bfqd->low_latency && | |
+ !bfq_bfqq_in_large_burst(bfqq) && | |
+ bic->cooperations < bfqq->bfqd->bfq_coop_thresh) { | |
+ /* | |
+ * Start a weight raising period with the duration given by | |
+ * the raising_time_left snapshot. | |
+ */ | |
+ if (bfq_bfqq_busy(bfqq)) | |
+ bfqq->bfqd->wr_busy_queues++; | |
+ bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff; | |
+ bfqq->wr_cur_max_time = bic->wr_time_left; | |
+ bfqq->last_wr_start_finish = jiffies; | |
+ bfqq->entity.ioprio_changed = 1; | |
+ } | |
+ /* | |
+ * Clear wr_time_left to prevent bfq_bfqq_save_state() from | |
+ * getting confused about the queue's need of a weight-raising | |
+ * period. | |
+ */ | |
+ bic->wr_time_left = 0; | |
+} | |
+ | |
+/* Must be called with the queue_lock held. */ | |
+static int bfqq_process_refs(struct bfq_queue *bfqq) | |
+{ | |
+ int process_refs, io_refs; | |
+ | |
+ io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE]; | |
+ process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st; | |
+ BUG_ON(process_refs < 0); | |
+ return process_refs; | |
+} | |
+ | |
+/* Empty burst list and add just bfqq (see comments to bfq_handle_burst) */ | |
+static inline void bfq_reset_burst_list(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_queue *item; | |
+ struct hlist_node *n; | |
+ | |
+ hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node) | |
+ hlist_del_init(&item->burst_list_node); | |
+ hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); | |
+ bfqd->burst_size = 1; | |
+} | |
+ | |
+/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */ | |
+static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ /* Increment burst size to take into account also bfqq */ | |
+ bfqd->burst_size++; | |
+ | |
+ if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) { | |
+ struct bfq_queue *pos, *bfqq_item; | |
+ struct hlist_node *n; | |
+ | |
+ /* | |
+ * Enough queues have been activated shortly after each | |
+ * other to consider this burst as large. | |
+ */ | |
+ bfqd->large_burst = true; | |
+ | |
+ /* | |
+ * We can now mark all queues in the burst list as | |
+ * belonging to a large burst. | |
+ */ | |
+ hlist_for_each_entry(bfqq_item, &bfqd->burst_list, | |
+ burst_list_node) | |
+ bfq_mark_bfqq_in_large_burst(bfqq_item); | |
+ bfq_mark_bfqq_in_large_burst(bfqq); | |
+ | |
+ /* | |
+ * From now on, and until the current burst finishes, any | |
+ * new queue being activated shortly after the last queue | |
+ * was inserted in the burst can be immediately marked as | |
+ * belonging to a large burst. So the burst list is not | |
+ * needed any more. Remove it. | |
+ */ | |
+ hlist_for_each_entry_safe(pos, n, &bfqd->burst_list, | |
+ burst_list_node) | |
+ hlist_del_init(&pos->burst_list_node); | |
+ } else /* burst not yet large: add bfqq to the burst list */ | |
+ hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); | |
+} | |
+ | |
+/* | |
+ * If many queues happen to become active shortly after each other, then, | |
+ * to help the processes associated to these queues get their job done as | |
+ * soon as possible, it is usually better to not grant either weight-raising | |
+ * or device idling to these queues. In this comment we describe, firstly, | |
+ * the reasons why this fact holds, and, secondly, the next function, which | |
+ * implements the main steps needed to properly mark these queues so that | |
+ * they can then be treated in a different way. | |
+ * | |
+ * As for the terminology, we say that a queue becomes active, i.e., | |
+ * switches from idle to backlogged, either when it is created (as a | |
+ * consequence of the arrival of an I/O request), or, if already existing, | |
+ * when a new request for the queue arrives while the queue is idle. | |
+ * Bursts of activations, i.e., activations of different queues occurring | |
+ * shortly after each other, are typically caused by services or applications | |
+ * that spawn or reactivate many parallel threads/processes. Examples are | |
+ * systemd during boot or git grep. | |
+ * | |
+ * These services or applications benefit mostly from a high throughput: | |
+ * the quicker the requests of the activated queues are cumulatively served, | |
+ * the sooner the target job of these queues gets completed. As a consequence, | |
+ * weight-raising any of these queues, which also implies idling the device | |
+ * for it, is almost always counterproductive: in most cases it just lowers | |
+ * throughput. | |
+ * | |
+ * On the other hand, a burst of activations may be also caused by the start | |
+ * of an application that does not consist in a lot of parallel I/O-bound | |
+ * threads. In fact, with a complex application, the burst may be just a | |
+ * consequence of the fact that several processes need to be executed to | |
+ * start-up the application. To start an application as quickly as possible, | |
+ * the best thing to do is to privilege the I/O related to the application | |
+ * with respect to all other I/O. Therefore, the best strategy to start as | |
+ * quickly as possible an application that causes a burst of activations is | |
+ * to weight-raise all the queues activated during the burst. This is the | |
+ * exact opposite of the best strategy for the other type of bursts. | |
+ * | |
+ * In the end, to take the best action for each of the two cases, the two | |
+ * types of bursts need to be distinguished. Fortunately, this seems | |
+ * relatively easy to do, by looking at the sizes of the bursts. In | |
+ * particular, we found a threshold such that bursts with a larger size | |
+ * than that threshold are apparently caused only by services or commands | |
+ * such as systemd or git grep. For brevity, hereafter we call just 'large' | |
+ * these bursts. BFQ *does not* weight-raise queues whose activations occur | |
+ * in a large burst. In addition, for each of these queues BFQ performs or | |
+ * does not perform idling depending on which choice boosts the throughput | |
+ * most. The exact choice depends on the device and request pattern at | |
+ * hand. | |
+ * | |
+ * Turning back to the next function, it implements all the steps needed | |
+ * to detect the occurrence of a large burst and to properly mark all the | |
+ * queues belonging to it (so that they can then be treated in a different | |
+ * way). This goal is achieved by maintaining a special "burst list" that | |
+ * holds, temporarily, the queues that belong to the burst in progress. The | |
+ * list is then used to mark these queues as belonging to a large burst if | |
+ * the burst does become large. The main steps are the following. | |
+ * | |
+ * . when the very first queue is activated, the queue is inserted into the | |
+ * list (as it could be the first queue in a possible burst) | |
+ * | |
+ * . if the current burst has not yet become large, and a queue Q that does | |
+ * not yet belong to the burst is activated shortly after the last time | |
+ * at which a new queue entered the burst list, then the function appends | |
+ * Q to the burst list | |
+ * | |
+ * . if, as a consequence of the previous step, the burst size reaches | |
+ * the large-burst threshold, then | |
+ * | |
+ * . all the queues in the burst list are marked as belonging to a | |
+ * large burst | |
+ * | |
+ * . the burst list is deleted; in fact, the burst list already served | |
+ * its purpose (keeping temporarily track of the queues in a burst, | |
+ * so as to be able to mark them as belonging to a large burst in the | |
+ * previous sub-step), and now is not needed any more | |
+ * | |
+ * . the device enters a large-burst mode | |
+ * | |
+ * . if a queue Q that does not belong to the burst is activated while | |
+ * the device is in large-burst mode and shortly after the last time | |
+ * at which a queue either entered the burst list or was marked as | |
+ * belonging to the current large burst, then Q is immediately marked | |
+ * as belonging to a large burst. | |
+ * | |
+ * . if a queue Q that does not belong to the burst is activated a while | |
+ * later, i.e., not shortly after, than the last time at which a queue | |
+ * either entered the burst list or was marked as belonging to the | |
+ * current large burst, then the current burst is deemed as finished and: | |
+ * | |
+ * . the large-burst mode is reset if set | |
+ * | |
+ * . the burst list is emptied | |
+ * | |
+ * . Q is inserted in the burst list, as Q may be the first queue | |
+ * in a possible new burst (then the burst list contains just Q | |
+ * after this step). | |
+ */ | |
+static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ bool idle_for_long_time) | |
+{ | |
+ /* | |
+ * If bfqq happened to be activated in a burst, but has been idle | |
+ * for at least as long as an interactive queue, then we assume | |
+ * that, in the overall I/O initiated in the burst, the I/O | |
+ * associated to bfqq is finished. So bfqq does not need to be | |
+ * treated as a queue belonging to a burst anymore. Accordingly, | |
+ * we reset bfqq's in_large_burst flag if set, and remove bfqq | |
+ * from the burst list if it's there. We do not decrement instead | |
+ * burst_size, because the fact that bfqq does not need to belong | |
+ * to the burst list any more does not invalidate the fact that | |
+ * bfqq may have been activated during the current burst. | |
+ */ | |
+ if (idle_for_long_time) { | |
+ hlist_del_init(&bfqq->burst_list_node); | |
+ bfq_clear_bfqq_in_large_burst(bfqq); | |
+ } | |
+ | |
+ /* | |
+ * If bfqq is already in the burst list or is part of a large | |
+ * burst, then there is nothing else to do. | |
+ */ | |
+ if (!hlist_unhashed(&bfqq->burst_list_node) || | |
+ bfq_bfqq_in_large_burst(bfqq)) | |
+ return; | |
+ | |
+ /* | |
+ * If bfqq's activation happens late enough, then the current | |
+ * burst is finished, and related data structures must be reset. | |
+ * | |
+ * In this respect, consider the special case where bfqq is the very | |
+ * first queue being activated. In this case, last_ins_in_burst is | |
+ * not yet significant when we get here. But it is easy to verify | |
+ * that, whether or not the following condition is true, bfqq will | |
+ * end up being inserted into the burst list. In particular the | |
+ * list will happen to contain only bfqq. And this is exactly what | |
+ * has to happen, as bfqq may be the first queue in a possible | |
+ * burst. | |
+ */ | |
+ if (time_is_before_jiffies(bfqd->last_ins_in_burst + | |
+ bfqd->bfq_burst_interval)) { | |
+ bfqd->large_burst = false; | |
+ bfq_reset_burst_list(bfqd, bfqq); | |
+ return; | |
+ } | |
+ | |
+ /* | |
+ * If we get here, then bfqq is being activated shortly after the | |
+ * last queue. So, if the current burst is also large, we can mark | |
+ * bfqq as belonging to this large burst immediately. | |
+ */ | |
+ if (bfqd->large_burst) { | |
+ bfq_mark_bfqq_in_large_burst(bfqq); | |
+ return; | |
+ } | |
+ | |
+ /* | |
+ * If we get here, then a large-burst state has not yet been | |
+ * reached, but bfqq is being activated shortly after the last | |
+ * queue. Then we add bfqq to the burst. | |
+ */ | |
+ bfq_add_to_burst(bfqd, bfqq); | |
+} | |
+ | |
+static void bfq_add_request(struct request *rq) | |
+{ | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq); | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ struct request *next_rq, *prev; | |
+ unsigned long old_wr_coeff = bfqq->wr_coeff; | |
+ bool interactive = false; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq)); | |
+ bfqq->queued[rq_is_sync(rq)]++; | |
+ bfqd->queued++; | |
+ | |
+ elv_rb_add(&bfqq->sort_list, rq); | |
+ | |
+ /* | |
+ * Check if this request is a better next-serve candidate. | |
+ */ | |
+ prev = bfqq->next_rq; | |
+ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position); | |
+ BUG_ON(next_rq == NULL); | |
+ bfqq->next_rq = next_rq; | |
+ | |
+ /* | |
+ * Adjust priority tree position, if next_rq changes. | |
+ */ | |
+ if (prev != bfqq->next_rq) | |
+ bfq_rq_pos_tree_add(bfqd, bfqq); | |
+ | |
+ if (!bfq_bfqq_busy(bfqq)) { | |
+ bool soft_rt, coop_or_in_burst, | |
+ idle_for_long_time = time_is_before_jiffies( | |
+ bfqq->budget_timeout + | |
+ bfqd->bfq_wr_min_idle_time); | |
+ | |
+ if (bfq_bfqq_sync(bfqq)) { | |
+ bool already_in_burst = | |
+ !hlist_unhashed(&bfqq->burst_list_node) || | |
+ bfq_bfqq_in_large_burst(bfqq); | |
+ bfq_handle_burst(bfqd, bfqq, idle_for_long_time); | |
+ /* | |
+ * If bfqq was not already in the current burst, | |
+ * then, at this point, bfqq either has been | |
+ * added to the current burst or has caused the | |
+ * current burst to terminate. In particular, in | |
+ * the second case, bfqq has become the first | |
+ * queue in a possible new burst. | |
+ * In both cases last_ins_in_burst needs to be | |
+ * moved forward. | |
+ */ | |
+ if (!already_in_burst) | |
+ bfqd->last_ins_in_burst = jiffies; | |
+ } | |
+ | |
+ coop_or_in_burst = bfq_bfqq_in_large_burst(bfqq) || | |
+ bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh; | |
+ soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 && | |
+ !coop_or_in_burst && | |
+ time_is_before_jiffies(bfqq->soft_rt_next_start); | |
+ interactive = !coop_or_in_burst && idle_for_long_time; | |
+ entity->budget = max_t(unsigned long, bfqq->max_budget, | |
+ bfq_serv_to_charge(next_rq, bfqq)); | |
+ | |
+ if (!bfq_bfqq_IO_bound(bfqq)) { | |
+ if (time_before(jiffies, | |
+ RQ_BIC(rq)->ttime.last_end_request + | |
+ bfqd->bfq_slice_idle)) { | |
+ bfqq->requests_within_timer++; | |
+ if (bfqq->requests_within_timer >= | |
+ bfqd->bfq_requests_within_timer) | |
+ bfq_mark_bfqq_IO_bound(bfqq); | |
+ } else | |
+ bfqq->requests_within_timer = 0; | |
+ } | |
+ | |
+ if (!bfqd->low_latency) | |
+ goto add_bfqq_busy; | |
+ | |
+ if (bfq_bfqq_just_split(bfqq)) | |
+ goto set_ioprio_changed; | |
+ | |
+ /* | |
+ * If the queue: | |
+ * - is not being boosted, | |
+ * - has been idle for enough time, | |
+ * - is not a sync queue or is linked to a bfq_io_cq (it is | |
+ * shared "for its nature" or it is not shared and its | |
+ * requests have not been redirected to a shared queue) | |
+ * start a weight-raising period. | |
+ */ | |
+ if (old_wr_coeff == 1 && (interactive || soft_rt) && | |
+ (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) { | |
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff; | |
+ if (interactive) | |
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); | |
+ else | |
+ bfqq->wr_cur_max_time = | |
+ bfqd->bfq_wr_rt_max_time; | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "wrais starting at %lu, rais_max_time %u", | |
+ jiffies, | |
+ jiffies_to_msecs(bfqq->wr_cur_max_time)); | |
+ } else if (old_wr_coeff > 1) { | |
+ if (interactive) | |
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); | |
+ else if (coop_or_in_burst || | |
+ (bfqq->wr_cur_max_time == | |
+ bfqd->bfq_wr_rt_max_time && | |
+ !soft_rt)) { | |
+ bfqq->wr_coeff = 1; | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "wrais ending at %lu, rais_max_time %u", | |
+ jiffies, | |
+ jiffies_to_msecs(bfqq-> | |
+ wr_cur_max_time)); | |
+ } else if (time_before( | |
+ bfqq->last_wr_start_finish + | |
+ bfqq->wr_cur_max_time, | |
+ jiffies + | |
+ bfqd->bfq_wr_rt_max_time) && | |
+ soft_rt) { | |
+ /* | |
+ * | |
+ * The remaining weight-raising time is lower | |
+ * than bfqd->bfq_wr_rt_max_time, which means | |
+ * that the application is enjoying weight | |
+ * raising either because deemed soft-rt in | |
+ * the near past, or because deemed interactive | |
+ * a long ago. | |
+ * In both cases, resetting now the current | |
+ * remaining weight-raising time for the | |
+ * application to the weight-raising duration | |
+ * for soft rt applications would not cause any | |
+ * latency increase for the application (as the | |
+ * new duration would be higher than the | |
+ * remaining time). | |
+ * | |
+ * In addition, the application is now meeting | |
+ * the requirements for being deemed soft rt. | |
+ * In the end we can correctly and safely | |
+ * (re)charge the weight-raising duration for | |
+ * the application with the weight-raising | |
+ * duration for soft rt applications. | |
+ * | |
+ * In particular, doing this recharge now, i.e., | |
+ * before the weight-raising period for the | |
+ * application finishes, reduces the probability | |
+ * of the following negative scenario: | |
+ * 1) the weight of a soft rt application is | |
+ * raised at startup (as for any newly | |
+ * created application), | |
+ * 2) since the application is not interactive, | |
+ * at a certain time weight-raising is | |
+ * stopped for the application, | |
+ * 3) at that time the application happens to | |
+ * still have pending requests, and hence | |
+ * is destined to not have a chance to be | |
+ * deemed soft rt before these requests are | |
+ * completed (see the comments to the | |
+ * function bfq_bfqq_softrt_next_start() | |
+ * for details on soft rt detection), | |
+ * 4) these pending requests experience a high | |
+ * latency because the application is not | |
+ * weight-raised while they are pending. | |
+ */ | |
+ bfqq->last_wr_start_finish = jiffies; | |
+ bfqq->wr_cur_max_time = | |
+ bfqd->bfq_wr_rt_max_time; | |
+ } | |
+ } | |
+set_ioprio_changed: | |
+ if (old_wr_coeff != bfqq->wr_coeff) | |
+ entity->ioprio_changed = 1; | |
+add_bfqq_busy: | |
+ bfqq->last_idle_bklogged = jiffies; | |
+ bfqq->service_from_backlogged = 0; | |
+ bfq_clear_bfqq_softrt_update(bfqq); | |
+ bfq_add_bfqq_busy(bfqd, bfqq); | |
+ } else { | |
+ if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) && | |
+ time_is_before_jiffies( | |
+ bfqq->last_wr_start_finish + | |
+ bfqd->bfq_wr_min_inter_arr_async)) { | |
+ bfqq->wr_coeff = bfqd->bfq_wr_coeff; | |
+ bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); | |
+ | |
+ bfqd->wr_busy_queues++; | |
+ entity->ioprio_changed = 1; | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "non-idle wrais starting at %lu, rais_max_time %u", | |
+ jiffies, | |
+ jiffies_to_msecs(bfqq->wr_cur_max_time)); | |
+ } | |
+ if (prev != bfqq->next_rq) | |
+ bfq_updated_next_req(bfqd, bfqq); | |
+ } | |
+ | |
+ if (bfqd->low_latency && | |
+ (old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive)) | |
+ bfqq->last_wr_start_finish = jiffies; | |
+} | |
+ | |
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd, | |
+ struct bio *bio) | |
+{ | |
+ struct task_struct *tsk = current; | |
+ struct bfq_io_cq *bic; | |
+ struct bfq_queue *bfqq; | |
+ | |
+ bic = bfq_bic_lookup(bfqd, tsk->io_context); | |
+ if (bic == NULL) | |
+ return NULL; | |
+ | |
+ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio)); | |
+ if (bfqq != NULL) | |
+ return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio)); | |
+ | |
+ return NULL; | |
+} | |
+ | |
+static void bfq_activate_request(struct request_queue *q, struct request *rq) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ | |
+ bfqd->rq_in_driver++; | |
+ bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq); | |
+ bfq_log(bfqd, "activate_request: new bfqd->last_position %llu", | |
+ (long long unsigned)bfqd->last_position); | |
+} | |
+ | |
+static inline void bfq_deactivate_request(struct request_queue *q, | |
+ struct request *rq) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ | |
+ BUG_ON(bfqd->rq_in_driver == 0); | |
+ bfqd->rq_in_driver--; | |
+} | |
+ | |
+static void bfq_remove_request(struct request *rq) | |
+{ | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq); | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ const int sync = rq_is_sync(rq); | |
+ | |
+ if (bfqq->next_rq == rq) { | |
+ bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq); | |
+ bfq_updated_next_req(bfqd, bfqq); | |
+ } | |
+ | |
+ list_del_init(&rq->queuelist); | |
+ BUG_ON(bfqq->queued[sync] == 0); | |
+ bfqq->queued[sync]--; | |
+ bfqd->queued--; | |
+ elv_rb_del(&bfqq->sort_list, rq); | |
+ | |
+ if (RB_EMPTY_ROOT(&bfqq->sort_list)) { | |
+ if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) | |
+ bfq_del_bfqq_busy(bfqd, bfqq, 1); | |
+ /* | |
+ * Remove queue from request-position tree as it is empty. | |
+ */ | |
+ if (bfqq->pos_root != NULL) { | |
+ rb_erase(&bfqq->pos_node, bfqq->pos_root); | |
+ bfqq->pos_root = NULL; | |
+ } | |
+ } | |
+ | |
+ if (rq->cmd_flags & REQ_META) { | |
+ BUG_ON(bfqq->meta_pending == 0); | |
+ bfqq->meta_pending--; | |
+ } | |
+} | |
+ | |
+static int bfq_merge(struct request_queue *q, struct request **req, | |
+ struct bio *bio) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct request *__rq; | |
+ | |
+ __rq = bfq_find_rq_fmerge(bfqd, bio); | |
+ if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) { | |
+ *req = __rq; | |
+ return ELEVATOR_FRONT_MERGE; | |
+ } | |
+ | |
+ return ELEVATOR_NO_MERGE; | |
+} | |
+ | |
+static void bfq_merged_request(struct request_queue *q, struct request *req, | |
+ int type) | |
+{ | |
+ if (type == ELEVATOR_FRONT_MERGE && | |
+ rb_prev(&req->rb_node) && | |
+ blk_rq_pos(req) < | |
+ blk_rq_pos(container_of(rb_prev(&req->rb_node), | |
+ struct request, rb_node))) { | |
+ struct bfq_queue *bfqq = RQ_BFQQ(req); | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ struct request *prev, *next_rq; | |
+ | |
+ /* Reposition request in its sort_list */ | |
+ elv_rb_del(&bfqq->sort_list, req); | |
+ elv_rb_add(&bfqq->sort_list, req); | |
+ /* Choose next request to be served for bfqq */ | |
+ prev = bfqq->next_rq; | |
+ next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req, | |
+ bfqd->last_position); | |
+ BUG_ON(next_rq == NULL); | |
+ bfqq->next_rq = next_rq; | |
+ /* | |
+ * If next_rq changes, update both the queue's budget to | |
+ * fit the new request and the queue's position in its | |
+ * rq_pos_tree. | |
+ */ | |
+ if (prev != bfqq->next_rq) { | |
+ bfq_updated_next_req(bfqd, bfqq); | |
+ bfq_rq_pos_tree_add(bfqd, bfqq); | |
+ } | |
+ } | |
+} | |
+ | |
+static void bfq_merged_requests(struct request_queue *q, struct request *rq, | |
+ struct request *next) | |
+{ | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq); | |
+ | |
+ /* | |
+ * Reposition in fifo if next is older than rq. | |
+ */ | |
+ if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) && | |
+ time_before(next->fifo_time, rq->fifo_time)) { | |
+ list_move(&rq->queuelist, &next->queuelist); | |
+ rq->fifo_time = next->fifo_time; | |
+ } | |
+ | |
+ if (bfqq->next_rq == next) | |
+ bfqq->next_rq = rq; | |
+ | |
+ bfq_remove_request(next); | |
+} | |
+ | |
+/* Must be called with bfqq != NULL */ | |
+static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq) | |
+{ | |
+ BUG_ON(bfqq == NULL); | |
+ if (bfq_bfqq_busy(bfqq)) | |
+ bfqq->bfqd->wr_busy_queues--; | |
+ bfqq->wr_coeff = 1; | |
+ bfqq->wr_cur_max_time = 0; | |
+ /* Trigger a weight change on the next activation of the queue */ | |
+ bfqq->entity.ioprio_changed = 1; | |
+} | |
+ | |
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg) | |
+{ | |
+ int i, j; | |
+ | |
+ for (i = 0; i < 2; i++) | |
+ for (j = 0; j < IOPRIO_BE_NR; j++) | |
+ if (bfqg->async_bfqq[i][j] != NULL) | |
+ bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]); | |
+ if (bfqg->async_idle_bfqq != NULL) | |
+ bfq_bfqq_end_wr(bfqg->async_idle_bfqq); | |
+} | |
+ | |
+static void bfq_end_wr(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_queue *bfqq; | |
+ | |
+ spin_lock_irq(bfqd->queue->queue_lock); | |
+ | |
+ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) | |
+ bfq_bfqq_end_wr(bfqq); | |
+ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) | |
+ bfq_bfqq_end_wr(bfqq); | |
+ bfq_end_wr_async(bfqd); | |
+ | |
+ spin_unlock_irq(bfqd->queue->queue_lock); | |
+} | |
+ | |
+static inline sector_t bfq_io_struct_pos(void *io_struct, bool request) | |
+{ | |
+ if (request) | |
+ return blk_rq_pos(io_struct); | |
+ else | |
+ return ((struct bio *)io_struct)->bi_iter.bi_sector; | |
+} | |
+ | |
+static inline sector_t bfq_dist_from(sector_t pos1, | |
+ sector_t pos2) | |
+{ | |
+ if (pos1 >= pos2) | |
+ return pos1 - pos2; | |
+ else | |
+ return pos2 - pos1; | |
+} | |
+ | |
+static inline int bfq_rq_close_to_sector(void *io_struct, bool request, | |
+ sector_t sector) | |
+{ | |
+ return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <= | |
+ BFQQ_SEEK_THR; | |
+} | |
+ | |
+static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector) | |
+{ | |
+ struct rb_root *root = &bfqd->rq_pos_tree; | |
+ struct rb_node *parent, *node; | |
+ struct bfq_queue *__bfqq; | |
+ | |
+ if (RB_EMPTY_ROOT(root)) | |
+ return NULL; | |
+ | |
+ /* | |
+ * First, if we find a request starting at the end of the last | |
+ * request, choose it. | |
+ */ | |
+ __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL); | |
+ if (__bfqq != NULL) | |
+ return __bfqq; | |
+ | |
+ /* | |
+ * If the exact sector wasn't found, the parent of the NULL leaf | |
+ * will contain the closest sector (rq_pos_tree sorted by | |
+ * next_request position). | |
+ */ | |
+ __bfqq = rb_entry(parent, struct bfq_queue, pos_node); | |
+ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector)) | |
+ return __bfqq; | |
+ | |
+ if (blk_rq_pos(__bfqq->next_rq) < sector) | |
+ node = rb_next(&__bfqq->pos_node); | |
+ else | |
+ node = rb_prev(&__bfqq->pos_node); | |
+ if (node == NULL) | |
+ return NULL; | |
+ | |
+ __bfqq = rb_entry(node, struct bfq_queue, pos_node); | |
+ if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector)) | |
+ return __bfqq; | |
+ | |
+ return NULL; | |
+} | |
+ | |
+/* | |
+ * bfqd - obvious | |
+ * cur_bfqq - passed in so that we don't decide that the current queue | |
+ * is closely cooperating with itself | |
+ * sector - used as a reference point to search for a close queue | |
+ */ | |
+static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd, | |
+ struct bfq_queue *cur_bfqq, | |
+ sector_t sector) | |
+{ | |
+ struct bfq_queue *bfqq; | |
+ | |
+ if (bfq_class_idle(cur_bfqq)) | |
+ return NULL; | |
+ if (!bfq_bfqq_sync(cur_bfqq)) | |
+ return NULL; | |
+ if (BFQQ_SEEKY(cur_bfqq)) | |
+ return NULL; | |
+ | |
+ /* If device has only one backlogged bfq_queue, don't search. */ | |
+ if (bfqd->busy_queues == 1) | |
+ return NULL; | |
+ | |
+ /* | |
+ * We should notice if some of the queues are cooperating, e.g. | |
+ * working closely on the same area of the disk. In that case, | |
+ * we can group them together and don't waste time idling. | |
+ */ | |
+ bfqq = bfqq_close(bfqd, sector); | |
+ if (bfqq == NULL || bfqq == cur_bfqq) | |
+ return NULL; | |
+ | |
+ /* | |
+ * Do not merge queues from different bfq_groups. | |
+ */ | |
+ if (bfqq->entity.parent != cur_bfqq->entity.parent) | |
+ return NULL; | |
+ | |
+ /* | |
+ * It only makes sense to merge sync queues. | |
+ */ | |
+ if (!bfq_bfqq_sync(bfqq)) | |
+ return NULL; | |
+ if (BFQQ_SEEKY(bfqq)) | |
+ return NULL; | |
+ | |
+ /* | |
+ * Do not merge queues of different priority classes. | |
+ */ | |
+ if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq)) | |
+ return NULL; | |
+ | |
+ return bfqq; | |
+} | |
+ | |
+static struct bfq_queue * | |
+bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) | |
+{ | |
+ int process_refs, new_process_refs; | |
+ struct bfq_queue *__bfqq; | |
+ | |
+ /* | |
+ * If there are no process references on the new_bfqq, then it is | |
+ * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain | |
+ * may have dropped their last reference (not just their last process | |
+ * reference). | |
+ */ | |
+ if (!bfqq_process_refs(new_bfqq)) | |
+ return NULL; | |
+ | |
+ /* Avoid a circular list and skip interim queue merges. */ | |
+ while ((__bfqq = new_bfqq->new_bfqq)) { | |
+ if (__bfqq == bfqq) | |
+ return NULL; | |
+ new_bfqq = __bfqq; | |
+ } | |
+ | |
+ process_refs = bfqq_process_refs(bfqq); | |
+ new_process_refs = bfqq_process_refs(new_bfqq); | |
+ /* | |
+ * If the process for the bfqq has gone away, there is no | |
+ * sense in merging the queues. | |
+ */ | |
+ if (process_refs == 0 || new_process_refs == 0) | |
+ return NULL; | |
+ | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d", | |
+ new_bfqq->pid); | |
+ | |
+ /* | |
+ * Merging is just a redirection: the requests of the process | |
+ * owning one of the two queues are redirected to the other queue. | |
+ * The latter queue, in its turn, is set as shared if this is the | |
+ * first time that the requests of some process are redirected to | |
+ * it. | |
+ * | |
+ * We redirect bfqq to new_bfqq and not the opposite, because we | |
+ * are in the context of the process owning bfqq, hence we have | |
+ * the io_cq of this process. So we can immediately configure this | |
+ * io_cq to redirect the requests of the process to new_bfqq. | |
+ * | |
+ * NOTE, even if new_bfqq coincides with the in-service queue, the | |
+ * io_cq of new_bfqq is not available, because, if the in-service | |
+ * queue is shared, bfqd->in_service_bic may not point to the | |
+ * io_cq of the in-service queue. | |
+ * Redirecting the requests of the process owning bfqq to the | |
+ * currently in-service queue is in any case the best option, as | |
+ * we feed the in-service queue with new requests close to the | |
+ * last request served and, by doing so, hopefully increase the | |
+ * throughput. | |
+ */ | |
+ bfqq->new_bfqq = new_bfqq; | |
+ atomic_add(process_refs, &new_bfqq->ref); | |
+ return new_bfqq; | |
+} | |
+ | |
+/* | |
+ * Attempt to schedule a merge of bfqq with the currently in-service queue | |
+ * or with a close queue among the scheduled queues. | |
+ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue | |
+ * structure otherwise. | |
+ */ | |
+static struct bfq_queue * | |
+bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ void *io_struct, bool request) | |
+{ | |
+ struct bfq_queue *in_service_bfqq, *new_bfqq; | |
+ | |
+ if (bfqq->new_bfqq) | |
+ return bfqq->new_bfqq; | |
+ | |
+ if (!io_struct) | |
+ return NULL; | |
+ | |
+ in_service_bfqq = bfqd->in_service_queue; | |
+ | |
+ if (in_service_bfqq == NULL || in_service_bfqq == bfqq || | |
+ !bfqd->in_service_bic) | |
+ goto check_scheduled; | |
+ | |
+ if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq)) | |
+ goto check_scheduled; | |
+ | |
+ if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq)) | |
+ goto check_scheduled; | |
+ | |
+ if (in_service_bfqq->entity.parent != bfqq->entity.parent) | |
+ goto check_scheduled; | |
+ | |
+ if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) && | |
+ bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) { | |
+ new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq); | |
+ if (new_bfqq != NULL) | |
+ return new_bfqq; /* Merge with in-service queue */ | |
+ } | |
+ | |
+ /* | |
+ * Check whether there is a cooperator among currently scheduled | |
+ * queues. The only thing we need is that the bio/request is not | |
+ * NULL, as we need it to establish whether a cooperator exists. | |
+ */ | |
+check_scheduled: | |
+ new_bfqq = bfq_close_cooperator(bfqd, bfqq, | |
+ bfq_io_struct_pos(io_struct, request)); | |
+ if (new_bfqq) | |
+ return bfq_setup_merge(bfqq, new_bfqq); | |
+ | |
+ return NULL; | |
+} | |
+ | |
+static inline void | |
+bfq_bfqq_save_state(struct bfq_queue *bfqq) | |
+{ | |
+ /* | |
+ * If bfqq->bic == NULL, the queue is already shared or its requests | |
+ * have already been redirected to a shared queue; both idle window | |
+ * and weight raising state have already been saved. Do nothing. | |
+ */ | |
+ if (bfqq->bic == NULL) | |
+ return; | |
+ if (bfqq->bic->wr_time_left) | |
+ /* | |
+ * This is the queue of a just-started process, and would | |
+ * deserve weight raising: we set wr_time_left to the full | |
+ * weight-raising duration to trigger weight-raising when | |
+ * and if the queue is split and the first request of the | |
+ * queue is enqueued. | |
+ */ | |
+ bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd); | |
+ else if (bfqq->wr_coeff > 1) { | |
+ unsigned long wr_duration = | |
+ jiffies - bfqq->last_wr_start_finish; | |
+ /* | |
+ * It may happen that a queue's weight raising period lasts | |
+ * longer than its wr_cur_max_time, as weight raising is | |
+ * handled only when a request is enqueued or dispatched (it | |
+ * does not use any timer). If the weight raising period is | |
+ * about to end, don't save it. | |
+ */ | |
+ if (bfqq->wr_cur_max_time <= wr_duration) | |
+ bfqq->bic->wr_time_left = 0; | |
+ else | |
+ bfqq->bic->wr_time_left = | |
+ bfqq->wr_cur_max_time - wr_duration; | |
+ /* | |
+ * The bfq_queue is becoming shared or the requests of the | |
+ * process owning the queue are being redirected to a shared | |
+ * queue. Stop the weight raising period of the queue, as in | |
+ * both cases it should not be owned by an interactive or | |
+ * soft real-time application. | |
+ */ | |
+ bfq_bfqq_end_wr(bfqq); | |
+ } else | |
+ bfqq->bic->wr_time_left = 0; | |
+ bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq); | |
+ bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); | |
+ bfqq->bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq); | |
+ bfqq->bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node); | |
+ bfqq->bic->cooperations++; | |
+ bfqq->bic->failed_cooperations = 0; | |
+} | |
+ | |
+static inline void | |
+bfq_get_bic_reference(struct bfq_queue *bfqq) | |
+{ | |
+ /* | |
+ * If bfqq->bic has a non-NULL value, the bic to which it belongs | |
+ * is about to begin using a shared bfq_queue. | |
+ */ | |
+ if (bfqq->bic) | |
+ atomic_long_inc(&bfqq->bic->icq.ioc->refcount); | |
+} | |
+ | |
+static void | |
+bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic, | |
+ struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) | |
+{ | |
+ bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu", | |
+ (long unsigned)new_bfqq->pid); | |
+ /* Save weight raising and idle window of the merged queues */ | |
+ bfq_bfqq_save_state(bfqq); | |
+ bfq_bfqq_save_state(new_bfqq); | |
+ if (bfq_bfqq_IO_bound(bfqq)) | |
+ bfq_mark_bfqq_IO_bound(new_bfqq); | |
+ bfq_clear_bfqq_IO_bound(bfqq); | |
+ /* | |
+ * Grab a reference to the bic, to prevent it from being destroyed | |
+ * before being possibly touched by a bfq_split_bfqq(). | |
+ */ | |
+ bfq_get_bic_reference(bfqq); | |
+ bfq_get_bic_reference(new_bfqq); | |
+ /* | |
+ * Merge queues (that is, let bic redirect its requests to new_bfqq) | |
+ */ | |
+ bic_set_bfqq(bic, new_bfqq, 1); | |
+ bfq_mark_bfqq_coop(new_bfqq); | |
+ /* | |
+ * new_bfqq now belongs to at least two bics (it is a shared queue): | |
+ * set new_bfqq->bic to NULL. bfqq either: | |
+ * - does not belong to any bic any more, and hence bfqq->bic must | |
+ * be set to NULL, or | |
+ * - is a queue whose owning bics have already been redirected to a | |
+ * different queue, hence the queue is destined to not belong to | |
+ * any bic soon and bfqq->bic is already NULL (therefore the next | |
+ * assignment causes no harm). | |
+ */ | |
+ new_bfqq->bic = NULL; | |
+ bfqq->bic = NULL; | |
+ bfq_put_queue(bfqq); | |
+} | |
+ | |
+static inline void bfq_bfqq_increase_failed_cooperations(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_io_cq *bic = bfqq->bic; | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ | |
+ if (bic && bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh) { | |
+ bic->failed_cooperations++; | |
+ if (bic->failed_cooperations >= bfqd->bfq_failed_cooperations) | |
+ bic->cooperations = 0; | |
+ } | |
+} | |
+ | |
+static int bfq_allow_merge(struct request_queue *q, struct request *rq, | |
+ struct bio *bio) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct bfq_io_cq *bic; | |
+ struct bfq_queue *bfqq, *new_bfqq; | |
+ | |
+ /* | |
+ * Disallow merge of a sync bio into an async request. | |
+ */ | |
+ if (bfq_bio_sync(bio) && !rq_is_sync(rq)) | |
+ return 0; | |
+ | |
+ /* | |
+ * Lookup the bfqq that this bio will be queued with. Allow | |
+ * merge only if rq is queued there. | |
+ * Queue lock is held here. | |
+ */ | |
+ bic = bfq_bic_lookup(bfqd, current->io_context); | |
+ if (bic == NULL) | |
+ return 0; | |
+ | |
+ bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio)); | |
+ /* | |
+ * We take advantage of this function to perform an early merge | |
+ * of the queues of possible cooperating processes. | |
+ */ | |
+ if (bfqq != NULL) { | |
+ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false); | |
+ if (new_bfqq != NULL) { | |
+ bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq); | |
+ /* | |
+ * If we get here, the bio will be queued in the | |
+ * shared queue, i.e., new_bfqq, so use new_bfqq | |
+ * to decide whether bio and rq can be merged. | |
+ */ | |
+ bfqq = new_bfqq; | |
+ } else | |
+ bfq_bfqq_increase_failed_cooperations(bfqq); | |
+ } | |
+ | |
+ return bfqq == RQ_BFQQ(rq); | |
+} | |
+ | |
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq) | |
+{ | |
+ if (bfqq != NULL) { | |
+ bfq_mark_bfqq_must_alloc(bfqq); | |
+ bfq_mark_bfqq_budget_new(bfqq); | |
+ bfq_clear_bfqq_fifo_expire(bfqq); | |
+ | |
+ bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "set_in_service_queue, cur-budget = %lu", | |
+ bfqq->entity.budget); | |
+ } | |
+ | |
+ bfqd->in_service_queue = bfqq; | |
+} | |
+ | |
+/* | |
+ * Get and set a new queue for service. | |
+ */ | |
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_get_next_queue(bfqd); | |
+ | |
+ __bfq_set_in_service_queue(bfqd, bfqq); | |
+ return bfqq; | |
+} | |
+ | |
+/* | |
+ * If enough samples have been computed, return the current max budget | |
+ * stored in bfqd, which is dynamically updated according to the | |
+ * estimated disk peak rate; otherwise return the default max budget | |
+ */ | |
+static inline unsigned long bfq_max_budget(struct bfq_data *bfqd) | |
+{ | |
+ if (bfqd->budgets_assigned < 194) | |
+ return bfq_default_max_budget; | |
+ else | |
+ return bfqd->bfq_max_budget; | |
+} | |
+ | |
+/* | |
+ * Return min budget, which is a fraction of the current or default | |
+ * max budget (trying with 1/32) | |
+ */ | |
+static inline unsigned long bfq_min_budget(struct bfq_data *bfqd) | |
+{ | |
+ if (bfqd->budgets_assigned < 194) | |
+ return bfq_default_max_budget / 32; | |
+ else | |
+ return bfqd->bfq_max_budget / 32; | |
+} | |
+ | |
+static void bfq_arm_slice_timer(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_queue *bfqq = bfqd->in_service_queue; | |
+ struct bfq_io_cq *bic; | |
+ unsigned long sl; | |
+ | |
+ BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list)); | |
+ | |
+ /* Processes have exited, don't wait. */ | |
+ bic = bfqd->in_service_bic; | |
+ if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0) | |
+ return; | |
+ | |
+ bfq_mark_bfqq_wait_request(bfqq); | |
+ | |
+ /* | |
+ * We don't want to idle for seeks, but we do want to allow | |
+ * fair distribution of slice time for a process doing back-to-back | |
+ * seeks. So allow a little bit of time for him to submit a new rq. | |
+ * | |
+ * To prevent processes with (partly) seeky workloads from | |
+ * being too ill-treated, grant them a small fraction of the | |
+ * assigned budget before reducing the waiting time to | |
+ * BFQ_MIN_TT. This happened to help reduce latency. | |
+ */ | |
+ sl = bfqd->bfq_slice_idle; | |
+ /* | |
+ * Unless the queue is being weight-raised, grant only minimum idle | |
+ * time if the queue either has been seeky for long enough or has | |
+ * already proved to be constantly seeky. | |
+ */ | |
+ if (bfq_sample_valid(bfqq->seek_samples) && | |
+ ((BFQQ_SEEKY(bfqq) && bfqq->entity.service > | |
+ bfq_max_budget(bfqq->bfqd) / 8) || | |
+ bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1) | |
+ sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT)); | |
+ else if (bfqq->wr_coeff > 1) | |
+ sl = sl * 3; | |
+ bfqd->last_idling_start = ktime_get(); | |
+ mod_timer(&bfqd->idle_slice_timer, jiffies + sl); | |
+ bfq_log(bfqd, "arm idle: %u/%u ms", | |
+ jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle)); | |
+} | |
+ | |
+/* | |
+ * Set the maximum time for the in-service queue to consume its | |
+ * budget. This prevents seeky processes from lowering the disk | |
+ * throughput (always guaranteed with a time slice scheme as in CFQ). | |
+ */ | |
+static void bfq_set_budget_timeout(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_queue *bfqq = bfqd->in_service_queue; | |
+ unsigned int timeout_coeff; | |
+ if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time) | |
+ timeout_coeff = 1; | |
+ else | |
+ timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight; | |
+ | |
+ bfqd->last_budget_start = ktime_get(); | |
+ | |
+ bfq_clear_bfqq_budget_new(bfqq); | |
+ bfqq->budget_timeout = jiffies + | |
+ bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u", | |
+ jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * | |
+ timeout_coeff)); | |
+} | |
+ | |
+/* | |
+ * Move request from internal lists to the request queue dispatch list. | |
+ */ | |
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq); | |
+ | |
+ /* | |
+ * For consistency, the next instruction should have been executed | |
+ * after removing the request from the queue and dispatching it. | |
+ * We execute instead this instruction before bfq_remove_request() | |
+ * (and hence introduce a temporary inconsistency), for efficiency. | |
+ * In fact, in a forced_dispatch, this prevents two counters related | |
+ * to bfqq->dispatched to risk to be uselessly decremented if bfqq | |
+ * is not in service, and then to be incremented again after | |
+ * incrementing bfqq->dispatched. | |
+ */ | |
+ bfqq->dispatched++; | |
+ bfq_remove_request(rq); | |
+ elv_dispatch_sort(q, rq); | |
+ | |
+ if (bfq_bfqq_sync(bfqq)) | |
+ bfqd->sync_flight++; | |
+} | |
+ | |
+/* | |
+ * Return expired entry, or NULL to just start from scratch in rbtree. | |
+ */ | |
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq) | |
+{ | |
+ struct request *rq = NULL; | |
+ | |
+ if (bfq_bfqq_fifo_expire(bfqq)) | |
+ return NULL; | |
+ | |
+ bfq_mark_bfqq_fifo_expire(bfqq); | |
+ | |
+ if (list_empty(&bfqq->fifo)) | |
+ return NULL; | |
+ | |
+ rq = rq_entry_fifo(bfqq->fifo.next); | |
+ | |
+ if (time_before(jiffies, rq->fifo_time)) | |
+ return NULL; | |
+ | |
+ return rq; | |
+} | |
+ | |
+static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ return entity->budget - entity->service; | |
+} | |
+ | |
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ BUG_ON(bfqq != bfqd->in_service_queue); | |
+ | |
+ __bfq_bfqd_reset_in_service(bfqd); | |
+ | |
+ /* | |
+ * If this bfqq is shared between multiple processes, check | |
+ * to make sure that those processes are still issuing I/Os | |
+ * within the mean seek distance. If not, it may be time to | |
+ * break the queues apart again. | |
+ */ | |
+ if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq)) | |
+ bfq_mark_bfqq_split_coop(bfqq); | |
+ | |
+ if (RB_EMPTY_ROOT(&bfqq->sort_list)) { | |
+ /* | |
+ * Overloading budget_timeout field to store the time | |
+ * at which the queue remains with no backlog; used by | |
+ * the weight-raising mechanism. | |
+ */ | |
+ bfqq->budget_timeout = jiffies; | |
+ bfq_del_bfqq_busy(bfqd, bfqq, 1); | |
+ } else { | |
+ bfq_activate_bfqq(bfqd, bfqq); | |
+ /* | |
+ * Resort priority tree of potential close cooperators. | |
+ */ | |
+ bfq_rq_pos_tree_add(bfqd, bfqq); | |
+ } | |
+} | |
+ | |
+/** | |
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior. | |
+ * @bfqd: device data. | |
+ * @bfqq: queue to update. | |
+ * @reason: reason for expiration. | |
+ * | |
+ * Handle the feedback on @bfqq budget. See the body for detailed | |
+ * comments. | |
+ */ | |
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq, | |
+ enum bfqq_expiration reason) | |
+{ | |
+ struct request *next_rq; | |
+ unsigned long budget, min_budget; | |
+ | |
+ budget = bfqq->max_budget; | |
+ min_budget = bfq_min_budget(bfqd); | |
+ | |
+ BUG_ON(bfqq != bfqd->in_service_queue); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu", | |
+ bfqq->entity.budget, bfq_bfqq_budget_left(bfqq)); | |
+ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu", | |
+ budget, bfq_min_budget(bfqd)); | |
+ bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d", | |
+ bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue)); | |
+ | |
+ if (bfq_bfqq_sync(bfqq)) { | |
+ switch (reason) { | |
+ /* | |
+ * Caveat: in all the following cases we trade latency | |
+ * for throughput. | |
+ */ | |
+ case BFQ_BFQQ_TOO_IDLE: | |
+ /* | |
+ * This is the only case where we may reduce | |
+ * the budget: if there is no request of the | |
+ * process still waiting for completion, then | |
+ * we assume (tentatively) that the timer has | |
+ * expired because the batch of requests of | |
+ * the process could have been served with a | |
+ * smaller budget. Hence, betting that | |
+ * process will behave in the same way when it | |
+ * becomes backlogged again, we reduce its | |
+ * next budget. As long as we guess right, | |
+ * this budget cut reduces the latency | |
+ * experienced by the process. | |
+ * | |
+ * However, if there are still outstanding | |
+ * requests, then the process may have not yet | |
+ * issued its next request just because it is | |
+ * still waiting for the completion of some of | |
+ * the still outstanding ones. So in this | |
+ * subcase we do not reduce its budget, on the | |
+ * contrary we increase it to possibly boost | |
+ * the throughput, as discussed in the | |
+ * comments to the BUDGET_TIMEOUT case. | |
+ */ | |
+ if (bfqq->dispatched > 0) /* still outstanding reqs */ | |
+ budget = min(budget * 2, bfqd->bfq_max_budget); | |
+ else { | |
+ if (budget > 5 * min_budget) | |
+ budget -= 4 * min_budget; | |
+ else | |
+ budget = min_budget; | |
+ } | |
+ break; | |
+ case BFQ_BFQQ_BUDGET_TIMEOUT: | |
+ /* | |
+ * We double the budget here because: 1) it | |
+ * gives the chance to boost the throughput if | |
+ * this is not a seeky process (which may have | |
+ * bumped into this timeout because of, e.g., | |
+ * ZBR), 2) together with charge_full_budget | |
+ * it helps give seeky processes higher | |
+ * timestamps, and hence be served less | |
+ * frequently. | |
+ */ | |
+ budget = min(budget * 2, bfqd->bfq_max_budget); | |
+ break; | |
+ case BFQ_BFQQ_BUDGET_EXHAUSTED: | |
+ /* | |
+ * The process still has backlog, and did not | |
+ * let either the budget timeout or the disk | |
+ * idling timeout expire. Hence it is not | |
+ * seeky, has a short thinktime and may be | |
+ * happy with a higher budget too. So | |
+ * definitely increase the budget of this good | |
+ * candidate to boost the disk throughput. | |
+ */ | |
+ budget = min(budget * 4, bfqd->bfq_max_budget); | |
+ break; | |
+ case BFQ_BFQQ_NO_MORE_REQUESTS: | |
+ /* | |
+ * Leave the budget unchanged. | |
+ */ | |
+ default: | |
+ return; | |
+ } | |
+ } else /* async queue */ | |
+ /* async queues get always the maximum possible budget | |
+ * (their ability to dispatch is limited by | |
+ * @bfqd->bfq_max_budget_async_rq). | |
+ */ | |
+ budget = bfqd->bfq_max_budget; | |
+ | |
+ bfqq->max_budget = budget; | |
+ | |
+ if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 && | |
+ bfqq->max_budget > bfqd->bfq_max_budget) | |
+ bfqq->max_budget = bfqd->bfq_max_budget; | |
+ | |
+ /* | |
+ * Make sure that we have enough budget for the next request. | |
+ * Since the finish time of the bfqq must be kept in sync with | |
+ * the budget, be sure to call __bfq_bfqq_expire() after the | |
+ * update. | |
+ */ | |
+ next_rq = bfqq->next_rq; | |
+ if (next_rq != NULL) | |
+ bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget, | |
+ bfq_serv_to_charge(next_rq, bfqq)); | |
+ else | |
+ bfqq->entity.budget = bfqq->max_budget; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu", | |
+ next_rq != NULL ? blk_rq_sectors(next_rq) : 0, | |
+ bfqq->entity.budget); | |
+} | |
+ | |
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout) | |
+{ | |
+ unsigned long max_budget; | |
+ | |
+ /* | |
+ * The max_budget calculated when autotuning is equal to the | |
+ * amount of sectors transfered in timeout_sync at the | |
+ * estimated peak rate. | |
+ */ | |
+ max_budget = (unsigned long)(peak_rate * 1000 * | |
+ timeout >> BFQ_RATE_SHIFT); | |
+ | |
+ return max_budget; | |
+} | |
+ | |
+/* | |
+ * In addition to updating the peak rate, checks whether the process | |
+ * is "slow", and returns 1 if so. This slow flag is used, in addition | |
+ * to the budget timeout, to reduce the amount of service provided to | |
+ * seeky processes, and hence reduce their chances to lower the | |
+ * throughput. See the code for more details. | |
+ */ | |
+static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ int compensate, enum bfqq_expiration reason) | |
+{ | |
+ u64 bw, usecs, expected, timeout; | |
+ ktime_t delta; | |
+ int update = 0; | |
+ | |
+ if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq)) | |
+ return 0; | |
+ | |
+ if (compensate) | |
+ delta = bfqd->last_idling_start; | |
+ else | |
+ delta = ktime_get(); | |
+ delta = ktime_sub(delta, bfqd->last_budget_start); | |
+ usecs = ktime_to_us(delta); | |
+ | |
+ /* Don't trust short/unrealistic values. */ | |
+ if (usecs < 100 || usecs >= LONG_MAX) | |
+ return 0; | |
+ | |
+ /* | |
+ * Calculate the bandwidth for the last slice. We use a 64 bit | |
+ * value to store the peak rate, in sectors per usec in fixed | |
+ * point math. We do so to have enough precision in the estimate | |
+ * and to avoid overflows. | |
+ */ | |
+ bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT; | |
+ do_div(bw, (unsigned long)usecs); | |
+ | |
+ timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]); | |
+ | |
+ /* | |
+ * Use only long (> 20ms) intervals to filter out spikes for | |
+ * the peak rate estimation. | |
+ */ | |
+ if (usecs > 20000) { | |
+ if (bw > bfqd->peak_rate || | |
+ (!BFQQ_SEEKY(bfqq) && | |
+ reason == BFQ_BFQQ_BUDGET_TIMEOUT)) { | |
+ bfq_log(bfqd, "measured bw =%llu", bw); | |
+ /* | |
+ * To smooth oscillations use a low-pass filter with | |
+ * alpha=7/8, i.e., | |
+ * new_rate = (7/8) * old_rate + (1/8) * bw | |
+ */ | |
+ do_div(bw, 8); | |
+ if (bw == 0) | |
+ return 0; | |
+ bfqd->peak_rate *= 7; | |
+ do_div(bfqd->peak_rate, 8); | |
+ bfqd->peak_rate += bw; | |
+ update = 1; | |
+ bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate); | |
+ } | |
+ | |
+ update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1; | |
+ | |
+ if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES) | |
+ bfqd->peak_rate_samples++; | |
+ | |
+ if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES && | |
+ update) { | |
+ int dev_type = blk_queue_nonrot(bfqd->queue); | |
+ if (bfqd->bfq_user_max_budget == 0) { | |
+ bfqd->bfq_max_budget = | |
+ bfq_calc_max_budget(bfqd->peak_rate, | |
+ timeout); | |
+ bfq_log(bfqd, "new max_budget=%lu", | |
+ bfqd->bfq_max_budget); | |
+ } | |
+ if (bfqd->device_speed == BFQ_BFQD_FAST && | |
+ bfqd->peak_rate < device_speed_thresh[dev_type]) { | |
+ bfqd->device_speed = BFQ_BFQD_SLOW; | |
+ bfqd->RT_prod = R_slow[dev_type] * | |
+ T_slow[dev_type]; | |
+ } else if (bfqd->device_speed == BFQ_BFQD_SLOW && | |
+ bfqd->peak_rate > device_speed_thresh[dev_type]) { | |
+ bfqd->device_speed = BFQ_BFQD_FAST; | |
+ bfqd->RT_prod = R_fast[dev_type] * | |
+ T_fast[dev_type]; | |
+ } | |
+ } | |
+ } | |
+ | |
+ /* | |
+ * If the process has been served for a too short time | |
+ * interval to let its possible sequential accesses prevail on | |
+ * the initial seek time needed to move the disk head on the | |
+ * first sector it requested, then give the process a chance | |
+ * and for the moment return false. | |
+ */ | |
+ if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8) | |
+ return 0; | |
+ | |
+ /* | |
+ * A process is considered ``slow'' (i.e., seeky, so that we | |
+ * cannot treat it fairly in the service domain, as it would | |
+ * slow down too much the other processes) if, when a slice | |
+ * ends for whatever reason, it has received service at a | |
+ * rate that would not be high enough to complete the budget | |
+ * before the budget timeout expiration. | |
+ */ | |
+ expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT; | |
+ | |
+ /* | |
+ * Caveat: processes doing IO in the slower disk zones will | |
+ * tend to be slow(er) even if not seeky. And the estimated | |
+ * peak rate will actually be an average over the disk | |
+ * surface. Hence, to not be too harsh with unlucky processes, | |
+ * we keep a budget/3 margin of safety before declaring a | |
+ * process slow. | |
+ */ | |
+ return expected > (4 * bfqq->entity.budget) / 3; | |
+} | |
+ | |
+/* | |
+ * To be deemed as soft real-time, an application must meet two | |
+ * requirements. First, the application must not require an average | |
+ * bandwidth higher than the approximate bandwidth required to playback or | |
+ * record a compressed high-definition video. | |
+ * The next function is invoked on the completion of the last request of a | |
+ * batch, to compute the next-start time instant, soft_rt_next_start, such | |
+ * that, if the next request of the application does not arrive before | |
+ * soft_rt_next_start, then the above requirement on the bandwidth is met. | |
+ * | |
+ * The second requirement is that the request pattern of the application is | |
+ * isochronous, i.e., that, after issuing a request or a batch of requests, | |
+ * the application stops issuing new requests until all its pending requests | |
+ * have been completed. After that, the application may issue a new batch, | |
+ * and so on. | |
+ * For this reason the next function is invoked to compute | |
+ * soft_rt_next_start only for applications that meet this requirement, | |
+ * whereas soft_rt_next_start is set to infinity for applications that do | |
+ * not. | |
+ * | |
+ * Unfortunately, even a greedy application may happen to behave in an | |
+ * isochronous way if the CPU load is high. In fact, the application may | |
+ * stop issuing requests while the CPUs are busy serving other processes, | |
+ * then restart, then stop again for a while, and so on. In addition, if | |
+ * the disk achieves a low enough throughput with the request pattern | |
+ * issued by the application (e.g., because the request pattern is random | |
+ * and/or the device is slow), then the application may meet the above | |
+ * bandwidth requirement too. To prevent such a greedy application to be | |
+ * deemed as soft real-time, a further rule is used in the computation of | |
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current | |
+ * time plus the maximum time for which the arrival of a request is waited | |
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle. | |
+ * This filters out greedy applications, as the latter issue instead their | |
+ * next request as soon as possible after the last one has been completed | |
+ * (in contrast, when a batch of requests is completed, a soft real-time | |
+ * application spends some time processing data). | |
+ * | |
+ * Unfortunately, the last filter may easily generate false positives if | |
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one | |
+ * or both the following cases occur: | |
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher | |
+ * than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with | |
+ * HZ=100. | |
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing | |
+ * for a while, then suddenly 'jump' by several units to recover the lost | |
+ * increments. This seems to happen, e.g., inside virtual machines. | |
+ * To address this issue, we do not use as a reference time interval just | |
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In | |
+ * particular we add the minimum number of jiffies for which the filter | |
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual | |
+ * machines. | |
+ */ | |
+static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq) | |
+{ | |
+ return max(bfqq->last_idle_bklogged + | |
+ HZ * bfqq->service_from_backlogged / | |
+ bfqd->bfq_wr_max_softrt_rate, | |
+ jiffies + bfqq->bfqd->bfq_slice_idle + 4); | |
+} | |
+ | |
+/* | |
+ * Return the largest-possible time instant such that, for as long as possible, | |
+ * the current time will be lower than this time instant according to the macro | |
+ * time_is_before_jiffies(). | |
+ */ | |
+static inline unsigned long bfq_infinity_from_now(unsigned long now) | |
+{ | |
+ return now + ULONG_MAX / 2; | |
+} | |
+ | |
+/** | |
+ * bfq_bfqq_expire - expire a queue. | |
+ * @bfqd: device owning the queue. | |
+ * @bfqq: the queue to expire. | |
+ * @compensate: if true, compensate for the time spent idling. | |
+ * @reason: the reason causing the expiration. | |
+ * | |
+ * | |
+ * If the process associated to the queue is slow (i.e., seeky), or in | |
+ * case of budget timeout, or, finally, if it is async, we | |
+ * artificially charge it an entire budget (independently of the | |
+ * actual service it received). As a consequence, the queue will get | |
+ * higher timestamps than the correct ones upon reactivation, and | |
+ * hence it will be rescheduled as if it had received more service | |
+ * than what it actually received. In the end, this class of processes | |
+ * will receive less service in proportion to how slowly they consume | |
+ * their budgets (and hence how seriously they tend to lower the | |
+ * throughput). | |
+ * | |
+ * In contrast, when a queue expires because it has been idling for | |
+ * too much or because it exhausted its budget, we do not touch the | |
+ * amount of service it has received. Hence when the queue will be | |
+ * reactivated and its timestamps updated, the latter will be in sync | |
+ * with the actual service received by the queue until expiration. | |
+ * | |
+ * Charging a full budget to the first type of queues and the exact | |
+ * service to the others has the effect of using the WF2Q+ policy to | |
+ * schedule the former on a timeslice basis, without violating the | |
+ * service domain guarantees of the latter. | |
+ */ | |
+static void bfq_bfqq_expire(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq, | |
+ int compensate, | |
+ enum bfqq_expiration reason) | |
+{ | |
+ int slow; | |
+ BUG_ON(bfqq != bfqd->in_service_queue); | |
+ | |
+ /* Update disk peak rate for autotuning and check whether the | |
+ * process is slow (see bfq_update_peak_rate). | |
+ */ | |
+ slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason); | |
+ | |
+ /* | |
+ * As above explained, 'punish' slow (i.e., seeky), timed-out | |
+ * and async queues, to favor sequential sync workloads. | |
+ * | |
+ * Processes doing I/O in the slower disk zones will tend to be | |
+ * slow(er) even if not seeky. Hence, since the estimated peak | |
+ * rate is actually an average over the disk surface, these | |
+ * processes may timeout just for bad luck. To avoid punishing | |
+ * them we do not charge a full budget to a process that | |
+ * succeeded in consuming at least 2/3 of its budget. | |
+ */ | |
+ if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT && | |
+ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3)) | |
+ bfq_bfqq_charge_full_budget(bfqq); | |
+ | |
+ bfqq->service_from_backlogged += bfqq->entity.service; | |
+ | |
+ if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT && | |
+ !bfq_bfqq_constantly_seeky(bfqq)) { | |
+ bfq_mark_bfqq_constantly_seeky(bfqq); | |
+ if (!blk_queue_nonrot(bfqd->queue)) | |
+ bfqd->const_seeky_busy_in_flight_queues++; | |
+ } | |
+ | |
+ if (reason == BFQ_BFQQ_TOO_IDLE && | |
+ bfqq->entity.service <= 2 * bfqq->entity.budget / 10 ) | |
+ bfq_clear_bfqq_IO_bound(bfqq); | |
+ | |
+ if (bfqd->low_latency && bfqq->wr_coeff == 1) | |
+ bfqq->last_wr_start_finish = jiffies; | |
+ | |
+ if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 && | |
+ RB_EMPTY_ROOT(&bfqq->sort_list)) { | |
+ /* | |
+ * If we get here, and there are no outstanding requests, | |
+ * then the request pattern is isochronous (see the comments | |
+ * to the function bfq_bfqq_softrt_next_start()). Hence we | |
+ * can compute soft_rt_next_start. If, instead, the queue | |
+ * still has outstanding requests, then we have to wait | |
+ * for the completion of all the outstanding requests to | |
+ * discover whether the request pattern is actually | |
+ * isochronous. | |
+ */ | |
+ if (bfqq->dispatched == 0) | |
+ bfqq->soft_rt_next_start = | |
+ bfq_bfqq_softrt_next_start(bfqd, bfqq); | |
+ else { | |
+ /* | |
+ * The application is still waiting for the | |
+ * completion of one or more requests: | |
+ * prevent it from possibly being incorrectly | |
+ * deemed as soft real-time by setting its | |
+ * soft_rt_next_start to infinity. In fact, | |
+ * without this assignment, the application | |
+ * would be incorrectly deemed as soft | |
+ * real-time if: | |
+ * 1) it issued a new request before the | |
+ * completion of all its in-flight | |
+ * requests, and | |
+ * 2) at that time, its soft_rt_next_start | |
+ * happened to be in the past. | |
+ */ | |
+ bfqq->soft_rt_next_start = | |
+ bfq_infinity_from_now(jiffies); | |
+ /* | |
+ * Schedule an update of soft_rt_next_start to when | |
+ * the task may be discovered to be isochronous. | |
+ */ | |
+ bfq_mark_bfqq_softrt_update(bfqq); | |
+ } | |
+ } | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "expire (%d, slow %d, num_disp %d, idle_win %d)", reason, | |
+ slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq)); | |
+ | |
+ /* | |
+ * Increase, decrease or leave budget unchanged according to | |
+ * reason. | |
+ */ | |
+ __bfq_bfqq_recalc_budget(bfqd, bfqq, reason); | |
+ __bfq_bfqq_expire(bfqd, bfqq); | |
+} | |
+ | |
+/* | |
+ * Budget timeout is not implemented through a dedicated timer, but | |
+ * just checked on request arrivals and completions, as well as on | |
+ * idle timer expirations. | |
+ */ | |
+static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq) | |
+{ | |
+ if (bfq_bfqq_budget_new(bfqq) || | |
+ time_before(jiffies, bfqq->budget_timeout)) | |
+ return 0; | |
+ return 1; | |
+} | |
+ | |
+/* | |
+ * If we expire a queue that is waiting for the arrival of a new | |
+ * request, we may prevent the fictitious timestamp back-shifting that | |
+ * allows the guarantees of the queue to be preserved (see [1] for | |
+ * this tricky aspect). Hence we return true only if this condition | |
+ * does not hold, or if the queue is slow enough to deserve only to be | |
+ * kicked off for preserving a high throughput. | |
+*/ | |
+static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq) | |
+{ | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, | |
+ "may_budget_timeout: wait_request %d left %d timeout %d", | |
+ bfq_bfqq_wait_request(bfqq), | |
+ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3, | |
+ bfq_bfqq_budget_timeout(bfqq)); | |
+ | |
+ return (!bfq_bfqq_wait_request(bfqq) || | |
+ bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3) | |
+ && | |
+ bfq_bfqq_budget_timeout(bfqq); | |
+} | |
+ | |
+/* | |
+ * Device idling is allowed only for the queues for which this function | |
+ * returns true. For this reason, the return value of this function plays a | |
+ * critical role for both throughput boosting and service guarantees. The | |
+ * return value is computed through a logical expression. In this rather | |
+ * long comment, we try to briefly describe all the details and motivations | |
+ * behind the components of this logical expression. | |
+ * | |
+ * First, the expression is false if bfqq is not sync, or if: bfqq happened | |
+ * to become active during a large burst of queue activations, and the | |
+ * pattern of requests bfqq contains boosts the throughput if bfqq is | |
+ * expired. In fact, queues that became active during a large burst benefit | |
+ * only from throughput, as discussed in the comments to bfq_handle_burst. | |
+ * In this respect, expiring bfqq certainly boosts the throughput on NCQ- | |
+ * capable flash-based devices, whereas, on rotational devices, it boosts | |
+ * the throughput only if bfqq contains random requests. | |
+ * | |
+ * On the opposite end, if (a) bfqq is sync, (b) the above burst-related | |
+ * condition does not hold, and (c) bfqq is being weight-raised, then the | |
+ * expression always evaluates to true, as device idling is instrumental | |
+ * for preserving low-latency guarantees (see [1]). If, instead, conditions | |
+ * (a) and (b) do hold, but (c) does not, then the expression evaluates to | |
+ * true only if: (1) bfqq is I/O-bound and has a non-null idle window, and | |
+ * (2) at least one of the following two conditions holds. | |
+ * The first condition is that the device is not performing NCQ, because | |
+ * idling the device most certainly boosts the throughput if this condition | |
+ * holds and bfqq is I/O-bound and has been granted a non-null idle window. | |
+ * The second compound condition is made of the logical AND of two components. | |
+ * | |
+ * The first component is true only if there is no weight-raised busy | |
+ * queue. This guarantees that the device is not idled for a sync non- | |
+ * weight-raised queue when there are busy weight-raised queues. The former | |
+ * is then expired immediately if empty. Combined with the timestamping | |
+ * rules of BFQ (see [1] for details), this causes sync non-weight-raised | |
+ * queues to get a lower number of requests served, and hence to ask for a | |
+ * lower number of requests from the request pool, before the busy weight- | |
+ * raised queues get served again. | |
+ * | |
+ * This is beneficial for the processes associated with weight-raised | |
+ * queues, when the request pool is saturated (e.g., in the presence of | |
+ * write hogs). In fact, if the processes associated with the other queues | |
+ * ask for requests at a lower rate, then weight-raised processes have a | |
+ * higher probability to get a request from the pool immediately (or at | |
+ * least soon) when they need one. Hence they have a higher probability to | |
+ * actually get a fraction of the disk throughput proportional to their | |
+ * high weight. This is especially true with NCQ-capable drives, which | |
+ * enqueue several requests in advance and further reorder internally- | |
+ * queued requests. | |
+ * | |
+ * In the end, mistreating non-weight-raised queues when there are busy | |
+ * weight-raised queues seems to mitigate starvation problems in the | |
+ * presence of heavy write workloads and NCQ, and hence to guarantee a | |
+ * higher application and system responsiveness in these hostile scenarios. | |
+ * | |
+ * If the first component of the compound condition is instead true, i.e., | |
+ * there is no weight-raised busy queue, then the second component of the | |
+ * compound condition takes into account service-guarantee and throughput | |
+ * issues related to NCQ (recall that the compound condition is evaluated | |
+ * only if the device is detected as supporting NCQ). | |
+ * | |
+ * As for service guarantees, allowing the drive to enqueue more than one | |
+ * request at a time, and hence delegating de facto final scheduling | |
+ * decisions to the drive's internal scheduler, causes loss of control on | |
+ * the actual request service order. In this respect, when the drive is | |
+ * allowed to enqueue more than one request at a time, the service | |
+ * distribution enforced by the drive's internal scheduler is likely to | |
+ * coincide with the desired device-throughput distribution only in the | |
+ * following, perfectly symmetric, scenario: | |
+ * 1) all active queues have the same weight, | |
+ * 2) all active groups at the same level in the groups tree have the same | |
+ * weight, | |
+ * 3) all active groups at the same level in the groups tree have the same | |
+ * number of children. | |
+ * | |
+ * Even in such a scenario, sequential I/O may still receive a preferential | |
+ * treatment, but this is not likely to be a big issue with flash-based | |
+ * devices, because of their non-dramatic loss of throughput with random | |
+ * I/O. Things do differ with HDDs, for which additional care is taken, as | |
+ * explained after completing the discussion for flash-based devices. | |
+ * | |
+ * Unfortunately, keeping the necessary state for evaluating exactly the | |
+ * above symmetry conditions would be quite complex and time-consuming. | |
+ * Therefore BFQ evaluates instead the following stronger sub-conditions, | |
+ * for which it is much easier to maintain the needed state: | |
+ * 1) all active queues have the same weight, | |
+ * 2) all active groups have the same weight, | |
+ * 3) all active groups have at most one active child each. | |
+ * In particular, the last two conditions are always true if hierarchical | |
+ * support and the cgroups interface are not enabled, hence no state needs | |
+ * to be maintained in this case. | |
+ * | |
+ * According to the above considerations, the second component of the | |
+ * compound condition evaluates to true if any of the above symmetry | |
+ * sub-condition does not hold, or the device is not flash-based. Therefore, | |
+ * if also the first component is true, then idling is allowed for a sync | |
+ * queue. These are the only sub-conditions considered if the device is | |
+ * flash-based, as, for such a device, it is sensible to force idling only | |
+ * for service-guarantee issues. In fact, as for throughput, idling | |
+ * NCQ-capable flash-based devices would not boost the throughput even | |
+ * with sequential I/O; rather it would lower the throughput in proportion | |
+ * to how fast the device is. In the end, (only) if all the three | |
+ * sub-conditions hold and the device is flash-based, the compound | |
+ * condition evaluates to false and therefore no idling is performed. | |
+ * | |
+ * As already said, things change with a rotational device, where idling | |
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for | |
+ * such a device the second component of the compound condition evaluates | |
+ * to true also if the following additional sub-condition does not hold: | |
+ * the queue is constantly seeky. Unfortunately, this different behavior | |
+ * with respect to flash-based devices causes an additional asymmetry: if | |
+ * some sync queues enjoy idling and some other sync queues do not, then | |
+ * the latter get a low share of the device throughput, simply because the | |
+ * former get many requests served after being set as in service, whereas | |
+ * the latter do not. As a consequence, to guarantee the desired throughput | |
+ * distribution, on HDDs the compound expression evaluates to true (and | |
+ * hence device idling is performed) also if the following last symmetry | |
+ * condition does not hold: no other queue is benefiting from idling. Also | |
+ * this last condition is actually replaced with a simpler-to-maintain and | |
+ * stronger condition: there is no busy queue which is not constantly seeky | |
+ * (and hence may also benefit from idling). | |
+ * | |
+ * To sum up, when all the required symmetry and throughput-boosting | |
+ * sub-conditions hold, the second component of the compound condition | |
+ * evaluates to false, and hence no idling is performed. This helps to | |
+ * keep the drives' internal queues full on NCQ-capable devices, and hence | |
+ * to boost the throughput, without causing 'almost' any loss of service | |
+ * guarantees. The 'almost' follows from the fact that, if the internal | |
+ * queue of one such device is filled while all the sub-conditions hold, | |
+ * but at some point in time some sub-condition stops to hold, then it may | |
+ * become impossible to let requests be served in the new desired order | |
+ * until all the requests already queued in the device have been served. | |
+ */ | |
+static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+#define symmetric_scenario (!bfqd->active_numerous_groups && \ | |
+ !bfq_differentiated_weights(bfqd)) | |
+#else | |
+#define symmetric_scenario (!bfq_differentiated_weights(bfqd)) | |
+#endif | |
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \ | |
+ bfqd->busy_in_flight_queues == \ | |
+ bfqd->const_seeky_busy_in_flight_queues) | |
+ | |
+#define cond_for_expiring_in_burst (bfq_bfqq_in_large_burst(bfqq) && \ | |
+ bfqd->hw_tag && \ | |
+ (blk_queue_nonrot(bfqd->queue) || \ | |
+ bfq_bfqq_constantly_seeky(bfqq))) | |
+ | |
+/* | |
+ * Condition for expiring a non-weight-raised queue (and hence not idling | |
+ * the device). | |
+ */ | |
+#define cond_for_expiring_non_wr (bfqd->hw_tag && \ | |
+ (bfqd->wr_busy_queues > 0 || \ | |
+ (symmetric_scenario && \ | |
+ (blk_queue_nonrot(bfqd->queue) || \ | |
+ cond_for_seeky_on_ncq_hdd)))) | |
+ | |
+ return bfq_bfqq_sync(bfqq) && | |
+ !cond_for_expiring_in_burst && | |
+ (bfqq->wr_coeff > 1 || | |
+ (bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_idle_window(bfqq) && | |
+ !cond_for_expiring_non_wr) | |
+ ); | |
+} | |
+ | |
+/* | |
+ * If the in-service queue is empty but sync, and the function | |
+ * bfq_bfqq_must_not_expire returns true, then: | |
+ * 1) the queue must remain in service and cannot be expired, and | |
+ * 2) the disk must be idled to wait for the possible arrival of a new | |
+ * request for the queue. | |
+ * See the comments to the function bfq_bfqq_must_not_expire for the reasons | |
+ * why performing device idling is the best choice to boost the throughput | |
+ * and preserve service guarantees when bfq_bfqq_must_not_expire itself | |
+ * returns true. | |
+ */ | |
+static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ | |
+ return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 && | |
+ bfq_bfqq_must_not_expire(bfqq); | |
+} | |
+ | |
+/* | |
+ * Select a queue for service. If we have a current queue in service, | |
+ * check whether to continue servicing it, or retrieve and set a new one. | |
+ */ | |
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_queue *bfqq; | |
+ struct request *next_rq; | |
+ enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT; | |
+ | |
+ bfqq = bfqd->in_service_queue; | |
+ if (bfqq == NULL) | |
+ goto new_queue; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue"); | |
+ | |
+ if (bfq_may_expire_for_budg_timeout(bfqq) && | |
+ !timer_pending(&bfqd->idle_slice_timer) && | |
+ !bfq_bfqq_must_idle(bfqq)) | |
+ goto expire; | |
+ | |
+ next_rq = bfqq->next_rq; | |
+ /* | |
+ * If bfqq has requests queued and it has enough budget left to | |
+ * serve them, keep the queue, otherwise expire it. | |
+ */ | |
+ if (next_rq != NULL) { | |
+ if (bfq_serv_to_charge(next_rq, bfqq) > | |
+ bfq_bfqq_budget_left(bfqq)) { | |
+ reason = BFQ_BFQQ_BUDGET_EXHAUSTED; | |
+ goto expire; | |
+ } else { | |
+ /* | |
+ * The idle timer may be pending because we may | |
+ * not disable disk idling even when a new request | |
+ * arrives. | |
+ */ | |
+ if (timer_pending(&bfqd->idle_slice_timer)) { | |
+ /* | |
+ * If we get here: 1) at least a new request | |
+ * has arrived but we have not disabled the | |
+ * timer because the request was too small, | |
+ * 2) then the block layer has unplugged | |
+ * the device, causing the dispatch to be | |
+ * invoked. | |
+ * | |
+ * Since the device is unplugged, now the | |
+ * requests are probably large enough to | |
+ * provide a reasonable throughput. | |
+ * So we disable idling. | |
+ */ | |
+ bfq_clear_bfqq_wait_request(bfqq); | |
+ del_timer(&bfqd->idle_slice_timer); | |
+ } | |
+ goto keep_queue; | |
+ } | |
+ } | |
+ | |
+ /* | |
+ * No requests pending. If the in-service queue still has requests | |
+ * in flight (possibly waiting for a completion) or is idling for a | |
+ * new request, then keep it. | |
+ */ | |
+ if (timer_pending(&bfqd->idle_slice_timer) || | |
+ (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) { | |
+ bfqq = NULL; | |
+ goto keep_queue; | |
+ } | |
+ | |
+ reason = BFQ_BFQQ_NO_MORE_REQUESTS; | |
+expire: | |
+ bfq_bfqq_expire(bfqd, bfqq, 0, reason); | |
+new_queue: | |
+ bfqq = bfq_set_in_service_queue(bfqd); | |
+ bfq_log(bfqd, "select_queue: new queue %d returned", | |
+ bfqq != NULL ? bfqq->pid : 0); | |
+keep_queue: | |
+ return bfqq; | |
+} | |
+ | |
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */ | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "raising period dur %u/%u msec, old coeff %u, w %d(%d)", | |
+ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish), | |
+ jiffies_to_msecs(bfqq->wr_cur_max_time), | |
+ bfqq->wr_coeff, | |
+ bfqq->entity.weight, bfqq->entity.orig_weight); | |
+ | |
+ BUG_ON(bfqq != bfqd->in_service_queue && entity->weight != | |
+ entity->orig_weight * bfqq->wr_coeff); | |
+ if (entity->ioprio_changed) | |
+ bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change"); | |
+ | |
+ /* | |
+ * If the queue was activated in a burst, or | |
+ * too much time has elapsed from the beginning | |
+ * of this weight-raising period, or the queue has | |
+ * exceeded the acceptable number of cooperations, | |
+ * then end weight raising. | |
+ */ | |
+ if (bfq_bfqq_in_large_burst(bfqq) || | |
+ bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh || | |
+ time_is_before_jiffies(bfqq->last_wr_start_finish + | |
+ bfqq->wr_cur_max_time)) { | |
+ bfqq->last_wr_start_finish = jiffies; | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "wrais ending at %lu, rais_max_time %u", | |
+ bfqq->last_wr_start_finish, | |
+ jiffies_to_msecs(bfqq->wr_cur_max_time)); | |
+ bfq_bfqq_end_wr(bfqq); | |
+ } | |
+ } | |
+ /* Update weight both if it must be raised and if it must be lowered */ | |
+ if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1)) | |
+ __bfq_entity_update_weight_prio( | |
+ bfq_entity_service_tree(entity), | |
+ entity); | |
+} | |
+ | |
+/* | |
+ * Dispatch one request from bfqq, moving it to the request queue | |
+ * dispatch list. | |
+ */ | |
+static int bfq_dispatch_request(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq) | |
+{ | |
+ int dispatched = 0; | |
+ struct request *rq; | |
+ unsigned long service_to_charge; | |
+ | |
+ BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list)); | |
+ | |
+ /* Follow expired path, else get first next available. */ | |
+ rq = bfq_check_fifo(bfqq); | |
+ if (rq == NULL) | |
+ rq = bfqq->next_rq; | |
+ service_to_charge = bfq_serv_to_charge(rq, bfqq); | |
+ | |
+ if (service_to_charge > bfq_bfqq_budget_left(bfqq)) { | |
+ /* | |
+ * This may happen if the next rq is chosen in fifo order | |
+ * instead of sector order. The budget is properly | |
+ * dimensioned to be always sufficient to serve the next | |
+ * request only if it is chosen in sector order. The reason | |
+ * is that it would be quite inefficient and little useful | |
+ * to always make sure that the budget is large enough to | |
+ * serve even the possible next rq in fifo order. | |
+ * In fact, requests are seldom served in fifo order. | |
+ * | |
+ * Expire the queue for budget exhaustion, and make sure | |
+ * that the next act_budget is enough to serve the next | |
+ * request, even if it comes from the fifo expired path. | |
+ */ | |
+ bfqq->next_rq = rq; | |
+ /* | |
+ * Since this dispatch is failed, make sure that | |
+ * a new one will be performed | |
+ */ | |
+ if (!bfqd->rq_in_driver) | |
+ bfq_schedule_dispatch(bfqd); | |
+ goto expire; | |
+ } | |
+ | |
+ /* Finally, insert request into driver dispatch list. */ | |
+ bfq_bfqq_served(bfqq, service_to_charge); | |
+ bfq_dispatch_insert(bfqd->queue, rq); | |
+ | |
+ bfq_update_wr_data(bfqd, bfqq); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "dispatched %u sec req (%llu), budg left %lu", | |
+ blk_rq_sectors(rq), | |
+ (long long unsigned)blk_rq_pos(rq), | |
+ bfq_bfqq_budget_left(bfqq)); | |
+ | |
+ dispatched++; | |
+ | |
+ if (bfqd->in_service_bic == NULL) { | |
+ atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount); | |
+ bfqd->in_service_bic = RQ_BIC(rq); | |
+ } | |
+ | |
+ if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) && | |
+ dispatched >= bfqd->bfq_max_budget_async_rq) || | |
+ bfq_class_idle(bfqq))) | |
+ goto expire; | |
+ | |
+ return dispatched; | |
+ | |
+expire: | |
+ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED); | |
+ return dispatched; | |
+} | |
+ | |
+static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq) | |
+{ | |
+ int dispatched = 0; | |
+ | |
+ while (bfqq->next_rq != NULL) { | |
+ bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq); | |
+ dispatched++; | |
+ } | |
+ | |
+ BUG_ON(!list_empty(&bfqq->fifo)); | |
+ return dispatched; | |
+} | |
+ | |
+/* | |
+ * Drain our current requests. | |
+ * Used for barriers and when switching io schedulers on-the-fly. | |
+ */ | |
+static int bfq_forced_dispatch(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_queue *bfqq, *n; | |
+ struct bfq_service_tree *st; | |
+ int dispatched = 0; | |
+ | |
+ bfqq = bfqd->in_service_queue; | |
+ if (bfqq != NULL) | |
+ __bfq_bfqq_expire(bfqd, bfqq); | |
+ | |
+ /* | |
+ * Loop through classes, and be careful to leave the scheduler | |
+ * in a consistent state, as feedback mechanisms and vtime | |
+ * updates cannot be disabled during the process. | |
+ */ | |
+ list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) { | |
+ st = bfq_entity_service_tree(&bfqq->entity); | |
+ | |
+ dispatched += __bfq_forced_dispatch_bfqq(bfqq); | |
+ bfqq->max_budget = bfq_max_budget(bfqd); | |
+ | |
+ bfq_forget_idle(st); | |
+ } | |
+ | |
+ BUG_ON(bfqd->busy_queues != 0); | |
+ | |
+ return dispatched; | |
+} | |
+ | |
+static int bfq_dispatch_requests(struct request_queue *q, int force) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct bfq_queue *bfqq; | |
+ int max_dispatch; | |
+ | |
+ bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues); | |
+ if (bfqd->busy_queues == 0) | |
+ return 0; | |
+ | |
+ if (unlikely(force)) | |
+ return bfq_forced_dispatch(bfqd); | |
+ | |
+ bfqq = bfq_select_queue(bfqd); | |
+ if (bfqq == NULL) | |
+ return 0; | |
+ | |
+ max_dispatch = bfqd->bfq_quantum; | |
+ if (bfq_class_idle(bfqq)) | |
+ max_dispatch = 1; | |
+ | |
+ if (!bfq_bfqq_sync(bfqq)) | |
+ max_dispatch = bfqd->bfq_max_budget_async_rq; | |
+ | |
+ if (bfqq->dispatched >= max_dispatch) { | |
+ if (bfqd->busy_queues > 1) | |
+ return 0; | |
+ if (bfqq->dispatched >= 4 * max_dispatch) | |
+ return 0; | |
+ } | |
+ | |
+ if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq)) | |
+ return 0; | |
+ | |
+ bfq_clear_bfqq_wait_request(bfqq); | |
+ BUG_ON(timer_pending(&bfqd->idle_slice_timer)); | |
+ | |
+ if (!bfq_dispatch_request(bfqd, bfqq)) | |
+ return 0; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)", | |
+ bfqq->pid, max_dispatch); | |
+ | |
+ return 1; | |
+} | |
+ | |
+/* | |
+ * Task holds one reference to the queue, dropped when task exits. Each rq | |
+ * in-flight on this queue also holds a reference, dropped when rq is freed. | |
+ * | |
+ * Queue lock must be held here. | |
+ */ | |
+static void bfq_put_queue(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ | |
+ BUG_ON(atomic_read(&bfqq->ref) <= 0); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq, | |
+ atomic_read(&bfqq->ref)); | |
+ if (!atomic_dec_and_test(&bfqq->ref)) | |
+ return; | |
+ | |
+ BUG_ON(rb_first(&bfqq->sort_list) != NULL); | |
+ BUG_ON(bfqq->allocated[READ] + bfqq->allocated[WRITE] != 0); | |
+ BUG_ON(bfqq->entity.tree != NULL); | |
+ BUG_ON(bfq_bfqq_busy(bfqq)); | |
+ BUG_ON(bfqd->in_service_queue == bfqq); | |
+ | |
+ if (bfq_bfqq_sync(bfqq)) | |
+ /* | |
+ * The fact that this queue is being destroyed does not | |
+ * invalidate the fact that this queue may have been | |
+ * activated during the current burst. As a consequence, | |
+ * although the queue does not exist anymore, and hence | |
+ * needs to be removed from the burst list if there, | |
+ * the burst size has not to be decremented. | |
+ */ | |
+ hlist_del_init(&bfqq->burst_list_node); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq); | |
+ | |
+ kmem_cache_free(bfq_pool, bfqq); | |
+} | |
+ | |
+static void bfq_put_cooperator(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_queue *__bfqq, *next; | |
+ | |
+ /* | |
+ * If this queue was scheduled to merge with another queue, be | |
+ * sure to drop the reference taken on that queue (and others in | |
+ * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs. | |
+ */ | |
+ __bfqq = bfqq->new_bfqq; | |
+ while (__bfqq) { | |
+ if (__bfqq == bfqq) | |
+ break; | |
+ next = __bfqq->new_bfqq; | |
+ bfq_put_queue(__bfqq); | |
+ __bfqq = next; | |
+ } | |
+} | |
+ | |
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ if (bfqq == bfqd->in_service_queue) { | |
+ __bfq_bfqq_expire(bfqd, bfqq); | |
+ bfq_schedule_dispatch(bfqd); | |
+ } | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, | |
+ atomic_read(&bfqq->ref)); | |
+ | |
+ bfq_put_cooperator(bfqq); | |
+ | |
+ bfq_put_queue(bfqq); | |
+} | |
+ | |
+static inline void bfq_init_icq(struct io_cq *icq) | |
+{ | |
+ struct bfq_io_cq *bic = icq_to_bic(icq); | |
+ | |
+ bic->ttime.last_end_request = jiffies; | |
+ /* | |
+ * A newly created bic indicates that the process has just | |
+ * started doing I/O, and is probably mapping into memory its | |
+ * executable and libraries: it definitely needs weight raising. | |
+ * There is however the possibility that the process performs, | |
+ * for a while, I/O close to some other process. EQM intercepts | |
+ * this behavior and may merge the queue corresponding to the | |
+ * process with some other queue, BEFORE the weight of the queue | |
+ * is raised. Merged queues are not weight-raised (they are assumed | |
+ * to belong to processes that benefit only from high throughput). | |
+ * If the merge is basically the consequence of an accident, then | |
+ * the queue will be split soon and will get back its old weight. | |
+ * It is then important to write down somewhere that this queue | |
+ * does need weight raising, even if it did not make it to get its | |
+ * weight raised before being merged. To this purpose, we overload | |
+ * the field raising_time_left and assign 1 to it, to mark the queue | |
+ * as needing weight raising. | |
+ */ | |
+ bic->wr_time_left = 1; | |
+} | |
+ | |
+static void bfq_exit_icq(struct io_cq *icq) | |
+{ | |
+ struct bfq_io_cq *bic = icq_to_bic(icq); | |
+ struct bfq_data *bfqd = bic_to_bfqd(bic); | |
+ | |
+ if (bic->bfqq[BLK_RW_ASYNC]) { | |
+ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]); | |
+ bic->bfqq[BLK_RW_ASYNC] = NULL; | |
+ } | |
+ | |
+ if (bic->bfqq[BLK_RW_SYNC]) { | |
+ /* | |
+ * If the bic is using a shared queue, put the reference | |
+ * taken on the io_context when the bic started using a | |
+ * shared bfq_queue. | |
+ */ | |
+ if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC])) | |
+ put_io_context(icq->ioc); | |
+ bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]); | |
+ bic->bfqq[BLK_RW_SYNC] = NULL; | |
+ } | |
+} | |
+ | |
+/* | |
+ * Update the entity prio values; note that the new values will not | |
+ * be used until the next (re)activation. | |
+ */ | |
+static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) | |
+{ | |
+ struct task_struct *tsk = current; | |
+ int ioprio_class; | |
+ | |
+ if (!bfq_bfqq_prio_changed(bfqq)) | |
+ return; | |
+ | |
+ ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); | |
+ switch (ioprio_class) { | |
+ default: | |
+ dev_err(bfqq->bfqd->queue->backing_dev_info.dev, | |
+ "bfq: bad prio %x\n", ioprio_class); | |
+ case IOPRIO_CLASS_NONE: | |
+ /* | |
+ * No prio set, inherit CPU scheduling settings. | |
+ */ | |
+ bfqq->entity.new_ioprio = task_nice_ioprio(tsk); | |
+ bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk); | |
+ break; | |
+ case IOPRIO_CLASS_RT: | |
+ bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); | |
+ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT; | |
+ break; | |
+ case IOPRIO_CLASS_BE: | |
+ bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); | |
+ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE; | |
+ break; | |
+ case IOPRIO_CLASS_IDLE: | |
+ bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE; | |
+ bfqq->entity.new_ioprio = 7; | |
+ bfq_clear_bfqq_idle_window(bfqq); | |
+ break; | |
+ } | |
+ | |
+ bfqq->entity.ioprio_changed = 1; | |
+ | |
+ bfq_clear_bfqq_prio_changed(bfqq); | |
+} | |
+ | |
+static void bfq_changed_ioprio(struct bfq_io_cq *bic) | |
+{ | |
+ struct bfq_data *bfqd; | |
+ struct bfq_queue *bfqq, *new_bfqq; | |
+ struct bfq_group *bfqg; | |
+ unsigned long uninitialized_var(flags); | |
+ int ioprio = bic->icq.ioc->ioprio; | |
+ | |
+ bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data), | |
+ &flags); | |
+ /* | |
+ * This condition may trigger on a newly created bic, be sure to | |
+ * drop the lock before returning. | |
+ */ | |
+ if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio)) | |
+ goto out; | |
+ | |
+ bfqq = bic->bfqq[BLK_RW_ASYNC]; | |
+ if (bfqq != NULL) { | |
+ bfqg = container_of(bfqq->entity.sched_data, struct bfq_group, | |
+ sched_data); | |
+ new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic, | |
+ GFP_ATOMIC); | |
+ if (new_bfqq != NULL) { | |
+ bic->bfqq[BLK_RW_ASYNC] = new_bfqq; | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "changed_ioprio: bfqq %p %d", | |
+ bfqq, atomic_read(&bfqq->ref)); | |
+ bfq_put_queue(bfqq); | |
+ } | |
+ } | |
+ | |
+ bfqq = bic->bfqq[BLK_RW_SYNC]; | |
+ if (bfqq != NULL) | |
+ bfq_mark_bfqq_prio_changed(bfqq); | |
+ | |
+ bic->ioprio = ioprio; | |
+ | |
+out: | |
+ bfq_put_bfqd_unlock(bfqd, &flags); | |
+} | |
+ | |
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ pid_t pid, int is_sync) | |
+{ | |
+ RB_CLEAR_NODE(&bfqq->entity.rb_node); | |
+ INIT_LIST_HEAD(&bfqq->fifo); | |
+ INIT_HLIST_NODE(&bfqq->burst_list_node); | |
+ | |
+ atomic_set(&bfqq->ref, 0); | |
+ bfqq->bfqd = bfqd; | |
+ | |
+ bfq_mark_bfqq_prio_changed(bfqq); | |
+ | |
+ if (is_sync) { | |
+ if (!bfq_class_idle(bfqq)) | |
+ bfq_mark_bfqq_idle_window(bfqq); | |
+ bfq_mark_bfqq_sync(bfqq); | |
+ } | |
+ bfq_mark_bfqq_IO_bound(bfqq); | |
+ | |
+ /* Tentative initial value to trade off between thr and lat */ | |
+ bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3; | |
+ bfqq->pid = pid; | |
+ | |
+ bfqq->wr_coeff = 1; | |
+ bfqq->last_wr_start_finish = 0; | |
+ /* | |
+ * Set to the value for which bfqq will not be deemed as | |
+ * soft rt when it becomes backlogged. | |
+ */ | |
+ bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies); | |
+} | |
+ | |
+static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg, | |
+ int is_sync, | |
+ struct bfq_io_cq *bic, | |
+ gfp_t gfp_mask) | |
+{ | |
+ struct bfq_queue *bfqq, *new_bfqq = NULL; | |
+ | |
+retry: | |
+ /* bic always exists here */ | |
+ bfqq = bic_to_bfqq(bic, is_sync); | |
+ | |
+ /* | |
+ * Always try a new alloc if we fall back to the OOM bfqq | |
+ * originally, since it should just be a temporary situation. | |
+ */ | |
+ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) { | |
+ bfqq = NULL; | |
+ if (new_bfqq != NULL) { | |
+ bfqq = new_bfqq; | |
+ new_bfqq = NULL; | |
+ } else if (gfp_mask & __GFP_WAIT) { | |
+ spin_unlock_irq(bfqd->queue->queue_lock); | |
+ new_bfqq = kmem_cache_alloc_node(bfq_pool, | |
+ gfp_mask | __GFP_ZERO, | |
+ bfqd->queue->node); | |
+ spin_lock_irq(bfqd->queue->queue_lock); | |
+ if (new_bfqq != NULL) | |
+ goto retry; | |
+ } else { | |
+ bfqq = kmem_cache_alloc_node(bfq_pool, | |
+ gfp_mask | __GFP_ZERO, | |
+ bfqd->queue->node); | |
+ } | |
+ | |
+ if (bfqq != NULL) { | |
+ bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync); | |
+ bfq_log_bfqq(bfqd, bfqq, "allocated"); | |
+ } else { | |
+ bfqq = &bfqd->oom_bfqq; | |
+ bfq_log_bfqq(bfqd, bfqq, "using oom bfqq"); | |
+ } | |
+ | |
+ bfq_init_prio_data(bfqq, bic); | |
+ bfq_init_entity(&bfqq->entity, bfqg); | |
+ } | |
+ | |
+ if (new_bfqq != NULL) | |
+ kmem_cache_free(bfq_pool, new_bfqq); | |
+ | |
+ return bfqq; | |
+} | |
+ | |
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg, | |
+ int ioprio_class, int ioprio) | |
+{ | |
+ switch (ioprio_class) { | |
+ case IOPRIO_CLASS_RT: | |
+ return &bfqg->async_bfqq[0][ioprio]; | |
+ case IOPRIO_CLASS_NONE: | |
+ ioprio = IOPRIO_NORM; | |
+ /* fall through */ | |
+ case IOPRIO_CLASS_BE: | |
+ return &bfqg->async_bfqq[1][ioprio]; | |
+ case IOPRIO_CLASS_IDLE: | |
+ return &bfqg->async_idle_bfqq; | |
+ default: | |
+ BUG(); | |
+ } | |
+} | |
+ | |
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg, int is_sync, | |
+ struct bfq_io_cq *bic, gfp_t gfp_mask) | |
+{ | |
+ const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio); | |
+ const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); | |
+ struct bfq_queue **async_bfqq = NULL; | |
+ struct bfq_queue *bfqq = NULL; | |
+ | |
+ if (!is_sync) { | |
+ async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class, | |
+ ioprio); | |
+ bfqq = *async_bfqq; | |
+ } | |
+ | |
+ if (bfqq == NULL) | |
+ bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask); | |
+ | |
+ /* | |
+ * Pin the queue now that it's allocated, scheduler exit will | |
+ * prune it. | |
+ */ | |
+ if (!is_sync && *async_bfqq == NULL) { | |
+ atomic_inc(&bfqq->ref); | |
+ bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d", | |
+ bfqq, atomic_read(&bfqq->ref)); | |
+ *async_bfqq = bfqq; | |
+ } | |
+ | |
+ atomic_inc(&bfqq->ref); | |
+ bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, | |
+ atomic_read(&bfqq->ref)); | |
+ return bfqq; | |
+} | |
+ | |
+static void bfq_update_io_thinktime(struct bfq_data *bfqd, | |
+ struct bfq_io_cq *bic) | |
+{ | |
+ unsigned long elapsed = jiffies - bic->ttime.last_end_request; | |
+ unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle); | |
+ | |
+ bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8; | |
+ bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8; | |
+ bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) / | |
+ bic->ttime.ttime_samples; | |
+} | |
+ | |
+static void bfq_update_io_seektime(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq, | |
+ struct request *rq) | |
+{ | |
+ sector_t sdist; | |
+ u64 total; | |
+ | |
+ if (bfqq->last_request_pos < blk_rq_pos(rq)) | |
+ sdist = blk_rq_pos(rq) - bfqq->last_request_pos; | |
+ else | |
+ sdist = bfqq->last_request_pos - blk_rq_pos(rq); | |
+ | |
+ /* | |
+ * Don't allow the seek distance to get too large from the | |
+ * odd fragment, pagein, etc. | |
+ */ | |
+ if (bfqq->seek_samples == 0) /* first request, not really a seek */ | |
+ sdist = 0; | |
+ else if (bfqq->seek_samples <= 60) /* second & third seek */ | |
+ sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024); | |
+ else | |
+ sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64); | |
+ | |
+ bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8; | |
+ bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8; | |
+ total = bfqq->seek_total + (bfqq->seek_samples/2); | |
+ do_div(total, bfqq->seek_samples); | |
+ bfqq->seek_mean = (sector_t)total; | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist, | |
+ (u64)bfqq->seek_mean); | |
+} | |
+ | |
+/* | |
+ * Disable idle window if the process thinks too long or seeks so much that | |
+ * it doesn't matter. | |
+ */ | |
+static void bfq_update_idle_window(struct bfq_data *bfqd, | |
+ struct bfq_queue *bfqq, | |
+ struct bfq_io_cq *bic) | |
+{ | |
+ int enable_idle; | |
+ | |
+ /* Don't idle for async or idle io prio class. */ | |
+ if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq)) | |
+ return; | |
+ | |
+ /* Idle window just restored, statistics are meaningless. */ | |
+ if (bfq_bfqq_just_split(bfqq)) | |
+ return; | |
+ | |
+ enable_idle = bfq_bfqq_idle_window(bfqq); | |
+ | |
+ if (atomic_read(&bic->icq.ioc->active_ref) == 0 || | |
+ bfqd->bfq_slice_idle == 0 || | |
+ (bfqd->hw_tag && BFQQ_SEEKY(bfqq) && | |
+ bfqq->wr_coeff == 1)) | |
+ enable_idle = 0; | |
+ else if (bfq_sample_valid(bic->ttime.ttime_samples)) { | |
+ if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle && | |
+ bfqq->wr_coeff == 1) | |
+ enable_idle = 0; | |
+ else | |
+ enable_idle = 1; | |
+ } | |
+ bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d", | |
+ enable_idle); | |
+ | |
+ if (enable_idle) | |
+ bfq_mark_bfqq_idle_window(bfqq); | |
+ else | |
+ bfq_clear_bfqq_idle_window(bfqq); | |
+} | |
+ | |
+/* | |
+ * Called when a new fs request (rq) is added to bfqq. Check if there's | |
+ * something we should do about it. | |
+ */ | |
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ struct request *rq) | |
+{ | |
+ struct bfq_io_cq *bic = RQ_BIC(rq); | |
+ | |
+ if (rq->cmd_flags & REQ_META) | |
+ bfqq->meta_pending++; | |
+ | |
+ bfq_update_io_thinktime(bfqd, bic); | |
+ bfq_update_io_seektime(bfqd, bfqq, rq); | |
+ if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) { | |
+ bfq_clear_bfqq_constantly_seeky(bfqq); | |
+ if (!blk_queue_nonrot(bfqd->queue)) { | |
+ BUG_ON(!bfqd->const_seeky_busy_in_flight_queues); | |
+ bfqd->const_seeky_busy_in_flight_queues--; | |
+ } | |
+ } | |
+ if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 || | |
+ !BFQQ_SEEKY(bfqq)) | |
+ bfq_update_idle_window(bfqd, bfqq, bic); | |
+ bfq_clear_bfqq_just_split(bfqq); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, | |
+ "rq_enqueued: idle_window=%d (seeky %d, mean %llu)", | |
+ bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq), | |
+ (long long unsigned)bfqq->seek_mean); | |
+ | |
+ bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq); | |
+ | |
+ if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) { | |
+ int small_req = bfqq->queued[rq_is_sync(rq)] == 1 && | |
+ blk_rq_sectors(rq) < 32; | |
+ int budget_timeout = bfq_bfqq_budget_timeout(bfqq); | |
+ | |
+ /* | |
+ * There is just this request queued: if the request | |
+ * is small and the queue is not to be expired, then | |
+ * just exit. | |
+ * | |
+ * In this way, if the disk is being idled to wait for | |
+ * a new request from the in-service queue, we avoid | |
+ * unplugging the device and committing the disk to serve | |
+ * just a small request. On the contrary, we wait for | |
+ * the block layer to decide when to unplug the device: | |
+ * hopefully, new requests will be merged to this one | |
+ * quickly, then the device will be unplugged and | |
+ * larger requests will be dispatched. | |
+ */ | |
+ if (small_req && !budget_timeout) | |
+ return; | |
+ | |
+ /* | |
+ * A large enough request arrived, or the queue is to | |
+ * be expired: in both cases disk idling is to be | |
+ * stopped, so clear wait_request flag and reset | |
+ * timer. | |
+ */ | |
+ bfq_clear_bfqq_wait_request(bfqq); | |
+ del_timer(&bfqd->idle_slice_timer); | |
+ | |
+ /* | |
+ * The queue is not empty, because a new request just | |
+ * arrived. Hence we can safely expire the queue, in | |
+ * case of budget timeout, without risking that the | |
+ * timestamps of the queue are not updated correctly. | |
+ * See [1] for more details. | |
+ */ | |
+ if (budget_timeout) | |
+ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT); | |
+ | |
+ /* | |
+ * Let the request rip immediately, or let a new queue be | |
+ * selected if bfqq has just been expired. | |
+ */ | |
+ __blk_run_queue(bfqd->queue); | |
+ } | |
+} | |
+ | |
+static void bfq_insert_request(struct request_queue *q, struct request *rq) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq; | |
+ | |
+ assert_spin_locked(bfqd->queue->queue_lock); | |
+ | |
+ /* | |
+ * An unplug may trigger a requeue of a request from the device | |
+ * driver: make sure we are in process context while trying to | |
+ * merge two bfq_queues. | |
+ */ | |
+ if (!in_interrupt()) { | |
+ new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true); | |
+ if (new_bfqq != NULL) { | |
+ if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq) | |
+ new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1); | |
+ /* | |
+ * Release the request's reference to the old bfqq | |
+ * and make sure one is taken to the shared queue. | |
+ */ | |
+ new_bfqq->allocated[rq_data_dir(rq)]++; | |
+ bfqq->allocated[rq_data_dir(rq)]--; | |
+ atomic_inc(&new_bfqq->ref); | |
+ bfq_put_queue(bfqq); | |
+ if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq) | |
+ bfq_merge_bfqqs(bfqd, RQ_BIC(rq), | |
+ bfqq, new_bfqq); | |
+ rq->elv.priv[1] = new_bfqq; | |
+ bfqq = new_bfqq; | |
+ } else | |
+ bfq_bfqq_increase_failed_cooperations(bfqq); | |
+ } | |
+ | |
+ bfq_init_prio_data(bfqq, RQ_BIC(rq)); | |
+ | |
+ bfq_add_request(rq); | |
+ | |
+ /* | |
+ * Here a newly-created bfq_queue has already started a weight-raising | |
+ * period: clear raising_time_left to prevent bfq_bfqq_save_state() | |
+ * from assigning it a full weight-raising period. See the detailed | |
+ * comments about this field in bfq_init_icq(). | |
+ */ | |
+ if (bfqq->bic != NULL) | |
+ bfqq->bic->wr_time_left = 0; | |
+ rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)]; | |
+ list_add_tail(&rq->queuelist, &bfqq->fifo); | |
+ | |
+ bfq_rq_enqueued(bfqd, bfqq, rq); | |
+} | |
+ | |
+static void bfq_update_hw_tag(struct bfq_data *bfqd) | |
+{ | |
+ bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver, | |
+ bfqd->rq_in_driver); | |
+ | |
+ if (bfqd->hw_tag == 1) | |
+ return; | |
+ | |
+ /* | |
+ * This sample is valid if the number of outstanding requests | |
+ * is large enough to allow a queueing behavior. Note that the | |
+ * sum is not exact, as it's not taking into account deactivated | |
+ * requests. | |
+ */ | |
+ if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD) | |
+ return; | |
+ | |
+ if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES) | |
+ return; | |
+ | |
+ bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD; | |
+ bfqd->max_rq_in_driver = 0; | |
+ bfqd->hw_tag_samples = 0; | |
+} | |
+ | |
+static void bfq_completed_request(struct request_queue *q, struct request *rq) | |
+{ | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq); | |
+ struct bfq_data *bfqd = bfqq->bfqd; | |
+ bool sync = bfq_bfqq_sync(bfqq); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)", | |
+ blk_rq_sectors(rq), sync); | |
+ | |
+ bfq_update_hw_tag(bfqd); | |
+ | |
+ BUG_ON(!bfqd->rq_in_driver); | |
+ BUG_ON(!bfqq->dispatched); | |
+ bfqd->rq_in_driver--; | |
+ bfqq->dispatched--; | |
+ | |
+ if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) { | |
+ bfq_weights_tree_remove(bfqd, &bfqq->entity, | |
+ &bfqd->queue_weights_tree); | |
+ if (!blk_queue_nonrot(bfqd->queue)) { | |
+ BUG_ON(!bfqd->busy_in_flight_queues); | |
+ bfqd->busy_in_flight_queues--; | |
+ if (bfq_bfqq_constantly_seeky(bfqq)) { | |
+ BUG_ON(!bfqd-> | |
+ const_seeky_busy_in_flight_queues); | |
+ bfqd->const_seeky_busy_in_flight_queues--; | |
+ } | |
+ } | |
+ } | |
+ | |
+ if (sync) { | |
+ bfqd->sync_flight--; | |
+ RQ_BIC(rq)->ttime.last_end_request = jiffies; | |
+ } | |
+ | |
+ /* | |
+ * If we are waiting to discover whether the request pattern of the | |
+ * task associated with the queue is actually isochronous, and | |
+ * both requisites for this condition to hold are satisfied, then | |
+ * compute soft_rt_next_start (see the comments to the function | |
+ * bfq_bfqq_softrt_next_start()). | |
+ */ | |
+ if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 && | |
+ RB_EMPTY_ROOT(&bfqq->sort_list)) | |
+ bfqq->soft_rt_next_start = | |
+ bfq_bfqq_softrt_next_start(bfqd, bfqq); | |
+ | |
+ /* | |
+ * If this is the in-service queue, check if it needs to be expired, | |
+ * or if we want to idle in case it has no pending requests. | |
+ */ | |
+ if (bfqd->in_service_queue == bfqq) { | |
+ if (bfq_bfqq_budget_new(bfqq)) | |
+ bfq_set_budget_timeout(bfqd); | |
+ | |
+ if (bfq_bfqq_must_idle(bfqq)) { | |
+ bfq_arm_slice_timer(bfqd); | |
+ goto out; | |
+ } else if (bfq_may_expire_for_budg_timeout(bfqq)) | |
+ bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT); | |
+ else if (RB_EMPTY_ROOT(&bfqq->sort_list) && | |
+ (bfqq->dispatched == 0 || | |
+ !bfq_bfqq_must_not_expire(bfqq))) | |
+ bfq_bfqq_expire(bfqd, bfqq, 0, | |
+ BFQ_BFQQ_NO_MORE_REQUESTS); | |
+ } | |
+ | |
+ if (!bfqd->rq_in_driver) | |
+ bfq_schedule_dispatch(bfqd); | |
+ | |
+out: | |
+ return; | |
+} | |
+ | |
+static inline int __bfq_may_queue(struct bfq_queue *bfqq) | |
+{ | |
+ if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) { | |
+ bfq_clear_bfqq_must_alloc(bfqq); | |
+ return ELV_MQUEUE_MUST; | |
+ } | |
+ | |
+ return ELV_MQUEUE_MAY; | |
+} | |
+ | |
+static int bfq_may_queue(struct request_queue *q, int rw) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct task_struct *tsk = current; | |
+ struct bfq_io_cq *bic; | |
+ struct bfq_queue *bfqq; | |
+ | |
+ /* | |
+ * Don't force setup of a queue from here, as a call to may_queue | |
+ * does not necessarily imply that a request actually will be | |
+ * queued. So just lookup a possibly existing queue, or return | |
+ * 'may queue' if that fails. | |
+ */ | |
+ bic = bfq_bic_lookup(bfqd, tsk->io_context); | |
+ if (bic == NULL) | |
+ return ELV_MQUEUE_MAY; | |
+ | |
+ bfqq = bic_to_bfqq(bic, rw_is_sync(rw)); | |
+ if (bfqq != NULL) { | |
+ bfq_init_prio_data(bfqq, bic); | |
+ | |
+ return __bfq_may_queue(bfqq); | |
+ } | |
+ | |
+ return ELV_MQUEUE_MAY; | |
+} | |
+ | |
+/* | |
+ * Queue lock held here. | |
+ */ | |
+static void bfq_put_request(struct request *rq) | |
+{ | |
+ struct bfq_queue *bfqq = RQ_BFQQ(rq); | |
+ | |
+ if (bfqq != NULL) { | |
+ const int rw = rq_data_dir(rq); | |
+ | |
+ BUG_ON(!bfqq->allocated[rw]); | |
+ bfqq->allocated[rw]--; | |
+ | |
+ rq->elv.priv[0] = NULL; | |
+ rq->elv.priv[1] = NULL; | |
+ | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d", | |
+ bfqq, atomic_read(&bfqq->ref)); | |
+ bfq_put_queue(bfqq); | |
+ } | |
+} | |
+ | |
+/* | |
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this | |
+ * was the last process referring to said bfqq. | |
+ */ | |
+static struct bfq_queue * | |
+bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq) | |
+{ | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue"); | |
+ | |
+ put_io_context(bic->icq.ioc); | |
+ | |
+ if (bfqq_process_refs(bfqq) == 1) { | |
+ bfqq->pid = current->pid; | |
+ bfq_clear_bfqq_coop(bfqq); | |
+ bfq_clear_bfqq_split_coop(bfqq); | |
+ return bfqq; | |
+ } | |
+ | |
+ bic_set_bfqq(bic, NULL, 1); | |
+ | |
+ bfq_put_cooperator(bfqq); | |
+ | |
+ bfq_put_queue(bfqq); | |
+ return NULL; | |
+} | |
+ | |
+/* | |
+ * Allocate bfq data structures associated with this request. | |
+ */ | |
+static int bfq_set_request(struct request_queue *q, struct request *rq, | |
+ struct bio *bio, gfp_t gfp_mask) | |
+{ | |
+ struct bfq_data *bfqd = q->elevator->elevator_data; | |
+ struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq); | |
+ const int rw = rq_data_dir(rq); | |
+ const int is_sync = rq_is_sync(rq); | |
+ struct bfq_queue *bfqq; | |
+ struct bfq_group *bfqg; | |
+ unsigned long flags; | |
+ bool split = false; | |
+ | |
+ might_sleep_if(gfp_mask & __GFP_WAIT); | |
+ | |
+ bfq_changed_ioprio(bic); | |
+ | |
+ spin_lock_irqsave(q->queue_lock, flags); | |
+ | |
+ if (bic == NULL) | |
+ goto queue_fail; | |
+ | |
+ bfqg = bfq_bic_update_cgroup(bic); | |
+ | |
+new_queue: | |
+ bfqq = bic_to_bfqq(bic, is_sync); | |
+ if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) { | |
+ bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask); | |
+ bic_set_bfqq(bic, bfqq, is_sync); | |
+ if (split && is_sync) { | |
+ if ((bic->was_in_burst_list && bfqd->large_burst) || | |
+ bic->saved_in_large_burst) | |
+ bfq_mark_bfqq_in_large_burst(bfqq); | |
+ else { | |
+ bfq_clear_bfqq_in_large_burst(bfqq); | |
+ if (bic->was_in_burst_list) | |
+ hlist_add_head(&bfqq->burst_list_node, | |
+ &bfqd->burst_list); | |
+ } | |
+ } | |
+ } else { | |
+ /* If the queue was seeky for too long, break it apart. */ | |
+ if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) { | |
+ bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq"); | |
+ bfqq = bfq_split_bfqq(bic, bfqq); | |
+ split = true; | |
+ if (!bfqq) | |
+ goto new_queue; | |
+ } | |
+ } | |
+ | |
+ bfqq->allocated[rw]++; | |
+ atomic_inc(&bfqq->ref); | |
+ bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq, | |
+ atomic_read(&bfqq->ref)); | |
+ | |
+ rq->elv.priv[0] = bic; | |
+ rq->elv.priv[1] = bfqq; | |
+ | |
+ /* | |
+ * If a bfq_queue has only one process reference, it is owned | |
+ * by only one bfq_io_cq: we can set the bic field of the | |
+ * bfq_queue to the address of that structure. Also, if the | |
+ * queue has just been split, mark a flag so that the | |
+ * information is available to the other scheduler hooks. | |
+ */ | |
+ if (bfqq_process_refs(bfqq) == 1) { | |
+ bfqq->bic = bic; | |
+ if (split) { | |
+ bfq_mark_bfqq_just_split(bfqq); | |
+ /* | |
+ * If the queue has just been split from a shared | |
+ * queue, restore the idle window and the possible | |
+ * weight raising period. | |
+ */ | |
+ bfq_bfqq_resume_state(bfqq, bic); | |
+ } | |
+ } | |
+ | |
+ spin_unlock_irqrestore(q->queue_lock, flags); | |
+ | |
+ return 0; | |
+ | |
+queue_fail: | |
+ bfq_schedule_dispatch(bfqd); | |
+ spin_unlock_irqrestore(q->queue_lock, flags); | |
+ | |
+ return 1; | |
+} | |
+ | |
+static void bfq_kick_queue(struct work_struct *work) | |
+{ | |
+ struct bfq_data *bfqd = | |
+ container_of(work, struct bfq_data, unplug_work); | |
+ struct request_queue *q = bfqd->queue; | |
+ | |
+ spin_lock_irq(q->queue_lock); | |
+ __blk_run_queue(q); | |
+ spin_unlock_irq(q->queue_lock); | |
+} | |
+ | |
+/* | |
+ * Handler of the expiration of the timer running if the in-service queue | |
+ * is idling inside its time slice. | |
+ */ | |
+static void bfq_idle_slice_timer(unsigned long data) | |
+{ | |
+ struct bfq_data *bfqd = (struct bfq_data *)data; | |
+ struct bfq_queue *bfqq; | |
+ unsigned long flags; | |
+ enum bfqq_expiration reason; | |
+ | |
+ spin_lock_irqsave(bfqd->queue->queue_lock, flags); | |
+ | |
+ bfqq = bfqd->in_service_queue; | |
+ /* | |
+ * Theoretical race here: the in-service queue can be NULL or | |
+ * different from the queue that was idling if the timer handler | |
+ * spins on the queue_lock and a new request arrives for the | |
+ * current queue and there is a full dispatch cycle that changes | |
+ * the in-service queue. This can hardly happen, but in the worst | |
+ * case we just expire a queue too early. | |
+ */ | |
+ if (bfqq != NULL) { | |
+ bfq_log_bfqq(bfqd, bfqq, "slice_timer expired"); | |
+ if (bfq_bfqq_budget_timeout(bfqq)) | |
+ /* | |
+ * Also here the queue can be safely expired | |
+ * for budget timeout without wasting | |
+ * guarantees | |
+ */ | |
+ reason = BFQ_BFQQ_BUDGET_TIMEOUT; | |
+ else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0) | |
+ /* | |
+ * The queue may not be empty upon timer expiration, | |
+ * because we may not disable the timer when the | |
+ * first request of the in-service queue arrives | |
+ * during disk idling. | |
+ */ | |
+ reason = BFQ_BFQQ_TOO_IDLE; | |
+ else | |
+ goto schedule_dispatch; | |
+ | |
+ bfq_bfqq_expire(bfqd, bfqq, 1, reason); | |
+ } | |
+ | |
+schedule_dispatch: | |
+ bfq_schedule_dispatch(bfqd); | |
+ | |
+ spin_unlock_irqrestore(bfqd->queue->queue_lock, flags); | |
+} | |
+ | |
+static void bfq_shutdown_timer_wq(struct bfq_data *bfqd) | |
+{ | |
+ del_timer_sync(&bfqd->idle_slice_timer); | |
+ cancel_work_sync(&bfqd->unplug_work); | |
+} | |
+ | |
+static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd, | |
+ struct bfq_queue **bfqq_ptr) | |
+{ | |
+ struct bfq_group *root_group = bfqd->root_group; | |
+ struct bfq_queue *bfqq = *bfqq_ptr; | |
+ | |
+ bfq_log(bfqd, "put_async_bfqq: %p", bfqq); | |
+ if (bfqq != NULL) { | |
+ bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group); | |
+ bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d", | |
+ bfqq, atomic_read(&bfqq->ref)); | |
+ bfq_put_queue(bfqq); | |
+ *bfqq_ptr = NULL; | |
+ } | |
+} | |
+ | |
+/* | |
+ * Release all the bfqg references to its async queues. If we are | |
+ * deallocating the group these queues may still contain requests, so | |
+ * we reparent them to the root cgroup (i.e., the only one that will | |
+ * exist for sure until all the requests on a device are gone). | |
+ */ | |
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg) | |
+{ | |
+ int i, j; | |
+ | |
+ for (i = 0; i < 2; i++) | |
+ for (j = 0; j < IOPRIO_BE_NR; j++) | |
+ __bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]); | |
+ | |
+ __bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq); | |
+} | |
+ | |
+static void bfq_exit_queue(struct elevator_queue *e) | |
+{ | |
+ struct bfq_data *bfqd = e->elevator_data; | |
+ struct request_queue *q = bfqd->queue; | |
+ struct bfq_queue *bfqq, *n; | |
+ | |
+ bfq_shutdown_timer_wq(bfqd); | |
+ | |
+ spin_lock_irq(q->queue_lock); | |
+ | |
+ BUG_ON(bfqd->in_service_queue != NULL); | |
+ list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list) | |
+ bfq_deactivate_bfqq(bfqd, bfqq, 0); | |
+ | |
+ bfq_disconnect_groups(bfqd); | |
+ spin_unlock_irq(q->queue_lock); | |
+ | |
+ bfq_shutdown_timer_wq(bfqd); | |
+ | |
+ synchronize_rcu(); | |
+ | |
+ BUG_ON(timer_pending(&bfqd->idle_slice_timer)); | |
+ | |
+ bfq_free_root_group(bfqd); | |
+ kfree(bfqd); | |
+} | |
+ | |
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) | |
+{ | |
+ struct bfq_group *bfqg; | |
+ struct bfq_data *bfqd; | |
+ struct elevator_queue *eq; | |
+ | |
+ eq = elevator_alloc(q, e); | |
+ if (eq == NULL) | |
+ return -ENOMEM; | |
+ | |
+ bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node); | |
+ if (bfqd == NULL) { | |
+ kobject_put(&eq->kobj); | |
+ return -ENOMEM; | |
+ } | |
+ eq->elevator_data = bfqd; | |
+ | |
+ /* | |
+ * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues. | |
+ * Grab a permanent reference to it, so that the normal code flow | |
+ * will not attempt to free it. | |
+ */ | |
+ bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0); | |
+ atomic_inc(&bfqd->oom_bfqq.ref); | |
+ | |
+ bfqd->queue = q; | |
+ | |
+ spin_lock_irq(q->queue_lock); | |
+ q->elevator = eq; | |
+ spin_unlock_irq(q->queue_lock); | |
+ | |
+ bfqg = bfq_alloc_root_group(bfqd, q->node); | |
+ if (bfqg == NULL) { | |
+ kfree(bfqd); | |
+ kobject_put(&eq->kobj); | |
+ return -ENOMEM; | |
+ } | |
+ | |
+ bfqd->root_group = bfqg; | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ bfqd->active_numerous_groups = 0; | |
+#endif | |
+ | |
+ init_timer(&bfqd->idle_slice_timer); | |
+ bfqd->idle_slice_timer.function = bfq_idle_slice_timer; | |
+ bfqd->idle_slice_timer.data = (unsigned long)bfqd; | |
+ | |
+ bfqd->rq_pos_tree = RB_ROOT; | |
+ bfqd->queue_weights_tree = RB_ROOT; | |
+ bfqd->group_weights_tree = RB_ROOT; | |
+ | |
+ INIT_WORK(&bfqd->unplug_work, bfq_kick_queue); | |
+ | |
+ INIT_LIST_HEAD(&bfqd->active_list); | |
+ INIT_LIST_HEAD(&bfqd->idle_list); | |
+ INIT_HLIST_HEAD(&bfqd->burst_list); | |
+ | |
+ bfqd->hw_tag = -1; | |
+ | |
+ bfqd->bfq_max_budget = bfq_default_max_budget; | |
+ | |
+ bfqd->bfq_quantum = bfq_quantum; | |
+ bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0]; | |
+ bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1]; | |
+ bfqd->bfq_back_max = bfq_back_max; | |
+ bfqd->bfq_back_penalty = bfq_back_penalty; | |
+ bfqd->bfq_slice_idle = bfq_slice_idle; | |
+ bfqd->bfq_class_idle_last_service = 0; | |
+ bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq; | |
+ bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async; | |
+ bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync; | |
+ | |
+ bfqd->bfq_coop_thresh = 2; | |
+ bfqd->bfq_failed_cooperations = 7000; | |
+ bfqd->bfq_requests_within_timer = 120; | |
+ | |
+ bfqd->bfq_large_burst_thresh = 11; | |
+ bfqd->bfq_burst_interval = msecs_to_jiffies(500); | |
+ | |
+ bfqd->low_latency = true; | |
+ | |
+ bfqd->bfq_wr_coeff = 20; | |
+ bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300); | |
+ bfqd->bfq_wr_max_time = 0; | |
+ bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000); | |
+ bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500); | |
+ bfqd->bfq_wr_max_softrt_rate = 7000; /* | |
+ * Approximate rate required | |
+ * to playback or record a | |
+ * high-definition compressed | |
+ * video. | |
+ */ | |
+ bfqd->wr_busy_queues = 0; | |
+ bfqd->busy_in_flight_queues = 0; | |
+ bfqd->const_seeky_busy_in_flight_queues = 0; | |
+ | |
+ /* | |
+ * Begin by assuming, optimistically, that the device peak rate is | |
+ * equal to the highest reference rate. | |
+ */ | |
+ bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] * | |
+ T_fast[blk_queue_nonrot(bfqd->queue)]; | |
+ bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)]; | |
+ bfqd->device_speed = BFQ_BFQD_FAST; | |
+ | |
+ return 0; | |
+} | |
+ | |
+static void bfq_slab_kill(void) | |
+{ | |
+ if (bfq_pool != NULL) | |
+ kmem_cache_destroy(bfq_pool); | |
+} | |
+ | |
+static int __init bfq_slab_setup(void) | |
+{ | |
+ bfq_pool = KMEM_CACHE(bfq_queue, 0); | |
+ if (bfq_pool == NULL) | |
+ return -ENOMEM; | |
+ return 0; | |
+} | |
+ | |
+static ssize_t bfq_var_show(unsigned int var, char *page) | |
+{ | |
+ return sprintf(page, "%d\n", var); | |
+} | |
+ | |
+static ssize_t bfq_var_store(unsigned long *var, const char *page, | |
+ size_t count) | |
+{ | |
+ unsigned long new_val; | |
+ int ret = kstrtoul(page, 10, &new_val); | |
+ | |
+ if (ret == 0) | |
+ *var = new_val; | |
+ | |
+ return count; | |
+} | |
+ | |
+static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page) | |
+{ | |
+ struct bfq_data *bfqd = e->elevator_data; | |
+ return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ? | |
+ jiffies_to_msecs(bfqd->bfq_wr_max_time) : | |
+ jiffies_to_msecs(bfq_wr_duration(bfqd))); | |
+} | |
+ | |
+static ssize_t bfq_weights_show(struct elevator_queue *e, char *page) | |
+{ | |
+ struct bfq_queue *bfqq; | |
+ struct bfq_data *bfqd = e->elevator_data; | |
+ ssize_t num_char = 0; | |
+ | |
+ num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n", | |
+ bfqd->queued); | |
+ | |
+ spin_lock_irq(bfqd->queue->queue_lock); | |
+ | |
+ num_char += sprintf(page + num_char, "Active:\n"); | |
+ list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) { | |
+ num_char += sprintf(page + num_char, | |
+ "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n", | |
+ bfqq->pid, | |
+ bfqq->entity.weight, | |
+ bfqq->queued[0], | |
+ bfqq->queued[1], | |
+ jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish), | |
+ jiffies_to_msecs(bfqq->wr_cur_max_time)); | |
+ } | |
+ | |
+ num_char += sprintf(page + num_char, "Idle:\n"); | |
+ list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) { | |
+ num_char += sprintf(page + num_char, | |
+ "pid%d: weight %hu, dur %d/%u\n", | |
+ bfqq->pid, | |
+ bfqq->entity.weight, | |
+ jiffies_to_msecs(jiffies - | |
+ bfqq->last_wr_start_finish), | |
+ jiffies_to_msecs(bfqq->wr_cur_max_time)); | |
+ } | |
+ | |
+ spin_unlock_irq(bfqd->queue->queue_lock); | |
+ | |
+ return num_char; | |
+} | |
+ | |
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \ | |
+static ssize_t __FUNC(struct elevator_queue *e, char *page) \ | |
+{ \ | |
+ struct bfq_data *bfqd = e->elevator_data; \ | |
+ unsigned int __data = __VAR; \ | |
+ if (__CONV) \ | |
+ __data = jiffies_to_msecs(__data); \ | |
+ return bfq_var_show(__data, (page)); \ | |
+} | |
+SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0); | |
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1); | |
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1); | |
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0); | |
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0); | |
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1); | |
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0); | |
+SHOW_FUNCTION(bfq_max_budget_async_rq_show, | |
+ bfqd->bfq_max_budget_async_rq, 0); | |
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1); | |
+SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1); | |
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0); | |
+SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0); | |
+SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1); | |
+SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1); | |
+SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async, | |
+ 1); | |
+SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0); | |
+#undef SHOW_FUNCTION | |
+ | |
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ | |
+static ssize_t \ | |
+__FUNC(struct elevator_queue *e, const char *page, size_t count) \ | |
+{ \ | |
+ struct bfq_data *bfqd = e->elevator_data; \ | |
+ unsigned long uninitialized_var(__data); \ | |
+ int ret = bfq_var_store(&__data, (page), count); \ | |
+ if (__data < (MIN)) \ | |
+ __data = (MIN); \ | |
+ else if (__data > (MAX)) \ | |
+ __data = (MAX); \ | |
+ if (__CONV) \ | |
+ *(__PTR) = msecs_to_jiffies(__data); \ | |
+ else \ | |
+ *(__PTR) = __data; \ | |
+ return ret; \ | |
+} | |
+STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0); | |
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1, | |
+ INT_MAX, 1); | |
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1, | |
+ INT_MAX, 1); | |
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0); | |
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1, | |
+ INT_MAX, 0); | |
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1); | |
+STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq, | |
+ 1, INT_MAX, 0); | |
+STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0, | |
+ INT_MAX, 1); | |
+STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0); | |
+STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1); | |
+STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX, | |
+ 1); | |
+STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0, | |
+ INT_MAX, 1); | |
+STORE_FUNCTION(bfq_wr_min_inter_arr_async_store, | |
+ &bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1); | |
+STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0, | |
+ INT_MAX, 0); | |
+#undef STORE_FUNCTION | |
+ | |
+/* do nothing for the moment */ | |
+static ssize_t bfq_weights_store(struct elevator_queue *e, | |
+ const char *page, size_t count) | |
+{ | |
+ return count; | |
+} | |
+ | |
+static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd) | |
+{ | |
+ u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]); | |
+ | |
+ if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES) | |
+ return bfq_calc_max_budget(bfqd->peak_rate, timeout); | |
+ else | |
+ return bfq_default_max_budget; | |
+} | |
+ | |
+static ssize_t bfq_max_budget_store(struct elevator_queue *e, | |
+ const char *page, size_t count) | |
+{ | |
+ struct bfq_data *bfqd = e->elevator_data; | |
+ unsigned long uninitialized_var(__data); | |
+ int ret = bfq_var_store(&__data, (page), count); | |
+ | |
+ if (__data == 0) | |
+ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd); | |
+ else { | |
+ if (__data > INT_MAX) | |
+ __data = INT_MAX; | |
+ bfqd->bfq_max_budget = __data; | |
+ } | |
+ | |
+ bfqd->bfq_user_max_budget = __data; | |
+ | |
+ return ret; | |
+} | |
+ | |
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e, | |
+ const char *page, size_t count) | |
+{ | |
+ struct bfq_data *bfqd = e->elevator_data; | |
+ unsigned long uninitialized_var(__data); | |
+ int ret = bfq_var_store(&__data, (page), count); | |
+ | |
+ if (__data < 1) | |
+ __data = 1; | |
+ else if (__data > INT_MAX) | |
+ __data = INT_MAX; | |
+ | |
+ bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data); | |
+ if (bfqd->bfq_user_max_budget == 0) | |
+ bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd); | |
+ | |
+ return ret; | |
+} | |
+ | |
+static ssize_t bfq_low_latency_store(struct elevator_queue *e, | |
+ const char *page, size_t count) | |
+{ | |
+ struct bfq_data *bfqd = e->elevator_data; | |
+ unsigned long uninitialized_var(__data); | |
+ int ret = bfq_var_store(&__data, (page), count); | |
+ | |
+ if (__data > 1) | |
+ __data = 1; | |
+ if (__data == 0 && bfqd->low_latency != 0) | |
+ bfq_end_wr(bfqd); | |
+ bfqd->low_latency = __data; | |
+ | |
+ return ret; | |
+} | |
+ | |
+#define BFQ_ATTR(name) \ | |
+ __ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store) | |
+ | |
+static struct elv_fs_entry bfq_attrs[] = { | |
+ BFQ_ATTR(quantum), | |
+ BFQ_ATTR(fifo_expire_sync), | |
+ BFQ_ATTR(fifo_expire_async), | |
+ BFQ_ATTR(back_seek_max), | |
+ BFQ_ATTR(back_seek_penalty), | |
+ BFQ_ATTR(slice_idle), | |
+ BFQ_ATTR(max_budget), | |
+ BFQ_ATTR(max_budget_async_rq), | |
+ BFQ_ATTR(timeout_sync), | |
+ BFQ_ATTR(timeout_async), | |
+ BFQ_ATTR(low_latency), | |
+ BFQ_ATTR(wr_coeff), | |
+ BFQ_ATTR(wr_max_time), | |
+ BFQ_ATTR(wr_rt_max_time), | |
+ BFQ_ATTR(wr_min_idle_time), | |
+ BFQ_ATTR(wr_min_inter_arr_async), | |
+ BFQ_ATTR(wr_max_softrt_rate), | |
+ BFQ_ATTR(weights), | |
+ __ATTR_NULL | |
+}; | |
+ | |
+static struct elevator_type iosched_bfq = { | |
+ .ops = { | |
+ .elevator_merge_fn = bfq_merge, | |
+ .elevator_merged_fn = bfq_merged_request, | |
+ .elevator_merge_req_fn = bfq_merged_requests, | |
+ .elevator_allow_merge_fn = bfq_allow_merge, | |
+ .elevator_dispatch_fn = bfq_dispatch_requests, | |
+ .elevator_add_req_fn = bfq_insert_request, | |
+ .elevator_activate_req_fn = bfq_activate_request, | |
+ .elevator_deactivate_req_fn = bfq_deactivate_request, | |
+ .elevator_completed_req_fn = bfq_completed_request, | |
+ .elevator_former_req_fn = elv_rb_former_request, | |
+ .elevator_latter_req_fn = elv_rb_latter_request, | |
+ .elevator_init_icq_fn = bfq_init_icq, | |
+ .elevator_exit_icq_fn = bfq_exit_icq, | |
+ .elevator_set_req_fn = bfq_set_request, | |
+ .elevator_put_req_fn = bfq_put_request, | |
+ .elevator_may_queue_fn = bfq_may_queue, | |
+ .elevator_init_fn = bfq_init_queue, | |
+ .elevator_exit_fn = bfq_exit_queue, | |
+ }, | |
+ .icq_size = sizeof(struct bfq_io_cq), | |
+ .icq_align = __alignof__(struct bfq_io_cq), | |
+ .elevator_attrs = bfq_attrs, | |
+ .elevator_name = "bfq", | |
+ .elevator_owner = THIS_MODULE, | |
+}; | |
+ | |
+static int __init bfq_init(void) | |
+{ | |
+ /* | |
+ * Can be 0 on HZ < 1000 setups. | |
+ */ | |
+ if (bfq_slice_idle == 0) | |
+ bfq_slice_idle = 1; | |
+ | |
+ if (bfq_timeout_async == 0) | |
+ bfq_timeout_async = 1; | |
+ | |
+ if (bfq_slab_setup()) | |
+ return -ENOMEM; | |
+ | |
+ /* | |
+ * Times to load large popular applications for the typical systems | |
+ * installed on the reference devices (see the comments before the | |
+ * definitions of the two arrays). | |
+ */ | |
+ T_slow[0] = msecs_to_jiffies(2600); | |
+ T_slow[1] = msecs_to_jiffies(1000); | |
+ T_fast[0] = msecs_to_jiffies(5500); | |
+ T_fast[1] = msecs_to_jiffies(2000); | |
+ | |
+ /* | |
+ * Thresholds that determine the switch between speed classes (see | |
+ * the comments before the definition of the array). | |
+ */ | |
+ device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2; | |
+ device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2; | |
+ | |
+ elv_register(&iosched_bfq); | |
+ pr_info("BFQ I/O-scheduler version: v7r6"); | |
+ | |
+ return 0; | |
+} | |
+ | |
+static void __exit bfq_exit(void) | |
+{ | |
+ elv_unregister(&iosched_bfq); | |
+ bfq_slab_kill(); | |
+} | |
+ | |
+module_init(bfq_init); | |
+module_exit(bfq_exit); | |
+ | |
+MODULE_AUTHOR("Fabio Checconi, Paolo Valente"); | |
+MODULE_LICENSE("GPL"); | |
diff --git a/block/bfq-sched.c b/block/bfq-sched.c | |
new file mode 100644 | |
index 0000000..546a254 | |
--- /dev/null | |
+++ b/block/bfq-sched.c | |
@@ -0,0 +1,1179 @@ | |
+/* | |
+ * BFQ: Hierarchical B-WF2Q+ scheduler. | |
+ * | |
+ * Based on ideas and code from CFQ: | |
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> | |
+ * | |
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it> | |
+ * Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it> | |
+ */ | |
+ | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+#define for_each_entity(entity) \ | |
+ for (; entity != NULL; entity = entity->parent) | |
+ | |
+#define for_each_entity_safe(entity, parent) \ | |
+ for (; entity && ({ parent = entity->parent; 1; }); entity = parent) | |
+ | |
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd, | |
+ int extract, | |
+ struct bfq_data *bfqd); | |
+ | |
+static inline void bfq_update_budget(struct bfq_entity *next_in_service) | |
+{ | |
+ struct bfq_entity *bfqg_entity; | |
+ struct bfq_group *bfqg; | |
+ struct bfq_sched_data *group_sd; | |
+ | |
+ BUG_ON(next_in_service == NULL); | |
+ | |
+ group_sd = next_in_service->sched_data; | |
+ | |
+ bfqg = container_of(group_sd, struct bfq_group, sched_data); | |
+ /* | |
+ * bfq_group's my_entity field is not NULL only if the group | |
+ * is not the root group. We must not touch the root entity | |
+ * as it must never become an in-service entity. | |
+ */ | |
+ bfqg_entity = bfqg->my_entity; | |
+ if (bfqg_entity != NULL) | |
+ bfqg_entity->budget = next_in_service->budget; | |
+} | |
+ | |
+static int bfq_update_next_in_service(struct bfq_sched_data *sd) | |
+{ | |
+ struct bfq_entity *next_in_service; | |
+ | |
+ if (sd->in_service_entity != NULL) | |
+ /* will update/requeue at the end of service */ | |
+ return 0; | |
+ | |
+ /* | |
+ * NOTE: this can be improved in many ways, such as returning | |
+ * 1 (and thus propagating upwards the update) only when the | |
+ * budget changes, or caching the bfqq that will be scheduled | |
+ * next from this subtree. By now we worry more about | |
+ * correctness than about performance... | |
+ */ | |
+ next_in_service = bfq_lookup_next_entity(sd, 0, NULL); | |
+ sd->next_in_service = next_in_service; | |
+ | |
+ if (next_in_service != NULL) | |
+ bfq_update_budget(next_in_service); | |
+ | |
+ return 1; | |
+} | |
+ | |
+static inline void bfq_check_next_in_service(struct bfq_sched_data *sd, | |
+ struct bfq_entity *entity) | |
+{ | |
+ BUG_ON(sd->next_in_service != entity); | |
+} | |
+#else | |
+#define for_each_entity(entity) \ | |
+ for (; entity != NULL; entity = NULL) | |
+ | |
+#define for_each_entity_safe(entity, parent) \ | |
+ for (parent = NULL; entity != NULL; entity = parent) | |
+ | |
+static inline int bfq_update_next_in_service(struct bfq_sched_data *sd) | |
+{ | |
+ return 0; | |
+} | |
+ | |
+static inline void bfq_check_next_in_service(struct bfq_sched_data *sd, | |
+ struct bfq_entity *entity) | |
+{ | |
+} | |
+ | |
+static inline void bfq_update_budget(struct bfq_entity *next_in_service) | |
+{ | |
+} | |
+#endif | |
+ | |
+/* | |
+ * Shift for timestamp calculations. This actually limits the maximum | |
+ * service allowed in one timestamp delta (small shift values increase it), | |
+ * the maximum total weight that can be used for the queues in the system | |
+ * (big shift values increase it), and the period of virtual time | |
+ * wraparounds. | |
+ */ | |
+#define WFQ_SERVICE_SHIFT 22 | |
+ | |
+/** | |
+ * bfq_gt - compare two timestamps. | |
+ * @a: first ts. | |
+ * @b: second ts. | |
+ * | |
+ * Return @a > @b, dealing with wrapping correctly. | |
+ */ | |
+static inline int bfq_gt(u64 a, u64 b) | |
+{ | |
+ return (s64)(a - b) > 0; | |
+} | |
+ | |
+static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = NULL; | |
+ | |
+ BUG_ON(entity == NULL); | |
+ | |
+ if (entity->my_sched_data == NULL) | |
+ bfqq = container_of(entity, struct bfq_queue, entity); | |
+ | |
+ return bfqq; | |
+} | |
+ | |
+ | |
+/** | |
+ * bfq_delta - map service into the virtual time domain. | |
+ * @service: amount of service. | |
+ * @weight: scale factor (weight of an entity or weight sum). | |
+ */ | |
+static inline u64 bfq_delta(unsigned long service, | |
+ unsigned long weight) | |
+{ | |
+ u64 d = (u64)service << WFQ_SERVICE_SHIFT; | |
+ | |
+ do_div(d, weight); | |
+ return d; | |
+} | |
+ | |
+/** | |
+ * bfq_calc_finish - assign the finish time to an entity. | |
+ * @entity: the entity to act upon. | |
+ * @service: the service to be charged to the entity. | |
+ */ | |
+static inline void bfq_calc_finish(struct bfq_entity *entity, | |
+ unsigned long service) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ | |
+ BUG_ON(entity->weight == 0); | |
+ | |
+ entity->finish = entity->start + | |
+ bfq_delta(service, entity->weight); | |
+ | |
+ if (bfqq != NULL) { | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, | |
+ "calc_finish: serv %lu, w %d", | |
+ service, entity->weight); | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, | |
+ "calc_finish: start %llu, finish %llu, delta %llu", | |
+ entity->start, entity->finish, | |
+ bfq_delta(service, entity->weight)); | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_entity_of - get an entity from a node. | |
+ * @node: the node field of the entity. | |
+ * | |
+ * Convert a node pointer to the relative entity. This is used only | |
+ * to simplify the logic of some functions and not as the generic | |
+ * conversion mechanism because, e.g., in the tree walking functions, | |
+ * the check for a %NULL value would be redundant. | |
+ */ | |
+static inline struct bfq_entity *bfq_entity_of(struct rb_node *node) | |
+{ | |
+ struct bfq_entity *entity = NULL; | |
+ | |
+ if (node != NULL) | |
+ entity = rb_entry(node, struct bfq_entity, rb_node); | |
+ | |
+ return entity; | |
+} | |
+ | |
+/** | |
+ * bfq_extract - remove an entity from a tree. | |
+ * @root: the tree root. | |
+ * @entity: the entity to remove. | |
+ */ | |
+static inline void bfq_extract(struct rb_root *root, | |
+ struct bfq_entity *entity) | |
+{ | |
+ BUG_ON(entity->tree != root); | |
+ | |
+ entity->tree = NULL; | |
+ rb_erase(&entity->rb_node, root); | |
+} | |
+ | |
+/** | |
+ * bfq_idle_extract - extract an entity from the idle tree. | |
+ * @st: the service tree of the owning @entity. | |
+ * @entity: the entity being removed. | |
+ */ | |
+static void bfq_idle_extract(struct bfq_service_tree *st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ struct rb_node *next; | |
+ | |
+ BUG_ON(entity->tree != &st->idle); | |
+ | |
+ if (entity == st->first_idle) { | |
+ next = rb_next(&entity->rb_node); | |
+ st->first_idle = bfq_entity_of(next); | |
+ } | |
+ | |
+ if (entity == st->last_idle) { | |
+ next = rb_prev(&entity->rb_node); | |
+ st->last_idle = bfq_entity_of(next); | |
+ } | |
+ | |
+ bfq_extract(&st->idle, entity); | |
+ | |
+ if (bfqq != NULL) | |
+ list_del(&bfqq->bfqq_list); | |
+} | |
+ | |
+/** | |
+ * bfq_insert - generic tree insertion. | |
+ * @root: tree root. | |
+ * @entity: entity to insert. | |
+ * | |
+ * This is used for the idle and the active tree, since they are both | |
+ * ordered by finish time. | |
+ */ | |
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity) | |
+{ | |
+ struct bfq_entity *entry; | |
+ struct rb_node **node = &root->rb_node; | |
+ struct rb_node *parent = NULL; | |
+ | |
+ BUG_ON(entity->tree != NULL); | |
+ | |
+ while (*node != NULL) { | |
+ parent = *node; | |
+ entry = rb_entry(parent, struct bfq_entity, rb_node); | |
+ | |
+ if (bfq_gt(entry->finish, entity->finish)) | |
+ node = &parent->rb_left; | |
+ else | |
+ node = &parent->rb_right; | |
+ } | |
+ | |
+ rb_link_node(&entity->rb_node, parent, node); | |
+ rb_insert_color(&entity->rb_node, root); | |
+ | |
+ entity->tree = root; | |
+} | |
+ | |
+/** | |
+ * bfq_update_min - update the min_start field of a entity. | |
+ * @entity: the entity to update. | |
+ * @node: one of its children. | |
+ * | |
+ * This function is called when @entity may store an invalid value for | |
+ * min_start due to updates to the active tree. The function assumes | |
+ * that the subtree rooted at @node (which may be its left or its right | |
+ * child) has a valid min_start value. | |
+ */ | |
+static inline void bfq_update_min(struct bfq_entity *entity, | |
+ struct rb_node *node) | |
+{ | |
+ struct bfq_entity *child; | |
+ | |
+ if (node != NULL) { | |
+ child = rb_entry(node, struct bfq_entity, rb_node); | |
+ if (bfq_gt(entity->min_start, child->min_start)) | |
+ entity->min_start = child->min_start; | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_update_active_node - recalculate min_start. | |
+ * @node: the node to update. | |
+ * | |
+ * @node may have changed position or one of its children may have moved, | |
+ * this function updates its min_start value. The left and right subtrees | |
+ * are assumed to hold a correct min_start value. | |
+ */ | |
+static inline void bfq_update_active_node(struct rb_node *node) | |
+{ | |
+ struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node); | |
+ | |
+ entity->min_start = entity->start; | |
+ bfq_update_min(entity, node->rb_right); | |
+ bfq_update_min(entity, node->rb_left); | |
+} | |
+ | |
+/** | |
+ * bfq_update_active_tree - update min_start for the whole active tree. | |
+ * @node: the starting node. | |
+ * | |
+ * @node must be the deepest modified node after an update. This function | |
+ * updates its min_start using the values held by its children, assuming | |
+ * that they did not change, and then updates all the nodes that may have | |
+ * changed in the path to the root. The only nodes that may have changed | |
+ * are the ones in the path or their siblings. | |
+ */ | |
+static void bfq_update_active_tree(struct rb_node *node) | |
+{ | |
+ struct rb_node *parent; | |
+ | |
+up: | |
+ bfq_update_active_node(node); | |
+ | |
+ parent = rb_parent(node); | |
+ if (parent == NULL) | |
+ return; | |
+ | |
+ if (node == parent->rb_left && parent->rb_right != NULL) | |
+ bfq_update_active_node(parent->rb_right); | |
+ else if (parent->rb_left != NULL) | |
+ bfq_update_active_node(parent->rb_left); | |
+ | |
+ node = parent; | |
+ goto up; | |
+} | |
+ | |
+static void bfq_weights_tree_add(struct bfq_data *bfqd, | |
+ struct bfq_entity *entity, | |
+ struct rb_root *root); | |
+ | |
+static void bfq_weights_tree_remove(struct bfq_data *bfqd, | |
+ struct bfq_entity *entity, | |
+ struct rb_root *root); | |
+ | |
+ | |
+/** | |
+ * bfq_active_insert - insert an entity in the active tree of its | |
+ * group/device. | |
+ * @st: the service tree of the entity. | |
+ * @entity: the entity being inserted. | |
+ * | |
+ * The active tree is ordered by finish time, but an extra key is kept | |
+ * per each node, containing the minimum value for the start times of | |
+ * its children (and the node itself), so it's possible to search for | |
+ * the eligible node with the lowest finish time in logarithmic time. | |
+ */ | |
+static void bfq_active_insert(struct bfq_service_tree *st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ struct rb_node *node = &entity->rb_node; | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ struct bfq_sched_data *sd = NULL; | |
+ struct bfq_group *bfqg = NULL; | |
+ struct bfq_data *bfqd = NULL; | |
+#endif | |
+ | |
+ bfq_insert(&st->active, entity); | |
+ | |
+ if (node->rb_left != NULL) | |
+ node = node->rb_left; | |
+ else if (node->rb_right != NULL) | |
+ node = node->rb_right; | |
+ | |
+ bfq_update_active_tree(node); | |
+ | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ sd = entity->sched_data; | |
+ bfqg = container_of(sd, struct bfq_group, sched_data); | |
+ BUG_ON(!bfqg); | |
+ bfqd = (struct bfq_data *)bfqg->bfqd; | |
+#endif | |
+ if (bfqq != NULL) | |
+ list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list); | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ else { /* bfq_group */ | |
+ BUG_ON(!bfqd); | |
+ bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree); | |
+ } | |
+ if (bfqg != bfqd->root_group) { | |
+ BUG_ON(!bfqg); | |
+ BUG_ON(!bfqd); | |
+ bfqg->active_entities++; | |
+ if (bfqg->active_entities == 2) | |
+ bfqd->active_numerous_groups++; | |
+ } | |
+#endif | |
+} | |
+ | |
+/** | |
+ * bfq_ioprio_to_weight - calc a weight from an ioprio. | |
+ * @ioprio: the ioprio value to convert. | |
+ */ | |
+static inline unsigned short bfq_ioprio_to_weight(int ioprio) | |
+{ | |
+ BUG_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR); | |
+ return IOPRIO_BE_NR - ioprio; | |
+} | |
+ | |
+/** | |
+ * bfq_weight_to_ioprio - calc an ioprio from a weight. | |
+ * @weight: the weight value to convert. | |
+ * | |
+ * To preserve as mush as possible the old only-ioprio user interface, | |
+ * 0 is used as an escape ioprio value for weights (numerically) equal or | |
+ * larger than IOPRIO_BE_NR | |
+ */ | |
+static inline unsigned short bfq_weight_to_ioprio(int weight) | |
+{ | |
+ BUG_ON(weight < BFQ_MIN_WEIGHT || weight > BFQ_MAX_WEIGHT); | |
+ return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight; | |
+} | |
+ | |
+static inline void bfq_get_entity(struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ | |
+ if (bfqq != NULL) { | |
+ atomic_inc(&bfqq->ref); | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d", | |
+ bfqq, atomic_read(&bfqq->ref)); | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_find_deepest - find the deepest node that an extraction can modify. | |
+ * @node: the node being removed. | |
+ * | |
+ * Do the first step of an extraction in an rb tree, looking for the | |
+ * node that will replace @node, and returning the deepest node that | |
+ * the following modifications to the tree can touch. If @node is the | |
+ * last node in the tree return %NULL. | |
+ */ | |
+static struct rb_node *bfq_find_deepest(struct rb_node *node) | |
+{ | |
+ struct rb_node *deepest; | |
+ | |
+ if (node->rb_right == NULL && node->rb_left == NULL) | |
+ deepest = rb_parent(node); | |
+ else if (node->rb_right == NULL) | |
+ deepest = node->rb_left; | |
+ else if (node->rb_left == NULL) | |
+ deepest = node->rb_right; | |
+ else { | |
+ deepest = rb_next(node); | |
+ if (deepest->rb_right != NULL) | |
+ deepest = deepest->rb_right; | |
+ else if (rb_parent(deepest) != node) | |
+ deepest = rb_parent(deepest); | |
+ } | |
+ | |
+ return deepest; | |
+} | |
+ | |
+/** | |
+ * bfq_active_extract - remove an entity from the active tree. | |
+ * @st: the service_tree containing the tree. | |
+ * @entity: the entity being removed. | |
+ */ | |
+static void bfq_active_extract(struct bfq_service_tree *st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ struct rb_node *node; | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ struct bfq_sched_data *sd = NULL; | |
+ struct bfq_group *bfqg = NULL; | |
+ struct bfq_data *bfqd = NULL; | |
+#endif | |
+ | |
+ node = bfq_find_deepest(&entity->rb_node); | |
+ bfq_extract(&st->active, entity); | |
+ | |
+ if (node != NULL) | |
+ bfq_update_active_tree(node); | |
+ | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ sd = entity->sched_data; | |
+ bfqg = container_of(sd, struct bfq_group, sched_data); | |
+ BUG_ON(!bfqg); | |
+ bfqd = (struct bfq_data *)bfqg->bfqd; | |
+#endif | |
+ if (bfqq != NULL) | |
+ list_del(&bfqq->bfqq_list); | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ else { /* bfq_group */ | |
+ BUG_ON(!bfqd); | |
+ bfq_weights_tree_remove(bfqd, entity, | |
+ &bfqd->group_weights_tree); | |
+ } | |
+ if (bfqg != bfqd->root_group) { | |
+ BUG_ON(!bfqg); | |
+ BUG_ON(!bfqd); | |
+ BUG_ON(!bfqg->active_entities); | |
+ bfqg->active_entities--; | |
+ if (bfqg->active_entities == 1) { | |
+ BUG_ON(!bfqd->active_numerous_groups); | |
+ bfqd->active_numerous_groups--; | |
+ } | |
+ } | |
+#endif | |
+} | |
+ | |
+/** | |
+ * bfq_idle_insert - insert an entity into the idle tree. | |
+ * @st: the service tree containing the tree. | |
+ * @entity: the entity to insert. | |
+ */ | |
+static void bfq_idle_insert(struct bfq_service_tree *st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ struct bfq_entity *first_idle = st->first_idle; | |
+ struct bfq_entity *last_idle = st->last_idle; | |
+ | |
+ if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish)) | |
+ st->first_idle = entity; | |
+ if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish)) | |
+ st->last_idle = entity; | |
+ | |
+ bfq_insert(&st->idle, entity); | |
+ | |
+ if (bfqq != NULL) | |
+ list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list); | |
+} | |
+ | |
+/** | |
+ * bfq_forget_entity - remove an entity from the wfq trees. | |
+ * @st: the service tree. | |
+ * @entity: the entity being removed. | |
+ * | |
+ * Update the device status and forget everything about @entity, putting | |
+ * the device reference to it, if it is a queue. Entities belonging to | |
+ * groups are not refcounted. | |
+ */ | |
+static void bfq_forget_entity(struct bfq_service_tree *st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ struct bfq_sched_data *sd; | |
+ | |
+ BUG_ON(!entity->on_st); | |
+ | |
+ entity->on_st = 0; | |
+ st->wsum -= entity->weight; | |
+ if (bfqq != NULL) { | |
+ sd = entity->sched_data; | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d", | |
+ bfqq, atomic_read(&bfqq->ref)); | |
+ bfq_put_queue(bfqq); | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_put_idle_entity - release the idle tree ref of an entity. | |
+ * @st: service tree for the entity. | |
+ * @entity: the entity being released. | |
+ */ | |
+static void bfq_put_idle_entity(struct bfq_service_tree *st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ bfq_idle_extract(st, entity); | |
+ bfq_forget_entity(st, entity); | |
+} | |
+ | |
+/** | |
+ * bfq_forget_idle - update the idle tree if necessary. | |
+ * @st: the service tree to act upon. | |
+ * | |
+ * To preserve the global O(log N) complexity we only remove one entry here; | |
+ * as the idle tree will not grow indefinitely this can be done safely. | |
+ */ | |
+static void bfq_forget_idle(struct bfq_service_tree *st) | |
+{ | |
+ struct bfq_entity *first_idle = st->first_idle; | |
+ struct bfq_entity *last_idle = st->last_idle; | |
+ | |
+ if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL && | |
+ !bfq_gt(last_idle->finish, st->vtime)) { | |
+ /* | |
+ * Forget the whole idle tree, increasing the vtime past | |
+ * the last finish time of idle entities. | |
+ */ | |
+ st->vtime = last_idle->finish; | |
+ } | |
+ | |
+ if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime)) | |
+ bfq_put_idle_entity(st, first_idle); | |
+} | |
+ | |
+static struct bfq_service_tree * | |
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, | |
+ struct bfq_entity *entity) | |
+{ | |
+ struct bfq_service_tree *new_st = old_st; | |
+ | |
+ if (entity->ioprio_changed) { | |
+ struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); | |
+ unsigned short prev_weight, new_weight; | |
+ struct bfq_data *bfqd = NULL; | |
+ struct rb_root *root; | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ struct bfq_sched_data *sd; | |
+ struct bfq_group *bfqg; | |
+#endif | |
+ | |
+ if (bfqq != NULL) | |
+ bfqd = bfqq->bfqd; | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ else { | |
+ sd = entity->my_sched_data; | |
+ bfqg = container_of(sd, struct bfq_group, sched_data); | |
+ BUG_ON(!bfqg); | |
+ bfqd = (struct bfq_data *)bfqg->bfqd; | |
+ BUG_ON(!bfqd); | |
+ } | |
+#endif | |
+ | |
+ BUG_ON(old_st->wsum < entity->weight); | |
+ old_st->wsum -= entity->weight; | |
+ | |
+ if (entity->new_weight != entity->orig_weight) { | |
+ entity->orig_weight = entity->new_weight; | |
+ entity->ioprio = | |
+ bfq_weight_to_ioprio(entity->orig_weight); | |
+ } else if (entity->new_ioprio != entity->ioprio) { | |
+ entity->ioprio = entity->new_ioprio; | |
+ entity->orig_weight = | |
+ bfq_ioprio_to_weight(entity->ioprio); | |
+ } else | |
+ entity->new_weight = entity->orig_weight = | |
+ bfq_ioprio_to_weight(entity->ioprio); | |
+ | |
+ entity->ioprio_class = entity->new_ioprio_class; | |
+ entity->ioprio_changed = 0; | |
+ | |
+ /* | |
+ * NOTE: here we may be changing the weight too early, | |
+ * this will cause unfairness. The correct approach | |
+ * would have required additional complexity to defer | |
+ * weight changes to the proper time instants (i.e., | |
+ * when entity->finish <= old_st->vtime). | |
+ */ | |
+ new_st = bfq_entity_service_tree(entity); | |
+ | |
+ prev_weight = entity->weight; | |
+ new_weight = entity->orig_weight * | |
+ (bfqq != NULL ? bfqq->wr_coeff : 1); | |
+ /* | |
+ * If the weight of the entity changes, remove the entity | |
+ * from its old weight counter (if there is a counter | |
+ * associated with the entity), and add it to the counter | |
+ * associated with its new weight. | |
+ */ | |
+ if (prev_weight != new_weight) { | |
+ root = bfqq ? &bfqd->queue_weights_tree : | |
+ &bfqd->group_weights_tree; | |
+ bfq_weights_tree_remove(bfqd, entity, root); | |
+ } | |
+ entity->weight = new_weight; | |
+ /* | |
+ * Add the entity to its weights tree only if it is | |
+ * not associated with a weight-raised queue. | |
+ */ | |
+ if (prev_weight != new_weight && | |
+ (bfqq ? bfqq->wr_coeff == 1 : 1)) | |
+ /* If we get here, root has been initialized. */ | |
+ bfq_weights_tree_add(bfqd, entity, root); | |
+ | |
+ new_st->wsum += entity->weight; | |
+ | |
+ if (new_st != old_st) | |
+ entity->start = new_st->vtime; | |
+ } | |
+ | |
+ return new_st; | |
+} | |
+ | |
+/** | |
+ * bfq_bfqq_served - update the scheduler status after selection for | |
+ * service. | |
+ * @bfqq: the queue being served. | |
+ * @served: bytes to transfer. | |
+ * | |
+ * NOTE: this can be optimized, as the timestamps of upper level entities | |
+ * are synchronized every time a new bfqq is selected for service. By now, | |
+ * we keep it to better check consistency. | |
+ */ | |
+static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ struct bfq_service_tree *st; | |
+ | |
+ for_each_entity(entity) { | |
+ st = bfq_entity_service_tree(entity); | |
+ | |
+ entity->service += served; | |
+ BUG_ON(entity->service > entity->budget); | |
+ BUG_ON(st->wsum == 0); | |
+ | |
+ st->vtime += bfq_delta(served, st->wsum); | |
+ bfq_forget_idle(st); | |
+ } | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served); | |
+} | |
+ | |
+/** | |
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget. | |
+ * @bfqq: the queue that needs a service update. | |
+ * | |
+ * When it's not possible to be fair in the service domain, because | |
+ * a queue is not consuming its budget fast enough (the meaning of | |
+ * fast depends on the timeout parameter), we charge it a full | |
+ * budget. In this way we should obtain a sort of time-domain | |
+ * fairness among all the seeky/slow queues. | |
+ */ | |
+static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ | |
+ bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget"); | |
+ | |
+ bfq_bfqq_served(bfqq, entity->budget - entity->service); | |
+} | |
+ | |
+/** | |
+ * __bfq_activate_entity - activate an entity. | |
+ * @entity: the entity being activated. | |
+ * | |
+ * Called whenever an entity is activated, i.e., it is not active and one | |
+ * of its children receives a new request, or has to be reactivated due to | |
+ * budget exhaustion. It uses the current budget of the entity (and the | |
+ * service received if @entity is active) of the queue to calculate its | |
+ * timestamps. | |
+ */ | |
+static void __bfq_activate_entity(struct bfq_entity *entity) | |
+{ | |
+ struct bfq_sched_data *sd = entity->sched_data; | |
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity); | |
+ | |
+ if (entity == sd->in_service_entity) { | |
+ BUG_ON(entity->tree != NULL); | |
+ /* | |
+ * If we are requeueing the current entity we have | |
+ * to take care of not charging to it service it has | |
+ * not received. | |
+ */ | |
+ bfq_calc_finish(entity, entity->service); | |
+ entity->start = entity->finish; | |
+ sd->in_service_entity = NULL; | |
+ } else if (entity->tree == &st->active) { | |
+ /* | |
+ * Requeueing an entity due to a change of some | |
+ * next_in_service entity below it. We reuse the | |
+ * old start time. | |
+ */ | |
+ bfq_active_extract(st, entity); | |
+ } else if (entity->tree == &st->idle) { | |
+ /* | |
+ * Must be on the idle tree, bfq_idle_extract() will | |
+ * check for that. | |
+ */ | |
+ bfq_idle_extract(st, entity); | |
+ entity->start = bfq_gt(st->vtime, entity->finish) ? | |
+ st->vtime : entity->finish; | |
+ } else { | |
+ /* | |
+ * The finish time of the entity may be invalid, and | |
+ * it is in the past for sure, otherwise the queue | |
+ * would have been on the idle tree. | |
+ */ | |
+ entity->start = st->vtime; | |
+ st->wsum += entity->weight; | |
+ bfq_get_entity(entity); | |
+ | |
+ BUG_ON(entity->on_st); | |
+ entity->on_st = 1; | |
+ } | |
+ | |
+ st = __bfq_entity_update_weight_prio(st, entity); | |
+ bfq_calc_finish(entity, entity->budget); | |
+ bfq_active_insert(st, entity); | |
+} | |
+ | |
+/** | |
+ * bfq_activate_entity - activate an entity and its ancestors if necessary. | |
+ * @entity: the entity to activate. | |
+ * | |
+ * Activate @entity and all the entities on the path from it to the root. | |
+ */ | |
+static void bfq_activate_entity(struct bfq_entity *entity) | |
+{ | |
+ struct bfq_sched_data *sd; | |
+ | |
+ for_each_entity(entity) { | |
+ __bfq_activate_entity(entity); | |
+ | |
+ sd = entity->sched_data; | |
+ if (!bfq_update_next_in_service(sd)) | |
+ /* | |
+ * No need to propagate the activation to the | |
+ * upper entities, as they will be updated when | |
+ * the in-service entity is rescheduled. | |
+ */ | |
+ break; | |
+ } | |
+} | |
+ | |
+/** | |
+ * __bfq_deactivate_entity - deactivate an entity from its service tree. | |
+ * @entity: the entity to deactivate. | |
+ * @requeue: if false, the entity will not be put into the idle tree. | |
+ * | |
+ * Deactivate an entity, independently from its previous state. If the | |
+ * entity was not on a service tree just return, otherwise if it is on | |
+ * any scheduler tree, extract it from that tree, and if necessary | |
+ * and if the caller did not specify @requeue, put it on the idle tree. | |
+ * | |
+ * Return %1 if the caller should update the entity hierarchy, i.e., | |
+ * if the entity was in service or if it was the next_in_service for | |
+ * its sched_data; return %0 otherwise. | |
+ */ | |
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue) | |
+{ | |
+ struct bfq_sched_data *sd = entity->sched_data; | |
+ struct bfq_service_tree *st = bfq_entity_service_tree(entity); | |
+ int was_in_service = entity == sd->in_service_entity; | |
+ int ret = 0; | |
+ | |
+ if (!entity->on_st) | |
+ return 0; | |
+ | |
+ BUG_ON(was_in_service && entity->tree != NULL); | |
+ | |
+ if (was_in_service) { | |
+ bfq_calc_finish(entity, entity->service); | |
+ sd->in_service_entity = NULL; | |
+ } else if (entity->tree == &st->active) | |
+ bfq_active_extract(st, entity); | |
+ else if (entity->tree == &st->idle) | |
+ bfq_idle_extract(st, entity); | |
+ else if (entity->tree != NULL) | |
+ BUG(); | |
+ | |
+ if (was_in_service || sd->next_in_service == entity) | |
+ ret = bfq_update_next_in_service(sd); | |
+ | |
+ if (!requeue || !bfq_gt(entity->finish, st->vtime)) | |
+ bfq_forget_entity(st, entity); | |
+ else | |
+ bfq_idle_insert(st, entity); | |
+ | |
+ BUG_ON(sd->in_service_entity == entity); | |
+ BUG_ON(sd->next_in_service == entity); | |
+ | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * bfq_deactivate_entity - deactivate an entity. | |
+ * @entity: the entity to deactivate. | |
+ * @requeue: true if the entity can be put on the idle tree | |
+ */ | |
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue) | |
+{ | |
+ struct bfq_sched_data *sd; | |
+ struct bfq_entity *parent; | |
+ | |
+ for_each_entity_safe(entity, parent) { | |
+ sd = entity->sched_data; | |
+ | |
+ if (!__bfq_deactivate_entity(entity, requeue)) | |
+ /* | |
+ * The parent entity is still backlogged, and | |
+ * we don't need to update it as it is still | |
+ * in service. | |
+ */ | |
+ break; | |
+ | |
+ if (sd->next_in_service != NULL) | |
+ /* | |
+ * The parent entity is still backlogged and | |
+ * the budgets on the path towards the root | |
+ * need to be updated. | |
+ */ | |
+ goto update; | |
+ | |
+ /* | |
+ * If we reach there the parent is no more backlogged and | |
+ * we want to propagate the dequeue upwards. | |
+ */ | |
+ requeue = 1; | |
+ } | |
+ | |
+ return; | |
+ | |
+update: | |
+ entity = parent; | |
+ for_each_entity(entity) { | |
+ __bfq_activate_entity(entity); | |
+ | |
+ sd = entity->sched_data; | |
+ if (!bfq_update_next_in_service(sd)) | |
+ break; | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_update_vtime - update vtime if necessary. | |
+ * @st: the service tree to act upon. | |
+ * | |
+ * If necessary update the service tree vtime to have at least one | |
+ * eligible entity, skipping to its start time. Assumes that the | |
+ * active tree of the device is not empty. | |
+ * | |
+ * NOTE: this hierarchical implementation updates vtimes quite often, | |
+ * we may end up with reactivated processes getting timestamps after a | |
+ * vtime skip done because we needed a ->first_active entity on some | |
+ * intermediate node. | |
+ */ | |
+static void bfq_update_vtime(struct bfq_service_tree *st) | |
+{ | |
+ struct bfq_entity *entry; | |
+ struct rb_node *node = st->active.rb_node; | |
+ | |
+ entry = rb_entry(node, struct bfq_entity, rb_node); | |
+ if (bfq_gt(entry->min_start, st->vtime)) { | |
+ st->vtime = entry->min_start; | |
+ bfq_forget_idle(st); | |
+ } | |
+} | |
+ | |
+/** | |
+ * bfq_first_active_entity - find the eligible entity with | |
+ * the smallest finish time | |
+ * @st: the service tree to select from. | |
+ * | |
+ * This function searches the first schedulable entity, starting from the | |
+ * root of the tree and going on the left every time on this side there is | |
+ * a subtree with at least one eligible (start >= vtime) entity. The path on | |
+ * the right is followed only if a) the left subtree contains no eligible | |
+ * entities and b) no eligible entity has been found yet. | |
+ */ | |
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st) | |
+{ | |
+ struct bfq_entity *entry, *first = NULL; | |
+ struct rb_node *node = st->active.rb_node; | |
+ | |
+ while (node != NULL) { | |
+ entry = rb_entry(node, struct bfq_entity, rb_node); | |
+left: | |
+ if (!bfq_gt(entry->start, st->vtime)) | |
+ first = entry; | |
+ | |
+ BUG_ON(bfq_gt(entry->min_start, st->vtime)); | |
+ | |
+ if (node->rb_left != NULL) { | |
+ entry = rb_entry(node->rb_left, | |
+ struct bfq_entity, rb_node); | |
+ if (!bfq_gt(entry->min_start, st->vtime)) { | |
+ node = node->rb_left; | |
+ goto left; | |
+ } | |
+ } | |
+ if (first != NULL) | |
+ break; | |
+ node = node->rb_right; | |
+ } | |
+ | |
+ BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active)); | |
+ return first; | |
+} | |
+ | |
+/** | |
+ * __bfq_lookup_next_entity - return the first eligible entity in @st. | |
+ * @st: the service tree. | |
+ * | |
+ * Update the virtual time in @st and return the first eligible entity | |
+ * it contains. | |
+ */ | |
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st, | |
+ bool force) | |
+{ | |
+ struct bfq_entity *entity, *new_next_in_service = NULL; | |
+ | |
+ if (RB_EMPTY_ROOT(&st->active)) | |
+ return NULL; | |
+ | |
+ bfq_update_vtime(st); | |
+ entity = bfq_first_active_entity(st); | |
+ BUG_ON(bfq_gt(entity->start, st->vtime)); | |
+ | |
+ /* | |
+ * If the chosen entity does not match with the sched_data's | |
+ * next_in_service and we are forcedly serving the IDLE priority | |
+ * class tree, bubble up budget update. | |
+ */ | |
+ if (unlikely(force && entity != entity->sched_data->next_in_service)) { | |
+ new_next_in_service = entity; | |
+ for_each_entity(new_next_in_service) | |
+ bfq_update_budget(new_next_in_service); | |
+ } | |
+ | |
+ return entity; | |
+} | |
+ | |
+/** | |
+ * bfq_lookup_next_entity - return the first eligible entity in @sd. | |
+ * @sd: the sched_data. | |
+ * @extract: if true the returned entity will be also extracted from @sd. | |
+ * | |
+ * NOTE: since we cache the next_in_service entity at each level of the | |
+ * hierarchy, the complexity of the lookup can be decreased with | |
+ * absolutely no effort just returning the cached next_in_service value; | |
+ * we prefer to do full lookups to test the consistency of * the data | |
+ * structures. | |
+ */ | |
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd, | |
+ int extract, | |
+ struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_service_tree *st = sd->service_tree; | |
+ struct bfq_entity *entity; | |
+ int i = 0; | |
+ | |
+ BUG_ON(sd->in_service_entity != NULL); | |
+ | |
+ if (bfqd != NULL && | |
+ jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) { | |
+ entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1, | |
+ true); | |
+ if (entity != NULL) { | |
+ i = BFQ_IOPRIO_CLASSES - 1; | |
+ bfqd->bfq_class_idle_last_service = jiffies; | |
+ sd->next_in_service = entity; | |
+ } | |
+ } | |
+ for (; i < BFQ_IOPRIO_CLASSES; i++) { | |
+ entity = __bfq_lookup_next_entity(st + i, false); | |
+ if (entity != NULL) { | |
+ if (extract) { | |
+ bfq_check_next_in_service(sd, entity); | |
+ bfq_active_extract(st + i, entity); | |
+ sd->in_service_entity = entity; | |
+ sd->next_in_service = NULL; | |
+ } | |
+ break; | |
+ } | |
+ } | |
+ | |
+ return entity; | |
+} | |
+ | |
+/* | |
+ * Get next queue for service. | |
+ */ | |
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd) | |
+{ | |
+ struct bfq_entity *entity = NULL; | |
+ struct bfq_sched_data *sd; | |
+ struct bfq_queue *bfqq; | |
+ | |
+ BUG_ON(bfqd->in_service_queue != NULL); | |
+ | |
+ if (bfqd->busy_queues == 0) | |
+ return NULL; | |
+ | |
+ sd = &bfqd->root_group->sched_data; | |
+ for (; sd != NULL; sd = entity->my_sched_data) { | |
+ entity = bfq_lookup_next_entity(sd, 1, bfqd); | |
+ BUG_ON(entity == NULL); | |
+ entity->service = 0; | |
+ } | |
+ | |
+ bfqq = bfq_entity_to_bfqq(entity); | |
+ BUG_ON(bfqq == NULL); | |
+ | |
+ return bfqq; | |
+} | |
+ | |
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd) | |
+{ | |
+ if (bfqd->in_service_bic != NULL) { | |
+ put_io_context(bfqd->in_service_bic->icq.ioc); | |
+ bfqd->in_service_bic = NULL; | |
+ } | |
+ | |
+ bfqd->in_service_queue = NULL; | |
+ del_timer(&bfqd->idle_slice_timer); | |
+} | |
+ | |
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ int requeue) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ | |
+ if (bfqq == bfqd->in_service_queue) | |
+ __bfq_bfqd_reset_in_service(bfqd); | |
+ | |
+ bfq_deactivate_entity(entity, requeue); | |
+} | |
+ | |
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ struct bfq_entity *entity = &bfqq->entity; | |
+ | |
+ bfq_activate_entity(entity); | |
+} | |
+ | |
+/* | |
+ * Called when the bfqq no longer has requests pending, remove it from | |
+ * the service tree. | |
+ */ | |
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq, | |
+ int requeue) | |
+{ | |
+ BUG_ON(!bfq_bfqq_busy(bfqq)); | |
+ BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list)); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "del from busy"); | |
+ | |
+ bfq_clear_bfqq_busy(bfqq); | |
+ | |
+ BUG_ON(bfqd->busy_queues == 0); | |
+ bfqd->busy_queues--; | |
+ | |
+ if (!bfqq->dispatched) { | |
+ bfq_weights_tree_remove(bfqd, &bfqq->entity, | |
+ &bfqd->queue_weights_tree); | |
+ if (!blk_queue_nonrot(bfqd->queue)) { | |
+ BUG_ON(!bfqd->busy_in_flight_queues); | |
+ bfqd->busy_in_flight_queues--; | |
+ if (bfq_bfqq_constantly_seeky(bfqq)) { | |
+ BUG_ON(!bfqd-> | |
+ const_seeky_busy_in_flight_queues); | |
+ bfqd->const_seeky_busy_in_flight_queues--; | |
+ } | |
+ } | |
+ } | |
+ if (bfqq->wr_coeff > 1) | |
+ bfqd->wr_busy_queues--; | |
+ | |
+ bfq_deactivate_bfqq(bfqd, bfqq, requeue); | |
+} | |
+ | |
+/* | |
+ * Called when an inactive queue receives a new request. | |
+ */ | |
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq) | |
+{ | |
+ BUG_ON(bfq_bfqq_busy(bfqq)); | |
+ BUG_ON(bfqq == bfqd->in_service_queue); | |
+ | |
+ bfq_log_bfqq(bfqd, bfqq, "add to busy"); | |
+ | |
+ bfq_activate_bfqq(bfqd, bfqq); | |
+ | |
+ bfq_mark_bfqq_busy(bfqq); | |
+ bfqd->busy_queues++; | |
+ | |
+ if (!bfqq->dispatched) { | |
+ if (bfqq->wr_coeff == 1) | |
+ bfq_weights_tree_add(bfqd, &bfqq->entity, | |
+ &bfqd->queue_weights_tree); | |
+ if (!blk_queue_nonrot(bfqd->queue)) { | |
+ bfqd->busy_in_flight_queues++; | |
+ if (bfq_bfqq_constantly_seeky(bfqq)) | |
+ bfqd->const_seeky_busy_in_flight_queues++; | |
+ } | |
+ } | |
+ if (bfqq->wr_coeff > 1) | |
+ bfqd->wr_busy_queues++; | |
+} | |
diff --git a/block/bfq.h b/block/bfq.h | |
new file mode 100644 | |
index 0000000..a193a56 | |
--- /dev/null | |
+++ b/block/bfq.h | |
@@ -0,0 +1,809 @@ | |
+/* | |
+ * BFQ-v7r6 for 3.18.0: data structures and common functions prototypes. | |
+ * | |
+ * Based on ideas and code from CFQ: | |
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> | |
+ * | |
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it> | |
+ * Paolo Valente <paolo.valente@unimore.it> | |
+ * | |
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it> | |
+ */ | |
+ | |
+#ifndef _BFQ_H | |
+#define _BFQ_H | |
+ | |
+#include <linux/blktrace_api.h> | |
+#include <linux/hrtimer.h> | |
+#include <linux/ioprio.h> | |
+#include <linux/rbtree.h> | |
+ | |
+#define BFQ_IOPRIO_CLASSES 3 | |
+#define BFQ_CL_IDLE_TIMEOUT (HZ/5) | |
+ | |
+#define BFQ_MIN_WEIGHT 1 | |
+#define BFQ_MAX_WEIGHT 1000 | |
+ | |
+#define BFQ_DEFAULT_GRP_WEIGHT 10 | |
+#define BFQ_DEFAULT_GRP_IOPRIO 0 | |
+#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE | |
+ | |
+struct bfq_entity; | |
+ | |
+/** | |
+ * struct bfq_service_tree - per ioprio_class service tree. | |
+ * @active: tree for active entities (i.e., those backlogged). | |
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i). | |
+ * @first_idle: idle entity with minimum F_i. | |
+ * @last_idle: idle entity with maximum F_i. | |
+ * @vtime: scheduler virtual time. | |
+ * @wsum: scheduler weight sum; active and idle entities contribute to it. | |
+ * | |
+ * Each service tree represents a B-WF2Q+ scheduler on its own. Each | |
+ * ioprio_class has its own independent scheduler, and so its own | |
+ * bfq_service_tree. All the fields are protected by the queue lock | |
+ * of the containing bfqd. | |
+ */ | |
+struct bfq_service_tree { | |
+ struct rb_root active; | |
+ struct rb_root idle; | |
+ | |
+ struct bfq_entity *first_idle; | |
+ struct bfq_entity *last_idle; | |
+ | |
+ u64 vtime; | |
+ unsigned long wsum; | |
+}; | |
+ | |
+/** | |
+ * struct bfq_sched_data - multi-class scheduler. | |
+ * @in_service_entity: entity in service. | |
+ * @next_in_service: head-of-the-line entity in the scheduler. | |
+ * @service_tree: array of service trees, one per ioprio_class. | |
+ * | |
+ * bfq_sched_data is the basic scheduler queue. It supports three | |
+ * ioprio_classes, and can be used either as a toplevel queue or as | |
+ * an intermediate queue on a hierarchical setup. | |
+ * @next_in_service points to the active entity of the sched_data | |
+ * service trees that will be scheduled next. | |
+ * | |
+ * The supported ioprio_classes are the same as in CFQ, in descending | |
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE. | |
+ * Requests from higher priority queues are served before all the | |
+ * requests from lower priority queues; among requests of the same | |
+ * queue requests are served according to B-WF2Q+. | |
+ * All the fields are protected by the queue lock of the containing bfqd. | |
+ */ | |
+struct bfq_sched_data { | |
+ struct bfq_entity *in_service_entity; | |
+ struct bfq_entity *next_in_service; | |
+ struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES]; | |
+}; | |
+ | |
+/** | |
+ * struct bfq_weight_counter - counter of the number of all active entities | |
+ * with a given weight. | |
+ * @weight: weight of the entities that this counter refers to. | |
+ * @num_active: number of active entities with this weight. | |
+ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree | |
+ * and @group_weights_tree). | |
+ */ | |
+struct bfq_weight_counter { | |
+ short int weight; | |
+ unsigned int num_active; | |
+ struct rb_node weights_node; | |
+}; | |
+ | |
+/** | |
+ * struct bfq_entity - schedulable entity. | |
+ * @rb_node: service_tree member. | |
+ * @weight_counter: pointer to the weight counter associated with this entity. | |
+ * @on_st: flag, true if the entity is on a tree (either the active or | |
+ * the idle one of its service_tree). | |
+ * @finish: B-WF2Q+ finish timestamp (aka F_i). | |
+ * @start: B-WF2Q+ start timestamp (aka S_i). | |
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree. | |
+ * @min_start: minimum start time of the (active) subtree rooted at | |
+ * this entity; used for O(log N) lookups into active trees. | |
+ * @service: service received during the last round of service. | |
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight. | |
+ * @weight: weight of the queue | |
+ * @parent: parent entity, for hierarchical scheduling. | |
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the | |
+ * associated scheduler queue, %NULL on leaf nodes. | |
+ * @sched_data: the scheduler queue this entity belongs to. | |
+ * @ioprio: the ioprio in use. | |
+ * @new_weight: when a weight change is requested, the new weight value. | |
+ * @orig_weight: original weight, used to implement weight boosting | |
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value. | |
+ * @ioprio_class: the ioprio_class in use. | |
+ * @new_ioprio_class: when an ioprio_class change is requested, the new | |
+ * ioprio_class value. | |
+ * @ioprio_changed: flag, true when the user requested a weight, ioprio or | |
+ * ioprio_class change. | |
+ * | |
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the | |
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each | |
+ * entity belongs to the sched_data of the parent group in the cgroup | |
+ * hierarchy. Non-leaf entities have also their own sched_data, stored | |
+ * in @my_sched_data. | |
+ * | |
+ * Each entity stores independently its priority values; this would | |
+ * allow different weights on different devices, but this | |
+ * functionality is not exported to userspace by now. Priorities and | |
+ * weights are updated lazily, first storing the new values into the | |
+ * new_* fields, then setting the @ioprio_changed flag. As soon as | |
+ * there is a transition in the entity state that allows the priority | |
+ * update to take place the effective and the requested priority | |
+ * values are synchronized. | |
+ * | |
+ * Unless cgroups are used, the weight value is calculated from the | |
+ * ioprio to export the same interface as CFQ. When dealing with | |
+ * ``well-behaved'' queues (i.e., queues that do not spend too much | |
+ * time to consume their budget and have true sequential behavior, and | |
+ * when there are no external factors breaking anticipation) the | |
+ * relative weights at each level of the cgroups hierarchy should be | |
+ * guaranteed. All the fields are protected by the queue lock of the | |
+ * containing bfqd. | |
+ */ | |
+struct bfq_entity { | |
+ struct rb_node rb_node; | |
+ struct bfq_weight_counter *weight_counter; | |
+ | |
+ int on_st; | |
+ | |
+ u64 finish; | |
+ u64 start; | |
+ | |
+ struct rb_root *tree; | |
+ | |
+ u64 min_start; | |
+ | |
+ unsigned long service, budget; | |
+ unsigned short weight, new_weight; | |
+ unsigned short orig_weight; | |
+ | |
+ struct bfq_entity *parent; | |
+ | |
+ struct bfq_sched_data *my_sched_data; | |
+ struct bfq_sched_data *sched_data; | |
+ | |
+ unsigned short ioprio, new_ioprio; | |
+ unsigned short ioprio_class, new_ioprio_class; | |
+ | |
+ int ioprio_changed; | |
+}; | |
+ | |
+struct bfq_group; | |
+ | |
+/** | |
+ * struct bfq_queue - leaf schedulable entity. | |
+ * @ref: reference counter. | |
+ * @bfqd: parent bfq_data. | |
+ * @new_bfqq: shared bfq_queue if queue is cooperating with | |
+ * one or more other queues. | |
+ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree). | |
+ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree). | |
+ * @sort_list: sorted list of pending requests. | |
+ * @next_rq: if fifo isn't expired, next request to serve. | |
+ * @queued: nr of requests queued in @sort_list. | |
+ * @allocated: currently allocated requests. | |
+ * @meta_pending: pending metadata requests. | |
+ * @fifo: fifo list of requests in sort_list. | |
+ * @entity: entity representing this queue in the scheduler. | |
+ * @max_budget: maximum budget allowed from the feedback mechanism. | |
+ * @budget_timeout: budget expiration (in jiffies). | |
+ * @dispatched: number of requests on the dispatch list or inside driver. | |
+ * @flags: status flags. | |
+ * @bfqq_list: node for active/idle bfqq list inside our bfqd. | |
+ * @burst_list_node: node for the device's burst list. | |
+ * @seek_samples: number of seeks sampled | |
+ * @seek_total: sum of the distances of the seeks sampled | |
+ * @seek_mean: mean seek distance | |
+ * @last_request_pos: position of the last request enqueued | |
+ * @requests_within_timer: number of consecutive pairs of request completion | |
+ * and arrival, such that the queue becomes idle | |
+ * after the completion, but the next request arrives | |
+ * within an idle time slice; used only if the queue's | |
+ * IO_bound has been cleared. | |
+ * @pid: pid of the process owning the queue, used for logging purposes. | |
+ * @last_wr_start_finish: start time of the current weight-raising period if | |
+ * the @bfq-queue is being weight-raised, otherwise | |
+ * finish time of the last weight-raising period | |
+ * @wr_cur_max_time: current max raising time for this queue | |
+ * @soft_rt_next_start: minimum time instant such that, only if a new | |
+ * request is enqueued after this time instant in an | |
+ * idle @bfq_queue with no outstanding requests, then | |
+ * the task associated with the queue it is deemed as | |
+ * soft real-time (see the comments to the function | |
+ * bfq_bfqq_softrt_next_start()) | |
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from | |
+ * idle to backlogged | |
+ * @service_from_backlogged: cumulative service received from the @bfq_queue | |
+ * since the last transition from idle to | |
+ * backlogged | |
+ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the | |
+ * queue is shared | |
+ * | |
+ * A bfq_queue is a leaf request queue; it can be associated with an | |
+ * io_context or more, if it is async or shared between cooperating | |
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it | |
+ * does not disappear while a bfqq still references it (mostly to avoid | |
+ * races between request issuing and task migration followed by cgroup | |
+ * destruction). | |
+ * All the fields are protected by the queue lock of the containing bfqd. | |
+ */ | |
+struct bfq_queue { | |
+ atomic_t ref; | |
+ struct bfq_data *bfqd; | |
+ | |
+ /* fields for cooperating queues handling */ | |
+ struct bfq_queue *new_bfqq; | |
+ struct rb_node pos_node; | |
+ struct rb_root *pos_root; | |
+ | |
+ struct rb_root sort_list; | |
+ struct request *next_rq; | |
+ int queued[2]; | |
+ int allocated[2]; | |
+ int meta_pending; | |
+ struct list_head fifo; | |
+ | |
+ struct bfq_entity entity; | |
+ | |
+ unsigned long max_budget; | |
+ unsigned long budget_timeout; | |
+ | |
+ int dispatched; | |
+ | |
+ unsigned int flags; | |
+ | |
+ struct list_head bfqq_list; | |
+ | |
+ struct hlist_node burst_list_node; | |
+ | |
+ unsigned int seek_samples; | |
+ u64 seek_total; | |
+ sector_t seek_mean; | |
+ sector_t last_request_pos; | |
+ | |
+ unsigned int requests_within_timer; | |
+ | |
+ pid_t pid; | |
+ struct bfq_io_cq *bic; | |
+ | |
+ /* weight-raising fields */ | |
+ unsigned long wr_cur_max_time; | |
+ unsigned long soft_rt_next_start; | |
+ unsigned long last_wr_start_finish; | |
+ unsigned int wr_coeff; | |
+ unsigned long last_idle_bklogged; | |
+ unsigned long service_from_backlogged; | |
+}; | |
+ | |
+/** | |
+ * struct bfq_ttime - per process thinktime stats. | |
+ * @ttime_total: total process thinktime | |
+ * @ttime_samples: number of thinktime samples | |
+ * @ttime_mean: average process thinktime | |
+ */ | |
+struct bfq_ttime { | |
+ unsigned long last_end_request; | |
+ | |
+ unsigned long ttime_total; | |
+ unsigned long ttime_samples; | |
+ unsigned long ttime_mean; | |
+}; | |
+ | |
+/** | |
+ * struct bfq_io_cq - per (request_queue, io_context) structure. | |
+ * @icq: associated io_cq structure | |
+ * @bfqq: array of two process queues, the sync and the async | |
+ * @ttime: associated @bfq_ttime struct | |
+ * @wr_time_left: snapshot of the time left before weight raising ends | |
+ * for the sync queue associated to this process; this | |
+ * snapshot is taken to remember this value while the weight | |
+ * raising is suspended because the queue is merged with a | |
+ * shared queue, and is used to set @raising_cur_max_time | |
+ * when the queue is split from the shared queue and its | |
+ * weight is raised again | |
+ * @saved_idle_window: same purpose as the previous field for the idle | |
+ * window | |
+ * @saved_IO_bound: same purpose as the previous two fields for the I/O | |
+ * bound classification of a queue | |
+ * @saved_in_large_burst: same purpose as the previous fields for the | |
+ * value of the field keeping the queue's belonging | |
+ * to a large burst | |
+ * @was_in_burst_list: true if the queue belonged to a burst list | |
+ * before its merge with another cooperating queue | |
+ * @cooperations: counter of consecutive successful queue merges underwent | |
+ * by any of the process' @bfq_queues | |
+ * @failed_cooperations: counter of consecutive failed queue merges of any | |
+ * of the process' @bfq_queues | |
+ */ | |
+struct bfq_io_cq { | |
+ struct io_cq icq; /* must be the first member */ | |
+ struct bfq_queue *bfqq[2]; | |
+ struct bfq_ttime ttime; | |
+ int ioprio; | |
+ | |
+ unsigned int wr_time_left; | |
+ bool saved_idle_window; | |
+ bool saved_IO_bound; | |
+ | |
+ bool saved_in_large_burst; | |
+ bool was_in_burst_list; | |
+ | |
+ unsigned int cooperations; | |
+ unsigned int failed_cooperations; | |
+}; | |
+ | |
+enum bfq_device_speed { | |
+ BFQ_BFQD_FAST, | |
+ BFQ_BFQD_SLOW, | |
+}; | |
+ | |
+/** | |
+ * struct bfq_data - per device data structure. | |
+ * @queue: request queue for the managed device. | |
+ * @root_group: root bfq_group for the device. | |
+ * @rq_pos_tree: rbtree sorted by next_request position, used when | |
+ * determining if two or more queues have interleaving | |
+ * requests (see bfq_close_cooperator()). | |
+ * @active_numerous_groups: number of bfq_groups containing more than one | |
+ * active @bfq_entity. | |
+ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by | |
+ * weight. Used to keep track of whether all @bfq_queues | |
+ * have the same weight. The tree contains one counter | |
+ * for each distinct weight associated to some active | |
+ * and not weight-raised @bfq_queue (see the comments to | |
+ * the functions bfq_weights_tree_[add|remove] for | |
+ * further details). | |
+ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted | |
+ * by weight. Used to keep track of whether all | |
+ * @bfq_groups have the same weight. The tree contains | |
+ * one counter for each distinct weight associated to | |
+ * some active @bfq_group (see the comments to the | |
+ * functions bfq_weights_tree_[add|remove] for further | |
+ * details). | |
+ * @busy_queues: number of bfq_queues containing requests (including the | |
+ * queue in service, even if it is idling). | |
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or | |
+ * in-flight requests, plus the @bfq_queue in | |
+ * service, even if idle but waiting for the | |
+ * possible arrival of its next sync request. This | |
+ * field is updated only if the device is rotational, | |
+ * but used only if the device is also NCQ-capable. | |
+ * The reason why the field is updated also for non- | |
+ * NCQ-capable rotational devices is related to the | |
+ * fact that the value of @hw_tag may be set also | |
+ * later than when busy_in_flight_queues may need to | |
+ * be incremented for the first time(s). Taking also | |
+ * this possibility into account, to avoid unbalanced | |
+ * increments/decrements, would imply more overhead | |
+ * than just updating busy_in_flight_queues | |
+ * regardless of the value of @hw_tag. | |
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues | |
+ * (that is, seeky queues that expired | |
+ * for budget timeout at least once) | |
+ * containing pending or in-flight | |
+ * requests, including the in-service | |
+ * @bfq_queue if constantly seeky. This | |
+ * field is updated only if the device | |
+ * is rotational, but used only if the | |
+ * device is also NCQ-capable (see the | |
+ * comments to @busy_in_flight_queues). | |
+ * @wr_busy_queues: number of weight-raised busy @bfq_queues. | |
+ * @queued: number of queued requests. | |
+ * @rq_in_driver: number of requests dispatched and waiting for completion. | |
+ * @sync_flight: number of sync requests in the driver. | |
+ * @max_rq_in_driver: max number of reqs in driver in the last | |
+ * @hw_tag_samples completed requests. | |
+ * @hw_tag_samples: nr of samples used to calculate hw_tag. | |
+ * @hw_tag: flag set to one if the driver is showing a queueing behavior. | |
+ * @budgets_assigned: number of budgets assigned. | |
+ * @idle_slice_timer: timer set when idling for the next sequential request | |
+ * from the queue in service. | |
+ * @unplug_work: delayed work to restart dispatching on the request queue. | |
+ * @in_service_queue: bfq_queue in service. | |
+ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue. | |
+ * @last_position: on-disk position of the last served request. | |
+ * @last_budget_start: beginning of the last budget. | |
+ * @last_idling_start: beginning of the last idle slice. | |
+ * @peak_rate: peak transfer rate observed for a budget. | |
+ * @peak_rate_samples: number of samples used to calculate @peak_rate. | |
+ * @bfq_max_budget: maximum budget allotted to a bfq_queue before | |
+ * rescheduling. | |
+ * @group_list: list of all the bfq_groups active on the device. | |
+ * @active_list: list of all the bfq_queues active on the device. | |
+ * @idle_list: list of all the bfq_queues idle on the device. | |
+ * @bfq_quantum: max number of requests dispatched per dispatch round. | |
+ * @bfq_fifo_expire: timeout for async/sync requests; when it expires | |
+ * requests are served in fifo order. | |
+ * @bfq_back_penalty: weight of backward seeks wrt forward ones. | |
+ * @bfq_back_max: maximum allowed backward seek. | |
+ * @bfq_slice_idle: maximum idling time. | |
+ * @bfq_user_max_budget: user-configured max budget value | |
+ * (0 for auto-tuning). | |
+ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to | |
+ * async queues. | |
+ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to | |
+ * to prevent seeky queues to impose long latencies to well | |
+ * behaved ones (this also implies that seeky queues cannot | |
+ * receive guarantees in the service domain; after a timeout | |
+ * they are charged for the whole allocated budget, to try | |
+ * to preserve a behavior reasonably fair among them, but | |
+ * without service-domain guarantees). | |
+ * @bfq_coop_thresh: number of queue merges after which a @bfq_queue is | |
+ * no more granted any weight-raising. | |
+ * @bfq_failed_cooperations: number of consecutive failed cooperation | |
+ * chances after which weight-raising is restored | |
+ * to a queue subject to more than bfq_coop_thresh | |
+ * queue merges. | |
+ * @bfq_requests_within_timer: number of consecutive requests that must be | |
+ * issued within the idle time slice to set | |
+ * again idling to a queue which was marked as | |
+ * non-I/O-bound (see the definition of the | |
+ * IO_bound flag for further details). | |
+ * @last_ins_in_burst: last time at which a queue entered the current | |
+ * burst of queues being activated shortly after | |
+ * each other; for more details about this and the | |
+ * following parameters related to a burst of | |
+ * activations, see the comments to the function | |
+ * @bfq_handle_burst. | |
+ * @bfq_burst_interval: reference time interval used to decide whether a | |
+ * queue has been activated shortly after | |
+ * @last_ins_in_burst. | |
+ * @burst_size: number of queues in the current burst of queue activations. | |
+ * @bfq_large_burst_thresh: maximum burst size above which the current | |
+ * queue-activation burst is deemed as 'large'. | |
+ * @large_burst: true if a large queue-activation burst is in progress. | |
+ * @burst_list: head of the burst list (as for the above fields, more details | |
+ * in the comments to the function bfq_handle_burst). | |
+ * @low_latency: if set to true, low-latency heuristics are enabled. | |
+ * @bfq_wr_coeff: maximum factor by which the weight of a weight-raised | |
+ * queue is multiplied. | |
+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies). | |
+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes. | |
+ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising | |
+ * may be reactivated for a queue (in jiffies). | |
+ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals | |
+ * after which weight-raising may be | |
+ * reactivated for an already busy queue | |
+ * (in jiffies). | |
+ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue, | |
+ * sectors per seconds. | |
+ * @RT_prod: cached value of the product R*T used for computing the maximum | |
+ * duration of the weight raising automatically. | |
+ * @device_speed: device-speed class for the low-latency heuristic. | |
+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions. | |
+ * | |
+ * All the fields are protected by the @queue lock. | |
+ */ | |
+struct bfq_data { | |
+ struct request_queue *queue; | |
+ | |
+ struct bfq_group *root_group; | |
+ struct rb_root rq_pos_tree; | |
+ | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+ int active_numerous_groups; | |
+#endif | |
+ | |
+ struct rb_root queue_weights_tree; | |
+ struct rb_root group_weights_tree; | |
+ | |
+ int busy_queues; | |
+ int busy_in_flight_queues; | |
+ int const_seeky_busy_in_flight_queues; | |
+ int wr_busy_queues; | |
+ int queued; | |
+ int rq_in_driver; | |
+ int sync_flight; | |
+ | |
+ int max_rq_in_driver; | |
+ int hw_tag_samples; | |
+ int hw_tag; | |
+ | |
+ int budgets_assigned; | |
+ | |
+ struct timer_list idle_slice_timer; | |
+ struct work_struct unplug_work; | |
+ | |
+ struct bfq_queue *in_service_queue; | |
+ struct bfq_io_cq *in_service_bic; | |
+ | |
+ sector_t last_position; | |
+ | |
+ ktime_t last_budget_start; | |
+ ktime_t last_idling_start; | |
+ int peak_rate_samples; | |
+ u64 peak_rate; | |
+ unsigned long bfq_max_budget; | |
+ | |
+ struct hlist_head group_list; | |
+ struct list_head active_list; | |
+ struct list_head idle_list; | |
+ | |
+ unsigned int bfq_quantum; | |
+ unsigned int bfq_fifo_expire[2]; | |
+ unsigned int bfq_back_penalty; | |
+ unsigned int bfq_back_max; | |
+ unsigned int bfq_slice_idle; | |
+ u64 bfq_class_idle_last_service; | |
+ | |
+ unsigned int bfq_user_max_budget; | |
+ unsigned int bfq_max_budget_async_rq; | |
+ unsigned int bfq_timeout[2]; | |
+ | |
+ unsigned int bfq_coop_thresh; | |
+ unsigned int bfq_failed_cooperations; | |
+ unsigned int bfq_requests_within_timer; | |
+ | |
+ unsigned long last_ins_in_burst; | |
+ unsigned long bfq_burst_interval; | |
+ int burst_size; | |
+ unsigned long bfq_large_burst_thresh; | |
+ bool large_burst; | |
+ struct hlist_head burst_list; | |
+ | |
+ bool low_latency; | |
+ | |
+ /* parameters of the low_latency heuristics */ | |
+ unsigned int bfq_wr_coeff; | |
+ unsigned int bfq_wr_max_time; | |
+ unsigned int bfq_wr_rt_max_time; | |
+ unsigned int bfq_wr_min_idle_time; | |
+ unsigned long bfq_wr_min_inter_arr_async; | |
+ unsigned int bfq_wr_max_softrt_rate; | |
+ u64 RT_prod; | |
+ enum bfq_device_speed device_speed; | |
+ | |
+ struct bfq_queue oom_bfqq; | |
+}; | |
+ | |
+enum bfqq_state_flags { | |
+ BFQ_BFQQ_FLAG_busy = 0, /* has requests or is in service */ | |
+ BFQ_BFQQ_FLAG_wait_request, /* waiting for a request */ | |
+ BFQ_BFQQ_FLAG_must_alloc, /* must be allowed rq alloc */ | |
+ BFQ_BFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */ | |
+ BFQ_BFQQ_FLAG_idle_window, /* slice idling enabled */ | |
+ BFQ_BFQQ_FLAG_prio_changed, /* task priority has changed */ | |
+ BFQ_BFQQ_FLAG_sync, /* synchronous queue */ | |
+ BFQ_BFQQ_FLAG_budget_new, /* no completion with this budget */ | |
+ BFQ_BFQQ_FLAG_IO_bound, /* | |
+ * bfqq has timed-out at least once | |
+ * having consumed at most 2/10 of | |
+ * its budget | |
+ */ | |
+ BFQ_BFQQ_FLAG_in_large_burst, /* | |
+ * bfqq activated in a large burst, | |
+ * see comments to bfq_handle_burst. | |
+ */ | |
+ BFQ_BFQQ_FLAG_constantly_seeky, /* | |
+ * bfqq has proved to be slow and | |
+ * seeky until budget timeout | |
+ */ | |
+ BFQ_BFQQ_FLAG_softrt_update, /* | |
+ * may need softrt-next-start | |
+ * update | |
+ */ | |
+ BFQ_BFQQ_FLAG_coop, /* bfqq is shared */ | |
+ BFQ_BFQQ_FLAG_split_coop, /* shared bfqq will be split */ | |
+ BFQ_BFQQ_FLAG_just_split, /* queue has just been split */ | |
+}; | |
+ | |
+#define BFQ_BFQQ_FNS(name) \ | |
+static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \ | |
+{ \ | |
+ (bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name); \ | |
+} \ | |
+static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \ | |
+{ \ | |
+ (bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name); \ | |
+} \ | |
+static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq) \ | |
+{ \ | |
+ return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0; \ | |
+} | |
+ | |
+BFQ_BFQQ_FNS(busy); | |
+BFQ_BFQQ_FNS(wait_request); | |
+BFQ_BFQQ_FNS(must_alloc); | |
+BFQ_BFQQ_FNS(fifo_expire); | |
+BFQ_BFQQ_FNS(idle_window); | |
+BFQ_BFQQ_FNS(prio_changed); | |
+BFQ_BFQQ_FNS(sync); | |
+BFQ_BFQQ_FNS(budget_new); | |
+BFQ_BFQQ_FNS(IO_bound); | |
+BFQ_BFQQ_FNS(in_large_burst); | |
+BFQ_BFQQ_FNS(constantly_seeky); | |
+BFQ_BFQQ_FNS(coop); | |
+BFQ_BFQQ_FNS(split_coop); | |
+BFQ_BFQQ_FNS(just_split); | |
+BFQ_BFQQ_FNS(softrt_update); | |
+#undef BFQ_BFQQ_FNS | |
+ | |
+/* Logging facilities. */ | |
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \ | |
+ blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args) | |
+ | |
+#define bfq_log(bfqd, fmt, args...) \ | |
+ blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args) | |
+ | |
+/* Expiration reasons. */ | |
+enum bfqq_expiration { | |
+ BFQ_BFQQ_TOO_IDLE = 0, /* | |
+ * queue has been idling for | |
+ * too long | |
+ */ | |
+ BFQ_BFQQ_BUDGET_TIMEOUT, /* budget took too long to be used */ | |
+ BFQ_BFQQ_BUDGET_EXHAUSTED, /* budget consumed */ | |
+ BFQ_BFQQ_NO_MORE_REQUESTS, /* the queue has no more requests */ | |
+}; | |
+ | |
+#ifdef CONFIG_CGROUP_BFQIO | |
+/** | |
+ * struct bfq_group - per (device, cgroup) data structure. | |
+ * @entity: schedulable entity to insert into the parent group sched_data. | |
+ * @sched_data: own sched_data, to contain child entities (they may be | |
+ * both bfq_queues and bfq_groups). | |
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data | |
+ * list of the containing cgroup's bfqio_cgroup. | |
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list | |
+ * of the groups active on the same device; used for cleanup. | |
+ * @bfqd: the bfq_data for the device this group acts upon. | |
+ * @async_bfqq: array of async queues for all the tasks belonging to | |
+ * the group, one queue per ioprio value per ioprio_class, | |
+ * except for the idle class that has only one queue. | |
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored). | |
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used | |
+ * to avoid too many special cases during group creation/ | |
+ * migration. | |
+ * @active_entities: number of active entities belonging to the group; | |
+ * unused for the root group. Used to know whether there | |
+ * are groups with more than one active @bfq_entity | |
+ * (see the comments to the function | |
+ * bfq_bfqq_must_not_expire()). | |
+ * | |
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup | |
+ * there is a set of bfq_groups, each one collecting the lower-level | |
+ * entities belonging to the group that are acting on the same device. | |
+ * | |
+ * Locking works as follows: | |
+ * o @group_node is protected by the bfqio_cgroup lock, and is accessed | |
+ * via RCU from its readers. | |
+ * o @bfqd is protected by the queue lock, RCU is used to access it | |
+ * from the readers. | |
+ * o All the other fields are protected by the @bfqd queue lock. | |
+ */ | |
+struct bfq_group { | |
+ struct bfq_entity entity; | |
+ struct bfq_sched_data sched_data; | |
+ | |
+ struct hlist_node group_node; | |
+ struct hlist_node bfqd_node; | |
+ | |
+ void *bfqd; | |
+ | |
+ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR]; | |
+ struct bfq_queue *async_idle_bfqq; | |
+ | |
+ struct bfq_entity *my_entity; | |
+ | |
+ int active_entities; | |
+}; | |
+ | |
+/** | |
+ * struct bfqio_cgroup - bfq cgroup data structure. | |
+ * @css: subsystem state for bfq in the containing cgroup. | |
+ * @online: flag marked when the subsystem is inserted. | |
+ * @weight: cgroup weight. | |
+ * @ioprio: cgroup ioprio. | |
+ * @ioprio_class: cgroup ioprio_class. | |
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data. | |
+ * @group_data: list containing the bfq_group belonging to this cgroup. | |
+ * | |
+ * @group_data is accessed using RCU, with @lock protecting the updates, | |
+ * @ioprio and @ioprio_class are protected by @lock. | |
+ */ | |
+struct bfqio_cgroup { | |
+ struct cgroup_subsys_state css; | |
+ bool online; | |
+ | |
+ unsigned short weight, ioprio, ioprio_class; | |
+ | |
+ spinlock_t lock; | |
+ struct hlist_head group_data; | |
+}; | |
+#else | |
+struct bfq_group { | |
+ struct bfq_sched_data sched_data; | |
+ | |
+ struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR]; | |
+ struct bfq_queue *async_idle_bfqq; | |
+}; | |
+#endif | |
+ | |
+static inline struct bfq_service_tree * | |
+bfq_entity_service_tree(struct bfq_entity *entity) | |
+{ | |
+ struct bfq_sched_data *sched_data = entity->sched_data; | |
+ unsigned int idx = entity->ioprio_class - 1; | |
+ | |
+ BUG_ON(idx >= BFQ_IOPRIO_CLASSES); | |
+ BUG_ON(sched_data == NULL); | |
+ | |
+ return sched_data->service_tree + idx; | |
+} | |
+ | |
+static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, | |
+ bool is_sync) | |
+{ | |
+ return bic->bfqq[is_sync]; | |
+} | |
+ | |
+static inline void bic_set_bfqq(struct bfq_io_cq *bic, | |
+ struct bfq_queue *bfqq, bool is_sync) | |
+{ | |
+ bic->bfqq[is_sync] = bfqq; | |
+} | |
+ | |
+static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic) | |
+{ | |
+ return bic->icq.q->elevator->elevator_data; | |
+} | |
+ | |
+/** | |
+ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer. | |
+ * @ptr: a pointer to a bfqd. | |
+ * @flags: storage for the flags to be saved. | |
+ * | |
+ * This function allows bfqg->bfqd to be protected by the | |
+ * queue lock of the bfqd they reference; the pointer is dereferenced | |
+ * under RCU, so the storage for bfqd is assured to be safe as long | |
+ * as the RCU read side critical section does not end. After the | |
+ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be | |
+ * sure that no other writer accessed it. If we raced with a writer, | |
+ * the function returns NULL, with the queue unlocked, otherwise it | |
+ * returns the dereferenced pointer, with the queue locked. | |
+ */ | |
+static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr, | |
+ unsigned long *flags) | |
+{ | |
+ struct bfq_data *bfqd; | |
+ | |
+ rcu_read_lock(); | |
+ bfqd = rcu_dereference(*(struct bfq_data **)ptr); | |
+ | |
+ if (bfqd != NULL) { | |
+ spin_lock_irqsave(bfqd->queue->queue_lock, *flags); | |
+ if (*ptr == bfqd) | |
+ goto out; | |
+ spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags); | |
+ } | |
+ | |
+ bfqd = NULL; | |
+out: | |
+ rcu_read_unlock(); | |
+ return bfqd; | |
+} | |
+ | |
+static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd, | |
+ unsigned long *flags) | |
+{ | |
+ spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags); | |
+} | |
+ | |
+static void bfq_changed_ioprio(struct bfq_io_cq *bic); | |
+static void bfq_put_queue(struct bfq_queue *bfqq); | |
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq); | |
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg, int is_sync, | |
+ struct bfq_io_cq *bic, gfp_t gfp_mask); | |
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd, | |
+ struct bfq_group *bfqg); | |
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg); | |
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); | |
+ | |
+#endif /* _BFQ_H */ | |
diff --git a/distro/archlinux/PKGBUILD b/distro/archlinux/PKGBUILD | |
new file mode 100644 | |
index 0000000..bedfd7f | |
--- /dev/null | |
+++ b/distro/archlinux/PKGBUILD | |
@@ -0,0 +1,70 @@ | |
+pkgname=('linux-pf' 'linux-pf-headers') | |
+pkgver=3.18.0 | |
+pkgrel=0 | |
+_pkgsuffix="-pf$pkgrel" | |
+pkgdesc="pf-kernel with modules" | |
+arch=('i686' 'x86_64') | |
+makedepends=('xz' 'rsync' 'bc') | |
+options=('!strip') | |
+license=('GPL') | |
+url="http://pf.natalenko.name/" | |
+ | |
+build() { | |
+ # Go to kernel's tree root | |
+ cd $startdir | |
+ | |
+ # Remove depmod from kernel script, steal this trick from Arch | |
+ sed -i '2iexit 0' scripts/depmod.sh | |
+ | |
+ # Detect CPUs count automatically | |
+ CPUS_COUNT=`cat /proc/cpuinfo | grep processor | wc -l` | |
+ echo "Compiling using $CPUS_COUNT thread(s)" | |
+ LOCALVERSION="" make -j$CPUS_COUNT bzImage modules || return 1 | |
+} | |
+ | |
+package_linux-pf() { | |
+ depends=('coreutils' 'linux-firmware' 'kmod' 'mkinitcpio') | |
+ provides=('linux-pf') | |
+ install='linux-pf.install' | |
+ | |
+ cd $startdir | |
+ | |
+ # Note that modules are in /usr/lib/modules now | |
+ mkdir -p $pkgdir/{usr/lib/modules,boot} | |
+ make INSTALL_MOD_PATH=$pkgdir/usr modules_install || return 1 | |
+ | |
+ # Running depmod for installed modules | |
+ depmod -b "$pkgdir/usr" -F System.map "$pkgver$_pkgsuffix" | |
+ | |
+ # There's no separation of firmware depending on kernel version - | |
+ # comment this line if you intend on using the built kernel exclusively, | |
+ # otherwise there'll be file conflicts with the existing kernel | |
+ rm -rf $pkgdir/usr/lib/firmware | |
+ | |
+ rm -f $pkgdir/usr/lib/modules/$pkgver$_pkgsuffix/{source,build} | |
+ | |
+ install -Dm644 "System.map" "$pkgdir/boot/System.map-linux-pf" | |
+ install -Dm644 "arch/x86/boot/bzImage" "$pkgdir/boot/vmlinuz-linux-pf" | |
+ install -Dm644 "distro/archlinux/linux-pf.preset" "$pkgdir/etc/mkinitcpio.d/linux-pf.preset" | |
+} | |
+ | |
+package_linux-pf-headers() { | |
+ provides=('linux-pf-headers') | |
+ | |
+ cd $startdir | |
+ | |
+ mkdir -p $pkgdir/usr/lib/modules/$pkgver$_pkgsuffix/ | |
+ cd $pkgdir/usr/lib/modules/$pkgver$_pkgsuffix/ | |
+ ln -s ../../../src/linux-$pkgver$_pkgsuffix build | |
+ | |
+ cd $startdir | |
+ | |
+ mkdir -p $pkgdir/usr/src/linux-$pkgver$_pkgsuffix | |
+ make INSTALL_HDR_PATH=$pkgdir/usr/src/linux-$pkgver$_pkgsuffix headers_install | |
+ install -Dm644 .config $pkgdir/usr/src/linux-$pkgver$_pkgsuffix | |
+ install -Dm644 Module.symvers $pkgdir/usr/src/linux-$pkgver$_pkgsuffix | |
+ rsync -a --include='*/' --include='Kbuild*' --include='Kconfig*' --include='*Makefile*' --include='auto.conf' --include='autoconf.h' --include='kconfig.h' --include='asm-offsets.s' --exclude='*' --prune-empty-dirs . $pkgdir/usr/src/linux-$pkgver$_pkgsuffix | |
+ rsync -a scripts $pkgdir/usr/src/linux-$pkgver$_pkgsuffix | |
+ rsync -a include $pkgdir/usr/src/linux-$pkgver$_pkgsuffix | |
+ rsync -a arch/x86/include $pkgdir/usr/src/linux-$pkgver$_pkgsuffix/arch/x86 | |
+} | |
diff --git a/distro/archlinux/linux-pf.install b/distro/archlinux/linux-pf.install | |
new file mode 100644 | |
index 0000000..64ca691 | |
--- /dev/null | |
+++ b/distro/archlinux/linux-pf.install | |
@@ -0,0 +1,17 @@ | |
+KERNEL_VERSION="3.18.0" | |
+LOCAL_VERSION="-pf0" | |
+ | |
+post_install () { | |
+ echo ">>> Updating module dependencies..." | |
+ /sbin/depmod -A -v ${KERNEL_VERSION}${LOCAL_VERSION} | |
+ echo ">>> Creating initial ramdisk..." | |
+ mkinitcpio -p linux-pf | |
+} | |
+ | |
+post_upgrade() { | |
+ echo ">>> Updating module dependencies..." | |
+ /sbin/depmod -A -v ${KERNEL_VERSION}${LOCAL_VERSION} | |
+ echo ">>> Creating initial ramdisk..." | |
+ mkinitcpio -p linux-pf | |
+} | |
+ | |
diff --git a/distro/archlinux/linux-pf.preset b/distro/archlinux/linux-pf.preset | |
new file mode 100644 | |
index 0000000..e77e3f3 | |
--- /dev/null | |
+++ b/distro/archlinux/linux-pf.preset | |
@@ -0,0 +1,6 @@ | |
+ALL_config="/etc/mkinitcpio.conf" | |
+ALL_kver="/boot/vmlinuz-linux-pf" | |
+PRESETS=('default' 'fallback') | |
+default_image="/boot/initramfs-linux-pf.img" | |
+fallback_image="/boot/initramfs-linux-pf-fallback.img" | |
+fallback_options="-S autodetect" | |
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c | |
index 4473eba..7aa9d61 100644 | |
--- a/drivers/cpufreq/cpufreq.c | |
+++ b/drivers/cpufreq/cpufreq.c | |
@@ -25,6 +25,7 @@ | |
#include <linux/kernel_stat.h> | |
#include <linux/module.h> | |
#include <linux/mutex.h> | |
+#include <linux/sched.h> | |
#include <linux/slab.h> | |
#include <linux/suspend.h> | |
#include <linux/tick.h> | |
@@ -1979,6 +1980,12 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy, | |
} | |
out: | |
+ if (likely(retval != -EINVAL)) { | |
+ if (target_freq == policy->max) | |
+ cpu_nonscaling(policy->cpu); | |
+ else | |
+ cpu_scaling(policy->cpu); | |
+ } | |
return retval; | |
} | |
EXPORT_SYMBOL_GPL(__cpufreq_driver_target); | |
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c | |
index 25a70d0..7a8282a 100644 | |
--- a/drivers/cpufreq/cpufreq_conservative.c | |
+++ b/drivers/cpufreq/cpufreq_conservative.c | |
@@ -15,8 +15,8 @@ | |
#include "cpufreq_governor.h" | |
/* Conservative governor macros */ | |
-#define DEF_FREQUENCY_UP_THRESHOLD (80) | |
-#define DEF_FREQUENCY_DOWN_THRESHOLD (20) | |
+#define DEF_FREQUENCY_UP_THRESHOLD (63) | |
+#define DEF_FREQUENCY_DOWN_THRESHOLD (26) | |
#define DEF_FREQUENCY_STEP (5) | |
#define DEF_SAMPLING_DOWN_FACTOR (1) | |
#define MAX_SAMPLING_DOWN_FACTOR (10) | |
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c | |
index ad3f38f..9d6f4b2 100644 | |
--- a/drivers/cpufreq/cpufreq_ondemand.c | |
+++ b/drivers/cpufreq/cpufreq_ondemand.c | |
@@ -19,7 +19,7 @@ | |
#include "cpufreq_governor.h" | |
/* On-demand governor macros */ | |
-#define DEF_FREQUENCY_UP_THRESHOLD (80) | |
+#define DEF_FREQUENCY_UP_THRESHOLD (63) | |
#define DEF_SAMPLING_DOWN_FACTOR (1) | |
#define MAX_SAMPLING_DOWN_FACTOR (100000) | |
#define MICRO_FREQUENCY_UP_THRESHOLD (95) | |
@@ -148,7 +148,7 @@ static void dbs_freq_increase(struct cpufreq_policy *policy, unsigned int freq) | |
} | |
/* | |
- * Every sampling_rate, we check, if current idle time is less than 20% | |
+ * Every sampling_rate, we check, if current idle time is less than 37% | |
* (default), then we try to increase frequency. Else, we adjust the frequency | |
* proportional to load. | |
*/ | |
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c | |
index 27bb6d3..3adf36f 100644 | |
--- a/drivers/cpufreq/intel_pstate.c | |
+++ b/drivers/cpufreq/intel_pstate.c | |
@@ -438,8 +438,13 @@ static void byt_set_pstate(struct cpudata *cpudata, int pstate) | |
vid_fp = clamp_t(int32_t, vid_fp, cpudata->vid.min, cpudata->vid.max); | |
vid = ceiling_fp(vid_fp); | |
- if (pstate > cpudata->pstate.max_pstate) | |
- vid = cpudata->vid.turbo; | |
+ if (pstate < cpudata->pstate.max_pstate) | |
+ cpu_scaling(cpudata->cpu); | |
+ else { | |
+ if (pstate > cpudata->pstate.max_pstate) | |
+ vid = cpudata->vid.turbo; | |
+ cpu_nonscaling(cpudata->cpu); | |
+ } | |
val |= vid; | |
diff --git a/fs/exec.c b/fs/exec.c | |
index 7302b75..84b0df5 100644 | |
--- a/fs/exec.c | |
+++ b/fs/exec.c | |
@@ -19,7 +19,7 @@ | |
* current->executable is only used by the procfs. This allows a dispatch | |
* table to check for several different types of binary formats. We keep | |
* trying until we recognize the file or we run out of supported binary | |
- * formats. | |
+ * formats. | |
*/ | |
#include <linux/slab.h> | |
@@ -56,6 +56,7 @@ | |
#include <linux/pipe_fs_i.h> | |
#include <linux/oom.h> | |
#include <linux/compat.h> | |
+#include <linux/ksm.h> | |
#include <asm/uaccess.h> | |
#include <asm/mmu_context.h> | |
@@ -1128,6 +1129,7 @@ void setup_new_exec(struct linux_binprm * bprm) | |
/* An exec changes our domain. We are no longer part of the thread | |
group */ | |
current->self_exec_id++; | |
+ | |
flush_signal_handlers(current, 0); | |
do_close_on_exec(current->files); | |
} | |
diff --git a/fs/proc/base.c b/fs/proc/base.c | |
index 772efa4..bba6c63 100644 | |
--- a/fs/proc/base.c | |
+++ b/fs/proc/base.c | |
@@ -310,7 +310,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns, | |
struct pid *pid, struct task_struct *task) | |
{ | |
return seq_printf(m, "%llu %llu %lu\n", | |
- (unsigned long long)task->se.sum_exec_runtime, | |
+ (unsigned long long)tsk_seruntime(task), | |
(unsigned long long)task->sched_info.run_delay, | |
task->sched_info.pcount); | |
} | |
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c | |
index aa1eee0..5a9149c 100644 | |
--- a/fs/proc/meminfo.c | |
+++ b/fs/proc/meminfo.c | |
@@ -121,6 +121,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) | |
"SUnreclaim: %8lu kB\n" | |
"KernelStack: %8lu kB\n" | |
"PageTables: %8lu kB\n" | |
+#ifdef CONFIG_UKSM | |
+ "KsmZeroPages: %8lu kB\n" | |
+#endif | |
#ifdef CONFIG_QUICKLIST | |
"Quicklists: %8lu kB\n" | |
#endif | |
@@ -175,6 +178,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) | |
K(global_page_state(NR_SLAB_UNRECLAIMABLE)), | |
global_page_state(NR_KERNEL_STACK) * THREAD_SIZE / 1024, | |
K(global_page_state(NR_PAGETABLE)), | |
+#ifdef CONFIG_UKSM | |
+ K(global_page_state(NR_UKSM_ZERO_PAGES)), | |
+#endif | |
#ifdef CONFIG_QUICKLIST | |
K(quicklist_total_size()), | |
#endif | |
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h | |
index 752e30d..1e7c826 100644 | |
--- a/include/asm-generic/pgtable.h | |
+++ b/include/asm-generic/pgtable.h | |
@@ -537,12 +537,25 @@ extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn, | |
unsigned long size); | |
#endif | |
+#ifdef CONFIG_UKSM | |
+static inline int is_uksm_zero_pfn(unsigned long pfn) | |
+{ | |
+ extern unsigned long uksm_zero_pfn; | |
+ return pfn == uksm_zero_pfn; | |
+} | |
+#else | |
+static inline int is_uksm_zero_pfn(unsigned long pfn) | |
+{ | |
+ return 0; | |
+} | |
+#endif | |
+ | |
#ifdef __HAVE_COLOR_ZERO_PAGE | |
static inline int is_zero_pfn(unsigned long pfn) | |
{ | |
extern unsigned long zero_pfn; | |
unsigned long offset_from_zero_pfn = pfn - zero_pfn; | |
- return offset_from_zero_pfn <= (zero_page_mask >> PAGE_SHIFT); | |
+ return offset_from_zero_pfn <= (zero_page_mask >> PAGE_SHIFT) || is_uksm_zero_pfn(pfn); | |
} | |
#define my_zero_pfn(addr) page_to_pfn(ZERO_PAGE(addr)) | |
@@ -551,7 +564,7 @@ static inline int is_zero_pfn(unsigned long pfn) | |
static inline int is_zero_pfn(unsigned long pfn) | |
{ | |
extern unsigned long zero_pfn; | |
- return pfn == zero_pfn; | |
+ return (pfn == zero_pfn) || (is_uksm_zero_pfn(pfn)); | |
} | |
static inline unsigned long my_zero_pfn(unsigned long addr) | |
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h | |
index 98c4f9b..13b010d 100644 | |
--- a/include/linux/cgroup_subsys.h | |
+++ b/include/linux/cgroup_subsys.h | |
@@ -35,6 +35,10 @@ SUBSYS(net_cls) | |
SUBSYS(blkio) | |
#endif | |
+#if IS_ENABLED(CONFIG_CGROUP_BFQIO) | |
+SUBSYS(bfqio) | |
+#endif | |
+ | |
#if IS_ENABLED(CONFIG_CGROUP_PERF) | |
SUBSYS(perf_event) | |
#endif | |
diff --git a/include/linux/init_task.h b/include/linux/init_task.h | |
index 77fc43f..5e27ce4 100644 | |
--- a/include/linux/init_task.h | |
+++ b/include/linux/init_task.h | |
@@ -156,8 +156,6 @@ extern struct task_group root_task_group; | |
# define INIT_VTIME(tsk) | |
#endif | |
-#define INIT_TASK_COMM "swapper" | |
- | |
#ifdef CONFIG_RT_MUTEXES | |
# define INIT_RT_MUTEXES(tsk) \ | |
.pi_waiters = RB_ROOT, \ | |
@@ -170,6 +168,68 @@ extern struct task_group root_task_group; | |
* INIT_TASK is used to set up the first task table, touch at | |
* your own risk!. Base=0, limit=0x1fffff (=2MB) | |
*/ | |
+#ifdef CONFIG_SCHED_BFS | |
+#define INIT_TASK_COMM "BFS" | |
+#define INIT_TASK(tsk) \ | |
+{ \ | |
+ .state = 0, \ | |
+ .stack = &init_thread_info, \ | |
+ .usage = ATOMIC_INIT(2), \ | |
+ .flags = PF_KTHREAD, \ | |
+ .prio = NORMAL_PRIO, \ | |
+ .static_prio = MAX_PRIO-20, \ | |
+ .normal_prio = NORMAL_PRIO, \ | |
+ .deadline = 0, \ | |
+ .policy = SCHED_NORMAL, \ | |
+ .cpus_allowed = CPU_MASK_ALL, \ | |
+ .mm = NULL, \ | |
+ .active_mm = &init_mm, \ | |
+ .run_list = LIST_HEAD_INIT(tsk.run_list), \ | |
+ .time_slice = HZ, \ | |
+ .tasks = LIST_HEAD_INIT(tsk.tasks), \ | |
+ INIT_PUSHABLE_TASKS(tsk) \ | |
+ .ptraced = LIST_HEAD_INIT(tsk.ptraced), \ | |
+ .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \ | |
+ .real_parent = &tsk, \ | |
+ .parent = &tsk, \ | |
+ .children = LIST_HEAD_INIT(tsk.children), \ | |
+ .sibling = LIST_HEAD_INIT(tsk.sibling), \ | |
+ .group_leader = &tsk, \ | |
+ RCU_POINTER_INITIALIZER(real_cred, &init_cred), \ | |
+ RCU_POINTER_INITIALIZER(cred, &init_cred), \ | |
+ .comm = INIT_TASK_COMM, \ | |
+ .thread = INIT_THREAD, \ | |
+ .fs = &init_fs, \ | |
+ .files = &init_files, \ | |
+ .signal = &init_signals, \ | |
+ .sighand = &init_sighand, \ | |
+ .nsproxy = &init_nsproxy, \ | |
+ .pending = { \ | |
+ .list = LIST_HEAD_INIT(tsk.pending.list), \ | |
+ .signal = {{0}}}, \ | |
+ .blocked = {{0}}, \ | |
+ .alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \ | |
+ .journal_info = NULL, \ | |
+ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ | |
+ .pi_lock = __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ | |
+ .timer_slack_ns = 50000, /* 50 usec default slack */ \ | |
+ .pids = { \ | |
+ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \ | |
+ [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \ | |
+ [PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \ | |
+ }, \ | |
+ .thread_group = LIST_HEAD_INIT(tsk.thread_group), \ | |
+ .thread_node = LIST_HEAD_INIT(init_signals.thread_head), \ | |
+ INIT_IDS \ | |
+ INIT_PERF_EVENTS(tsk) \ | |
+ INIT_TRACE_IRQFLAGS \ | |
+ INIT_LOCKDEP \ | |
+ INIT_FTRACE_GRAPH \ | |
+ INIT_TRACE_RECURSION \ | |
+ INIT_TASK_RCU_PREEMPT(tsk) \ | |
+} | |
+#else /* CONFIG_SCHED_BFS */ | |
+#define INIT_TASK_COMM "swapper" | |
#define INIT_TASK(tsk) \ | |
{ \ | |
.state = 0, \ | |
@@ -238,7 +298,7 @@ extern struct task_group root_task_group; | |
INIT_RT_MUTEXES(tsk) \ | |
INIT_VTIME(tsk) \ | |
} | |
- | |
+#endif /* CONFIG_SCHED_BFS */ | |
#define INIT_CPU_TIMERS(cpu_timers) \ | |
{ \ | |
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h | |
index beb9ce1..ce2fc3c 100644 | |
--- a/include/linux/ioprio.h | |
+++ b/include/linux/ioprio.h | |
@@ -52,6 +52,8 @@ enum { | |
*/ | |
static inline int task_nice_ioprio(struct task_struct *task) | |
{ | |
+ if (iso_task(task)) | |
+ return 0; | |
return (task_nice(task) + 20) / 5; | |
} | |
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h | |
index c367cbd..55feba7 100644 | |
--- a/include/linux/jiffies.h | |
+++ b/include/linux/jiffies.h | |
@@ -163,7 +163,7 @@ static inline u64 get_jiffies_64(void) | |
* Have the 32 bit jiffies value wrap 5 minutes after boot | |
* so jiffies wrap bugs show up earlier. | |
*/ | |
-#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ)) | |
+#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ)) | |
/* | |
* Change timeval to jiffies, trying to avoid the | |
diff --git a/include/linux/ksm.h b/include/linux/ksm.h | |
index 3be6bb1..51557d1 100644 | |
--- a/include/linux/ksm.h | |
+++ b/include/linux/ksm.h | |
@@ -19,21 +19,6 @@ struct mem_cgroup; | |
#ifdef CONFIG_KSM | |
int ksm_madvise(struct vm_area_struct *vma, unsigned long start, | |
unsigned long end, int advice, unsigned long *vm_flags); | |
-int __ksm_enter(struct mm_struct *mm); | |
-void __ksm_exit(struct mm_struct *mm); | |
- | |
-static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) | |
-{ | |
- if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags)) | |
- return __ksm_enter(mm); | |
- return 0; | |
-} | |
- | |
-static inline void ksm_exit(struct mm_struct *mm) | |
-{ | |
- if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) | |
- __ksm_exit(mm); | |
-} | |
/* | |
* A KSM page is one of those write-protected "shared pages" or "merged pages" | |
@@ -76,6 +61,33 @@ struct page *ksm_might_need_to_copy(struct page *page, | |
int rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); | |
void ksm_migrate_page(struct page *newpage, struct page *oldpage); | |
+#ifdef CONFIG_KSM_LEGACY | |
+int __ksm_enter(struct mm_struct *mm); | |
+void __ksm_exit(struct mm_struct *mm); | |
+static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) | |
+{ | |
+ if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags)) | |
+ return __ksm_enter(mm); | |
+ return 0; | |
+} | |
+ | |
+static inline void ksm_exit(struct mm_struct *mm) | |
+{ | |
+ if (test_bit(MMF_VM_MERGEABLE, &mm->flags)) | |
+ __ksm_exit(mm); | |
+} | |
+ | |
+#elif defined(CONFIG_UKSM) | |
+static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) | |
+{ | |
+ return 0; | |
+} | |
+ | |
+static inline void ksm_exit(struct mm_struct *mm) | |
+{ | |
+} | |
+#endif /* !CONFIG_UKSM */ | |
+ | |
#else /* !CONFIG_KSM */ | |
static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) | |
@@ -123,4 +135,6 @@ static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) | |
#endif /* CONFIG_MMU */ | |
#endif /* !CONFIG_KSM */ | |
+#include <linux/uksm.h> | |
+ | |
#endif /* __LINUX_KSM_H */ | |
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h | |
index 6e0b286..ebd6243 100644 | |
--- a/include/linux/mm_types.h | |
+++ b/include/linux/mm_types.h | |
@@ -308,6 +308,9 @@ struct vm_area_struct { | |
#ifdef CONFIG_NUMA | |
struct mempolicy *vm_policy; /* NUMA policy for the VMA */ | |
#endif | |
+#ifdef CONFIG_UKSM | |
+ struct vma_slot *uksm_vma_slot; | |
+#endif | |
}; | |
struct core_thread { | |
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h | |
index ffe66e3..355f0e9 100644 | |
--- a/include/linux/mmzone.h | |
+++ b/include/linux/mmzone.h | |
@@ -157,6 +157,9 @@ enum zone_stat_item { | |
WORKINGSET_NODERECLAIM, | |
NR_ANON_TRANSPARENT_HUGEPAGES, | |
NR_FREE_CMA_PAGES, | |
+#ifdef CONFIG_UKSM | |
+ NR_UKSM_ZERO_PAGES, | |
+#endif | |
NR_VM_ZONE_STAT_ITEMS }; | |
/* | |
@@ -865,7 +868,7 @@ static inline int is_highmem_idx(enum zone_type idx) | |
} | |
/** | |
- * is_highmem - helper function to quickly check if a struct zone is a | |
+ * is_highmem - helper function to quickly check if a struct zone is a | |
* highmem zone or not. This is an attempt to keep references | |
* to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum. | |
* @zone - pointer to struct zone variable | |
diff --git a/include/linux/sched.h b/include/linux/sched.h | |
index 5e344bb..84a20f3 100644 | |
--- a/include/linux/sched.h | |
+++ b/include/linux/sched.h | |
@@ -290,8 +290,6 @@ extern asmlinkage void schedule_tail(struct task_struct *prev); | |
extern void init_idle(struct task_struct *idle, int cpu); | |
extern void init_idle_bootup_task(struct task_struct *idle); | |
-extern int runqueue_is_locked(int cpu); | |
- | |
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) | |
extern void nohz_balance_enter_idle(int cpu); | |
extern void set_cpu_sd_state_idle(void); | |
@@ -1239,9 +1237,11 @@ struct task_struct { | |
unsigned int flags; /* per process flags, defined below */ | |
unsigned int ptrace; | |
-#ifdef CONFIG_SMP | |
+#if defined(CONFIG_SMP) || defined(CONFIG_SCHED_BFS) | |
struct llist_node wake_entry; | |
int on_cpu; | |
+#endif | |
+#ifdef CONFIG_SMP | |
struct task_struct *last_wakee; | |
unsigned long wakee_flips; | |
unsigned long wakee_flip_decay_ts; | |
@@ -1249,12 +1249,29 @@ struct task_struct { | |
int wake_cpu; | |
#endif | |
int on_rq; | |
- | |
int prio, static_prio, normal_prio; | |
unsigned int rt_priority; | |
+#ifdef CONFIG_SCHED_BFS | |
+ int time_slice; | |
+ u64 deadline; | |
+ struct list_head run_list; | |
+ u64 last_ran; | |
+ u64 sched_time; /* sched_clock time spent running */ | |
+#ifdef CONFIG_SMT_NICE | |
+ int smt_bias; /* Policy/nice level bias across smt siblings */ | |
+#endif | |
+#ifdef CONFIG_SMP | |
+ bool sticky; /* Soft affined flag */ | |
+#endif | |
+#ifdef CONFIG_HOTPLUG_CPU | |
+ bool zerobound; /* Bound to CPU0 for hotplug */ | |
+#endif | |
+ unsigned long rt_timeout; | |
+#else /* CONFIG_SCHED_BFS */ | |
const struct sched_class *sched_class; | |
struct sched_entity se; | |
struct sched_rt_entity rt; | |
+#endif | |
#ifdef CONFIG_CGROUP_SCHED | |
struct task_group *sched_task_group; | |
#endif | |
@@ -1366,6 +1383,9 @@ struct task_struct { | |
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ | |
cputime_t utime, stime, utimescaled, stimescaled; | |
+#ifdef CONFIG_SCHED_BFS | |
+ unsigned long utime_pc, stime_pc; | |
+#endif | |
cputime_t gtime; | |
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE | |
struct cputime prev_cputime; | |
@@ -1663,6 +1683,63 @@ struct task_struct { | |
#endif | |
}; | |
+#ifdef CONFIG_SCHED_BFS | |
+bool grunqueue_is_locked(void); | |
+void grq_unlock_wait(void); | |
+void cpu_scaling(int cpu); | |
+void cpu_nonscaling(int cpu); | |
+#define tsk_seruntime(t) ((t)->sched_time) | |
+#define tsk_rttimeout(t) ((t)->rt_timeout) | |
+ | |
+static inline void tsk_cpus_current(struct task_struct *p) | |
+{ | |
+} | |
+ | |
+static inline int runqueue_is_locked(int cpu) | |
+{ | |
+ return grunqueue_is_locked(); | |
+} | |
+ | |
+void print_scheduler_version(void); | |
+ | |
+static inline bool iso_task(struct task_struct *p) | |
+{ | |
+ return (p->policy == SCHED_ISO); | |
+} | |
+#else /* CFS */ | |
+extern int runqueue_is_locked(int cpu); | |
+static inline void cpu_scaling(int cpu) | |
+{ | |
+} | |
+ | |
+static inline void cpu_nonscaling(int cpu) | |
+{ | |
+} | |
+#define tsk_seruntime(t) ((t)->se.sum_exec_runtime) | |
+#define tsk_rttimeout(t) ((t)->rt.timeout) | |
+ | |
+static inline void tsk_cpus_current(struct task_struct *p) | |
+{ | |
+ p->nr_cpus_allowed = current->nr_cpus_allowed; | |
+} | |
+ | |
+static inline void print_scheduler_version(void) | |
+{ | |
+ printk(KERN_INFO"CFS CPU scheduler.\n"); | |
+} | |
+ | |
+static inline bool iso_task(struct task_struct *p) | |
+{ | |
+ return false; | |
+} | |
+ | |
+/* Anyone feel like implementing this? */ | |
+static inline bool above_background_load(void) | |
+{ | |
+ return false; | |
+} | |
+#endif /* CONFIG_SCHED_BFS */ | |
+ | |
/* Future-safe accessor for struct task_struct's cpus_allowed. */ | |
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed) | |
@@ -2151,7 +2228,7 @@ extern unsigned long long | |
task_sched_runtime(struct task_struct *task); | |
/* sched_exec is called by processes performing an exec */ | |
-#ifdef CONFIG_SMP | |
+#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_BFS) | |
extern void sched_exec(void); | |
#else | |
#define sched_exec() {} | |
@@ -2943,7 +3020,7 @@ static inline unsigned int task_cpu(const struct task_struct *p) | |
return 0; | |
} | |
-static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) | |
+static inline void set_task_cpu(struct task_struct *p, int cpu) | |
{ | |
} | |
diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h | |
index d9cf5a5..7d5d0b8 100644 | |
--- a/include/linux/sched/prio.h | |
+++ b/include/linux/sched/prio.h | |
@@ -19,8 +19,20 @@ | |
*/ | |
#define MAX_USER_RT_PRIO 100 | |
+ | |
+#ifdef CONFIG_SCHED_BFS | |
+/* Note different MAX_RT_PRIO */ | |
+#define MAX_RT_PRIO (MAX_USER_RT_PRIO + 1) | |
+ | |
+#define ISO_PRIO (MAX_RT_PRIO) | |
+#define NORMAL_PRIO (MAX_RT_PRIO + 1) | |
+#define IDLE_PRIO (MAX_RT_PRIO + 2) | |
+#define PRIO_LIMIT ((IDLE_PRIO) + 1) | |
+#else /* CONFIG_SCHED_BFS */ | |
#define MAX_RT_PRIO MAX_USER_RT_PRIO | |
+#endif /* CONFIG_SCHED_BFS */ | |
+ | |
#define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH) | |
#define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2) | |
diff --git a/include/linux/sradix-tree.h b/include/linux/sradix-tree.h | |
new file mode 100644 | |
index 0000000..6780fdb | |
--- /dev/null | |
+++ b/include/linux/sradix-tree.h | |
@@ -0,0 +1,77 @@ | |
+#ifndef _LINUX_SRADIX_TREE_H | |
+#define _LINUX_SRADIX_TREE_H | |
+ | |
+ | |
+#define INIT_SRADIX_TREE(root, mask) \ | |
+do { \ | |
+ (root)->height = 0; \ | |
+ (root)->gfp_mask = (mask); \ | |
+ (root)->rnode = NULL; \ | |
+} while (0) | |
+ | |
+#define ULONG_BITS (sizeof(unsigned long) * 8) | |
+#define SRADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long)) | |
+//#define SRADIX_TREE_MAP_SHIFT 6 | |
+//#define SRADIX_TREE_MAP_SIZE (1UL << SRADIX_TREE_MAP_SHIFT) | |
+//#define SRADIX_TREE_MAP_MASK (SRADIX_TREE_MAP_SIZE-1) | |
+ | |
+struct sradix_tree_node { | |
+ unsigned int height; /* Height from the bottom */ | |
+ unsigned int count; | |
+ unsigned int fulls; /* Number of full sublevel trees */ | |
+ struct sradix_tree_node *parent; | |
+ void *stores[0]; | |
+}; | |
+ | |
+/* A simple radix tree implementation */ | |
+struct sradix_tree_root { | |
+ unsigned int height; | |
+ struct sradix_tree_node *rnode; | |
+ | |
+ /* Where found to have available empty stores in its sublevels */ | |
+ struct sradix_tree_node *enter_node; | |
+ unsigned int shift; | |
+ unsigned int stores_size; | |
+ unsigned int mask; | |
+ unsigned long min; /* The first hole index */ | |
+ unsigned long num; | |
+ //unsigned long *height_to_maxindex; | |
+ | |
+ /* How the node is allocated and freed. */ | |
+ struct sradix_tree_node *(*alloc)(void); | |
+ void (*free)(struct sradix_tree_node *node); | |
+ | |
+ /* When a new node is added and removed */ | |
+ void (*extend)(struct sradix_tree_node *parent, struct sradix_tree_node *child); | |
+ void (*assign)(struct sradix_tree_node *node, unsigned index, void *item); | |
+ void (*rm)(struct sradix_tree_node *node, unsigned offset); | |
+}; | |
+ | |
+struct sradix_tree_path { | |
+ struct sradix_tree_node *node; | |
+ int offset; | |
+}; | |
+ | |
+static inline | |
+void init_sradix_tree_root(struct sradix_tree_root *root, unsigned long shift) | |
+{ | |
+ root->height = 0; | |
+ root->rnode = NULL; | |
+ root->shift = shift; | |
+ root->stores_size = 1UL << shift; | |
+ root->mask = root->stores_size - 1; | |
+} | |
+ | |
+ | |
+extern void *sradix_tree_next(struct sradix_tree_root *root, | |
+ struct sradix_tree_node *node, unsigned long index, | |
+ int (*iter)(void *, unsigned long)); | |
+ | |
+extern int sradix_tree_enter(struct sradix_tree_root *root, void **item, int num); | |
+ | |
+extern void sradix_tree_delete_from_leaf(struct sradix_tree_root *root, | |
+ struct sradix_tree_node *node, unsigned long index); | |
+ | |
+extern void *sradix_tree_lookup(struct sradix_tree_root *root, unsigned long index); | |
+ | |
+#endif /* _LINUX_SRADIX_TREE_H */ | |
diff --git a/include/linux/uksm.h b/include/linux/uksm.h | |
new file mode 100644 | |
index 0000000..a644bca | |
--- /dev/null | |
+++ b/include/linux/uksm.h | |
@@ -0,0 +1,146 @@ | |
+#ifndef __LINUX_UKSM_H | |
+#define __LINUX_UKSM_H | |
+/* | |
+ * Memory merging support. | |
+ * | |
+ * This code enables dynamic sharing of identical pages found in different | |
+ * memory areas, even if they are not shared by fork(). | |
+ */ | |
+ | |
+/* if !CONFIG_UKSM this file should not be compiled at all. */ | |
+#ifdef CONFIG_UKSM | |
+ | |
+#include <linux/bitops.h> | |
+#include <linux/mm.h> | |
+#include <linux/pagemap.h> | |
+#include <linux/rmap.h> | |
+#include <linux/sched.h> | |
+ | |
+extern unsigned long zero_pfn __read_mostly; | |
+extern unsigned long uksm_zero_pfn __read_mostly; | |
+extern struct page *empty_uksm_zero_page; | |
+ | |
+/* must be done before linked to mm */ | |
+extern void uksm_vma_add_new(struct vm_area_struct *vma); | |
+extern void uksm_remove_vma(struct vm_area_struct *vma); | |
+ | |
+#define UKSM_SLOT_NEED_SORT (1 << 0) | |
+#define UKSM_SLOT_NEED_RERAND (1 << 1) | |
+#define UKSM_SLOT_SCANNED (1 << 2) /* It's scanned in this round */ | |
+#define UKSM_SLOT_FUL_SCANNED (1 << 3) | |
+#define UKSM_SLOT_IN_UKSM (1 << 4) | |
+ | |
+struct vma_slot { | |
+ struct sradix_tree_node *snode; | |
+ unsigned long sindex; | |
+ | |
+ struct list_head slot_list; | |
+ unsigned long fully_scanned_round; | |
+ unsigned long dedup_num; | |
+ unsigned long pages_scanned; | |
+ unsigned long last_scanned; | |
+ unsigned long pages_to_scan; | |
+ struct scan_rung *rung; | |
+ struct page **rmap_list_pool; | |
+ unsigned int *pool_counts; | |
+ unsigned long pool_size; | |
+ struct vm_area_struct *vma; | |
+ struct mm_struct *mm; | |
+ unsigned long ctime_j; | |
+ unsigned long pages; | |
+ unsigned long flags; | |
+ unsigned long pages_cowed; /* pages cowed this round */ | |
+ unsigned long pages_merged; /* pages merged this round */ | |
+ unsigned long pages_bemerged; | |
+ | |
+ /* when it has page merged in this eval round */ | |
+ struct list_head dedup_list; | |
+}; | |
+ | |
+static inline void uksm_unmap_zero_page(pte_t pte) | |
+{ | |
+ if (pte_pfn(pte) == uksm_zero_pfn) | |
+ __dec_zone_page_state(empty_uksm_zero_page, NR_UKSM_ZERO_PAGES); | |
+} | |
+ | |
+static inline void uksm_map_zero_page(pte_t pte) | |
+{ | |
+ if (pte_pfn(pte) == uksm_zero_pfn) | |
+ __inc_zone_page_state(empty_uksm_zero_page, NR_UKSM_ZERO_PAGES); | |
+} | |
+ | |
+static inline void uksm_cow_page(struct vm_area_struct *vma, struct page *page) | |
+{ | |
+ if (vma->uksm_vma_slot && PageKsm(page)) | |
+ vma->uksm_vma_slot->pages_cowed++; | |
+} | |
+ | |
+static inline void uksm_cow_pte(struct vm_area_struct *vma, pte_t pte) | |
+{ | |
+ if (vma->uksm_vma_slot && pte_pfn(pte) == uksm_zero_pfn) | |
+ vma->uksm_vma_slot->pages_cowed++; | |
+} | |
+ | |
+static inline int uksm_flags_can_scan(unsigned long vm_flags) | |
+{ | |
+#ifndef VM_SAO | |
+#define VM_SAO 0 | |
+#endif | |
+ return !(vm_flags & (VM_PFNMAP | VM_IO | VM_DONTEXPAND | | |
+ VM_HUGETLB | VM_NONLINEAR | VM_MIXEDMAP | | |
+ VM_SHARED | VM_MAYSHARE | VM_GROWSUP | VM_GROWSDOWN | VM_SAO)); | |
+} | |
+ | |
+static inline void uksm_vm_flags_mod(unsigned long *vm_flags_p) | |
+{ | |
+ if (uksm_flags_can_scan(*vm_flags_p)) | |
+ *vm_flags_p |= VM_MERGEABLE; | |
+} | |
+ | |
+/* | |
+ * Just a wrapper for BUG_ON for where ksm_zeropage must not be. TODO: it will | |
+ * be removed when uksm zero page patch is stable enough. | |
+ */ | |
+static inline void uksm_bugon_zeropage(pte_t pte) | |
+{ | |
+ BUG_ON(pte_pfn(pte) == uksm_zero_pfn); | |
+} | |
+#else | |
+static inline void uksm_vma_add_new(struct vm_area_struct *vma) | |
+{ | |
+} | |
+ | |
+static inline void uksm_remove_vma(struct vm_area_struct *vma) | |
+{ | |
+} | |
+ | |
+static inline void uksm_unmap_zero_page(pte_t pte) | |
+{ | |
+} | |
+ | |
+static inline void uksm_map_zero_page(pte_t pte) | |
+{ | |
+} | |
+ | |
+static inline void uksm_cow_page(struct vm_area_struct *vma, struct page *page) | |
+{ | |
+} | |
+ | |
+static inline void uksm_cow_pte(struct vm_area_struct *vma, pte_t pte) | |
+{ | |
+} | |
+ | |
+static inline int uksm_flags_can_scan(unsigned long vm_flags) | |
+{ | |
+ return 0; | |
+} | |
+ | |
+static inline void uksm_vm_flags_mod(unsigned long *vm_flags_p) | |
+{ | |
+} | |
+ | |
+static inline void uksm_bugon_zeropage(pte_t pte) | |
+{ | |
+} | |
+#endif /* !CONFIG_UKSM */ | |
+#endif /* __LINUX_UKSM_H */ | |
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h | |
index b932be9..66cbe8a 100644 | |
--- a/include/uapi/linux/sched.h | |
+++ b/include/uapi/linux/sched.h | |
@@ -37,9 +37,16 @@ | |
#define SCHED_FIFO 1 | |
#define SCHED_RR 2 | |
#define SCHED_BATCH 3 | |
-/* SCHED_ISO: reserved but not implemented yet */ | |
+/* SCHED_ISO: Implemented on BFS only */ | |
#define SCHED_IDLE 5 | |
+#ifdef CONFIG_SCHED_BFS | |
+#define SCHED_ISO 4 | |
+#define SCHED_IDLEPRIO SCHED_IDLE | |
+#define SCHED_MAX (SCHED_IDLEPRIO) | |
+#define SCHED_RANGE(policy) ((policy) <= SCHED_MAX) | |
+#else /* CONFIG_SCHED_BFS */ | |
#define SCHED_DEADLINE 6 | |
+#endif /* CONFIG_SCHED_BFS */ | |
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */ | |
#define SCHED_RESET_ON_FORK 0x40000000 | |
diff --git a/init/Kconfig b/init/Kconfig | |
index 2081a4d..30cb994 100644 | |
--- a/init/Kconfig | |
+++ b/init/Kconfig | |
@@ -28,6 +28,20 @@ config BUILDTIME_EXTABLE_SORT | |
menu "General setup" | |
+config SCHED_BFS | |
+ bool "BFS cpu scheduler" | |
+ ---help--- | |
+ The Brain Fuck CPU Scheduler for excellent interactivity and | |
+ responsiveness on the desktop and solid scalability on normal | |
+ hardware and commodity servers. Not recommended for 4096 CPUs. | |
+ | |
+ Currently incompatible with the Group CPU scheduler, and RCU TORTURE | |
+ TEST so these options are disabled. | |
+ | |
+ Say Y here. | |
+ default y | |
+ | |
+ | |
config BROKEN | |
bool | |
@@ -340,7 +354,7 @@ choice | |
# Kind of a stub config for the pure tick based cputime accounting | |
config TICK_CPU_ACCOUNTING | |
bool "Simple tick based cputime accounting" | |
- depends on !S390 && !NO_HZ_FULL | |
+ depends on !S390 && !NO_HZ_FULL && !SCHED_BFS | |
help | |
This is the basic tick based cputime accounting that maintains | |
statistics about user, system and idle time spent on per jiffies | |
@@ -365,6 +379,7 @@ config VIRT_CPU_ACCOUNTING_GEN | |
bool "Full dynticks CPU time accounting" | |
depends on HAVE_CONTEXT_TRACKING | |
depends on HAVE_VIRT_CPU_ACCOUNTING_GEN | |
+ depends on !SCHED_BFS | |
select VIRT_CPU_ACCOUNTING | |
select CONTEXT_TRACKING | |
help | |
@@ -530,7 +545,7 @@ config CONTEXT_TRACKING | |
config RCU_USER_QS | |
bool "Consider userspace as in RCU extended quiescent state" | |
- depends on HAVE_CONTEXT_TRACKING && SMP | |
+ depends on HAVE_CONTEXT_TRACKING && SMP && !SCHED_BFS | |
select CONTEXT_TRACKING | |
help | |
This option sets hooks on kernel / userspace boundaries and | |
@@ -715,7 +730,7 @@ config RCU_BOOST_DELAY | |
config RCU_NOCB_CPU | |
bool "Offload RCU callback processing from boot-selected CPUs" | |
- depends on TREE_RCU || TREE_PREEMPT_RCU | |
+ depends on (TREE_RCU || TREE_PREEMPT_RCU) && !SCHED_BFS | |
default n | |
help | |
Use this option to reduce OS jitter for aggressive HPC or | |
@@ -913,6 +928,7 @@ config NUMA_BALANCING | |
depends on ARCH_SUPPORTS_NUMA_BALANCING | |
depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY | |
depends on SMP && NUMA && MIGRATION | |
+ depends on !SCHED_BFS | |
help | |
This option adds support for automatic NUMA aware memory/task placement. | |
The mechanism is quite primitive and is based on migrating memory when | |
@@ -975,6 +991,7 @@ config PROC_PID_CPUSET | |
config CGROUP_CPUACCT | |
bool "Simple CPU accounting cgroup subsystem" | |
+ depends on !SCHED_BFS | |
help | |
Provides a simple Resource Controller for monitoring the | |
total CPU consumed by the tasks in a cgroup. | |
@@ -1080,6 +1097,7 @@ config CGROUP_PERF | |
menuconfig CGROUP_SCHED | |
bool "Group CPU scheduler" | |
+ depends on !SCHED_BFS | |
default n | |
help | |
This feature lets CPU scheduler recognize task groups and control CPU | |
@@ -1220,6 +1238,7 @@ endif # NAMESPACES | |
config SCHED_AUTOGROUP | |
bool "Automatic process group scheduling" | |
+ depends on !SCHED_BFS | |
select CGROUPS | |
select CGROUP_SCHED | |
select FAIR_GROUP_SCHED | |
@@ -1669,38 +1688,8 @@ config COMPAT_BRK | |
On non-ancient distros (post-2000 ones) N is usually a safe choice. | |
-choice | |
- prompt "Choose SLAB allocator" | |
- default SLUB | |
- help | |
- This option allows to select a slab allocator. | |
- | |
-config SLAB | |
- bool "SLAB" | |
- help | |
- The regular slab allocator that is established and known to work | |
- well in all environments. It organizes cache hot objects in | |
- per cpu and per node queues. | |
- | |
config SLUB | |
- bool "SLUB (Unqueued Allocator)" | |
- help | |
- SLUB is a slab allocator that minimizes cache line usage | |
- instead of managing queues of cached objects (SLAB approach). | |
- Per cpu caching is realized using slabs of objects instead | |
- of queues of objects. SLUB can use memory efficiently | |
- and has enhanced diagnostics. SLUB is the default choice for | |
- a slab allocator. | |
- | |
-config SLOB | |
- depends on EXPERT | |
- bool "SLOB (Simple Allocator)" | |
- help | |
- SLOB replaces the stock allocator with a drastically simpler | |
- allocator. SLOB is generally more space efficient but | |
- does not perform as well on large systems. | |
- | |
-endchoice | |
+ def_bool y | |
config SLUB_CPU_PARTIAL | |
default y | |
diff --git a/init/main.c b/init/main.c | |
index 321d0ce..c721cfe 100644 | |
--- a/init/main.c | |
+++ b/init/main.c | |
@@ -805,7 +805,6 @@ int __init_or_module do_one_initcall(initcall_t fn) | |
return ret; | |
} | |
- | |
extern initcall_t __initcall_start[]; | |
extern initcall_t __initcall0_start[]; | |
extern initcall_t __initcall1_start[]; | |
@@ -941,6 +940,8 @@ static int __ref kernel_init(void *unused) | |
flush_delayed_fput(); | |
+ print_scheduler_version(); | |
+ | |
if (ramdisk_execute_command) { | |
ret = run_init_process(ramdisk_execute_command); | |
if (!ret) | |
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz | |
index 2a202a8..0e8075b 100644 | |
--- a/kernel/Kconfig.hz | |
+++ b/kernel/Kconfig.hz | |
@@ -4,7 +4,7 @@ | |
choice | |
prompt "Timer frequency" | |
- default HZ_250 | |
+ default HZ_1000 | |
help | |
Allows the configuration of the timer frequency. It is customary | |
to have the timer interrupt run at 1000 Hz but 100 Hz may be more | |
@@ -23,13 +23,14 @@ choice | |
with lots of processors that may show reduced performance if | |
too many timer interrupts are occurring. | |
- config HZ_250 | |
+ config HZ_250_NODEFAULT | |
bool "250 HZ" | |
help | |
- 250 Hz is a good compromise choice allowing server performance | |
- while also showing good interactive responsiveness even | |
- on SMP and NUMA systems. If you are going to be using NTSC video | |
- or multimedia, selected 300Hz instead. | |
+ 250 HZ is a lousy compromise choice allowing server interactivity | |
+ while also showing desktop throughput and no extra power saving on | |
+ laptops. No good for anything. | |
+ | |
+ Recommend 100 or 1000 instead. | |
config HZ_300 | |
bool "300 HZ" | |
@@ -43,14 +44,16 @@ choice | |
bool "1000 HZ" | |
help | |
1000 Hz is the preferred choice for desktop systems and other | |
- systems requiring fast interactive responses to events. | |
+ systems requiring fast interactive responses to events. Laptops | |
+ can also benefit from this choice without sacrificing battery life | |
+ if dynticks is also enabled. | |
endchoice | |
config HZ | |
int | |
default 100 if HZ_100 | |
- default 250 if HZ_250 | |
+ default 250 if HZ_250_NODEFAULT | |
default 300 if HZ_300 | |
default 1000 if HZ_1000 | |
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt | |
index 3f9c974..1dc79ec 100644 | |
--- a/kernel/Kconfig.preempt | |
+++ b/kernel/Kconfig.preempt | |
@@ -1,7 +1,7 @@ | |
choice | |
prompt "Preemption Model" | |
- default PREEMPT_NONE | |
+ default PREEMPT | |
config PREEMPT_NONE | |
bool "No Forced Preemption (Server)" | |
@@ -17,7 +17,7 @@ config PREEMPT_NONE | |
latencies. | |
config PREEMPT_VOLUNTARY | |
- bool "Voluntary Kernel Preemption (Desktop)" | |
+ bool "Voluntary Kernel Preemption (Nothing)" | |
help | |
This option reduces the latency of the kernel by adding more | |
"explicit preemption points" to the kernel code. These new | |
@@ -31,7 +31,8 @@ config PREEMPT_VOLUNTARY | |
applications to run more 'smoothly' even when the system is | |
under load. | |
- Select this if you are building a kernel for a desktop system. | |
+ Select this for no system in particular (choose Preemptible | |
+ instead on a desktop if you know what's good for you). | |
config PREEMPT | |
bool "Preemptible Kernel (Low-Latency Desktop)" | |
diff --git a/kernel/delayacct.c b/kernel/delayacct.c | |
index ef90b04..d12807d 100644 | |
--- a/kernel/delayacct.c | |
+++ b/kernel/delayacct.c | |
@@ -104,7 +104,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) | |
*/ | |
t1 = tsk->sched_info.pcount; | |
t2 = tsk->sched_info.run_delay; | |
- t3 = tsk->se.sum_exec_runtime; | |
+ t3 = tsk_seruntime(tsk); | |
d->cpu_count += t1; | |
diff --git a/kernel/exit.c b/kernel/exit.c | |
index 5d30019..13a5e3b 100644 | |
--- a/kernel/exit.c | |
+++ b/kernel/exit.c | |
@@ -138,7 +138,7 @@ static void __exit_signal(struct task_struct *tsk) | |
sig->inblock += task_io_get_inblock(tsk); | |
sig->oublock += task_io_get_oublock(tsk); | |
task_io_accounting_add(&sig->ioac, &tsk->ioac); | |
- sig->sum_sched_runtime += tsk->se.sum_exec_runtime; | |
+ sig->sum_sched_runtime += tsk_seruntime(tsk); | |
sig->nr_threads--; | |
__unhash_process(tsk, group_dead); | |
write_sequnlock(&sig->stats_lock); | |
diff --git a/kernel/fork.c b/kernel/fork.c | |
index 9b7d746..73ad90d 100644 | |
--- a/kernel/fork.c | |
+++ b/kernel/fork.c | |
@@ -412,7 +412,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) | |
goto fail_nomem; | |
charge = len; | |
} | |
- tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); | |
+ tmp = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); | |
if (!tmp) | |
goto fail_nomem; | |
*tmp = *mpnt; | |
@@ -467,7 +467,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) | |
__vma_link_rb(mm, tmp, rb_link, rb_parent); | |
rb_link = &tmp->vm_rb.rb_right; | |
rb_parent = &tmp->vm_rb; | |
- | |
+ uksm_vma_add_new(tmp); | |
mm->map_count++; | |
retval = copy_page_range(mm, oldmm, mpnt); | |
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile | |
index ab32b7b..5bb8ab8 100644 | |
--- a/kernel/sched/Makefile | |
+++ b/kernel/sched/Makefile | |
@@ -11,11 +11,16 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) | |
CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer | |
endif | |
+ifdef CONFIG_SCHED_BFS | |
+obj-y += bfs.o clock.o | |
+else | |
obj-y += core.o proc.o clock.o cputime.o | |
obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o | |
-obj-y += wait.o completion.o idle.o | |
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o | |
+obj-$(CONFIG_SMP) += cpudeadline.o | |
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o | |
-obj-$(CONFIG_SCHEDSTATS) += stats.o | |
obj-$(CONFIG_SCHED_DEBUG) += debug.o | |
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o | |
+endif | |
+obj-y += wait.o completion.o idle.o | |
+obj-$(CONFIG_SMP) += cpupri.o | |
+obj-$(CONFIG_SCHEDSTATS) += stats.o | |
diff --git a/kernel/sched/bfs.c b/kernel/sched/bfs.c | |
new file mode 100644 | |
index 0000000..210772a | |
--- /dev/null | |
+++ b/kernel/sched/bfs.c | |
@@ -0,0 +1,7369 @@ | |
+/* | |
+ * kernel/sched/bfs.c, was kernel/sched.c | |
+ * | |
+ * Kernel scheduler and related syscalls | |
+ * | |
+ * Copyright (C) 1991-2002 Linus Torvalds | |
+ * | |
+ * 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and | |
+ * make semaphores SMP safe | |
+ * 1998-11-19 Implemented schedule_timeout() and related stuff | |
+ * by Andrea Arcangeli | |
+ * 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar: | |
+ * hybrid priority-list and round-robin design with | |
+ * an array-switch method of distributing timeslices | |
+ * and per-CPU runqueues. Cleanups and useful suggestions | |
+ * by Davide Libenzi, preemptible kernel bits by Robert Love. | |
+ * 2003-09-03 Interactivity tuning by Con Kolivas. | |
+ * 2004-04-02 Scheduler domains code by Nick Piggin | |
+ * 2007-04-15 Work begun on replacing all interactivity tuning with a | |
+ * fair scheduling design by Con Kolivas. | |
+ * 2007-05-05 Load balancing (smp-nice) and other improvements | |
+ * by Peter Williams | |
+ * 2007-05-06 Interactivity improvements to CFS by Mike Galbraith | |
+ * 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri | |
+ * 2007-11-29 RT balancing improvements by Steven Rostedt, Gregory Haskins, | |
+ * Thomas Gleixner, Mike Kravetz | |
+ * now Brainfuck deadline scheduling policy by Con Kolivas deletes | |
+ * a whole lot of those previous things. | |
+ */ | |
+ | |
+#include <linux/mm.h> | |
+#include <linux/module.h> | |
+#include <linux/nmi.h> | |
+#include <linux/init.h> | |
+#include <asm/uaccess.h> | |
+#include <linux/highmem.h> | |
+#include <asm/mmu_context.h> | |
+#include <linux/interrupt.h> | |
+#include <linux/capability.h> | |
+#include <linux/completion.h> | |
+#include <linux/kernel_stat.h> | |
+#include <linux/debug_locks.h> | |
+#include <linux/perf_event.h> | |
+#include <linux/security.h> | |
+#include <linux/notifier.h> | |
+#include <linux/profile.h> | |
+#include <linux/freezer.h> | |
+#include <linux/vmalloc.h> | |
+#include <linux/blkdev.h> | |
+#include <linux/delay.h> | |
+#include <linux/smp.h> | |
+#include <linux/threads.h> | |
+#include <linux/timer.h> | |
+#include <linux/rcupdate.h> | |
+#include <linux/cpu.h> | |
+#include <linux/cpuset.h> | |
+#include <linux/cpumask.h> | |
+#include <linux/percpu.h> | |
+#include <linux/proc_fs.h> | |
+#include <linux/seq_file.h> | |
+#include <linux/syscalls.h> | |
+#include <linux/sched/sysctl.h> | |
+#include <linux/times.h> | |
+#include <linux/tsacct_kern.h> | |
+#include <linux/kprobes.h> | |
+#include <linux/delayacct.h> | |
+#include <linux/log2.h> | |
+#include <linux/bootmem.h> | |
+#include <linux/ftrace.h> | |
+#include <linux/slab.h> | |
+#include <linux/init_task.h> | |
+#include <linux/binfmts.h> | |
+#include <linux/context_tracking.h> | |
+#include <linux/sched/prio.h> | |
+ | |
+#include <asm/irq_regs.h> | |
+#include <asm/switch_to.h> | |
+#include <asm/tlb.h> | |
+#include <asm/unistd.h> | |
+#include <asm/mutex.h> | |
+#ifdef CONFIG_PARAVIRT | |
+#include <asm/paravirt.h> | |
+#endif | |
+ | |
+#include "cpupri.h" | |
+#include "../workqueue_internal.h" | |
+#include "../smpboot.h" | |
+ | |
+#define CREATE_TRACE_POINTS | |
+#include <trace/events/sched.h> | |
+ | |
+#include "bfs_sched.h" | |
+ | |
+#define rt_prio(prio) unlikely((prio) < MAX_RT_PRIO) | |
+#define rt_task(p) rt_prio((p)->prio) | |
+#define rt_queue(rq) rt_prio((rq)->rq_prio) | |
+#define batch_task(p) (unlikely((p)->policy == SCHED_BATCH)) | |
+#define is_rt_policy(policy) ((policy) == SCHED_FIFO || \ | |
+ (policy) == SCHED_RR) | |
+#define has_rt_policy(p) unlikely(is_rt_policy((p)->policy)) | |
+ | |
+#define is_idle_policy(policy) ((policy) == SCHED_IDLEPRIO) | |
+#define idleprio_task(p) unlikely(is_idle_policy((p)->policy)) | |
+#define task_running_idle(p) unlikely((p)->prio == IDLE_PRIO) | |
+#define idle_queue(rq) (unlikely(is_idle_policy((rq)->rq_policy))) | |
+ | |
+#define is_iso_policy(policy) ((policy) == SCHED_ISO) | |
+#define iso_task(p) unlikely(is_iso_policy((p)->policy)) | |
+#define iso_queue(rq) unlikely(is_iso_policy((rq)->rq_policy)) | |
+#define task_running_iso(p) unlikely((p)->prio == ISO_PRIO) | |
+#define rq_running_iso(rq) ((rq)->rq_prio == ISO_PRIO) | |
+ | |
+#define rq_idle(rq) ((rq)->rq_prio == PRIO_LIMIT) | |
+ | |
+#define ISO_PERIOD ((5 * HZ * grq.noc) + 1) | |
+ | |
+#define SCHED_PRIO(p) ((p) + MAX_RT_PRIO) | |
+#define STOP_PRIO (MAX_RT_PRIO - 1) | |
+ | |
+/* | |
+ * Some helpers for converting to/from various scales. Use shifts to get | |
+ * approximate multiples of ten for less overhead. | |
+ */ | |
+#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) | |
+#define JIFFY_NS (1000000000 / HZ) | |
+#define HALF_JIFFY_NS (1000000000 / HZ / 2) | |
+#define HALF_JIFFY_US (1000000 / HZ / 2) | |
+#define MS_TO_NS(TIME) ((TIME) << 20) | |
+#define MS_TO_US(TIME) ((TIME) << 10) | |
+#define NS_TO_MS(TIME) ((TIME) >> 20) | |
+#define NS_TO_US(TIME) ((TIME) >> 10) | |
+ | |
+#define RESCHED_US (100) /* Reschedule if less than this many μs left */ | |
+ | |
+void print_scheduler_version(void) | |
+{ | |
+ printk(KERN_INFO "BFS CPU scheduler v0.460 by Con Kolivas.\n"); | |
+} | |
+ | |
+/* | |
+ * This is the time all tasks within the same priority round robin. | |
+ * Value is in ms and set to a minimum of 6ms. Scales with number of cpus. | |
+ * Tunable via /proc interface. | |
+ */ | |
+int rr_interval __read_mostly = 6; | |
+ | |
+/* | |
+ * sched_iso_cpu - sysctl which determines the cpu percentage SCHED_ISO tasks | |
+ * are allowed to run five seconds as real time tasks. This is the total over | |
+ * all online cpus. | |
+ */ | |
+int sched_iso_cpu __read_mostly = 70; | |
+ | |
+/* | |
+ * The relative length of deadline for each priority(nice) level. | |
+ */ | |
+static int prio_ratios[NICE_WIDTH] __read_mostly; | |
+ | |
+/* | |
+ * The quota handed out to tasks of all priority levels when refilling their | |
+ * time_slice. | |
+ */ | |
+static inline int timeslice(void) | |
+{ | |
+ return MS_TO_US(rr_interval); | |
+} | |
+ | |
+/* | |
+ * The global runqueue data that all CPUs work off. Data is protected either | |
+ * by the global grq lock, or the discrete lock that precedes the data in this | |
+ * struct. | |
+ */ | |
+struct global_rq { | |
+ raw_spinlock_t lock; | |
+ unsigned long nr_running; | |
+ unsigned long nr_uninterruptible; | |
+ unsigned long long nr_switches; | |
+ struct list_head queue[PRIO_LIMIT]; | |
+ DECLARE_BITMAP(prio_bitmap, PRIO_LIMIT + 1); | |
+#ifdef CONFIG_SMP | |
+ unsigned long qnr; /* queued not running */ | |
+ cpumask_t cpu_idle_map; | |
+ bool idle_cpus; | |
+#endif | |
+ int noc; /* num_online_cpus stored and updated when it changes */ | |
+ u64 niffies; /* Nanosecond jiffies */ | |
+ unsigned long last_jiffy; /* Last jiffy we updated niffies */ | |
+ | |
+ raw_spinlock_t iso_lock; | |
+ int iso_ticks; | |
+ bool iso_refractory; | |
+}; | |
+ | |
+#ifdef CONFIG_SMP | |
+/* | |
+ * We add the notion of a root-domain which will be used to define per-domain | |
+ * variables. Each exclusive cpuset essentially defines an island domain by | |
+ * fully partitioning the member cpus from any other cpuset. Whenever a new | |
+ * exclusive cpuset is created, we also create and attach a new root-domain | |
+ * object. | |
+ * | |
+ */ | |
+struct root_domain { | |
+ atomic_t refcount; | |
+ atomic_t rto_count; | |
+ struct rcu_head rcu; | |
+ cpumask_var_t span; | |
+ cpumask_var_t online; | |
+ | |
+ /* | |
+ * The "RT overload" flag: it gets set if a CPU has more than | |
+ * one runnable RT task. | |
+ */ | |
+ cpumask_var_t rto_mask; | |
+ struct cpupri cpupri; | |
+}; | |
+ | |
+/* | |
+ * By default the system creates a single root-domain with all cpus as | |
+ * members (mimicking the global state we have today). | |
+ */ | |
+static struct root_domain def_root_domain; | |
+ | |
+#endif /* CONFIG_SMP */ | |
+ | |
+/* There can be only one */ | |
+static struct global_rq grq; | |
+ | |
+static DEFINE_MUTEX(sched_hotcpu_mutex); | |
+ | |
+DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); | |
+#ifdef CONFIG_SMP | |
+struct rq *cpu_rq(int cpu) | |
+{ | |
+ return &per_cpu(runqueues, (cpu)); | |
+} | |
+#define task_rq(p) cpu_rq(task_cpu(p)) | |
+#define cpu_curr(cpu) (cpu_rq(cpu)->curr) | |
+/* | |
+ * sched_domains_mutex serialises calls to init_sched_domains, | |
+ * detach_destroy_domains and partition_sched_domains. | |
+ */ | |
+static DEFINE_MUTEX(sched_domains_mutex); | |
+ | |
+/* | |
+ * By default the system creates a single root-domain with all cpus as | |
+ * members (mimicking the global state we have today). | |
+ */ | |
+static struct root_domain def_root_domain; | |
+ | |
+int __weak arch_sd_sibling_asym_packing(void) | |
+{ | |
+ return 0*SD_ASYM_PACKING; | |
+} | |
+#endif /* CONFIG_SMP */ | |
+ | |
+static inline void update_rq_clock(struct rq *rq); | |
+ | |
+/* | |
+ * Sanity check should sched_clock return bogus values. We make sure it does | |
+ * not appear to go backwards, and use jiffies to determine the maximum and | |
+ * minimum it could possibly have increased, and round down to the nearest | |
+ * jiffy when it falls outside this. | |
+ */ | |
+static inline void niffy_diff(s64 *niff_diff, int jiff_diff) | |
+{ | |
+ unsigned long min_diff, max_diff; | |
+ | |
+ if (jiff_diff > 1) | |
+ min_diff = JIFFIES_TO_NS(jiff_diff - 1); | |
+ else | |
+ min_diff = 1; | |
+ /* Round up to the nearest tick for maximum */ | |
+ max_diff = JIFFIES_TO_NS(jiff_diff + 1); | |
+ | |
+ if (unlikely(*niff_diff < min_diff || *niff_diff > max_diff)) | |
+ *niff_diff = min_diff; | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+static inline int cpu_of(struct rq *rq) | |
+{ | |
+ return rq->cpu; | |
+} | |
+ | |
+/* | |
+ * Niffies are a globally increasing nanosecond counter. Whenever a runqueue | |
+ * clock is updated with the grq.lock held, it is an opportunity to update the | |
+ * niffies value. Any CPU can update it by adding how much its clock has | |
+ * increased since it last updated niffies, minus any added niffies by other | |
+ * CPUs. | |
+ */ | |
+static inline void update_clocks(struct rq *rq) | |
+{ | |
+ s64 ndiff; | |
+ long jdiff; | |
+ | |
+ update_rq_clock(rq); | |
+ ndiff = rq->clock - rq->old_clock; | |
+ /* old_clock is only updated when we are updating niffies */ | |
+ rq->old_clock = rq->clock; | |
+ ndiff -= grq.niffies - rq->last_niffy; | |
+ jdiff = jiffies - grq.last_jiffy; | |
+ niffy_diff(&ndiff, jdiff); | |
+ grq.last_jiffy += jdiff; | |
+ grq.niffies += ndiff; | |
+ rq->last_niffy = grq.niffies; | |
+} | |
+#else /* CONFIG_SMP */ | |
+static inline int cpu_of(struct rq *rq) | |
+{ | |
+ return 0; | |
+} | |
+ | |
+static inline void update_clocks(struct rq *rq) | |
+{ | |
+ s64 ndiff; | |
+ long jdiff; | |
+ | |
+ update_rq_clock(rq); | |
+ ndiff = rq->clock - rq->old_clock; | |
+ rq->old_clock = rq->clock; | |
+ jdiff = jiffies - grq.last_jiffy; | |
+ niffy_diff(&ndiff, jdiff); | |
+ grq.last_jiffy += jdiff; | |
+ grq.niffies += ndiff; | |
+} | |
+#endif | |
+ | |
+#include "stats.h" | |
+ | |
+#ifndef prepare_arch_switch | |
+# define prepare_arch_switch(next) do { } while (0) | |
+#endif | |
+#ifndef finish_arch_switch | |
+# define finish_arch_switch(prev) do { } while (0) | |
+#endif | |
+#ifndef finish_arch_post_lock_switch | |
+# define finish_arch_post_lock_switch() do { } while (0) | |
+#endif | |
+ | |
+/* | |
+ * All common locking functions performed on grq.lock. rq->clock is local to | |
+ * the CPU accessing it so it can be modified just with interrupts disabled | |
+ * when we're not updating niffies. | |
+ * Looking up task_rq must be done under grq.lock to be safe. | |
+ */ | |
+static void update_rq_clock_task(struct rq *rq, s64 delta); | |
+ | |
+static inline void update_rq_clock(struct rq *rq) | |
+{ | |
+ s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; | |
+ | |
+ if (unlikely(delta < 0)) | |
+ return; | |
+ rq->clock += delta; | |
+ update_rq_clock_task(rq, delta); | |
+} | |
+ | |
+static inline bool task_running(struct task_struct *p) | |
+{ | |
+ return p->on_cpu; | |
+} | |
+ | |
+static inline void grq_lock(void) | |
+ __acquires(grq.lock) | |
+{ | |
+ raw_spin_lock(&grq.lock); | |
+} | |
+ | |
+static inline void grq_unlock(void) | |
+ __releases(grq.lock) | |
+{ | |
+ raw_spin_unlock(&grq.lock); | |
+} | |
+ | |
+static inline void grq_lock_irq(void) | |
+ __acquires(grq.lock) | |
+{ | |
+ raw_spin_lock_irq(&grq.lock); | |
+} | |
+ | |
+static inline void time_lock_grq(struct rq *rq) | |
+ __acquires(grq.lock) | |
+{ | |
+ grq_lock(); | |
+ update_clocks(rq); | |
+} | |
+ | |
+static inline void grq_unlock_irq(void) | |
+ __releases(grq.lock) | |
+{ | |
+ raw_spin_unlock_irq(&grq.lock); | |
+} | |
+ | |
+static inline void grq_lock_irqsave(unsigned long *flags) | |
+ __acquires(grq.lock) | |
+{ | |
+ raw_spin_lock_irqsave(&grq.lock, *flags); | |
+} | |
+ | |
+static inline void grq_unlock_irqrestore(unsigned long *flags) | |
+ __releases(grq.lock) | |
+{ | |
+ raw_spin_unlock_irqrestore(&grq.lock, *flags); | |
+} | |
+ | |
+static inline struct rq | |
+*task_grq_lock(struct task_struct *p, unsigned long *flags) | |
+ __acquires(grq.lock) | |
+{ | |
+ grq_lock_irqsave(flags); | |
+ return task_rq(p); | |
+} | |
+ | |
+static inline struct rq | |
+*time_task_grq_lock(struct task_struct *p, unsigned long *flags) | |
+ __acquires(grq.lock) | |
+{ | |
+ struct rq *rq = task_grq_lock(p, flags); | |
+ update_clocks(rq); | |
+ return rq; | |
+} | |
+ | |
+static inline struct rq *task_grq_lock_irq(struct task_struct *p) | |
+ __acquires(grq.lock) | |
+{ | |
+ grq_lock_irq(); | |
+ return task_rq(p); | |
+} | |
+ | |
+static inline void time_task_grq_lock_irq(struct task_struct *p) | |
+ __acquires(grq.lock) | |
+{ | |
+ struct rq *rq = task_grq_lock_irq(p); | |
+ update_clocks(rq); | |
+} | |
+ | |
+static inline void task_grq_unlock_irq(void) | |
+ __releases(grq.lock) | |
+{ | |
+ grq_unlock_irq(); | |
+} | |
+ | |
+static inline void task_grq_unlock(unsigned long *flags) | |
+ __releases(grq.lock) | |
+{ | |
+ grq_unlock_irqrestore(flags); | |
+} | |
+ | |
+/** | |
+ * grunqueue_is_locked | |
+ * | |
+ * Returns true if the global runqueue is locked. | |
+ * This interface allows printk to be called with the runqueue lock | |
+ * held and know whether or not it is OK to wake up the klogd. | |
+ */ | |
+bool grunqueue_is_locked(void) | |
+{ | |
+ return raw_spin_is_locked(&grq.lock); | |
+} | |
+ | |
+void grq_unlock_wait(void) | |
+ __releases(grq.lock) | |
+{ | |
+ smp_mb(); /* spin-unlock-wait is not a full memory barrier */ | |
+ raw_spin_unlock_wait(&grq.lock); | |
+} | |
+ | |
+static inline void time_grq_lock(struct rq *rq, unsigned long *flags) | |
+ __acquires(grq.lock) | |
+{ | |
+ local_irq_save(*flags); | |
+ time_lock_grq(rq); | |
+} | |
+ | |
+static inline struct rq *__task_grq_lock(struct task_struct *p) | |
+ __acquires(grq.lock) | |
+{ | |
+ grq_lock(); | |
+ return task_rq(p); | |
+} | |
+ | |
+static inline void __task_grq_unlock(void) | |
+ __releases(grq.lock) | |
+{ | |
+ grq_unlock(); | |
+} | |
+ | |
+static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next) | |
+{ | |
+} | |
+ | |
+static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) | |
+{ | |
+#ifdef CONFIG_DEBUG_SPINLOCK | |
+ /* this is a valid case when another task releases the spinlock */ | |
+ grq.lock.owner = current; | |
+#endif | |
+ /* | |
+ * If we are tracking spinlock dependencies then we have to | |
+ * fix up the runqueue lock - which gets 'carried over' from | |
+ * prev into current: | |
+ */ | |
+ spin_acquire(&grq.lock.dep_map, 0, 0, _THIS_IP_); | |
+ | |
+ grq_unlock_irq(); | |
+} | |
+ | |
+static inline bool deadline_before(u64 deadline, u64 time) | |
+{ | |
+ return (deadline < time); | |
+} | |
+ | |
+static inline bool deadline_after(u64 deadline, u64 time) | |
+{ | |
+ return (deadline > time); | |
+} | |
+ | |
+/* | |
+ * A task that is queued but not running will be on the grq run list. | |
+ * A task that is not running or queued will not be on the grq run list. | |
+ * A task that is currently running will have ->on_cpu set but not on the | |
+ * grq run list. | |
+ */ | |
+static inline bool task_queued(struct task_struct *p) | |
+{ | |
+ return (!list_empty(&p->run_list)); | |
+} | |
+ | |
+/* | |
+ * Removing from the global runqueue. Enter with grq locked. | |
+ */ | |
+static void dequeue_task(struct task_struct *p) | |
+{ | |
+ list_del_init(&p->run_list); | |
+ if (list_empty(grq.queue + p->prio)) | |
+ __clear_bit(p->prio, grq.prio_bitmap); | |
+ sched_info_dequeued(task_rq(p), p); | |
+} | |
+ | |
+/* | |
+ * To determine if it's safe for a task of SCHED_IDLEPRIO to actually run as | |
+ * an idle task, we ensure none of the following conditions are met. | |
+ */ | |
+static bool idleprio_suitable(struct task_struct *p) | |
+{ | |
+ return (!freezing(p) && !signal_pending(p) && | |
+ !(task_contributes_to_load(p)) && !(p->flags & (PF_EXITING))); | |
+} | |
+ | |
+/* | |
+ * To determine if a task of SCHED_ISO can run in pseudo-realtime, we check | |
+ * that the iso_refractory flag is not set. | |
+ */ | |
+static bool isoprio_suitable(void) | |
+{ | |
+ return !grq.iso_refractory; | |
+} | |
+ | |
+/* | |
+ * Adding to the global runqueue. Enter with grq locked. | |
+ */ | |
+static void enqueue_task(struct task_struct *p, struct rq *rq) | |
+{ | |
+ if (!rt_task(p)) { | |
+ /* Check it hasn't gotten rt from PI */ | |
+ if ((idleprio_task(p) && idleprio_suitable(p)) || | |
+ (iso_task(p) && isoprio_suitable())) | |
+ p->prio = p->normal_prio; | |
+ else | |
+ p->prio = NORMAL_PRIO; | |
+ } | |
+ __set_bit(p->prio, grq.prio_bitmap); | |
+ list_add_tail(&p->run_list, grq.queue + p->prio); | |
+ sched_info_queued(rq, p); | |
+} | |
+ | |
+static inline void requeue_task(struct task_struct *p) | |
+{ | |
+ sched_info_queued(task_rq(p), p); | |
+} | |
+ | |
+/* | |
+ * Returns the relative length of deadline all compared to the shortest | |
+ * deadline which is that of nice -20. | |
+ */ | |
+static inline int task_prio_ratio(struct task_struct *p) | |
+{ | |
+ return prio_ratios[TASK_USER_PRIO(p)]; | |
+} | |
+ | |
+/* | |
+ * task_timeslice - all tasks of all priorities get the exact same timeslice | |
+ * length. CPU distribution is handled by giving different deadlines to | |
+ * tasks of different priorities. Use 128 as the base value for fast shifts. | |
+ */ | |
+static inline int task_timeslice(struct task_struct *p) | |
+{ | |
+ return (rr_interval * task_prio_ratio(p) / 128); | |
+} | |
+ | |
+static void resched_task(struct task_struct *p); | |
+ | |
+static inline void resched_curr(struct rq *rq) | |
+{ | |
+ resched_task(rq->curr); | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+/* | |
+ * qnr is the "queued but not running" count which is the total number of | |
+ * tasks on the global runqueue list waiting for cpu time but not actually | |
+ * currently running on a cpu. | |
+ */ | |
+static inline void inc_qnr(void) | |
+{ | |
+ grq.qnr++; | |
+} | |
+ | |
+static inline void dec_qnr(void) | |
+{ | |
+ grq.qnr--; | |
+} | |
+ | |
+static inline int queued_notrunning(void) | |
+{ | |
+ return grq.qnr; | |
+} | |
+ | |
+/* | |
+ * The cpu_idle_map stores a bitmap of all the CPUs currently idle to | |
+ * allow easy lookup of whether any suitable idle CPUs are available. | |
+ * It's cheaper to maintain a binary yes/no if there are any idle CPUs on the | |
+ * idle_cpus variable than to do a full bitmask check when we are busy. | |
+ */ | |
+static inline void set_cpuidle_map(int cpu) | |
+{ | |
+ if (likely(cpu_online(cpu))) { | |
+ cpu_set(cpu, grq.cpu_idle_map); | |
+ grq.idle_cpus = true; | |
+ } | |
+} | |
+ | |
+static inline void clear_cpuidle_map(int cpu) | |
+{ | |
+ cpu_clear(cpu, grq.cpu_idle_map); | |
+ if (cpus_empty(grq.cpu_idle_map)) | |
+ grq.idle_cpus = false; | |
+} | |
+ | |
+static bool suitable_idle_cpus(struct task_struct *p) | |
+{ | |
+ if (!grq.idle_cpus) | |
+ return false; | |
+ return (cpus_intersects(p->cpus_allowed, grq.cpu_idle_map)); | |
+} | |
+ | |
+#define CPUIDLE_DIFF_THREAD (1) | |
+#define CPUIDLE_DIFF_CORE (2) | |
+#define CPUIDLE_CACHE_BUSY (4) | |
+#define CPUIDLE_DIFF_CPU (8) | |
+#define CPUIDLE_THREAD_BUSY (16) | |
+#define CPUIDLE_THROTTLED (32) | |
+#define CPUIDLE_DIFF_NODE (64) | |
+ | |
+static inline bool scaling_rq(struct rq *rq); | |
+ | |
+/* | |
+ * The best idle CPU is chosen according to the CPUIDLE ranking above where the | |
+ * lowest value would give the most suitable CPU to schedule p onto next. The | |
+ * order works out to be the following: | |
+ * | |
+ * Same core, idle or busy cache, idle or busy threads | |
+ * Other core, same cache, idle or busy cache, idle threads. | |
+ * Same node, other CPU, idle cache, idle threads. | |
+ * Same node, other CPU, busy cache, idle threads. | |
+ * Other core, same cache, busy threads. | |
+ * Same node, other CPU, busy threads. | |
+ * Other node, other CPU, idle cache, idle threads. | |
+ * Other node, other CPU, busy cache, idle threads. | |
+ * Other node, other CPU, busy threads. | |
+ */ | |
+static int best_mask_cpu(int best_cpu, struct rq *rq, cpumask_t *tmpmask) | |
+{ | |
+ int best_ranking = CPUIDLE_DIFF_NODE | CPUIDLE_THROTTLED | | |
+ CPUIDLE_THREAD_BUSY | CPUIDLE_DIFF_CPU | CPUIDLE_CACHE_BUSY | | |
+ CPUIDLE_DIFF_CORE | CPUIDLE_DIFF_THREAD; | |
+ int cpu_tmp; | |
+ | |
+ if (cpu_isset(best_cpu, *tmpmask)) | |
+ goto out; | |
+ | |
+ for_each_cpu_mask(cpu_tmp, *tmpmask) { | |
+ int ranking, locality; | |
+ struct rq *tmp_rq; | |
+ | |
+ ranking = 0; | |
+ tmp_rq = cpu_rq(cpu_tmp); | |
+ | |
+ locality = rq->cpu_locality[cpu_tmp]; | |
+#ifdef CONFIG_NUMA | |
+ if (locality > 3) | |
+ ranking |= CPUIDLE_DIFF_NODE; | |
+ else | |
+#endif | |
+ if (locality > 2) | |
+ ranking |= CPUIDLE_DIFF_CPU; | |
+#ifdef CONFIG_SCHED_MC | |
+ else if (locality == 2) | |
+ ranking |= CPUIDLE_DIFF_CORE; | |
+ if (!(tmp_rq->cache_idle(cpu_tmp))) | |
+ ranking |= CPUIDLE_CACHE_BUSY; | |
+#endif | |
+#ifdef CONFIG_SCHED_SMT | |
+ if (locality == 1) | |
+ ranking |= CPUIDLE_DIFF_THREAD; | |
+ if (!(tmp_rq->siblings_idle(cpu_tmp))) | |
+ ranking |= CPUIDLE_THREAD_BUSY; | |
+#endif | |
+ if (scaling_rq(tmp_rq)) | |
+ ranking |= CPUIDLE_THROTTLED; | |
+ | |
+ if (ranking < best_ranking) { | |
+ best_cpu = cpu_tmp; | |
+ best_ranking = ranking; | |
+ } | |
+ } | |
+out: | |
+ return best_cpu; | |
+} | |
+ | |
+static void resched_best_mask(int best_cpu, struct rq *rq, cpumask_t *tmpmask) | |
+{ | |
+ best_cpu = best_mask_cpu(best_cpu, rq, tmpmask); | |
+ resched_curr(cpu_rq(best_cpu)); | |
+} | |
+ | |
+bool cpus_share_cache(int this_cpu, int that_cpu) | |
+{ | |
+ struct rq *this_rq = cpu_rq(this_cpu); | |
+ | |
+ return (this_rq->cpu_locality[that_cpu] < 3); | |
+} | |
+ | |
+#ifdef CONFIG_SCHED_SMT | |
+#ifdef CONFIG_SMT_NICE | |
+static const cpumask_t *thread_cpumask(int cpu); | |
+ | |
+/* Find the best real time priority running on any SMT siblings of cpu and if | |
+ * none are running, the static priority of the best deadline task running. | |
+ * The lookups to the other runqueues is done lockless as the occasional wrong | |
+ * value would be harmless. */ | |
+static int best_smt_bias(int cpu) | |
+{ | |
+ int other_cpu, best_bias = 0; | |
+ | |
+ for_each_cpu_mask(other_cpu, *thread_cpumask(cpu)) { | |
+ struct rq *rq; | |
+ | |
+ if (other_cpu == cpu) | |
+ continue; | |
+ rq = cpu_rq(other_cpu); | |
+ if (rq_idle(rq)) | |
+ continue; | |
+ if (!rq->online) | |
+ continue; | |
+ if (likely(rq->rq_smt_bias > best_bias)) | |
+ best_bias = rq->rq_smt_bias; | |
+ } | |
+ return best_bias; | |
+} | |
+ | |
+static int task_prio_bias(struct task_struct *p) | |
+{ | |
+ if (rt_task(p)) | |
+ return 1 << 30; | |
+ else if (task_running_iso(p)) | |
+ return 1 << 29; | |
+ else if (task_running_idle(p)) | |
+ return 0; | |
+ return MAX_PRIO - p->static_prio; | |
+} | |
+ | |
+/* We've already decided p can run on CPU, now test if it shouldn't for SMT | |
+ * nice reasons. */ | |
+static bool smt_should_schedule(struct task_struct *p, int cpu) | |
+{ | |
+ int best_bias, task_bias; | |
+ | |
+ /* Kernel threads always run */ | |
+ if (unlikely(!p->mm)) | |
+ return true; | |
+ if (rt_task(p)) | |
+ return true; | |
+ best_bias = best_smt_bias(cpu); | |
+ /* The smt siblings are all idle or running IDLEPRIO */ | |
+ if (best_bias < 1) | |
+ return true; | |
+ task_bias = task_prio_bias(p); | |
+ if (task_bias < 1) | |
+ return false; | |
+ if (task_bias >= best_bias) | |
+ return true; | |
+ /* Dither 25% cpu of normal tasks regardless of nice difference */ | |
+ if (best_bias % 4 == 1) | |
+ return true; | |
+ /* Sorry, you lose */ | |
+ return false; | |
+} | |
+#endif | |
+#endif | |
+ | |
+static bool resched_best_idle(struct task_struct *p) | |
+{ | |
+ cpumask_t tmpmask; | |
+ int best_cpu; | |
+ | |
+ cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map); | |
+ best_cpu = best_mask_cpu(task_cpu(p), task_rq(p), &tmpmask); | |
+#ifdef CONFIG_SMT_NICE | |
+ if (!smt_should_schedule(p, best_cpu)) | |
+ return false; | |
+#endif | |
+ resched_curr(cpu_rq(best_cpu)); | |
+ return true; | |
+} | |
+ | |
+static inline void resched_suitable_idle(struct task_struct *p) | |
+{ | |
+ if (suitable_idle_cpus(p)) | |
+ resched_best_idle(p); | |
+} | |
+/* | |
+ * Flags to tell us whether this CPU is running a CPU frequency governor that | |
+ * has slowed its speed or not. No locking required as the very rare wrongly | |
+ * read value would be harmless. | |
+ */ | |
+void cpu_scaling(int cpu) | |
+{ | |
+ cpu_rq(cpu)->scaling = true; | |
+} | |
+ | |
+void cpu_nonscaling(int cpu) | |
+{ | |
+ cpu_rq(cpu)->scaling = false; | |
+} | |
+ | |
+static inline bool scaling_rq(struct rq *rq) | |
+{ | |
+ return rq->scaling; | |
+} | |
+ | |
+static inline int locality_diff(struct task_struct *p, struct rq *rq) | |
+{ | |
+ return rq->cpu_locality[task_cpu(p)]; | |
+} | |
+#else /* CONFIG_SMP */ | |
+static inline void inc_qnr(void) | |
+{ | |
+} | |
+ | |
+static inline void dec_qnr(void) | |
+{ | |
+} | |
+ | |
+static inline int queued_notrunning(void) | |
+{ | |
+ return grq.nr_running; | |
+} | |
+ | |
+static inline void set_cpuidle_map(int cpu) | |
+{ | |
+} | |
+ | |
+static inline void clear_cpuidle_map(int cpu) | |
+{ | |
+} | |
+ | |
+static inline bool suitable_idle_cpus(struct task_struct *p) | |
+{ | |
+ return uprq->curr == uprq->idle; | |
+} | |
+ | |
+static inline void resched_suitable_idle(struct task_struct *p) | |
+{ | |
+} | |
+ | |
+void cpu_scaling(int __unused) | |
+{ | |
+} | |
+ | |
+void cpu_nonscaling(int __unused) | |
+{ | |
+} | |
+ | |
+/* | |
+ * Although CPUs can scale in UP, there is nowhere else for tasks to go so this | |
+ * always returns 0. | |
+ */ | |
+static inline bool scaling_rq(struct rq *rq) | |
+{ | |
+ return false; | |
+} | |
+ | |
+static inline int locality_diff(struct task_struct *p, struct rq *rq) | |
+{ | |
+ return 0; | |
+} | |
+#endif /* CONFIG_SMP */ | |
+EXPORT_SYMBOL_GPL(cpu_scaling); | |
+EXPORT_SYMBOL_GPL(cpu_nonscaling); | |
+ | |
+static inline int normal_prio(struct task_struct *p) | |
+{ | |
+ if (has_rt_policy(p)) | |
+ return MAX_RT_PRIO - 1 - p->rt_priority; | |
+ if (idleprio_task(p)) | |
+ return IDLE_PRIO; | |
+ if (iso_task(p)) | |
+ return ISO_PRIO; | |
+ return NORMAL_PRIO; | |
+} | |
+ | |
+/* | |
+ * Calculate the current priority, i.e. the priority | |
+ * taken into account by the scheduler. This value might | |
+ * be boosted by RT tasks as it will be RT if the task got | |
+ * RT-boosted. If not then it returns p->normal_prio. | |
+ */ | |
+static int effective_prio(struct task_struct *p) | |
+{ | |
+ p->normal_prio = normal_prio(p); | |
+ /* | |
+ * If we are RT tasks or we were boosted to RT priority, | |
+ * keep the priority unchanged. Otherwise, update priority | |
+ * to the normal priority: | |
+ */ | |
+ if (!rt_prio(p->prio)) | |
+ return p->normal_prio; | |
+ return p->prio; | |
+} | |
+ | |
+/* | |
+ * activate_task - move a task to the runqueue. Enter with grq locked. | |
+ */ | |
+static void activate_task(struct task_struct *p, struct rq *rq) | |
+{ | |
+ update_clocks(rq); | |
+ | |
+ /* | |
+ * Sleep time is in units of nanosecs, so shift by 20 to get a | |
+ * milliseconds-range estimation of the amount of time that the task | |
+ * spent sleeping: | |
+ */ | |
+ if (unlikely(prof_on == SLEEP_PROFILING)) { | |
+ if (p->state == TASK_UNINTERRUPTIBLE) | |
+ profile_hits(SLEEP_PROFILING, (void *)get_wchan(p), | |
+ (rq->clock_task - p->last_ran) >> 20); | |
+ } | |
+ | |
+ p->prio = effective_prio(p); | |
+ if (task_contributes_to_load(p)) | |
+ grq.nr_uninterruptible--; | |
+ enqueue_task(p, rq); | |
+ rq->soft_affined++; | |
+ p->on_rq = 1; | |
+ grq.nr_running++; | |
+ inc_qnr(); | |
+} | |
+ | |
+static inline void clear_sticky(struct task_struct *p); | |
+ | |
+/* | |
+ * deactivate_task - If it's running, it's not on the grq and we can just | |
+ * decrement the nr_running. Enter with grq locked. | |
+ */ | |
+static inline void deactivate_task(struct task_struct *p, struct rq *rq) | |
+{ | |
+ if (task_contributes_to_load(p)) | |
+ grq.nr_uninterruptible++; | |
+ rq->soft_affined--; | |
+ p->on_rq = 0; | |
+ grq.nr_running--; | |
+ clear_sticky(p); | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+void set_task_cpu(struct task_struct *p, unsigned int cpu) | |
+{ | |
+#ifdef CONFIG_LOCKDEP | |
+ /* | |
+ * The caller should hold grq lock. | |
+ */ | |
+ WARN_ON_ONCE(debug_locks && !lockdep_is_held(&grq.lock)); | |
+#endif | |
+ if (task_cpu(p) == cpu) | |
+ return; | |
+ trace_sched_migrate_task(p, cpu); | |
+ perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0); | |
+ | |
+ /* | |
+ * After ->cpu is set up to a new value, task_grq_lock(p, ...) can be | |
+ * successfully executed on another CPU. We must ensure that updates of | |
+ * per-task data have been completed by this moment. | |
+ */ | |
+ smp_wmb(); | |
+ if (p->on_rq) { | |
+ task_rq(p)->soft_affined--; | |
+ cpu_rq(cpu)->soft_affined++; | |
+ } | |
+ task_thread_info(p)->cpu = cpu; | |
+} | |
+ | |
+static inline void clear_sticky(struct task_struct *p) | |
+{ | |
+ p->sticky = false; | |
+} | |
+ | |
+static inline bool task_sticky(struct task_struct *p) | |
+{ | |
+ return p->sticky; | |
+} | |
+ | |
+/* Reschedule the best idle CPU that is not this one. */ | |
+static void | |
+resched_closest_idle(struct rq *rq, int cpu, struct task_struct *p) | |
+{ | |
+ cpumask_t tmpmask; | |
+ | |
+ cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map); | |
+ cpu_clear(cpu, tmpmask); | |
+ if (cpus_empty(tmpmask)) | |
+ return; | |
+ resched_best_mask(cpu, rq, &tmpmask); | |
+} | |
+ | |
+/* | |
+ * We set the sticky flag on a task that is descheduled involuntarily meaning | |
+ * it is awaiting further CPU time. If the last sticky task is still sticky | |
+ * but unlucky enough to not be the next task scheduled, we unstick it and try | |
+ * to find it an idle CPU. Realtime tasks do not stick to minimise their | |
+ * latency at all times. | |
+ */ | |
+static inline void | |
+swap_sticky(struct rq *rq, int cpu, struct task_struct *p) | |
+{ | |
+ if (rq->sticky_task) { | |
+ if (rq->sticky_task == p) { | |
+ p->sticky = true; | |
+ return; | |
+ } | |
+ if (task_sticky(rq->sticky_task)) { | |
+ clear_sticky(rq->sticky_task); | |
+ resched_closest_idle(rq, cpu, rq->sticky_task); | |
+ } | |
+ } | |
+ if (!rt_task(p)) { | |
+ p->sticky = true; | |
+ rq->sticky_task = p; | |
+ } else { | |
+ resched_closest_idle(rq, cpu, p); | |
+ rq->sticky_task = NULL; | |
+ } | |
+} | |
+ | |
+static inline void unstick_task(struct rq *rq, struct task_struct *p) | |
+{ | |
+ rq->sticky_task = NULL; | |
+ clear_sticky(p); | |
+} | |
+#else | |
+static inline void clear_sticky(struct task_struct *p) | |
+{ | |
+} | |
+ | |
+static inline bool task_sticky(struct task_struct *p) | |
+{ | |
+ return false; | |
+} | |
+ | |
+static inline void | |
+swap_sticky(struct rq *rq, int cpu, struct task_struct *p) | |
+{ | |
+} | |
+ | |
+static inline void unstick_task(struct rq *rq, struct task_struct *p) | |
+{ | |
+} | |
+#endif | |
+ | |
+/* | |
+ * Move a task off the global queue and take it to a cpu for it will | |
+ * become the running task. | |
+ */ | |
+static inline void take_task(int cpu, struct task_struct *p) | |
+{ | |
+ set_task_cpu(p, cpu); | |
+ dequeue_task(p); | |
+ clear_sticky(p); | |
+ dec_qnr(); | |
+} | |
+ | |
+/* | |
+ * Returns a descheduling task to the grq runqueue unless it is being | |
+ * deactivated. | |
+ */ | |
+static inline void return_task(struct task_struct *p, struct rq *rq, bool deactivate) | |
+{ | |
+ if (deactivate) | |
+ deactivate_task(p, rq); | |
+ else { | |
+ inc_qnr(); | |
+ enqueue_task(p, rq); | |
+ } | |
+} | |
+ | |
+/* Enter with grq lock held. We know p is on the local cpu */ | |
+static inline void __set_tsk_resched(struct task_struct *p) | |
+{ | |
+ set_tsk_need_resched(p); | |
+ set_preempt_need_resched(); | |
+} | |
+ | |
+/* | |
+ * resched_task - mark a task 'to be rescheduled now'. | |
+ * | |
+ * On UP this means the setting of the need_resched flag, on SMP it | |
+ * might also involve a cross-CPU call to trigger the scheduler on | |
+ * the target CPU. | |
+ */ | |
+void resched_task(struct task_struct *p) | |
+{ | |
+ int cpu; | |
+ | |
+ lockdep_assert_held(&grq.lock); | |
+ | |
+ if (test_tsk_need_resched(p)) | |
+ return; | |
+ | |
+ set_tsk_need_resched(p); | |
+ | |
+ cpu = task_cpu(p); | |
+ if (cpu == smp_processor_id()) { | |
+ set_preempt_need_resched(); | |
+ return; | |
+ } | |
+ | |
+ smp_send_reschedule(cpu); | |
+} | |
+ | |
+/** | |
+ * task_curr - is this task currently executing on a CPU? | |
+ * @p: the task in question. | |
+ * | |
+ * Return: 1 if the task is currently executing. 0 otherwise. | |
+ */ | |
+inline int task_curr(const struct task_struct *p) | |
+{ | |
+ return cpu_curr(task_cpu(p)) == p; | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+struct migration_req { | |
+ struct task_struct *task; | |
+ int dest_cpu; | |
+}; | |
+ | |
+/* | |
+ * wait_task_inactive - wait for a thread to unschedule. | |
+ * | |
+ * If @match_state is nonzero, it's the @p->state value just checked and | |
+ * not expected to change. If it changes, i.e. @p might have woken up, | |
+ * then return zero. When we succeed in waiting for @p to be off its CPU, | |
+ * we return a positive number (its total switch count). If a second call | |
+ * a short while later returns the same number, the caller can be sure that | |
+ * @p has remained unscheduled the whole time. | |
+ * | |
+ * The caller must ensure that the task *will* unschedule sometime soon, | |
+ * else this function might spin for a *long* time. This function can't | |
+ * be called with interrupts off, or it may introduce deadlock with | |
+ * smp_call_function() if an IPI is sent by the same process we are | |
+ * waiting to become inactive. | |
+ */ | |
+unsigned long wait_task_inactive(struct task_struct *p, long match_state) | |
+{ | |
+ unsigned long flags; | |
+ bool running, on_rq; | |
+ unsigned long ncsw; | |
+ struct rq *rq; | |
+ | |
+ for (;;) { | |
+ rq = task_rq(p); | |
+ | |
+ /* | |
+ * If the task is actively running on another CPU | |
+ * still, just relax and busy-wait without holding | |
+ * any locks. | |
+ * | |
+ * NOTE! Since we don't hold any locks, it's not | |
+ * even sure that "rq" stays as the right runqueue! | |
+ * But we don't care, since this will return false | |
+ * if the runqueue has changed and p is actually now | |
+ * running somewhere else! | |
+ */ | |
+ while (task_running(p) && p == rq->curr) { | |
+ if (match_state && unlikely(p->state != match_state)) | |
+ return 0; | |
+ cpu_relax(); | |
+ } | |
+ | |
+ /* | |
+ * Ok, time to look more closely! We need the grq | |
+ * lock now, to be *sure*. If we're wrong, we'll | |
+ * just go back and repeat. | |
+ */ | |
+ rq = task_grq_lock(p, &flags); | |
+ trace_sched_wait_task(p); | |
+ running = task_running(p); | |
+ on_rq = p->on_rq; | |
+ ncsw = 0; | |
+ if (!match_state || p->state == match_state) | |
+ ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ | |
+ task_grq_unlock(&flags); | |
+ | |
+ /* | |
+ * If it changed from the expected state, bail out now. | |
+ */ | |
+ if (unlikely(!ncsw)) | |
+ break; | |
+ | |
+ /* | |
+ * Was it really running after all now that we | |
+ * checked with the proper locks actually held? | |
+ * | |
+ * Oops. Go back and try again.. | |
+ */ | |
+ if (unlikely(running)) { | |
+ cpu_relax(); | |
+ continue; | |
+ } | |
+ | |
+ /* | |
+ * It's not enough that it's not actively running, | |
+ * it must be off the runqueue _entirely_, and not | |
+ * preempted! | |
+ * | |
+ * So if it was still runnable (but just not actively | |
+ * running right now), it's preempted, and we should | |
+ * yield - it could be a while. | |
+ */ | |
+ if (unlikely(on_rq)) { | |
+ ktime_t to = ktime_set(0, NSEC_PER_SEC / HZ); | |
+ | |
+ set_current_state(TASK_UNINTERRUPTIBLE); | |
+ schedule_hrtimeout(&to, HRTIMER_MODE_REL); | |
+ continue; | |
+ } | |
+ | |
+ /* | |
+ * Ahh, all good. It wasn't running, and it wasn't | |
+ * runnable, which means that it will never become | |
+ * running in the future either. We're all done! | |
+ */ | |
+ break; | |
+ } | |
+ | |
+ return ncsw; | |
+} | |
+ | |
+/*** | |
+ * kick_process - kick a running thread to enter/exit the kernel | |
+ * @p: the to-be-kicked thread | |
+ * | |
+ * Cause a process which is running on another CPU to enter | |
+ * kernel-mode, without any delay. (to get signals handled.) | |
+ * | |
+ * NOTE: this function doesn't have to take the runqueue lock, | |
+ * because all it wants to ensure is that the remote task enters | |
+ * the kernel. If the IPI races and the task has been migrated | |
+ * to another CPU then no harm is done and the purpose has been | |
+ * achieved as well. | |
+ */ | |
+void kick_process(struct task_struct *p) | |
+{ | |
+ int cpu; | |
+ | |
+ preempt_disable(); | |
+ cpu = task_cpu(p); | |
+ if ((cpu != smp_processor_id()) && task_curr(p)) | |
+ smp_send_reschedule(cpu); | |
+ preempt_enable(); | |
+} | |
+EXPORT_SYMBOL_GPL(kick_process); | |
+#endif | |
+ | |
+/* | |
+ * RT tasks preempt purely on priority. SCHED_NORMAL tasks preempt on the | |
+ * basis of earlier deadlines. SCHED_IDLEPRIO don't preempt anything else or | |
+ * between themselves, they cooperatively multitask. An idle rq scores as | |
+ * prio PRIO_LIMIT so it is always preempted. | |
+ */ | |
+static inline bool | |
+can_preempt(struct task_struct *p, int prio, u64 deadline) | |
+{ | |
+ /* Better static priority RT task or better policy preemption */ | |
+ if (p->prio < prio) | |
+ return true; | |
+ if (p->prio > prio) | |
+ return false; | |
+ /* SCHED_NORMAL, BATCH and ISO will preempt based on deadline */ | |
+ if (!deadline_before(p->deadline, deadline)) | |
+ return false; | |
+ return true; | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+#define cpu_online_map (*(cpumask_t *)cpu_online_mask) | |
+#ifdef CONFIG_HOTPLUG_CPU | |
+/* | |
+ * Check to see if there is a task that is affined only to offline CPUs but | |
+ * still wants runtime. This happens to kernel threads during suspend/halt and | |
+ * disabling of CPUs. | |
+ */ | |
+static inline bool online_cpus(struct task_struct *p) | |
+{ | |
+ return (likely(cpus_intersects(cpu_online_map, p->cpus_allowed))); | |
+} | |
+#else /* CONFIG_HOTPLUG_CPU */ | |
+/* All available CPUs are always online without hotplug. */ | |
+static inline bool online_cpus(struct task_struct *p) | |
+{ | |
+ return true; | |
+} | |
+#endif | |
+ | |
+/* | |
+ * Check to see if p can run on cpu, and if not, whether there are any online | |
+ * CPUs it can run on instead. | |
+ */ | |
+static inline bool needs_other_cpu(struct task_struct *p, int cpu) | |
+{ | |
+ if (unlikely(!cpu_isset(cpu, p->cpus_allowed))) | |
+ return true; | |
+ return false; | |
+} | |
+ | |
+/* | |
+ * When all else is equal, still prefer this_rq. | |
+ */ | |
+static void try_preempt(struct task_struct *p, struct rq *this_rq) | |
+{ | |
+ struct rq *highest_prio_rq = NULL; | |
+ int cpu, highest_prio; | |
+ u64 latest_deadline; | |
+ cpumask_t tmp; | |
+ | |
+ /* | |
+ * We clear the sticky flag here because for a task to have called | |
+ * try_preempt with the sticky flag enabled means some complicated | |
+ * re-scheduling has occurred and we should ignore the sticky flag. | |
+ */ | |
+ clear_sticky(p); | |
+ | |
+ if (suitable_idle_cpus(p) && resched_best_idle(p)) | |
+ return; | |
+ | |
+ /* IDLEPRIO tasks never preempt anything but idle */ | |
+ if (p->policy == SCHED_IDLEPRIO) | |
+ return; | |
+ | |
+ if (likely(online_cpus(p))) | |
+ cpus_and(tmp, cpu_online_map, p->cpus_allowed); | |
+ else | |
+ return; | |
+ | |
+ highest_prio = latest_deadline = 0; | |
+ | |
+ for_each_cpu_mask(cpu, tmp) { | |
+ struct rq *rq; | |
+ int rq_prio; | |
+ | |
+ rq = cpu_rq(cpu); | |
+ rq_prio = rq->rq_prio; | |
+ if (rq_prio < highest_prio) | |
+ continue; | |
+ | |
+ if (rq_prio > highest_prio || | |
+ deadline_after(rq->rq_deadline, latest_deadline)) { | |
+ latest_deadline = rq->rq_deadline; | |
+ highest_prio = rq_prio; | |
+ highest_prio_rq = rq; | |
+ } | |
+ } | |
+ | |
+ if (likely(highest_prio_rq)) { | |
+#ifdef CONFIG_SMT_NICE | |
+ cpu = cpu_of(highest_prio_rq); | |
+ if (!smt_should_schedule(p, cpu)) | |
+ return; | |
+#endif | |
+ if (can_preempt(p, highest_prio, highest_prio_rq->rq_deadline)) | |
+ resched_curr(highest_prio_rq); | |
+ } | |
+} | |
+#else /* CONFIG_SMP */ | |
+static inline bool needs_other_cpu(struct task_struct *p, int cpu) | |
+{ | |
+ return false; | |
+} | |
+ | |
+static void try_preempt(struct task_struct *p, struct rq *this_rq) | |
+{ | |
+ if (p->policy == SCHED_IDLEPRIO) | |
+ return; | |
+ if (can_preempt(p, uprq->rq_prio, uprq->rq_deadline)) | |
+ resched_curr(uprq); | |
+} | |
+#endif /* CONFIG_SMP */ | |
+ | |
+static void | |
+ttwu_stat(struct task_struct *p, int cpu, int wake_flags) | |
+{ | |
+#ifdef CONFIG_SCHEDSTATS | |
+ struct rq *rq = this_rq(); | |
+ | |
+#ifdef CONFIG_SMP | |
+ int this_cpu = smp_processor_id(); | |
+ | |
+ if (cpu == this_cpu) | |
+ schedstat_inc(rq, ttwu_local); | |
+ else { | |
+ struct sched_domain *sd; | |
+ | |
+ rcu_read_lock(); | |
+ for_each_domain(this_cpu, sd) { | |
+ if (cpumask_test_cpu(cpu, sched_domain_span(sd))) { | |
+ schedstat_inc(sd, ttwu_wake_remote); | |
+ break; | |
+ } | |
+ } | |
+ rcu_read_unlock(); | |
+ } | |
+ | |
+#endif /* CONFIG_SMP */ | |
+ | |
+ schedstat_inc(rq, ttwu_count); | |
+#endif /* CONFIG_SCHEDSTATS */ | |
+} | |
+ | |
+void wake_up_if_idle(int cpu) | |
+{ | |
+ struct rq *rq = cpu_rq(cpu); | |
+ unsigned long flags; | |
+ | |
+ if (!is_idle_task(rq->curr)) | |
+ return; | |
+ | |
+ grq_lock_irqsave(&flags); | |
+ if (likely(is_idle_task(rq->curr))) | |
+ smp_send_reschedule(cpu); | |
+ /* Else cpu is not in idle, do nothing here */ | |
+ grq_unlock_irqrestore(&flags); | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+void scheduler_ipi(void) | |
+{ | |
+ /* | |
+ * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting | |
+ * TIF_NEED_RESCHED remotely (for the first time) will also send | |
+ * this IPI. | |
+ */ | |
+ preempt_fold_need_resched(); | |
+} | |
+#endif | |
+ | |
+static inline void ttwu_activate(struct task_struct *p, struct rq *rq, | |
+ bool is_sync) | |
+{ | |
+ activate_task(p, rq); | |
+ | |
+ /* | |
+ * Sync wakeups (i.e. those types of wakeups where the waker | |
+ * has indicated that it will leave the CPU in short order) | |
+ * don't trigger a preemption if there are no idle cpus, | |
+ * instead waiting for current to deschedule. | |
+ */ | |
+ if (!is_sync || suitable_idle_cpus(p)) | |
+ try_preempt(p, rq); | |
+} | |
+ | |
+static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq, | |
+ bool success) | |
+{ | |
+ trace_sched_wakeup(p, success); | |
+ p->state = TASK_RUNNING; | |
+ | |
+ /* | |
+ * if a worker is waking up, notify workqueue. Note that on BFS, we | |
+ * don't really know what cpu it will be, so we fake it for | |
+ * wq_worker_waking_up :/ | |
+ */ | |
+ if ((p->flags & PF_WQ_WORKER) && success) | |
+ wq_worker_waking_up(p, cpu_of(rq)); | |
+} | |
+ | |
+/* | |
+ * wake flags | |
+ */ | |
+#define WF_SYNC 0x01 /* waker goes to sleep after wakeup */ | |
+#define WF_FORK 0x02 /* child wakeup after fork */ | |
+#define WF_MIGRATED 0x4 /* internal use, task got migrated */ | |
+ | |
+/*** | |
+ * try_to_wake_up - wake up a thread | |
+ * @p: the thread to be awakened | |
+ * @state: the mask of task states that can be woken | |
+ * @wake_flags: wake modifier flags (WF_*) | |
+ * | |
+ * Put it on the run-queue if it's not already there. The "current" | |
+ * thread is always on the run-queue (except when the actual | |
+ * re-schedule is in progress), and as such you're allowed to do | |
+ * the simpler "current->state = TASK_RUNNING" to mark yourself | |
+ * runnable without the overhead of this. | |
+ * | |
+ * Return: %true if @p was woken up, %false if it was already running. | |
+ * or @state didn't match @p's state. | |
+ */ | |
+static bool try_to_wake_up(struct task_struct *p, unsigned int state, | |
+ int wake_flags) | |
+{ | |
+ bool success = false; | |
+ unsigned long flags; | |
+ struct rq *rq; | |
+ int cpu; | |
+ | |
+ get_cpu(); | |
+ | |
+ /* | |
+ * If we are going to wake up a thread waiting for CONDITION we | |
+ * need to ensure that CONDITION=1 done by the caller can not be | |
+ * reordered with p->state check below. This pairs with mb() in | |
+ * set_current_state() the waiting thread does. | |
+ */ | |
+ smp_mb__before_spinlock(); | |
+ | |
+ /* | |
+ * No need to do time_lock_grq as we only need to update the rq clock | |
+ * if we activate the task | |
+ */ | |
+ rq = task_grq_lock(p, &flags); | |
+ cpu = task_cpu(p); | |
+ | |
+ /* state is a volatile long, どうして、分からない */ | |
+ if (!((unsigned int)p->state & state)) | |
+ goto out_unlock; | |
+ | |
+ if (task_queued(p) || task_running(p)) | |
+ goto out_running; | |
+ | |
+ ttwu_activate(p, rq, wake_flags & WF_SYNC); | |
+ success = true; | |
+ | |
+out_running: | |
+ ttwu_post_activation(p, rq, success); | |
+out_unlock: | |
+ task_grq_unlock(&flags); | |
+ | |
+ ttwu_stat(p, cpu, wake_flags); | |
+ | |
+ put_cpu(); | |
+ | |
+ return success; | |
+} | |
+ | |
+/** | |
+ * try_to_wake_up_local - try to wake up a local task with grq lock held | |
+ * @p: the thread to be awakened | |
+ * | |
+ * Put @p on the run-queue if it's not already there. The caller must | |
+ * ensure that grq is locked and, @p is not the current task. | |
+ * grq stays locked over invocation. | |
+ */ | |
+static void try_to_wake_up_local(struct task_struct *p) | |
+{ | |
+ struct rq *rq = task_rq(p); | |
+ bool success = false; | |
+ | |
+ lockdep_assert_held(&grq.lock); | |
+ | |
+ if (!(p->state & TASK_NORMAL)) | |
+ return; | |
+ | |
+ if (!task_queued(p)) { | |
+ if (likely(!task_running(p))) { | |
+ schedstat_inc(rq, ttwu_count); | |
+ schedstat_inc(rq, ttwu_local); | |
+ } | |
+ ttwu_activate(p, rq, false); | |
+ ttwu_stat(p, smp_processor_id(), 0); | |
+ success = true; | |
+ } | |
+ ttwu_post_activation(p, rq, success); | |
+} | |
+ | |
+/** | |
+ * wake_up_process - Wake up a specific process | |
+ * @p: The process to be woken up. | |
+ * | |
+ * Attempt to wake up the nominated process and move it to the set of runnable | |
+ * processes. | |
+ * | |
+ * Return: 1 if the process was woken up, 0 if it was already running. | |
+ * | |
+ * It may be assumed that this function implies a write memory barrier before | |
+ * changing the task state if and only if any tasks are woken up. | |
+ */ | |
+int wake_up_process(struct task_struct *p) | |
+{ | |
+ WARN_ON(task_is_stopped_or_traced(p)); | |
+ return try_to_wake_up(p, TASK_NORMAL, 0); | |
+} | |
+EXPORT_SYMBOL(wake_up_process); | |
+ | |
+int wake_up_state(struct task_struct *p, unsigned int state) | |
+{ | |
+ return try_to_wake_up(p, state, 0); | |
+} | |
+ | |
+static void time_slice_expired(struct task_struct *p); | |
+ | |
+/* | |
+ * Perform scheduler related setup for a newly forked process p. | |
+ * p is forked by current. | |
+ */ | |
+int sched_fork(unsigned long __maybe_unused clone_flags, struct task_struct *p) | |
+{ | |
+#ifdef CONFIG_PREEMPT_NOTIFIERS | |
+ INIT_HLIST_HEAD(&p->preempt_notifiers); | |
+#endif | |
+ /* | |
+ * The process state is set to the same value of the process executing | |
+ * do_fork() code. That is running. This guarantees that nobody will | |
+ * actually run it, and a signal or other external event cannot wake | |
+ * it up and insert it on the runqueue either. | |
+ */ | |
+ | |
+ /* Should be reset in fork.c but done here for ease of bfs patching */ | |
+ p->on_rq = | |
+ p->utime = | |
+ p->stime = | |
+ p->utimescaled = | |
+ p->stimescaled = | |
+ p->sched_time = | |
+ p->stime_pc = | |
+ p->utime_pc = 0; | |
+ | |
+ /* | |
+ * Revert to default priority/policy on fork if requested. | |
+ */ | |
+ if (unlikely(p->sched_reset_on_fork)) { | |
+ if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) { | |
+ p->policy = SCHED_NORMAL; | |
+ p->normal_prio = normal_prio(p); | |
+ } | |
+ | |
+ if (PRIO_TO_NICE(p->static_prio) < 0) { | |
+ p->static_prio = NICE_TO_PRIO(0); | |
+ p->normal_prio = p->static_prio; | |
+ } | |
+ | |
+ /* | |
+ * We don't need the reset flag anymore after the fork. It has | |
+ * fulfilled its duty: | |
+ */ | |
+ p->sched_reset_on_fork = 0; | |
+ } | |
+ | |
+ INIT_LIST_HEAD(&p->run_list); | |
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) | |
+ if (unlikely(sched_info_on())) | |
+ memset(&p->sched_info, 0, sizeof(p->sched_info)); | |
+#endif | |
+ p->on_cpu = false; | |
+ clear_sticky(p); | |
+ init_task_preempt_count(p); | |
+ return 0; | |
+} | |
+ | |
+/* | |
+ * wake_up_new_task - wake up a newly created task for the first time. | |
+ * | |
+ * This function will do some initial scheduler statistics housekeeping | |
+ * that must be done for every newly created context, then puts the task | |
+ * on the runqueue and wakes it. | |
+ */ | |
+void wake_up_new_task(struct task_struct *p) | |
+{ | |
+ struct task_struct *parent; | |
+ unsigned long flags; | |
+ struct rq *rq; | |
+ | |
+ parent = p->parent; | |
+ rq = task_grq_lock(p, &flags); | |
+ | |
+ /* | |
+ * Reinit new task deadline as its creator deadline could have changed | |
+ * since call to dup_task_struct(). | |
+ */ | |
+ p->deadline = rq->rq_deadline; | |
+ | |
+ /* | |
+ * If the task is a new process, current and parent are the same. If | |
+ * the task is a new thread in the thread group, it will have much more | |
+ * in common with current than with the parent. | |
+ */ | |
+ set_task_cpu(p, task_cpu(rq->curr)); | |
+ | |
+ /* | |
+ * Make sure we do not leak PI boosting priority to the child. | |
+ */ | |
+ p->prio = rq->curr->normal_prio; | |
+ | |
+ activate_task(p, rq); | |
+ trace_sched_wakeup_new(p, 1); | |
+ if (unlikely(p->policy == SCHED_FIFO)) | |
+ goto after_ts_init; | |
+ | |
+ /* | |
+ * Share the timeslice between parent and child, thus the | |
+ * total amount of pending timeslices in the system doesn't change, | |
+ * resulting in more scheduling fairness. If it's negative, it won't | |
+ * matter since that's the same as being 0. current's time_slice is | |
+ * actually in rq_time_slice when it's running, as is its last_ran | |
+ * value. rq->rq_deadline is only modified within schedule() so it | |
+ * is always equal to current->deadline. | |
+ */ | |
+ p->last_ran = rq->rq_last_ran; | |
+ if (likely(rq->rq_time_slice >= RESCHED_US * 2)) { | |
+ rq->rq_time_slice /= 2; | |
+ p->time_slice = rq->rq_time_slice; | |
+after_ts_init: | |
+ if (rq->curr == parent && !suitable_idle_cpus(p)) { | |
+ /* | |
+ * The VM isn't cloned, so we're in a good position to | |
+ * do child-runs-first in anticipation of an exec. This | |
+ * usually avoids a lot of COW overhead. | |
+ */ | |
+ __set_tsk_resched(parent); | |
+ } else | |
+ try_preempt(p, rq); | |
+ } else { | |
+ if (rq->curr == parent) { | |
+ /* | |
+ * Forking task has run out of timeslice. Reschedule it and | |
+ * start its child with a new time slice and deadline. The | |
+ * child will end up running first because its deadline will | |
+ * be slightly earlier. | |
+ */ | |
+ rq->rq_time_slice = 0; | |
+ __set_tsk_resched(parent); | |
+ } | |
+ time_slice_expired(p); | |
+ } | |
+ task_grq_unlock(&flags); | |
+} | |
+ | |
+#ifdef CONFIG_PREEMPT_NOTIFIERS | |
+ | |
+/** | |
+ * preempt_notifier_register - tell me when current is being preempted & rescheduled | |
+ * @notifier: notifier struct to register | |
+ */ | |
+void preempt_notifier_register(struct preempt_notifier *notifier) | |
+{ | |
+ hlist_add_head(¬ifier->link, ¤t->preempt_notifiers); | |
+} | |
+EXPORT_SYMBOL_GPL(preempt_notifier_register); | |
+ | |
+/** | |
+ * preempt_notifier_unregister - no longer interested in preemption notifications | |
+ * @notifier: notifier struct to unregister | |
+ * | |
+ * This is safe to call from within a preemption notifier. | |
+ */ | |
+void preempt_notifier_unregister(struct preempt_notifier *notifier) | |
+{ | |
+ hlist_del(¬ifier->link); | |
+} | |
+EXPORT_SYMBOL_GPL(preempt_notifier_unregister); | |
+ | |
+static void fire_sched_in_preempt_notifiers(struct task_struct *curr) | |
+{ | |
+ struct preempt_notifier *notifier; | |
+ | |
+ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link) | |
+ notifier->ops->sched_in(notifier, raw_smp_processor_id()); | |
+} | |
+ | |
+static void | |
+fire_sched_out_preempt_notifiers(struct task_struct *curr, | |
+ struct task_struct *next) | |
+{ | |
+ struct preempt_notifier *notifier; | |
+ | |
+ hlist_for_each_entry(notifier, &curr->preempt_notifiers, link) | |
+ notifier->ops->sched_out(notifier, next); | |
+} | |
+ | |
+#else /* !CONFIG_PREEMPT_NOTIFIERS */ | |
+ | |
+static void fire_sched_in_preempt_notifiers(struct task_struct *curr) | |
+{ | |
+} | |
+ | |
+static void | |
+fire_sched_out_preempt_notifiers(struct task_struct *curr, | |
+ struct task_struct *next) | |
+{ | |
+} | |
+ | |
+#endif /* CONFIG_PREEMPT_NOTIFIERS */ | |
+ | |
+/** | |
+ * prepare_task_switch - prepare to switch tasks | |
+ * @rq: the runqueue preparing to switch | |
+ * @next: the task we are going to switch to. | |
+ * | |
+ * This is called with the rq lock held and interrupts off. It must | |
+ * be paired with a subsequent finish_task_switch after the context | |
+ * switch. | |
+ * | |
+ * prepare_task_switch sets up locking and calls architecture specific | |
+ * hooks. | |
+ */ | |
+static inline void | |
+prepare_task_switch(struct rq *rq, struct task_struct *prev, | |
+ struct task_struct *next) | |
+{ | |
+ sched_info_switch(rq, prev, next); | |
+ perf_event_task_sched_out(prev, next); | |
+ fire_sched_out_preempt_notifiers(prev, next); | |
+ prepare_lock_switch(rq, next); | |
+ prepare_arch_switch(next); | |
+ trace_sched_switch(prev, next); | |
+} | |
+ | |
+/** | |
+ * finish_task_switch - clean up after a task-switch | |
+ * @rq: runqueue associated with task-switch | |
+ * @prev: the thread we just switched away from. | |
+ * | |
+ * finish_task_switch must be called after the context switch, paired | |
+ * with a prepare_task_switch call before the context switch. | |
+ * finish_task_switch will reconcile locking set up by prepare_task_switch, | |
+ * and do any other architecture-specific cleanup actions. | |
+ * | |
+ * Note that we may have delayed dropping an mm in context_switch(). If | |
+ * so, we finish that here outside of the runqueue lock. (Doing it | |
+ * with the lock held can cause deadlocks; see schedule() for | |
+ * details.) | |
+ */ | |
+static inline void finish_task_switch(struct rq *rq, struct task_struct *prev) | |
+ __releases(grq.lock) | |
+{ | |
+ struct mm_struct *mm = rq->prev_mm; | |
+ long prev_state; | |
+ | |
+ rq->prev_mm = NULL; | |
+ | |
+ /* | |
+ * A task struct has one reference for the use as "current". | |
+ * If a task dies, then it sets TASK_DEAD in tsk->state and calls | |
+ * schedule one last time. The schedule call will never return, and | |
+ * the scheduled task must drop that reference. | |
+ * The test for TASK_DEAD must occur while the runqueue locks are | |
+ * still held, otherwise prev could be scheduled on another cpu, die | |
+ * there before we look at prev->state, and then the reference would | |
+ * be dropped twice. | |
+ * Manfred Spraul <manfred@colorfullife.com> | |
+ */ | |
+ prev_state = prev->state; | |
+ vtime_task_switch(prev); | |
+ finish_arch_switch(prev); | |
+ perf_event_task_sched_in(prev, current); | |
+ finish_lock_switch(rq, prev); | |
+ finish_arch_post_lock_switch(); | |
+ | |
+ fire_sched_in_preempt_notifiers(current); | |
+ if (mm) | |
+ mmdrop(mm); | |
+ if (unlikely(prev_state == TASK_DEAD)) { | |
+ /* | |
+ * Remove function-return probe instances associated with this | |
+ * task and put them back on the free list. | |
+ */ | |
+ kprobe_flush_task(prev); | |
+ put_task_struct(prev); | |
+ } | |
+} | |
+ | |
+/** | |
+ * schedule_tail - first thing a freshly forked thread must call. | |
+ * @prev: the thread we just switched away from. | |
+ */ | |
+asmlinkage __visible void schedule_tail(struct task_struct *prev) | |
+ __releases(grq.lock) | |
+{ | |
+ struct rq *rq = this_rq(); | |
+ | |
+ finish_task_switch(rq, prev); | |
+ if (current->set_child_tid) | |
+ put_user(task_pid_vnr(current), current->set_child_tid); | |
+} | |
+ | |
+/* | |
+ * context_switch - switch to the new MM and the new | |
+ * thread's register state. | |
+ */ | |
+static inline void | |
+context_switch(struct rq *rq, struct task_struct *prev, | |
+ struct task_struct *next) | |
+{ | |
+ struct mm_struct *mm, *oldmm; | |
+ | |
+ prepare_task_switch(rq, prev, next); | |
+ | |
+ mm = next->mm; | |
+ oldmm = prev->active_mm; | |
+ /* | |
+ * For paravirt, this is coupled with an exit in switch_to to | |
+ * combine the page table reload and the switch backend into | |
+ * one hypercall. | |
+ */ | |
+ arch_start_context_switch(prev); | |
+ | |
+ if (!mm) { | |
+ next->active_mm = oldmm; | |
+ atomic_inc(&oldmm->mm_count); | |
+ enter_lazy_tlb(oldmm, next); | |
+ } else | |
+ switch_mm(oldmm, mm, next); | |
+ | |
+ if (!prev->mm) { | |
+ prev->active_mm = NULL; | |
+ rq->prev_mm = oldmm; | |
+ } | |
+ /* | |
+ * Since the runqueue lock will be released by the next | |
+ * task (which is an invalid locking op but in the case | |
+ * of the scheduler it's an obvious special-case), so we | |
+ * do an early lockdep release here: | |
+ */ | |
+ spin_release(&grq.lock.dep_map, 1, _THIS_IP_); | |
+ | |
+ /* Here we just switch the register state and the stack. */ | |
+ context_tracking_task_switch(prev, next); | |
+ switch_to(prev, next, prev); | |
+ | |
+ barrier(); | |
+ /* | |
+ * this_rq must be evaluated again because prev may have moved | |
+ * CPUs since it called schedule(), thus the 'rq' on its stack | |
+ * frame will be invalid. | |
+ */ | |
+ finish_task_switch(this_rq(), prev); | |
+} | |
+ | |
+/* | |
+ * nr_running, nr_uninterruptible and nr_context_switches: | |
+ * | |
+ * externally visible scheduler statistics: current number of runnable | |
+ * threads, total number of context switches performed since bootup. All are | |
+ * measured without grabbing the grq lock but the occasional inaccurate result | |
+ * doesn't matter so long as it's positive. | |
+ */ | |
+unsigned long nr_running(void) | |
+{ | |
+ long nr = grq.nr_running; | |
+ | |
+ if (unlikely(nr < 0)) | |
+ nr = 0; | |
+ return (unsigned long)nr; | |
+} | |
+ | |
+static unsigned long nr_uninterruptible(void) | |
+{ | |
+ long nu = grq.nr_uninterruptible; | |
+ | |
+ if (unlikely(nu < 0)) | |
+ nu = 0; | |
+ return nu; | |
+} | |
+ | |
+/* | |
+ * Check if only the current task is running on the cpu. | |
+ */ | |
+bool single_task_running(void) | |
+{ | |
+ if (cpu_rq(smp_processor_id())->soft_affined == 1) | |
+ return true; | |
+ else | |
+ return false; | |
+} | |
+EXPORT_SYMBOL(single_task_running); | |
+ | |
+unsigned long long nr_context_switches(void) | |
+{ | |
+ long long ns = grq.nr_switches; | |
+ | |
+ /* This is of course impossible */ | |
+ if (unlikely(ns < 0)) | |
+ ns = 1; | |
+ return (unsigned long long)ns; | |
+} | |
+ | |
+unsigned long nr_iowait(void) | |
+{ | |
+ unsigned long i, sum = 0; | |
+ | |
+ for_each_possible_cpu(i) | |
+ sum += atomic_read(&cpu_rq(i)->nr_iowait); | |
+ | |
+ return sum; | |
+} | |
+ | |
+unsigned long nr_iowait_cpu(int cpu) | |
+{ | |
+ struct rq *this = cpu_rq(cpu); | |
+ return atomic_read(&this->nr_iowait); | |
+} | |
+ | |
+unsigned long nr_active(void) | |
+{ | |
+ return nr_running() + nr_uninterruptible(); | |
+} | |
+ | |
+/* Beyond a task running on this CPU, load is equal everywhere on BFS, so we | |
+ * base it on the number of running or queued tasks with their ->rq pointer | |
+ * set to this cpu as being the CPU they're more likely to run on. */ | |
+void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) | |
+{ | |
+ struct rq *this = this_rq(); | |
+ | |
+ *nr_waiters = atomic_read(&this->nr_iowait); | |
+ *load = this->soft_affined; | |
+} | |
+ | |
+/* Variables and functions for calc_load */ | |
+static unsigned long calc_load_update; | |
+unsigned long avenrun[3]; | |
+EXPORT_SYMBOL(avenrun); | |
+ | |
+/** | |
+ * get_avenrun - get the load average array | |
+ * @loads: pointer to dest load array | |
+ * @offset: offset to add | |
+ * @shift: shift count to shift the result left | |
+ * | |
+ * These values are estimates at best, so no need for locking. | |
+ */ | |
+void get_avenrun(unsigned long *loads, unsigned long offset, int shift) | |
+{ | |
+ loads[0] = (avenrun[0] + offset) << shift; | |
+ loads[1] = (avenrun[1] + offset) << shift; | |
+ loads[2] = (avenrun[2] + offset) << shift; | |
+} | |
+ | |
+static unsigned long | |
+calc_load(unsigned long load, unsigned long exp, unsigned long active) | |
+{ | |
+ load *= exp; | |
+ load += active * (FIXED_1 - exp); | |
+ return load >> FSHIFT; | |
+} | |
+ | |
+/* | |
+ * calc_load - update the avenrun load estimates every LOAD_FREQ seconds. | |
+ */ | |
+void calc_global_load(unsigned long ticks) | |
+{ | |
+ long active; | |
+ | |
+ if (time_before(jiffies, calc_load_update)) | |
+ return; | |
+ active = nr_active() * FIXED_1; | |
+ | |
+ avenrun[0] = calc_load(avenrun[0], EXP_1, active); | |
+ avenrun[1] = calc_load(avenrun[1], EXP_5, active); | |
+ avenrun[2] = calc_load(avenrun[2], EXP_15, active); | |
+ | |
+ calc_load_update = jiffies + LOAD_FREQ; | |
+} | |
+ | |
+DEFINE_PER_CPU(struct kernel_stat, kstat); | |
+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat); | |
+ | |
+EXPORT_PER_CPU_SYMBOL(kstat); | |
+EXPORT_PER_CPU_SYMBOL(kernel_cpustat); | |
+ | |
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING | |
+ | |
+/* | |
+ * There are no locks covering percpu hardirq/softirq time. | |
+ * They are only modified in account_system_vtime, on corresponding CPU | |
+ * with interrupts disabled. So, writes are safe. | |
+ * They are read and saved off onto struct rq in update_rq_clock(). | |
+ * This may result in other CPU reading this CPU's irq time and can | |
+ * race with irq/account_system_vtime on this CPU. We would either get old | |
+ * or new value with a side effect of accounting a slice of irq time to wrong | |
+ * task when irq is in progress while we read rq->clock. That is a worthy | |
+ * compromise in place of having locks on each irq in account_system_time. | |
+ */ | |
+static DEFINE_PER_CPU(u64, cpu_hardirq_time); | |
+static DEFINE_PER_CPU(u64, cpu_softirq_time); | |
+ | |
+static DEFINE_PER_CPU(u64, irq_start_time); | |
+static int sched_clock_irqtime; | |
+ | |
+void enable_sched_clock_irqtime(void) | |
+{ | |
+ sched_clock_irqtime = 1; | |
+} | |
+ | |
+void disable_sched_clock_irqtime(void) | |
+{ | |
+ sched_clock_irqtime = 0; | |
+} | |
+ | |
+#ifndef CONFIG_64BIT | |
+static DEFINE_PER_CPU(seqcount_t, irq_time_seq); | |
+ | |
+static inline void irq_time_write_begin(void) | |
+{ | |
+ __this_cpu_inc(irq_time_seq.sequence); | |
+ smp_wmb(); | |
+} | |
+ | |
+static inline void irq_time_write_end(void) | |
+{ | |
+ smp_wmb(); | |
+ __this_cpu_inc(irq_time_seq.sequence); | |
+} | |
+ | |
+static inline u64 irq_time_read(int cpu) | |
+{ | |
+ u64 irq_time; | |
+ unsigned seq; | |
+ | |
+ do { | |
+ seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); | |
+ irq_time = per_cpu(cpu_softirq_time, cpu) + | |
+ per_cpu(cpu_hardirq_time, cpu); | |
+ } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); | |
+ | |
+ return irq_time; | |
+} | |
+#else /* CONFIG_64BIT */ | |
+static inline void irq_time_write_begin(void) | |
+{ | |
+} | |
+ | |
+static inline void irq_time_write_end(void) | |
+{ | |
+} | |
+ | |
+static inline u64 irq_time_read(int cpu) | |
+{ | |
+ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); | |
+} | |
+#endif /* CONFIG_64BIT */ | |
+ | |
+/* | |
+ * Called before incrementing preempt_count on {soft,}irq_enter | |
+ * and before decrementing preempt_count on {soft,}irq_exit. | |
+ */ | |
+void irqtime_account_irq(struct task_struct *curr) | |
+{ | |
+ unsigned long flags; | |
+ s64 delta; | |
+ int cpu; | |
+ | |
+ if (!sched_clock_irqtime) | |
+ return; | |
+ | |
+ local_irq_save(flags); | |
+ | |
+ cpu = smp_processor_id(); | |
+ delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time); | |
+ __this_cpu_add(irq_start_time, delta); | |
+ | |
+ irq_time_write_begin(); | |
+ /* | |
+ * We do not account for softirq time from ksoftirqd here. | |
+ * We want to continue accounting softirq time to ksoftirqd thread | |
+ * in that case, so as not to confuse scheduler with a special task | |
+ * that do not consume any time, but still wants to run. | |
+ */ | |
+ if (hardirq_count()) | |
+ __this_cpu_add(cpu_hardirq_time, delta); | |
+ else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) | |
+ __this_cpu_add(cpu_softirq_time, delta); | |
+ | |
+ irq_time_write_end(); | |
+ local_irq_restore(flags); | |
+} | |
+EXPORT_SYMBOL_GPL(irqtime_account_irq); | |
+ | |
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ | |
+ | |
+#ifdef CONFIG_PARAVIRT | |
+static inline u64 steal_ticks(u64 steal) | |
+{ | |
+ if (unlikely(steal > NSEC_PER_SEC)) | |
+ return div_u64(steal, TICK_NSEC); | |
+ | |
+ return __iter_div_u64_rem(steal, TICK_NSEC, &steal); | |
+} | |
+#endif | |
+ | |
+static void update_rq_clock_task(struct rq *rq, s64 delta) | |
+{ | |
+/* | |
+ * In theory, the compile should just see 0 here, and optimize out the call | |
+ * to sched_rt_avg_update. But I don't trust it... | |
+ */ | |
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING | |
+ s64 irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; | |
+ | |
+ /* | |
+ * Since irq_time is only updated on {soft,}irq_exit, we might run into | |
+ * this case when a previous update_rq_clock() happened inside a | |
+ * {soft,}irq region. | |
+ * | |
+ * When this happens, we stop ->clock_task and only update the | |
+ * prev_irq_time stamp to account for the part that fit, so that a next | |
+ * update will consume the rest. This ensures ->clock_task is | |
+ * monotonic. | |
+ * | |
+ * It does however cause some slight miss-attribution of {soft,}irq | |
+ * time, a more accurate solution would be to update the irq_time using | |
+ * the current rq->clock timestamp, except that would require using | |
+ * atomic ops. | |
+ */ | |
+ if (irq_delta > delta) | |
+ irq_delta = delta; | |
+ | |
+ rq->prev_irq_time += irq_delta; | |
+ delta -= irq_delta; | |
+#endif | |
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING | |
+ if (static_key_false((¶virt_steal_rq_enabled))) { | |
+ s64 steal = paravirt_steal_clock(cpu_of(rq)); | |
+ | |
+ steal -= rq->prev_steal_time_rq; | |
+ | |
+ if (unlikely(steal > delta)) | |
+ steal = delta; | |
+ | |
+ rq->prev_steal_time_rq += steal; | |
+ | |
+ delta -= steal; | |
+ } | |
+#endif | |
+ | |
+ rq->clock_task += delta; | |
+} | |
+ | |
+#ifndef nsecs_to_cputime | |
+# define nsecs_to_cputime(__nsecs) nsecs_to_jiffies(__nsecs) | |
+#endif | |
+ | |
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING | |
+static void irqtime_account_hi_si(void) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ u64 latest_ns; | |
+ | |
+ latest_ns = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)); | |
+ if (latest_ns > cpustat[CPUTIME_IRQ]) | |
+ cpustat[CPUTIME_IRQ] += (__force u64)cputime_one_jiffy; | |
+ | |
+ latest_ns = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)); | |
+ if (latest_ns > cpustat[CPUTIME_SOFTIRQ]) | |
+ cpustat[CPUTIME_SOFTIRQ] += (__force u64)cputime_one_jiffy; | |
+} | |
+#else /* CONFIG_IRQ_TIME_ACCOUNTING */ | |
+ | |
+#define sched_clock_irqtime (0) | |
+ | |
+static inline void irqtime_account_hi_si(void) | |
+{ | |
+} | |
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ | |
+ | |
+static __always_inline bool steal_account_process_tick(void) | |
+{ | |
+#ifdef CONFIG_PARAVIRT | |
+ if (static_key_false(¶virt_steal_enabled)) { | |
+ u64 steal; | |
+ cputime_t steal_ct; | |
+ | |
+ steal = paravirt_steal_clock(smp_processor_id()); | |
+ steal -= this_rq()->prev_steal_time; | |
+ | |
+ /* | |
+ * cputime_t may be less precise than nsecs (eg: if it's | |
+ * based on jiffies). Lets cast the result to cputime | |
+ * granularity and account the rest on the next rounds. | |
+ */ | |
+ steal_ct = nsecs_to_cputime(steal); | |
+ this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct); | |
+ | |
+ account_steal_time(steal_ct); | |
+ return steal_ct; | |
+ } | |
+#endif | |
+ return false; | |
+} | |
+ | |
+/* | |
+ * Accumulate raw cputime values of dead tasks (sig->[us]time) and live | |
+ * tasks (sum on group iteration) belonging to @tsk's group. | |
+ */ | |
+void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times) | |
+{ | |
+ struct signal_struct *sig = tsk->signal; | |
+ cputime_t utime, stime; | |
+ struct task_struct *t; | |
+ unsigned int seq, nextseq; | |
+ unsigned long flags; | |
+ | |
+ rcu_read_lock(); | |
+ /* Attempt a lockless read on the first round. */ | |
+ nextseq = 0; | |
+ do { | |
+ seq = nextseq; | |
+ flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, &seq); | |
+ times->utime = sig->utime; | |
+ times->stime = sig->stime; | |
+ times->sum_exec_runtime = sig->sum_sched_runtime; | |
+ | |
+ for_each_thread(tsk, t) { | |
+ task_cputime(t, &utime, &stime); | |
+ times->utime += utime; | |
+ times->stime += stime; | |
+ times->sum_exec_runtime += task_sched_runtime(t); | |
+ } | |
+ /* If lockless access failed, take the lock. */ | |
+ nextseq = 1; | |
+ } while (need_seqretry(&sig->stats_lock, seq)); | |
+ done_seqretry_irqrestore(&sig->stats_lock, seq, flags); | |
+ rcu_read_unlock(); | |
+} | |
+ | |
+/* | |
+ * On each tick, see what percentage of that tick was attributed to each | |
+ * component and add the percentage to the _pc values. Once a _pc value has | |
+ * accumulated one tick's worth, account for that. This means the total | |
+ * percentage of load components will always be 128 (pseudo 100) per tick. | |
+ */ | |
+static void pc_idle_time(struct rq *rq, struct task_struct *idle, unsigned long pc) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ | |
+ if (atomic_read(&rq->nr_iowait) > 0) { | |
+ rq->iowait_pc += pc; | |
+ if (rq->iowait_pc >= 128) { | |
+ cpustat[CPUTIME_IOWAIT] += (__force u64)cputime_one_jiffy * rq->iowait_pc / 128; | |
+ rq->iowait_pc %= 128; | |
+ } | |
+ } else { | |
+ rq->idle_pc += pc; | |
+ if (rq->idle_pc >= 128) { | |
+ cpustat[CPUTIME_IDLE] += (__force u64)cputime_one_jiffy * rq->idle_pc / 128; | |
+ rq->idle_pc %= 128; | |
+ } | |
+ } | |
+ acct_update_integrals(idle); | |
+} | |
+ | |
+static void | |
+pc_system_time(struct rq *rq, struct task_struct *p, int hardirq_offset, | |
+ unsigned long pc, unsigned long ns) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy); | |
+ | |
+ p->stime_pc += pc; | |
+ if (p->stime_pc >= 128) { | |
+ int jiffs = p->stime_pc / 128; | |
+ | |
+ p->stime_pc %= 128; | |
+ p->stime += (__force u64)cputime_one_jiffy * jiffs; | |
+ p->stimescaled += one_jiffy_scaled * jiffs; | |
+ account_group_system_time(p, cputime_one_jiffy * jiffs); | |
+ } | |
+ p->sched_time += ns; | |
+ account_group_exec_runtime(p, ns); | |
+ | |
+ if (hardirq_count() - hardirq_offset) { | |
+ rq->irq_pc += pc; | |
+ if (rq->irq_pc >= 128) { | |
+ cpustat[CPUTIME_IRQ] += (__force u64)cputime_one_jiffy * rq->irq_pc / 128; | |
+ rq->irq_pc %= 128; | |
+ } | |
+ } else if (in_serving_softirq()) { | |
+ rq->softirq_pc += pc; | |
+ if (rq->softirq_pc >= 128) { | |
+ cpustat[CPUTIME_SOFTIRQ] += (__force u64)cputime_one_jiffy * rq->softirq_pc / 128; | |
+ rq->softirq_pc %= 128; | |
+ } | |
+ } else { | |
+ rq->system_pc += pc; | |
+ if (rq->system_pc >= 128) { | |
+ cpustat[CPUTIME_SYSTEM] += (__force u64)cputime_one_jiffy * rq->system_pc / 128; | |
+ rq->system_pc %= 128; | |
+ } | |
+ } | |
+ acct_update_integrals(p); | |
+} | |
+ | |
+static void pc_user_time(struct rq *rq, struct task_struct *p, | |
+ unsigned long pc, unsigned long ns) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy); | |
+ | |
+ p->utime_pc += pc; | |
+ if (p->utime_pc >= 128) { | |
+ int jiffs = p->utime_pc / 128; | |
+ | |
+ p->utime_pc %= 128; | |
+ p->utime += (__force u64)cputime_one_jiffy * jiffs; | |
+ p->utimescaled += one_jiffy_scaled * jiffs; | |
+ account_group_user_time(p, cputime_one_jiffy * jiffs); | |
+ } | |
+ p->sched_time += ns; | |
+ account_group_exec_runtime(p, ns); | |
+ | |
+ if (this_cpu_ksoftirqd() == p) { | |
+ /* | |
+ * ksoftirqd time do not get accounted in cpu_softirq_time. | |
+ * So, we have to handle it separately here. | |
+ */ | |
+ rq->softirq_pc += pc; | |
+ if (rq->softirq_pc >= 128) { | |
+ cpustat[CPUTIME_SOFTIRQ] += (__force u64)cputime_one_jiffy * rq->softirq_pc / 128; | |
+ rq->softirq_pc %= 128; | |
+ } | |
+ } | |
+ | |
+ if (task_nice(p) > 0 || idleprio_task(p)) { | |
+ rq->nice_pc += pc; | |
+ if (rq->nice_pc >= 128) { | |
+ cpustat[CPUTIME_NICE] += (__force u64)cputime_one_jiffy * rq->nice_pc / 128; | |
+ rq->nice_pc %= 128; | |
+ } | |
+ } else { | |
+ rq->user_pc += pc; | |
+ if (rq->user_pc >= 128) { | |
+ cpustat[CPUTIME_USER] += (__force u64)cputime_one_jiffy * rq->user_pc / 128; | |
+ rq->user_pc %= 128; | |
+ } | |
+ } | |
+ acct_update_integrals(p); | |
+} | |
+ | |
+/* | |
+ * Convert nanoseconds to pseudo percentage of one tick. Use 128 for fast | |
+ * shifts instead of 100 | |
+ */ | |
+#define NS_TO_PC(NS) (NS * 128 / JIFFY_NS) | |
+ | |
+/* | |
+ * This is called on clock ticks. | |
+ * Bank in p->sched_time the ns elapsed since the last tick or switch. | |
+ * CPU scheduler quota accounting is also performed here in microseconds. | |
+ */ | |
+static void | |
+update_cpu_clock_tick(struct rq *rq, struct task_struct *p) | |
+{ | |
+ long account_ns = rq->clock_task - rq->rq_last_ran; | |
+ struct task_struct *idle = rq->idle; | |
+ unsigned long account_pc; | |
+ | |
+ if (unlikely(account_ns < 0) || steal_account_process_tick()) | |
+ goto ts_account; | |
+ | |
+ account_pc = NS_TO_PC(account_ns); | |
+ | |
+ /* Accurate tick timekeeping */ | |
+ if (user_mode(get_irq_regs())) | |
+ pc_user_time(rq, p, account_pc, account_ns); | |
+ else if (p != idle || (irq_count() != HARDIRQ_OFFSET)) | |
+ pc_system_time(rq, p, HARDIRQ_OFFSET, | |
+ account_pc, account_ns); | |
+ else | |
+ pc_idle_time(rq, idle, account_pc); | |
+ | |
+ if (sched_clock_irqtime) | |
+ irqtime_account_hi_si(); | |
+ | |
+ts_account: | |
+ /* time_slice accounting is done in usecs to avoid overflow on 32bit */ | |
+ if (rq->rq_policy != SCHED_FIFO && p != idle) { | |
+ s64 time_diff = rq->clock - rq->timekeep_clock; | |
+ | |
+ niffy_diff(&time_diff, 1); | |
+ rq->rq_time_slice -= NS_TO_US(time_diff); | |
+ } | |
+ | |
+ rq->rq_last_ran = rq->clock_task; | |
+ rq->timekeep_clock = rq->clock; | |
+} | |
+ | |
+/* | |
+ * This is called on context switches. | |
+ * Bank in p->sched_time the ns elapsed since the last tick or switch. | |
+ * CPU scheduler quota accounting is also performed here in microseconds. | |
+ */ | |
+static void | |
+update_cpu_clock_switch(struct rq *rq, struct task_struct *p) | |
+{ | |
+ long account_ns = rq->clock_task - rq->rq_last_ran; | |
+ struct task_struct *idle = rq->idle; | |
+ unsigned long account_pc; | |
+ | |
+ if (unlikely(account_ns < 0)) | |
+ goto ts_account; | |
+ | |
+ account_pc = NS_TO_PC(account_ns); | |
+ | |
+ /* Accurate subtick timekeeping */ | |
+ if (p != idle) { | |
+ pc_user_time(rq, p, account_pc, account_ns); | |
+ } | |
+ else | |
+ pc_idle_time(rq, idle, account_pc); | |
+ | |
+ts_account: | |
+ /* time_slice accounting is done in usecs to avoid overflow on 32bit */ | |
+ if (rq->rq_policy != SCHED_FIFO && p != idle) { | |
+ s64 time_diff = rq->clock - rq->timekeep_clock; | |
+ | |
+ niffy_diff(&time_diff, 1); | |
+ rq->rq_time_slice -= NS_TO_US(time_diff); | |
+ } | |
+ | |
+ rq->rq_last_ran = rq->clock_task; | |
+ rq->timekeep_clock = rq->clock; | |
+} | |
+ | |
+/* | |
+ * Return any ns on the sched_clock that have not yet been accounted in | |
+ * @p in case that task is currently running. | |
+ * | |
+ * Called with task_grq_lock() held. | |
+ */ | |
+static inline u64 do_task_delta_exec(struct task_struct *p, struct rq *rq) | |
+{ | |
+ u64 ns = 0; | |
+ | |
+ /* | |
+ * Must be ->curr _and_ ->on_rq. If dequeued, we would | |
+ * project cycles that may never be accounted to this | |
+ * thread, breaking clock_gettime(). | |
+ */ | |
+ if (p == rq->curr && p->on_rq) { | |
+ update_clocks(rq); | |
+ ns = rq->clock_task - rq->rq_last_ran; | |
+ if (unlikely((s64)ns < 0)) | |
+ ns = 0; | |
+ } | |
+ | |
+ return ns; | |
+} | |
+ | |
+/* | |
+ * Return accounted runtime for the task. | |
+ * Return separately the current's pending runtime that have not been | |
+ * accounted yet. | |
+ * | |
+ */ | |
+unsigned long long task_sched_runtime(struct task_struct *p) | |
+{ | |
+ unsigned long flags; | |
+ struct rq *rq; | |
+ u64 ns; | |
+ | |
+#if defined(CONFIG_64BIT) && defined(CONFIG_SMP) | |
+ /* | |
+ * 64-bit doesn't need locks to atomically read a 64bit value. | |
+ * So we have a optimization chance when the task's delta_exec is 0. | |
+ * Reading ->on_cpu is racy, but this is ok. | |
+ * | |
+ * If we race with it leaving cpu, we'll take a lock. So we're correct. | |
+ * If we race with it entering cpu, unaccounted time is 0. This is | |
+ * indistinguishable from the read occurring a few cycles earlier. | |
+ * If we see ->on_cpu without ->on_rq, the task is leaving, and has | |
+ * been accounted, so we're correct here as well. | |
+ */ | |
+ if (!p->on_cpu || !p->on_rq) | |
+ return tsk_seruntime(p); | |
+#endif | |
+ | |
+ rq = task_grq_lock(p, &flags); | |
+ ns = p->sched_time + do_task_delta_exec(p, rq); | |
+ task_grq_unlock(&flags); | |
+ | |
+ return ns; | |
+} | |
+ | |
+/* Compatibility crap */ | |
+void account_user_time(struct task_struct *p, cputime_t cputime, | |
+ cputime_t cputime_scaled) | |
+{ | |
+} | |
+ | |
+void account_idle_time(cputime_t cputime) | |
+{ | |
+} | |
+ | |
+void update_cpu_load_nohz(void) | |
+{ | |
+} | |
+ | |
+#ifdef CONFIG_NO_HZ_COMMON | |
+void calc_load_enter_idle(void) | |
+{ | |
+} | |
+ | |
+void calc_load_exit_idle(void) | |
+{ | |
+} | |
+#endif /* CONFIG_NO_HZ_COMMON */ | |
+ | |
+/* | |
+ * Account guest cpu time to a process. | |
+ * @p: the process that the cpu time gets accounted to | |
+ * @cputime: the cpu time spent in virtual machine since the last update | |
+ * @cputime_scaled: cputime scaled by cpu frequency | |
+ */ | |
+static void account_guest_time(struct task_struct *p, cputime_t cputime, | |
+ cputime_t cputime_scaled) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ | |
+ /* Add guest time to process. */ | |
+ p->utime += (__force u64)cputime; | |
+ p->utimescaled += (__force u64)cputime_scaled; | |
+ account_group_user_time(p, cputime); | |
+ p->gtime += (__force u64)cputime; | |
+ | |
+ /* Add guest time to cpustat. */ | |
+ if (task_nice(p) > 0) { | |
+ cpustat[CPUTIME_NICE] += (__force u64)cputime; | |
+ cpustat[CPUTIME_GUEST_NICE] += (__force u64)cputime; | |
+ } else { | |
+ cpustat[CPUTIME_USER] += (__force u64)cputime; | |
+ cpustat[CPUTIME_GUEST] += (__force u64)cputime; | |
+ } | |
+} | |
+ | |
+/* | |
+ * Account system cpu time to a process and desired cpustat field | |
+ * @p: the process that the cpu time gets accounted to | |
+ * @cputime: the cpu time spent in kernel space since the last update | |
+ * @cputime_scaled: cputime scaled by cpu frequency | |
+ * @target_cputime64: pointer to cpustat field that has to be updated | |
+ */ | |
+static inline | |
+void __account_system_time(struct task_struct *p, cputime_t cputime, | |
+ cputime_t cputime_scaled, cputime64_t *target_cputime64) | |
+{ | |
+ /* Add system time to process. */ | |
+ p->stime += (__force u64)cputime; | |
+ p->stimescaled += (__force u64)cputime_scaled; | |
+ account_group_system_time(p, cputime); | |
+ | |
+ /* Add system time to cpustat. */ | |
+ *target_cputime64 += (__force u64)cputime; | |
+ | |
+ /* Account for system time used */ | |
+ acct_update_integrals(p); | |
+} | |
+ | |
+/* | |
+ * Account system cpu time to a process. | |
+ * @p: the process that the cpu time gets accounted to | |
+ * @hardirq_offset: the offset to subtract from hardirq_count() | |
+ * @cputime: the cpu time spent in kernel space since the last update | |
+ * @cputime_scaled: cputime scaled by cpu frequency | |
+ * This is for guest only now. | |
+ */ | |
+void account_system_time(struct task_struct *p, int hardirq_offset, | |
+ cputime_t cputime, cputime_t cputime_scaled) | |
+{ | |
+ | |
+ if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) | |
+ account_guest_time(p, cputime, cputime_scaled); | |
+} | |
+ | |
+/* | |
+ * Account for involuntary wait time. | |
+ * @steal: the cpu time spent in involuntary wait | |
+ */ | |
+void account_steal_time(cputime_t cputime) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ | |
+ cpustat[CPUTIME_STEAL] += (__force u64)cputime; | |
+} | |
+ | |
+/* | |
+ * Account for idle time. | |
+ * @cputime: the cpu time spent in idle wait | |
+ */ | |
+static void account_idle_times(cputime_t cputime) | |
+{ | |
+ u64 *cpustat = kcpustat_this_cpu->cpustat; | |
+ struct rq *rq = this_rq(); | |
+ | |
+ if (atomic_read(&rq->nr_iowait) > 0) | |
+ cpustat[CPUTIME_IOWAIT] += (__force u64)cputime; | |
+ else | |
+ cpustat[CPUTIME_IDLE] += (__force u64)cputime; | |
+} | |
+ | |
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE | |
+ | |
+void account_process_tick(struct task_struct *p, int user_tick) | |
+{ | |
+} | |
+ | |
+/* | |
+ * Account multiple ticks of steal time. | |
+ * @p: the process from which the cpu time has been stolen | |
+ * @ticks: number of stolen ticks | |
+ */ | |
+void account_steal_ticks(unsigned long ticks) | |
+{ | |
+ account_steal_time(jiffies_to_cputime(ticks)); | |
+} | |
+ | |
+/* | |
+ * Account multiple ticks of idle time. | |
+ * @ticks: number of stolen ticks | |
+ */ | |
+void account_idle_ticks(unsigned long ticks) | |
+{ | |
+ account_idle_times(jiffies_to_cputime(ticks)); | |
+} | |
+#endif | |
+ | |
+static inline void grq_iso_lock(void) | |
+ __acquires(grq.iso_lock) | |
+{ | |
+ raw_spin_lock(&grq.iso_lock); | |
+} | |
+ | |
+static inline void grq_iso_unlock(void) | |
+ __releases(grq.iso_lock) | |
+{ | |
+ raw_spin_unlock(&grq.iso_lock); | |
+} | |
+ | |
+/* | |
+ * Functions to test for when SCHED_ISO tasks have used their allocated | |
+ * quota as real time scheduling and convert them back to SCHED_NORMAL. | |
+ * Where possible, the data is tested lockless, to avoid grabbing iso_lock | |
+ * because the occasional inaccurate result won't matter. However the | |
+ * tick data is only ever modified under lock. iso_refractory is only simply | |
+ * set to 0 or 1 so it's not worth grabbing the lock yet again for that. | |
+ */ | |
+static bool set_iso_refractory(void) | |
+{ | |
+ grq.iso_refractory = true; | |
+ return grq.iso_refractory; | |
+} | |
+ | |
+static bool clear_iso_refractory(void) | |
+{ | |
+ grq.iso_refractory = false; | |
+ return grq.iso_refractory; | |
+} | |
+ | |
+/* | |
+ * Test if SCHED_ISO tasks have run longer than their alloted period as RT | |
+ * tasks and set the refractory flag if necessary. There is 10% hysteresis | |
+ * for unsetting the flag. 115/128 is ~90/100 as a fast shift instead of a | |
+ * slow division. | |
+ */ | |
+static bool test_ret_isorefractory(struct rq *rq) | |
+{ | |
+ if (likely(!grq.iso_refractory)) { | |
+ if (grq.iso_ticks > ISO_PERIOD * sched_iso_cpu) | |
+ return set_iso_refractory(); | |
+ } else { | |
+ if (grq.iso_ticks < ISO_PERIOD * (sched_iso_cpu * 115 / 128)) | |
+ return clear_iso_refractory(); | |
+ } | |
+ return grq.iso_refractory; | |
+} | |
+ | |
+static void iso_tick(void) | |
+{ | |
+ grq_iso_lock(); | |
+ grq.iso_ticks += 100; | |
+ grq_iso_unlock(); | |
+} | |
+ | |
+/* No SCHED_ISO task was running so decrease rq->iso_ticks */ | |
+static inline void no_iso_tick(void) | |
+{ | |
+ if (grq.iso_ticks) { | |
+ grq_iso_lock(); | |
+ grq.iso_ticks -= grq.iso_ticks / ISO_PERIOD + 1; | |
+ if (unlikely(grq.iso_refractory && grq.iso_ticks < | |
+ ISO_PERIOD * (sched_iso_cpu * 115 / 128))) | |
+ clear_iso_refractory(); | |
+ grq_iso_unlock(); | |
+ } | |
+} | |
+ | |
+/* This manages tasks that have run out of timeslice during a scheduler_tick */ | |
+static void task_running_tick(struct rq *rq) | |
+{ | |
+ struct task_struct *p; | |
+ | |
+ /* | |
+ * If a SCHED_ISO task is running we increment the iso_ticks. In | |
+ * order to prevent SCHED_ISO tasks from causing starvation in the | |
+ * presence of true RT tasks we account those as iso_ticks as well. | |
+ */ | |
+ if ((rt_queue(rq) || (iso_queue(rq) && !grq.iso_refractory))) { | |
+ if (grq.iso_ticks <= (ISO_PERIOD * 128) - 128) | |
+ iso_tick(); | |
+ } else | |
+ no_iso_tick(); | |
+ | |
+ if (iso_queue(rq)) { | |
+ if (unlikely(test_ret_isorefractory(rq))) { | |
+ if (rq_running_iso(rq)) { | |
+ /* | |
+ * SCHED_ISO task is running as RT and limit | |
+ * has been hit. Force it to reschedule as | |
+ * SCHED_NORMAL by zeroing its time_slice | |
+ */ | |
+ rq->rq_time_slice = 0; | |
+ } | |
+ } | |
+ } | |
+ | |
+ /* SCHED_FIFO tasks never run out of timeslice. */ | |
+ if (rq->rq_policy == SCHED_FIFO) | |
+ return; | |
+ /* | |
+ * Tasks that were scheduled in the first half of a tick are not | |
+ * allowed to run into the 2nd half of the next tick if they will | |
+ * run out of time slice in the interim. Otherwise, if they have | |
+ * less than RESCHED_US μs of time slice left they will be rescheduled. | |
+ */ | |
+ if (rq->dither) { | |
+ if (rq->rq_time_slice > HALF_JIFFY_US) | |
+ return; | |
+ else | |
+ rq->rq_time_slice = 0; | |
+ } else if (rq->rq_time_slice >= RESCHED_US) | |
+ return; | |
+ | |
+ /* p->time_slice < RESCHED_US. We only modify task_struct under grq lock */ | |
+ p = rq->curr; | |
+ | |
+ grq_lock(); | |
+ requeue_task(p); | |
+ __set_tsk_resched(p); | |
+ grq_unlock(); | |
+} | |
+ | |
+/* | |
+ * This function gets called by the timer code, with HZ frequency. | |
+ * We call it with interrupts disabled. The data modified is all | |
+ * local to struct rq so we don't need to grab grq lock. | |
+ */ | |
+void scheduler_tick(void) | |
+{ | |
+ int cpu __maybe_unused = smp_processor_id(); | |
+ struct rq *rq = cpu_rq(cpu); | |
+ | |
+ sched_clock_tick(); | |
+ /* grq lock not grabbed, so only update rq clock */ | |
+ update_rq_clock(rq); | |
+ update_cpu_clock_tick(rq, rq->curr); | |
+ if (!rq_idle(rq)) | |
+ task_running_tick(rq); | |
+ else | |
+ no_iso_tick(); | |
+ rq->last_tick = rq->clock; | |
+ perf_event_task_tick(); | |
+} | |
+ | |
+notrace unsigned long get_parent_ip(unsigned long addr) | |
+{ | |
+ if (in_lock_functions(addr)) { | |
+ addr = CALLER_ADDR2; | |
+ if (in_lock_functions(addr)) | |
+ addr = CALLER_ADDR3; | |
+ } | |
+ return addr; | |
+} | |
+ | |
+#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ | |
+ defined(CONFIG_PREEMPT_TRACER)) | |
+void preempt_count_add(int val) | |
+{ | |
+#ifdef CONFIG_DEBUG_PREEMPT | |
+ /* | |
+ * Underflow? | |
+ */ | |
+ if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0))) | |
+ return; | |
+#endif | |
+ __preempt_count_add(val); | |
+#ifdef CONFIG_DEBUG_PREEMPT | |
+ /* | |
+ * Spinlock count overflowing soon? | |
+ */ | |
+ DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >= | |
+ PREEMPT_MASK - 10); | |
+#endif | |
+ if (preempt_count() == val) { | |
+ unsigned long ip = get_parent_ip(CALLER_ADDR1); | |
+#ifdef CONFIG_DEBUG_PREEMPT | |
+ current->preempt_disable_ip = ip; | |
+#endif | |
+ trace_preempt_off(CALLER_ADDR0, ip); | |
+ } | |
+} | |
+EXPORT_SYMBOL(preempt_count_add); | |
+NOKPROBE_SYMBOL(preempt_count_add); | |
+ | |
+void preempt_count_sub(int val) | |
+{ | |
+#ifdef CONFIG_DEBUG_PREEMPT | |
+ /* | |
+ * Underflow? | |
+ */ | |
+ if (DEBUG_LOCKS_WARN_ON(val > preempt_count())) | |
+ return; | |
+ /* | |
+ * Is the spinlock portion underflowing? | |
+ */ | |
+ if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) && | |
+ !(preempt_count() & PREEMPT_MASK))) | |
+ return; | |
+#endif | |
+ | |
+ if (preempt_count() == val) | |
+ trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); | |
+ __preempt_count_sub(val); | |
+} | |
+EXPORT_SYMBOL(preempt_count_sub); | |
+NOKPROBE_SYMBOL(preempt_count_sub); | |
+#endif | |
+ | |
+/* | |
+ * Deadline is "now" in niffies + (offset by priority). Setting the deadline | |
+ * is the key to everything. It distributes cpu fairly amongst tasks of the | |
+ * same nice value, it proportions cpu according to nice level, it means the | |
+ * task that last woke up the longest ago has the earliest deadline, thus | |
+ * ensuring that interactive tasks get low latency on wake up. The CPU | |
+ * proportion works out to the square of the virtual deadline difference, so | |
+ * this equation will give nice 19 3% CPU compared to nice 0. | |
+ */ | |
+static inline u64 prio_deadline_diff(int user_prio) | |
+{ | |
+ return (prio_ratios[user_prio] * rr_interval * (MS_TO_NS(1) / 128)); | |
+} | |
+ | |
+static inline u64 task_deadline_diff(struct task_struct *p) | |
+{ | |
+ return prio_deadline_diff(TASK_USER_PRIO(p)); | |
+} | |
+ | |
+static inline u64 static_deadline_diff(int static_prio) | |
+{ | |
+ return prio_deadline_diff(USER_PRIO(static_prio)); | |
+} | |
+ | |
+static inline int longest_deadline_diff(void) | |
+{ | |
+ return prio_deadline_diff(39); | |
+} | |
+ | |
+static inline int ms_longest_deadline_diff(void) | |
+{ | |
+ return NS_TO_MS(longest_deadline_diff()); | |
+} | |
+ | |
+/* | |
+ * The time_slice is only refilled when it is empty and that is when we set a | |
+ * new deadline. | |
+ */ | |
+static void time_slice_expired(struct task_struct *p) | |
+{ | |
+ p->time_slice = timeslice(); | |
+ p->deadline = grq.niffies + task_deadline_diff(p); | |
+#ifdef CONFIG_SMT_NICE | |
+ if (!p->mm) | |
+ p->smt_bias = 0; | |
+ else if (rt_task(p)) | |
+ p->smt_bias = 1 << 30; | |
+ else if (task_running_iso(p)) | |
+ p->smt_bias = 1 << 29; | |
+ else if (idleprio_task(p)) { | |
+ if (task_running_idle(p)) | |
+ p->smt_bias = 0; | |
+ else | |
+ p->smt_bias = 1; | |
+ } else if (--p->smt_bias < 1) | |
+ p->smt_bias = MAX_PRIO - p->static_prio; | |
+#endif | |
+} | |
+ | |
+/* | |
+ * Timeslices below RESCHED_US are considered as good as expired as there's no | |
+ * point rescheduling when there's so little time left. SCHED_BATCH tasks | |
+ * have been flagged be not latency sensitive and likely to be fully CPU | |
+ * bound so every time they're rescheduled they have their time_slice | |
+ * refilled, but get a new later deadline to have little effect on | |
+ * SCHED_NORMAL tasks. | |
+ | |
+ */ | |
+static inline void check_deadline(struct task_struct *p) | |
+{ | |
+ if (p->time_slice < RESCHED_US || batch_task(p)) | |
+ time_slice_expired(p); | |
+} | |
+ | |
+#define BITOP_WORD(nr) ((nr) / BITS_PER_LONG) | |
+ | |
+/* | |
+ * Scheduler queue bitmap specific find next bit. | |
+ */ | |
+static inline unsigned long | |
+next_sched_bit(const unsigned long *addr, unsigned long offset) | |
+{ | |
+ const unsigned long *p; | |
+ unsigned long result; | |
+ unsigned long size; | |
+ unsigned long tmp; | |
+ | |
+ size = PRIO_LIMIT; | |
+ if (offset >= size) | |
+ return size; | |
+ | |
+ p = addr + BITOP_WORD(offset); | |
+ result = offset & ~(BITS_PER_LONG-1); | |
+ size -= result; | |
+ offset %= BITS_PER_LONG; | |
+ if (offset) { | |
+ tmp = *(p++); | |
+ tmp &= (~0UL << offset); | |
+ if (size < BITS_PER_LONG) | |
+ goto found_first; | |
+ if (tmp) | |
+ goto found_middle; | |
+ size -= BITS_PER_LONG; | |
+ result += BITS_PER_LONG; | |
+ } | |
+ while (size & ~(BITS_PER_LONG-1)) { | |
+ if ((tmp = *(p++))) | |
+ goto found_middle; | |
+ result += BITS_PER_LONG; | |
+ size -= BITS_PER_LONG; | |
+ } | |
+ if (!size) | |
+ return result; | |
+ tmp = *p; | |
+ | |
+found_first: | |
+ tmp &= (~0UL >> (BITS_PER_LONG - size)); | |
+ if (tmp == 0UL) /* Are any bits set? */ | |
+ return result + size; /* Nope. */ | |
+found_middle: | |
+ return result + __ffs(tmp); | |
+} | |
+ | |
+/* | |
+ * O(n) lookup of all tasks in the global runqueue. The real brainfuck | |
+ * of lock contention and O(n). It's not really O(n) as only the queued, | |
+ * but not running tasks are scanned, and is O(n) queued in the worst case | |
+ * scenario only because the right task can be found before scanning all of | |
+ * them. | |
+ * Tasks are selected in this order: | |
+ * Real time tasks are selected purely by their static priority and in the | |
+ * order they were queued, so the lowest value idx, and the first queued task | |
+ * of that priority value is chosen. | |
+ * If no real time tasks are found, the SCHED_ISO priority is checked, and | |
+ * all SCHED_ISO tasks have the same priority value, so they're selected by | |
+ * the earliest deadline value. | |
+ * If no SCHED_ISO tasks are found, SCHED_NORMAL tasks are selected by the | |
+ * earliest deadline. | |
+ * Finally if no SCHED_NORMAL tasks are found, SCHED_IDLEPRIO tasks are | |
+ * selected by the earliest deadline. | |
+ */ | |
+static inline struct | |
+task_struct *earliest_deadline_task(struct rq *rq, int cpu, struct task_struct *idle) | |
+{ | |
+ struct task_struct *edt = NULL; | |
+ unsigned long idx = -1; | |
+ | |
+ do { | |
+ struct list_head *queue; | |
+ struct task_struct *p; | |
+ u64 earliest_deadline; | |
+ | |
+ idx = next_sched_bit(grq.prio_bitmap, ++idx); | |
+ if (idx >= PRIO_LIMIT) | |
+ return idle; | |
+ queue = grq.queue + idx; | |
+ | |
+ if (idx < MAX_RT_PRIO) { | |
+ /* We found an rt task */ | |
+ list_for_each_entry(p, queue, run_list) { | |
+ /* Make sure cpu affinity is ok */ | |
+ if (needs_other_cpu(p, cpu)) | |
+ continue; | |
+ edt = p; | |
+ goto out_take; | |
+ } | |
+ /* | |
+ * None of the RT tasks at this priority can run on | |
+ * this cpu | |
+ */ | |
+ continue; | |
+ } | |
+ | |
+ /* | |
+ * No rt tasks. Find the earliest deadline task. Now we're in | |
+ * O(n) territory. | |
+ */ | |
+ earliest_deadline = ~0ULL; | |
+ list_for_each_entry(p, queue, run_list) { | |
+ u64 dl; | |
+ | |
+ /* Make sure cpu affinity is ok */ | |
+ if (needs_other_cpu(p, cpu)) | |
+ continue; | |
+ | |
+#ifdef CONFIG_SMT_NICE | |
+ if (!smt_should_schedule(p, cpu)) | |
+ continue; | |
+#endif | |
+ /* | |
+ * Soft affinity happens here by not scheduling a task | |
+ * with its sticky flag set that ran on a different CPU | |
+ * last when the CPU is scaling, or by greatly biasing | |
+ * against its deadline when not, based on cpu cache | |
+ * locality. | |
+ */ | |
+ if (task_sticky(p) && task_rq(p) != rq) { | |
+ if (scaling_rq(rq)) | |
+ continue; | |
+ dl = p->deadline << locality_diff(p, rq); | |
+ } else | |
+ dl = p->deadline; | |
+ | |
+ if (deadline_before(dl, earliest_deadline)) { | |
+ earliest_deadline = dl; | |
+ edt = p; | |
+ } | |
+ } | |
+ } while (!edt); | |
+ | |
+out_take: | |
+ take_task(cpu, edt); | |
+ return edt; | |
+} | |
+ | |
+ | |
+/* | |
+ * Print scheduling while atomic bug: | |
+ */ | |
+static noinline void __schedule_bug(struct task_struct *prev) | |
+{ | |
+ if (oops_in_progress) | |
+ return; | |
+ | |
+ printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n", | |
+ prev->comm, prev->pid, preempt_count()); | |
+ | |
+ debug_show_held_locks(prev); | |
+ print_modules(); | |
+ if (irqs_disabled()) | |
+ print_irqtrace_events(prev); | |
+#ifdef CONFIG_DEBUG_PREEMPT | |
+ if (in_atomic_preempt_off()) { | |
+ pr_err("Preemption disabled at:"); | |
+ print_ip_sym(current->preempt_disable_ip); | |
+ pr_cont("\n"); | |
+ } | |
+#endif | |
+ dump_stack(); | |
+ add_taint(TAINT_WARN, LOCKDEP_STILL_OK); | |
+} | |
+ | |
+/* | |
+ * Various schedule()-time debugging checks and statistics: | |
+ */ | |
+static inline void schedule_debug(struct task_struct *prev) | |
+{ | |
+#ifdef CONFIG_SCHED_STACK_END_CHECK | |
+ BUG_ON(unlikely(task_stack_end_corrupted(prev))); | |
+#endif | |
+ /* | |
+ * Test if we are atomic. Since do_exit() needs to call into | |
+ * schedule() atomically, we ignore that path. Otherwise whine | |
+ * if we are scheduling when we should not. | |
+ */ | |
+ if (unlikely(in_atomic_preempt_off() && prev->state != TASK_DEAD)) | |
+ __schedule_bug(prev); | |
+ rcu_sleep_check(); | |
+ | |
+ profile_hit(SCHED_PROFILING, __builtin_return_address(0)); | |
+ | |
+ schedstat_inc(this_rq(), sched_count); | |
+} | |
+ | |
+/* | |
+ * The currently running task's information is all stored in rq local data | |
+ * which is only modified by the local CPU, thereby allowing the data to be | |
+ * changed without grabbing the grq lock. | |
+ */ | |
+static inline void set_rq_task(struct rq *rq, struct task_struct *p) | |
+{ | |
+ rq->rq_time_slice = p->time_slice; | |
+ rq->rq_deadline = p->deadline; | |
+ rq->rq_last_ran = p->last_ran = rq->clock_task; | |
+ rq->rq_policy = p->policy; | |
+ rq->rq_prio = p->prio; | |
+#ifdef CONFIG_SMT_NICE | |
+ rq->rq_smt_bias = p->smt_bias; | |
+#endif | |
+ if (p != rq->idle) | |
+ rq->rq_running = true; | |
+ else | |
+ rq->rq_running = false; | |
+} | |
+ | |
+static void reset_rq_task(struct rq *rq, struct task_struct *p) | |
+{ | |
+ rq->rq_policy = p->policy; | |
+ rq->rq_prio = p->prio; | |
+#ifdef CONFIG_SMT_NICE | |
+ rq->rq_smt_bias = p->smt_bias; | |
+#endif | |
+} | |
+ | |
+#ifdef CONFIG_SMT_NICE | |
+/* Iterate over smt siblings when we've scheduled a process on cpu and decide | |
+ * whether they should continue running or be descheduled. */ | |
+static void check_smt_siblings(int cpu) | |
+{ | |
+ int other_cpu; | |
+ | |
+ for_each_cpu_mask(other_cpu, *thread_cpumask(cpu)) { | |
+ struct task_struct *p; | |
+ struct rq *rq; | |
+ | |
+ if (other_cpu == cpu) | |
+ continue; | |
+ rq = cpu_rq(other_cpu); | |
+ if (rq_idle(rq)) | |
+ continue; | |
+ if (!rq->online) | |
+ continue; | |
+ p = rq->curr; | |
+ if (!smt_should_schedule(p, cpu)) { | |
+ set_tsk_need_resched(p); | |
+ smp_send_reschedule(other_cpu); | |
+ } | |
+ } | |
+} | |
+ | |
+static void wake_smt_siblings(int cpu) | |
+{ | |
+ int other_cpu; | |
+ | |
+ if (!queued_notrunning()) | |
+ return; | |
+ | |
+ for_each_cpu_mask(other_cpu, *thread_cpumask(cpu)) { | |
+ struct rq *rq; | |
+ | |
+ if (other_cpu == cpu) | |
+ continue; | |
+ rq = cpu_rq(other_cpu); | |
+ if (rq_idle(rq)) { | |
+ struct task_struct *p = rq->curr; | |
+ | |
+ set_tsk_need_resched(p); | |
+ smp_send_reschedule(other_cpu); | |
+ } | |
+ } | |
+} | |
+#else | |
+static void check_smt_siblings(int __maybe_unused cpu) {} | |
+static void wake_smt_siblings(int __maybe_unused cpu) {} | |
+#endif | |
+ | |
+/* | |
+ * schedule() is the main scheduler function. | |
+ * | |
+ * The main means of driving the scheduler and thus entering this function are: | |
+ * | |
+ * 1. Explicit blocking: mutex, semaphore, waitqueue, etc. | |
+ * | |
+ * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return | |
+ * paths. For example, see arch/x86/entry_64.S. | |
+ * | |
+ * To drive preemption between tasks, the scheduler sets the flag in timer | |
+ * interrupt handler scheduler_tick(). | |
+ * | |
+ * 3. Wakeups don't really cause entry into schedule(). They add a | |
+ * task to the run-queue and that's it. | |
+ * | |
+ * Now, if the new task added to the run-queue preempts the current | |
+ * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets | |
+ * called on the nearest possible occasion: | |
+ * | |
+ * - If the kernel is preemptible (CONFIG_PREEMPT=y): | |
+ * | |
+ * - in syscall or exception context, at the next outmost | |
+ * preempt_enable(). (this might be as soon as the wake_up()'s | |
+ * spin_unlock()!) | |
+ * | |
+ * - in IRQ context, return from interrupt-handler to | |
+ * preemptible context | |
+ * | |
+ * - If the kernel is not preemptible (CONFIG_PREEMPT is not set) | |
+ * then at the next: | |
+ * | |
+ * - cond_resched() call | |
+ * - explicit schedule() call | |
+ * - return from syscall or exception to user-space | |
+ * - return from interrupt-handler to user-space | |
+ */ | |
+static void __sched __schedule(void) | |
+{ | |
+ struct task_struct *prev, *next, *idle; | |
+ unsigned long *switch_count; | |
+ bool deactivate; | |
+ struct rq *rq; | |
+ int cpu; | |
+ | |
+need_resched: | |
+ preempt_disable(); | |
+ cpu = smp_processor_id(); | |
+ rq = cpu_rq(cpu); | |
+ rcu_note_context_switch(cpu); | |
+ prev = rq->curr; | |
+ | |
+ deactivate = false; | |
+ schedule_debug(prev); | |
+ | |
+ /* | |
+ * Make sure that signal_pending_state()->signal_pending() below | |
+ * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) | |
+ * done by the caller to avoid the race with signal_wake_up(). | |
+ */ | |
+ smp_mb__before_spinlock(); | |
+ grq_lock_irq(); | |
+ | |
+ switch_count = &prev->nivcsw; | |
+ if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { | |
+ if (unlikely(signal_pending_state(prev->state, prev))) { | |
+ prev->state = TASK_RUNNING; | |
+ } else { | |
+ deactivate = true; | |
+ prev->on_rq = 0; | |
+ | |
+ /* | |
+ * If a worker is going to sleep, notify and | |
+ * ask workqueue whether it wants to wake up a | |
+ * task to maintain concurrency. If so, wake | |
+ * up the task. | |
+ */ | |
+ if (prev->flags & PF_WQ_WORKER) { | |
+ struct task_struct *to_wakeup; | |
+ | |
+ to_wakeup = wq_worker_sleeping(prev, cpu); | |
+ if (to_wakeup) { | |
+ /* This shouldn't happen, but does */ | |
+ if (unlikely(to_wakeup == prev)) | |
+ deactivate = false; | |
+ else | |
+ try_to_wake_up_local(to_wakeup); | |
+ } | |
+ } | |
+ } | |
+ switch_count = &prev->nvcsw; | |
+ } | |
+ | |
+ update_clocks(rq); | |
+ update_cpu_clock_switch(rq, prev); | |
+ if (rq->clock - rq->last_tick > HALF_JIFFY_NS) | |
+ rq->dither = false; | |
+ else | |
+ rq->dither = true; | |
+ | |
+ clear_tsk_need_resched(prev); | |
+ clear_preempt_need_resched(); | |
+ | |
+ idle = rq->idle; | |
+ if (idle != prev) { | |
+ /* Update all the information stored on struct rq */ | |
+ prev->time_slice = rq->rq_time_slice; | |
+ prev->deadline = rq->rq_deadline; | |
+ check_deadline(prev); | |
+ prev->last_ran = rq->clock_task; | |
+ | |
+ /* Task changed affinity off this CPU */ | |
+ if (likely(!needs_other_cpu(prev, cpu))) { | |
+ if (!deactivate) { | |
+ if (!queued_notrunning()) { | |
+ /* | |
+ * We now know prev is the only thing that is | |
+ * awaiting CPU so we can bypass rechecking for | |
+ * the earliest deadline task and just run it | |
+ * again. | |
+ */ | |
+ set_rq_task(rq, prev); | |
+ check_smt_siblings(cpu); | |
+ grq_unlock_irq(); | |
+ goto rerun_prev_unlocked; | |
+ } else | |
+ swap_sticky(rq, cpu, prev); | |
+ } | |
+ } | |
+ return_task(prev, rq, deactivate); | |
+ } | |
+ | |
+ if (unlikely(!queued_notrunning())) { | |
+ /* | |
+ * This CPU is now truly idle as opposed to when idle is | |
+ * scheduled as a high priority task in its own right. | |
+ */ | |
+ next = idle; | |
+ schedstat_inc(rq, sched_goidle); | |
+ set_cpuidle_map(cpu); | |
+ } else { | |
+ next = earliest_deadline_task(rq, cpu, idle); | |
+ if (likely(next->prio != PRIO_LIMIT)) | |
+ clear_cpuidle_map(cpu); | |
+ else | |
+ set_cpuidle_map(cpu); | |
+ } | |
+ | |
+ if (likely(prev != next)) { | |
+ /* | |
+ * Don't reschedule an idle task or deactivated tasks | |
+ */ | |
+ if ( prev != idle && !deactivate) | |
+ resched_suitable_idle(prev); | |
+ /* | |
+ * Don't stick tasks when a real time task is going to run as | |
+ * they may literally get stuck. | |
+ */ | |
+ if (rt_task(next)) | |
+ unstick_task(rq, prev); | |
+ set_rq_task(rq, next); | |
+ if (next != idle) | |
+ check_smt_siblings(cpu); | |
+ else | |
+ wake_smt_siblings(cpu); | |
+ grq.nr_switches++; | |
+ prev->on_cpu = false; | |
+ next->on_cpu = true; | |
+ rq->curr = next; | |
+ ++*switch_count; | |
+ | |
+ context_switch(rq, prev, next); /* unlocks the grq */ | |
+ /* | |
+ * The context switch have flipped the stack from under us | |
+ * and restored the local variables which were saved when | |
+ * this task called schedule() in the past. prev == current | |
+ * is still correct, but it can be moved to another cpu/rq. | |
+ */ | |
+ cpu = smp_processor_id(); | |
+ rq = cpu_rq(cpu); | |
+ idle = rq->idle; | |
+ } else { | |
+ check_smt_siblings(cpu); | |
+ grq_unlock_irq(); | |
+ } | |
+ | |
+rerun_prev_unlocked: | |
+ sched_preempt_enable_no_resched(); | |
+ if (unlikely(need_resched())) | |
+ goto need_resched; | |
+} | |
+ | |
+static inline void sched_submit_work(struct task_struct *tsk) | |
+{ | |
+ if (!tsk->state || tsk_is_pi_blocked(tsk)) | |
+ return; | |
+ /* | |
+ * If we are going to sleep and we have plugged IO queued, | |
+ * make sure to submit it to avoid deadlocks. | |
+ */ | |
+ if (blk_needs_flush_plug(tsk)) | |
+ blk_schedule_flush_plug(tsk); | |
+} | |
+ | |
+asmlinkage __visible void __sched schedule(void) | |
+{ | |
+ struct task_struct *tsk = current; | |
+ | |
+ sched_submit_work(tsk); | |
+ __schedule(); | |
+} | |
+EXPORT_SYMBOL(schedule); | |
+ | |
+#ifdef CONFIG_CONTEXT_TRACKING | |
+asmlinkage __visible void __sched schedule_user(void) | |
+{ | |
+ /* | |
+ * If we come here after a random call to set_need_resched(), | |
+ * or we have been woken up remotely but the IPI has not yet arrived, | |
+ * we haven't yet exited the RCU idle mode. Do it here manually until | |
+ * we find a better solution. | |
+ * | |
+ * NB: There are buggy callers of this function. Ideally we | |
+ * should warn if prev_state != IN_USER, but that will trigger | |
+ * too frequently to make sense yet. | |
+ */ | |
+ enum ctx_state prev_state = exception_enter(); | |
+ schedule(); | |
+ exception_exit(prev_state); | |
+} | |
+#endif | |
+ | |
+/** | |
+ * schedule_preempt_disabled - called with preemption disabled | |
+ * | |
+ * Returns with preemption disabled. Note: preempt_count must be 1 | |
+ */ | |
+void __sched schedule_preempt_disabled(void) | |
+{ | |
+ sched_preempt_enable_no_resched(); | |
+ schedule(); | |
+ preempt_disable(); | |
+} | |
+ | |
+#ifdef CONFIG_PREEMPT | |
+/* | |
+ * this is the entry point to schedule() from in-kernel preemption | |
+ * off of preempt_enable. Kernel preemptions off return from interrupt | |
+ * occur there and call schedule directly. | |
+ */ | |
+asmlinkage __visible void __sched notrace preempt_schedule(void) | |
+{ | |
+ /* | |
+ * If there is a non-zero preempt_count or interrupts are disabled, | |
+ * we do not want to preempt the current task. Just return.. | |
+ */ | |
+ if (likely(!preemptible())) | |
+ return; | |
+ | |
+ do { | |
+ __preempt_count_add(PREEMPT_ACTIVE); | |
+ schedule(); | |
+ __preempt_count_sub(PREEMPT_ACTIVE); | |
+ | |
+ /* | |
+ * Check again in case we missed a preemption opportunity | |
+ * between schedule and now. | |
+ */ | |
+ barrier(); | |
+ } while (need_resched()); | |
+} | |
+NOKPROBE_SYMBOL(preempt_schedule); | |
+EXPORT_SYMBOL(preempt_schedule); | |
+ | |
+#ifdef CONFIG_CONTEXT_TRACKING | |
+/** | |
+ * preempt_schedule_context - preempt_schedule called by tracing | |
+ * | |
+ * The tracing infrastructure uses preempt_enable_notrace to prevent | |
+ * recursion and tracing preempt enabling caused by the tracing | |
+ * infrastructure itself. But as tracing can happen in areas coming | |
+ * from userspace or just about to enter userspace, a preempt enable | |
+ * can occur before user_exit() is called. This will cause the scheduler | |
+ * to be called when the system is still in usermode. | |
+ * | |
+ * To prevent this, the preempt_enable_notrace will use this function | |
+ * instead of preempt_schedule() to exit user context if needed before | |
+ * calling the scheduler. | |
+ */ | |
+asmlinkage __visible void __sched notrace preempt_schedule_context(void) | |
+{ | |
+ enum ctx_state prev_ctx; | |
+ | |
+ if (likely(!preemptible())) | |
+ return; | |
+ | |
+ do { | |
+ __preempt_count_add(PREEMPT_ACTIVE); | |
+ /* | |
+ * Needs preempt disabled in case user_exit() is traced | |
+ * and the tracer calls preempt_enable_notrace() causing | |
+ * an infinite recursion. | |
+ */ | |
+ prev_ctx = exception_enter(); | |
+ __schedule(); | |
+ exception_exit(prev_ctx); | |
+ | |
+ __preempt_count_sub(PREEMPT_ACTIVE); | |
+ barrier(); | |
+ } while (need_resched()); | |
+} | |
+EXPORT_SYMBOL_GPL(preempt_schedule_context); | |
+#endif /* CONFIG_CONTEXT_TRACKING */ | |
+ | |
+#endif /* CONFIG_PREEMPT */ | |
+ | |
+/* | |
+ * this is the entry point to schedule() from kernel preemption | |
+ * off of irq context. | |
+ * Note, that this is called and return with irqs disabled. This will | |
+ * protect us against recursive calling from irq. | |
+ */ | |
+asmlinkage __visible void __sched preempt_schedule_irq(void) | |
+{ | |
+ enum ctx_state prev_state; | |
+ | |
+ /* Catch callers which need to be fixed */ | |
+ BUG_ON(preempt_count() || !irqs_disabled()); | |
+ | |
+ prev_state = exception_enter(); | |
+ | |
+ do { | |
+ __preempt_count_add(PREEMPT_ACTIVE); | |
+ local_irq_enable(); | |
+ schedule(); | |
+ local_irq_disable(); | |
+ __preempt_count_sub(PREEMPT_ACTIVE); | |
+ | |
+ /* | |
+ * Check again in case we missed a preemption opportunity | |
+ * between schedule and now. | |
+ */ | |
+ barrier(); | |
+ } while (need_resched()); | |
+ | |
+ exception_exit(prev_state); | |
+} | |
+ | |
+int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags, | |
+ void *key) | |
+{ | |
+ return try_to_wake_up(curr->private, mode, wake_flags); | |
+} | |
+EXPORT_SYMBOL(default_wake_function); | |
+ | |
+#ifdef CONFIG_RT_MUTEXES | |
+ | |
+/* | |
+ * rt_mutex_setprio - set the current priority of a task | |
+ * @p: task | |
+ * @prio: prio value (kernel-internal form) | |
+ * | |
+ * This function changes the 'effective' priority of a task. It does | |
+ * not touch ->normal_prio like __setscheduler(). | |
+ * | |
+ * Used by the rt_mutex code to implement priority inheritance | |
+ * logic. Call site only calls if the priority of the task changed. | |
+ */ | |
+void rt_mutex_setprio(struct task_struct *p, int prio) | |
+{ | |
+ unsigned long flags; | |
+ int queued, oldprio; | |
+ struct rq *rq; | |
+ | |
+ BUG_ON(prio < 0 || prio > MAX_PRIO); | |
+ | |
+ rq = task_grq_lock(p, &flags); | |
+ | |
+ /* | |
+ * Idle task boosting is a nono in general. There is one | |
+ * exception, when PREEMPT_RT and NOHZ is active: | |
+ * | |
+ * The idle task calls get_next_timer_interrupt() and holds | |
+ * the timer wheel base->lock on the CPU and another CPU wants | |
+ * to access the timer (probably to cancel it). We can safely | |
+ * ignore the boosting request, as the idle CPU runs this code | |
+ * with interrupts disabled and will complete the lock | |
+ * protected section without being interrupted. So there is no | |
+ * real need to boost. | |
+ */ | |
+ if (unlikely(p == rq->idle)) { | |
+ WARN_ON(p != rq->curr); | |
+ WARN_ON(p->pi_blocked_on); | |
+ goto out_unlock; | |
+ } | |
+ | |
+ trace_sched_pi_setprio(p, prio); | |
+ oldprio = p->prio; | |
+ queued = task_queued(p); | |
+ if (queued) | |
+ dequeue_task(p); | |
+ p->prio = prio; | |
+ if (task_running(p) && prio > oldprio) | |
+ resched_task(p); | |
+ if (queued) { | |
+ enqueue_task(p, rq); | |
+ try_preempt(p, rq); | |
+ } | |
+ | |
+out_unlock: | |
+ task_grq_unlock(&flags); | |
+} | |
+ | |
+#endif | |
+ | |
+/* | |
+ * Adjust the deadline for when the priority is to change, before it's | |
+ * changed. | |
+ */ | |
+static inline void adjust_deadline(struct task_struct *p, int new_prio) | |
+{ | |
+ p->deadline += static_deadline_diff(new_prio) - task_deadline_diff(p); | |
+} | |
+ | |
+void set_user_nice(struct task_struct *p, long nice) | |
+{ | |
+ int queued, new_static, old_static; | |
+ unsigned long flags; | |
+ struct rq *rq; | |
+ | |
+ if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE) | |
+ return; | |
+ new_static = NICE_TO_PRIO(nice); | |
+ /* | |
+ * We have to be careful, if called from sys_setpriority(), | |
+ * the task might be in the middle of scheduling on another CPU. | |
+ */ | |
+ rq = time_task_grq_lock(p, &flags); | |
+ /* | |
+ * The RT priorities are set via sched_setscheduler(), but we still | |
+ * allow the 'normal' nice value to be set - but as expected | |
+ * it wont have any effect on scheduling until the task is | |
+ * not SCHED_NORMAL/SCHED_BATCH: | |
+ */ | |
+ if (has_rt_policy(p)) { | |
+ p->static_prio = new_static; | |
+ goto out_unlock; | |
+ } | |
+ queued = task_queued(p); | |
+ if (queued) | |
+ dequeue_task(p); | |
+ | |
+ adjust_deadline(p, new_static); | |
+ old_static = p->static_prio; | |
+ p->static_prio = new_static; | |
+ p->prio = effective_prio(p); | |
+ | |
+ if (queued) { | |
+ enqueue_task(p, rq); | |
+ if (new_static < old_static) | |
+ try_preempt(p, rq); | |
+ } else if (task_running(p)) { | |
+ reset_rq_task(rq, p); | |
+ if (old_static < new_static) | |
+ resched_task(p); | |
+ } | |
+out_unlock: | |
+ task_grq_unlock(&flags); | |
+} | |
+EXPORT_SYMBOL(set_user_nice); | |
+ | |
+/* | |
+ * can_nice - check if a task can reduce its nice value | |
+ * @p: task | |
+ * @nice: nice value | |
+ */ | |
+int can_nice(const struct task_struct *p, const int nice) | |
+{ | |
+ /* convert nice value [19,-20] to rlimit style value [1,40] */ | |
+ int nice_rlim = nice_to_rlimit(nice); | |
+ | |
+ return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) || | |
+ capable(CAP_SYS_NICE)); | |
+} | |
+ | |
+#ifdef __ARCH_WANT_SYS_NICE | |
+ | |
+/* | |
+ * sys_nice - change the priority of the current process. | |
+ * @increment: priority increment | |
+ * | |
+ * sys_setpriority is a more generic, but much slower function that | |
+ * does similar things. | |
+ */ | |
+SYSCALL_DEFINE1(nice, int, increment) | |
+{ | |
+ long nice, retval; | |
+ | |
+ /* | |
+ * Setpriority might change our priority at the same moment. | |
+ * We don't have to worry. Conceptually one call occurs first | |
+ * and we have a single winner. | |
+ */ | |
+ | |
+ increment = clamp(increment, -NICE_WIDTH, NICE_WIDTH); | |
+ nice = task_nice(current) + increment; | |
+ | |
+ nice = clamp_val(nice, MIN_NICE, MAX_NICE); | |
+ if (increment < 0 && !can_nice(current, nice)) | |
+ return -EPERM; | |
+ | |
+ retval = security_task_setnice(current, nice); | |
+ if (retval) | |
+ return retval; | |
+ | |
+ set_user_nice(current, nice); | |
+ return 0; | |
+} | |
+ | |
+#endif | |
+ | |
+/** | |
+ * task_prio - return the priority value of a given task. | |
+ * @p: the task in question. | |
+ * | |
+ * Return: The priority value as seen by users in /proc. | |
+ * RT tasks are offset by -100. Normal tasks are centered around 1, value goes | |
+ * from 0 (SCHED_ISO) up to 82 (nice +19 SCHED_IDLEPRIO). | |
+ */ | |
+int task_prio(const struct task_struct *p) | |
+{ | |
+ int delta, prio = p->prio - MAX_RT_PRIO; | |
+ | |
+ /* rt tasks and iso tasks */ | |
+ if (prio <= 0) | |
+ goto out; | |
+ | |
+ /* Convert to ms to avoid overflows */ | |
+ delta = NS_TO_MS(p->deadline - grq.niffies); | |
+ delta = delta * 40 / ms_longest_deadline_diff(); | |
+ if (delta > 0 && delta <= 80) | |
+ prio += delta; | |
+ if (idleprio_task(p)) | |
+ prio += 40; | |
+out: | |
+ return prio; | |
+} | |
+ | |
+/** | |
+ * idle_cpu - is a given cpu idle currently? | |
+ * @cpu: the processor in question. | |
+ * | |
+ * Return: 1 if the CPU is currently idle. 0 otherwise. | |
+ */ | |
+int idle_cpu(int cpu) | |
+{ | |
+ return cpu_curr(cpu) == cpu_rq(cpu)->idle; | |
+} | |
+ | |
+/** | |
+ * idle_task - return the idle task for a given cpu. | |
+ * @cpu: the processor in question. | |
+ * | |
+ * Return: The idle task for the cpu @cpu. | |
+ */ | |
+struct task_struct *idle_task(int cpu) | |
+{ | |
+ return cpu_rq(cpu)->idle; | |
+} | |
+ | |
+/** | |
+ * find_process_by_pid - find a process with a matching PID value. | |
+ * @pid: the pid in question. | |
+ * | |
+ * The task of @pid, if found. %NULL otherwise. | |
+ */ | |
+static inline struct task_struct *find_process_by_pid(pid_t pid) | |
+{ | |
+ return pid ? find_task_by_vpid(pid) : current; | |
+} | |
+ | |
+/* Actually do priority change: must hold grq lock. */ | |
+static void | |
+__setscheduler(struct task_struct *p, struct rq *rq, int policy, int prio) | |
+{ | |
+ int oldrtprio, oldprio; | |
+ | |
+ p->policy = policy; | |
+ oldrtprio = p->rt_priority; | |
+ p->rt_priority = prio; | |
+ p->normal_prio = normal_prio(p); | |
+ oldprio = p->prio; | |
+ /* we are holding p->pi_lock already */ | |
+ p->prio = rt_mutex_getprio(p); | |
+ if (task_running(p)) { | |
+ reset_rq_task(rq, p); | |
+ /* Resched only if we might now be preempted */ | |
+ if (p->prio > oldprio || p->rt_priority > oldrtprio) | |
+ resched_task(p); | |
+ } | |
+} | |
+ | |
+/* | |
+ * check the target process has a UID that matches the current process's | |
+ */ | |
+static bool check_same_owner(struct task_struct *p) | |
+{ | |
+ const struct cred *cred = current_cred(), *pcred; | |
+ bool match; | |
+ | |
+ rcu_read_lock(); | |
+ pcred = __task_cred(p); | |
+ match = (uid_eq(cred->euid, pcred->euid) || | |
+ uid_eq(cred->euid, pcred->uid)); | |
+ rcu_read_unlock(); | |
+ return match; | |
+} | |
+ | |
+static int __sched_setscheduler(struct task_struct *p, int policy, | |
+ const struct sched_param *param, bool user) | |
+{ | |
+ struct sched_param zero_param = { .sched_priority = 0 }; | |
+ int queued, retval, oldpolicy = -1; | |
+ unsigned long flags, rlim_rtprio = 0; | |
+ int reset_on_fork; | |
+ struct rq *rq; | |
+ | |
+ /* may grab non-irq protected spin_locks */ | |
+ BUG_ON(in_interrupt()); | |
+ | |
+ if (is_rt_policy(policy) && !capable(CAP_SYS_NICE)) { | |
+ unsigned long lflags; | |
+ | |
+ if (!lock_task_sighand(p, &lflags)) | |
+ return -ESRCH; | |
+ rlim_rtprio = task_rlimit(p, RLIMIT_RTPRIO); | |
+ unlock_task_sighand(p, &lflags); | |
+ if (rlim_rtprio) | |
+ goto recheck; | |
+ /* | |
+ * If the caller requested an RT policy without having the | |
+ * necessary rights, we downgrade the policy to SCHED_ISO. | |
+ * We also set the parameter to zero to pass the checks. | |
+ */ | |
+ policy = SCHED_ISO; | |
+ param = &zero_param; | |
+ } | |
+recheck: | |
+ /* double check policy once rq lock held */ | |
+ if (policy < 0) { | |
+ reset_on_fork = p->sched_reset_on_fork; | |
+ policy = oldpolicy = p->policy; | |
+ } else { | |
+ reset_on_fork = !!(policy & SCHED_RESET_ON_FORK); | |
+ policy &= ~SCHED_RESET_ON_FORK; | |
+ | |
+ if (!SCHED_RANGE(policy)) | |
+ return -EINVAL; | |
+ } | |
+ | |
+ /* | |
+ * Valid priorities for SCHED_FIFO and SCHED_RR are | |
+ * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL and | |
+ * SCHED_BATCH is 0. | |
+ */ | |
+ if (param->sched_priority < 0 || | |
+ (p->mm && param->sched_priority > MAX_USER_RT_PRIO - 1) || | |
+ (!p->mm && param->sched_priority > MAX_RT_PRIO - 1)) | |
+ return -EINVAL; | |
+ if (is_rt_policy(policy) != (param->sched_priority != 0)) | |
+ return -EINVAL; | |
+ | |
+ /* | |
+ * Allow unprivileged RT tasks to decrease priority: | |
+ */ | |
+ if (user && !capable(CAP_SYS_NICE)) { | |
+ if (is_rt_policy(policy)) { | |
+ unsigned long rlim_rtprio = | |
+ task_rlimit(p, RLIMIT_RTPRIO); | |
+ | |
+ /* can't set/change the rt policy */ | |
+ if (policy != p->policy && !rlim_rtprio) | |
+ return -EPERM; | |
+ | |
+ /* can't increase priority */ | |
+ if (param->sched_priority > p->rt_priority && | |
+ param->sched_priority > rlim_rtprio) | |
+ return -EPERM; | |
+ } else { | |
+ switch (p->policy) { | |
+ /* | |
+ * Can only downgrade policies but not back to | |
+ * SCHED_NORMAL | |
+ */ | |
+ case SCHED_ISO: | |
+ if (policy == SCHED_ISO) | |
+ goto out; | |
+ if (policy == SCHED_NORMAL) | |
+ return -EPERM; | |
+ break; | |
+ case SCHED_BATCH: | |
+ if (policy == SCHED_BATCH) | |
+ goto out; | |
+ if (policy != SCHED_IDLEPRIO) | |
+ return -EPERM; | |
+ break; | |
+ case SCHED_IDLEPRIO: | |
+ if (policy == SCHED_IDLEPRIO) | |
+ goto out; | |
+ return -EPERM; | |
+ default: | |
+ break; | |
+ } | |
+ } | |
+ | |
+ /* can't change other user's priorities */ | |
+ if (!check_same_owner(p)) | |
+ return -EPERM; | |
+ | |
+ /* Normal users shall not reset the sched_reset_on_fork flag */ | |
+ if (p->sched_reset_on_fork && !reset_on_fork) | |
+ return -EPERM; | |
+ } | |
+ | |
+ if (user) { | |
+ retval = security_task_setscheduler(p); | |
+ if (retval) | |
+ return retval; | |
+ } | |
+ | |
+ /* | |
+ * make sure no PI-waiters arrive (or leave) while we are | |
+ * changing the priority of the task: | |
+ */ | |
+ raw_spin_lock_irqsave(&p->pi_lock, flags); | |
+ /* | |
+ * To be able to change p->policy safely, the grunqueue lock must be | |
+ * held. | |
+ */ | |
+ rq = __task_grq_lock(p); | |
+ | |
+ /* | |
+ * Changing the policy of the stop threads its a very bad idea | |
+ */ | |
+ if (p == rq->stop) { | |
+ __task_grq_unlock(); | |
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags); | |
+ return -EINVAL; | |
+ } | |
+ | |
+ /* | |
+ * If not changing anything there's no need to proceed further: | |
+ */ | |
+ if (unlikely(policy == p->policy && (!is_rt_policy(policy) || | |
+ param->sched_priority == p->rt_priority))) { | |
+ | |
+ __task_grq_unlock(); | |
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags); | |
+ return 0; | |
+ } | |
+ | |
+ /* recheck policy now with rq lock held */ | |
+ if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) { | |
+ policy = oldpolicy = -1; | |
+ __task_grq_unlock(); | |
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags); | |
+ goto recheck; | |
+ } | |
+ update_clocks(rq); | |
+ p->sched_reset_on_fork = reset_on_fork; | |
+ | |
+ queued = task_queued(p); | |
+ if (queued) | |
+ dequeue_task(p); | |
+ __setscheduler(p, rq, policy, param->sched_priority); | |
+ if (queued) { | |
+ enqueue_task(p, rq); | |
+ try_preempt(p, rq); | |
+ } | |
+ __task_grq_unlock(); | |
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags); | |
+ | |
+ rt_mutex_adjust_pi(p); | |
+out: | |
+ return 0; | |
+} | |
+ | |
+/** | |
+ * sched_setscheduler - change the scheduling policy and/or RT priority of a thread. | |
+ * @p: the task in question. | |
+ * @policy: new policy. | |
+ * @param: structure containing the new RT priority. | |
+ * | |
+ * Return: 0 on success. An error code otherwise. | |
+ * | |
+ * NOTE that the task may be already dead. | |
+ */ | |
+int sched_setscheduler(struct task_struct *p, int policy, | |
+ const struct sched_param *param) | |
+{ | |
+ return __sched_setscheduler(p, policy, param, true); | |
+} | |
+ | |
+EXPORT_SYMBOL_GPL(sched_setscheduler); | |
+ | |
+int sched_setattr(struct task_struct *p, const struct sched_attr *attr) | |
+{ | |
+ const struct sched_param param = { .sched_priority = attr->sched_priority }; | |
+ int policy = attr->sched_policy; | |
+ | |
+ return __sched_setscheduler(p, policy, ¶m, true); | |
+} | |
+EXPORT_SYMBOL_GPL(sched_setattr); | |
+ | |
+/** | |
+ * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace. | |
+ * @p: the task in question. | |
+ * @policy: new policy. | |
+ * @param: structure containing the new RT priority. | |
+ * | |
+ * Just like sched_setscheduler, only don't bother checking if the | |
+ * current context has permission. For example, this is needed in | |
+ * stop_machine(): we create temporary high priority worker threads, | |
+ * but our caller might not have that capability. | |
+ * | |
+ * Return: 0 on success. An error code otherwise. | |
+ */ | |
+int sched_setscheduler_nocheck(struct task_struct *p, int policy, | |
+ const struct sched_param *param) | |
+{ | |
+ return __sched_setscheduler(p, policy, param, false); | |
+} | |
+ | |
+static int | |
+do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param) | |
+{ | |
+ struct sched_param lparam; | |
+ struct task_struct *p; | |
+ int retval; | |
+ | |
+ if (!param || pid < 0) | |
+ return -EINVAL; | |
+ if (copy_from_user(&lparam, param, sizeof(struct sched_param))) | |
+ return -EFAULT; | |
+ | |
+ rcu_read_lock(); | |
+ retval = -ESRCH; | |
+ p = find_process_by_pid(pid); | |
+ if (p != NULL) | |
+ retval = sched_setscheduler(p, policy, &lparam); | |
+ rcu_read_unlock(); | |
+ | |
+ return retval; | |
+} | |
+ | |
+/* | |
+ * Mimics kernel/events/core.c perf_copy_attr(). | |
+ */ | |
+static int sched_copy_attr(struct sched_attr __user *uattr, | |
+ struct sched_attr *attr) | |
+{ | |
+ u32 size; | |
+ int ret; | |
+ | |
+ if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0)) | |
+ return -EFAULT; | |
+ | |
+ /* | |
+ * zero the full structure, so that a short copy will be nice. | |
+ */ | |
+ memset(attr, 0, sizeof(*attr)); | |
+ | |
+ ret = get_user(size, &uattr->size); | |
+ if (ret) | |
+ return ret; | |
+ | |
+ if (size > PAGE_SIZE) /* silly large */ | |
+ goto err_size; | |
+ | |
+ if (!size) /* abi compat */ | |
+ size = SCHED_ATTR_SIZE_VER0; | |
+ | |
+ if (size < SCHED_ATTR_SIZE_VER0) | |
+ goto err_size; | |
+ | |
+ /* | |
+ * If we're handed a bigger struct than we know of, | |
+ * ensure all the unknown bits are 0 - i.e. new | |
+ * user-space does not rely on any kernel feature | |
+ * extensions we dont know about yet. | |
+ */ | |
+ if (size > sizeof(*attr)) { | |
+ unsigned char __user *addr; | |
+ unsigned char __user *end; | |
+ unsigned char val; | |
+ | |
+ addr = (void __user *)uattr + sizeof(*attr); | |
+ end = (void __user *)uattr + size; | |
+ | |
+ for (; addr < end; addr++) { | |
+ ret = get_user(val, addr); | |
+ if (ret) | |
+ return ret; | |
+ if (val) | |
+ goto err_size; | |
+ } | |
+ size = sizeof(*attr); | |
+ } | |
+ | |
+ ret = copy_from_user(attr, uattr, size); | |
+ if (ret) | |
+ return -EFAULT; | |
+ | |
+ /* | |
+ * XXX: do we want to be lenient like existing syscalls; or do we want | |
+ * to be strict and return an error on out-of-bounds values? | |
+ */ | |
+ attr->sched_nice = clamp(attr->sched_nice, -20, 19); | |
+ | |
+ /* sched/core.c uses zero here but we already know ret is zero */ | |
+ return 0; | |
+ | |
+err_size: | |
+ put_user(sizeof(*attr), &uattr->size); | |
+ return -E2BIG; | |
+} | |
+ | |
+/** | |
+ * sys_sched_setscheduler - set/change the scheduler policy and RT priority | |
+ * @pid: the pid in question. | |
+ * @policy: new policy. | |
+ * | |
+ * Return: 0 on success. An error code otherwise. | |
+ * @param: structure containing the new RT priority. | |
+ */ | |
+asmlinkage long sys_sched_setscheduler(pid_t pid, int policy, | |
+ struct sched_param __user *param) | |
+{ | |
+ /* negative values for policy are not valid */ | |
+ if (policy < 0) | |
+ return -EINVAL; | |
+ | |
+ return do_sched_setscheduler(pid, policy, param); | |
+} | |
+ | |
+/* | |
+ * sched_setparam() passes in -1 for its policy, to let the functions | |
+ * it calls know not to change it. | |
+ */ | |
+#define SETPARAM_POLICY -1 | |
+ | |
+/** | |
+ * sys_sched_setparam - set/change the RT priority of a thread | |
+ * @pid: the pid in question. | |
+ * @param: structure containing the new RT priority. | |
+ * | |
+ * Return: 0 on success. An error code otherwise. | |
+ */ | |
+SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param) | |
+{ | |
+ return do_sched_setscheduler(pid, SETPARAM_POLICY, param); | |
+} | |
+ | |
+/** | |
+ * sys_sched_setattr - same as above, but with extended sched_attr | |
+ * @pid: the pid in question. | |
+ * @uattr: structure containing the extended parameters. | |
+ */ | |
+SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr, | |
+ unsigned int, flags) | |
+{ | |
+ struct sched_attr attr; | |
+ struct task_struct *p; | |
+ int retval; | |
+ | |
+ if (!uattr || pid < 0 || flags) | |
+ return -EINVAL; | |
+ | |
+ retval = sched_copy_attr(uattr, &attr); | |
+ if (retval) | |
+ return retval; | |
+ | |
+ if ((int)attr.sched_policy < 0) | |
+ return -EINVAL; | |
+ | |
+ rcu_read_lock(); | |
+ retval = -ESRCH; | |
+ p = find_process_by_pid(pid); | |
+ if (p != NULL) | |
+ retval = sched_setattr(p, &attr); | |
+ rcu_read_unlock(); | |
+ | |
+ return retval; | |
+} | |
+ | |
+/** | |
+ * sys_sched_getscheduler - get the policy (scheduling class) of a thread | |
+ * @pid: the pid in question. | |
+ * | |
+ * Return: On success, the policy of the thread. Otherwise, a negative error | |
+ * code. | |
+ */ | |
+SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid) | |
+{ | |
+ struct task_struct *p; | |
+ int retval = -EINVAL; | |
+ | |
+ if (pid < 0) | |
+ goto out_nounlock; | |
+ | |
+ retval = -ESRCH; | |
+ rcu_read_lock(); | |
+ p = find_process_by_pid(pid); | |
+ if (p) { | |
+ retval = security_task_getscheduler(p); | |
+ if (!retval) | |
+ retval = p->policy; | |
+ } | |
+ rcu_read_unlock(); | |
+ | |
+out_nounlock: | |
+ return retval; | |
+} | |
+ | |
+/** | |
+ * sys_sched_getscheduler - get the RT priority of a thread | |
+ * @pid: the pid in question. | |
+ * @param: structure containing the RT priority. | |
+ * | |
+ * Return: On success, 0 and the RT priority is in @param. Otherwise, an error | |
+ * code. | |
+ */ | |
+SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param) | |
+{ | |
+ struct sched_param lp = { .sched_priority = 0 }; | |
+ struct task_struct *p; | |
+ int retval = -EINVAL; | |
+ | |
+ if (!param || pid < 0) | |
+ goto out_nounlock; | |
+ | |
+ rcu_read_lock(); | |
+ p = find_process_by_pid(pid); | |
+ retval = -ESRCH; | |
+ if (!p) | |
+ goto out_unlock; | |
+ | |
+ retval = security_task_getscheduler(p); | |
+ if (retval) | |
+ goto out_unlock; | |
+ | |
+ if (has_rt_policy(p)) | |
+ lp.sched_priority = p->rt_priority; | |
+ rcu_read_unlock(); | |
+ | |
+ /* | |
+ * This one might sleep, we cannot do it with a spinlock held ... | |
+ */ | |
+ retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0; | |
+ | |
+out_nounlock: | |
+ return retval; | |
+ | |
+out_unlock: | |
+ rcu_read_unlock(); | |
+ return retval; | |
+} | |
+ | |
+static int sched_read_attr(struct sched_attr __user *uattr, | |
+ struct sched_attr *attr, | |
+ unsigned int usize) | |
+{ | |
+ int ret; | |
+ | |
+ if (!access_ok(VERIFY_WRITE, uattr, usize)) | |
+ return -EFAULT; | |
+ | |
+ /* | |
+ * If we're handed a smaller struct than we know of, | |
+ * ensure all the unknown bits are 0 - i.e. old | |
+ * user-space does not get uncomplete information. | |
+ */ | |
+ if (usize < sizeof(*attr)) { | |
+ unsigned char *addr; | |
+ unsigned char *end; | |
+ | |
+ addr = (void *)attr + usize; | |
+ end = (void *)attr + sizeof(*attr); | |
+ | |
+ for (; addr < end; addr++) { | |
+ if (*addr) | |
+ return -EFBIG; | |
+ } | |
+ | |
+ attr->size = usize; | |
+ } | |
+ | |
+ ret = copy_to_user(uattr, attr, attr->size); | |
+ if (ret) | |
+ return -EFAULT; | |
+ | |
+ /* sched/core.c uses zero here but we already know ret is zero */ | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * sys_sched_getattr - similar to sched_getparam, but with sched_attr | |
+ * @pid: the pid in question. | |
+ * @uattr: structure containing the extended parameters. | |
+ * @size: sizeof(attr) for fwd/bwd comp. | |
+ * @flags: for future extension. | |
+ */ | |
+SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, | |
+ unsigned int, size, unsigned int, flags) | |
+{ | |
+ struct sched_attr attr = { | |
+ .size = sizeof(struct sched_attr), | |
+ }; | |
+ struct task_struct *p; | |
+ int retval; | |
+ | |
+ if (!uattr || pid < 0 || size > PAGE_SIZE || | |
+ size < SCHED_ATTR_SIZE_VER0 || flags) | |
+ return -EINVAL; | |
+ | |
+ rcu_read_lock(); | |
+ p = find_process_by_pid(pid); | |
+ retval = -ESRCH; | |
+ if (!p) | |
+ goto out_unlock; | |
+ | |
+ retval = security_task_getscheduler(p); | |
+ if (retval) | |
+ goto out_unlock; | |
+ | |
+ attr.sched_policy = p->policy; | |
+ if (rt_task(p)) | |
+ attr.sched_priority = p->rt_priority; | |
+ else | |
+ attr.sched_nice = task_nice(p); | |
+ | |
+ rcu_read_unlock(); | |
+ | |
+ retval = sched_read_attr(uattr, &attr, size); | |
+ return retval; | |
+ | |
+out_unlock: | |
+ rcu_read_unlock(); | |
+ return retval; | |
+} | |
+ | |
+long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) | |
+{ | |
+ cpumask_var_t cpus_allowed, new_mask; | |
+ struct task_struct *p; | |
+ int retval; | |
+ | |
+ get_online_cpus(); | |
+ rcu_read_lock(); | |
+ | |
+ p = find_process_by_pid(pid); | |
+ if (!p) { | |
+ rcu_read_unlock(); | |
+ put_online_cpus(); | |
+ return -ESRCH; | |
+ } | |
+ | |
+ /* Prevent p going away */ | |
+ get_task_struct(p); | |
+ rcu_read_unlock(); | |
+ | |
+ if (p->flags & PF_NO_SETAFFINITY) { | |
+ retval = -EINVAL; | |
+ goto out_put_task; | |
+ } | |
+ if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) { | |
+ retval = -ENOMEM; | |
+ goto out_put_task; | |
+ } | |
+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) { | |
+ retval = -ENOMEM; | |
+ goto out_free_cpus_allowed; | |
+ } | |
+ retval = -EPERM; | |
+ if (!check_same_owner(p)) { | |
+ rcu_read_lock(); | |
+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) { | |
+ rcu_read_unlock(); | |
+ goto out_unlock; | |
+ } | |
+ rcu_read_unlock(); | |
+ } | |
+ | |
+ retval = security_task_setscheduler(p); | |
+ if (retval) | |
+ goto out_unlock; | |
+ | |
+ cpuset_cpus_allowed(p, cpus_allowed); | |
+ cpumask_and(new_mask, in_mask, cpus_allowed); | |
+again: | |
+ retval = set_cpus_allowed_ptr(p, new_mask); | |
+ | |
+ if (!retval) { | |
+ cpuset_cpus_allowed(p, cpus_allowed); | |
+ if (!cpumask_subset(new_mask, cpus_allowed)) { | |
+ /* | |
+ * We must have raced with a concurrent cpuset | |
+ * update. Just reset the cpus_allowed to the | |
+ * cpuset's cpus_allowed | |
+ */ | |
+ cpumask_copy(new_mask, cpus_allowed); | |
+ goto again; | |
+ } | |
+ } | |
+out_unlock: | |
+ free_cpumask_var(new_mask); | |
+out_free_cpus_allowed: | |
+ free_cpumask_var(cpus_allowed); | |
+out_put_task: | |
+ put_task_struct(p); | |
+ put_online_cpus(); | |
+ return retval; | |
+} | |
+ | |
+static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len, | |
+ cpumask_t *new_mask) | |
+{ | |
+ if (len < sizeof(cpumask_t)) { | |
+ memset(new_mask, 0, sizeof(cpumask_t)); | |
+ } else if (len > sizeof(cpumask_t)) { | |
+ len = sizeof(cpumask_t); | |
+ } | |
+ return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0; | |
+} | |
+ | |
+ | |
+/** | |
+ * sys_sched_setaffinity - set the cpu affinity of a process | |
+ * @pid: pid of the process | |
+ * @len: length in bytes of the bitmask pointed to by user_mask_ptr | |
+ * @user_mask_ptr: user-space pointer to the new cpu mask | |
+ * | |
+ * Return: 0 on success. An error code otherwise. | |
+ */ | |
+SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len, | |
+ unsigned long __user *, user_mask_ptr) | |
+{ | |
+ cpumask_var_t new_mask; | |
+ int retval; | |
+ | |
+ if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) | |
+ return -ENOMEM; | |
+ | |
+ retval = get_user_cpu_mask(user_mask_ptr, len, new_mask); | |
+ if (retval == 0) | |
+ retval = sched_setaffinity(pid, new_mask); | |
+ free_cpumask_var(new_mask); | |
+ return retval; | |
+} | |
+ | |
+long sched_getaffinity(pid_t pid, cpumask_t *mask) | |
+{ | |
+ struct task_struct *p; | |
+ unsigned long flags; | |
+ int retval; | |
+ | |
+ get_online_cpus(); | |
+ rcu_read_lock(); | |
+ | |
+ retval = -ESRCH; | |
+ p = find_process_by_pid(pid); | |
+ if (!p) | |
+ goto out_unlock; | |
+ | |
+ retval = security_task_getscheduler(p); | |
+ if (retval) | |
+ goto out_unlock; | |
+ | |
+ grq_lock_irqsave(&flags); | |
+ cpumask_and(mask, tsk_cpus_allowed(p), cpu_active_mask); | |
+ grq_unlock_irqrestore(&flags); | |
+ | |
+out_unlock: | |
+ rcu_read_unlock(); | |
+ put_online_cpus(); | |
+ | |
+ return retval; | |
+} | |
+ | |
+/** | |
+ * sys_sched_getaffinity - get the cpu affinity of a process | |
+ * @pid: pid of the process | |
+ * @len: length in bytes of the bitmask pointed to by user_mask_ptr | |
+ * @user_mask_ptr: user-space pointer to hold the current cpu mask | |
+ * | |
+ * Return: 0 on success. An error code otherwise. | |
+ */ | |
+SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, | |
+ unsigned long __user *, user_mask_ptr) | |
+{ | |
+ int ret; | |
+ cpumask_var_t mask; | |
+ | |
+ if ((len * BITS_PER_BYTE) < nr_cpu_ids) | |
+ return -EINVAL; | |
+ if (len & (sizeof(unsigned long)-1)) | |
+ return -EINVAL; | |
+ | |
+ if (!alloc_cpumask_var(&mask, GFP_KERNEL)) | |
+ return -ENOMEM; | |
+ | |
+ ret = sched_getaffinity(pid, mask); | |
+ if (ret == 0) { | |
+ size_t retlen = min_t(size_t, len, cpumask_size()); | |
+ | |
+ if (copy_to_user(user_mask_ptr, mask, retlen)) | |
+ ret = -EFAULT; | |
+ else | |
+ ret = retlen; | |
+ } | |
+ free_cpumask_var(mask); | |
+ | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * sys_sched_yield - yield the current processor to other threads. | |
+ * | |
+ * This function yields the current CPU to other tasks. It does this by | |
+ * scheduling away the current task. If it still has the earliest deadline | |
+ * it will be scheduled again as the next task. | |
+ * | |
+ * Return: 0. | |
+ */ | |
+SYSCALL_DEFINE0(sched_yield) | |
+{ | |
+ struct task_struct *p; | |
+ | |
+ p = current; | |
+ grq_lock_irq(); | |
+ schedstat_inc(task_rq(p), yld_count); | |
+ requeue_task(p); | |
+ | |
+ /* | |
+ * Since we are going to call schedule() anyway, there's | |
+ * no need to preempt or enable interrupts: | |
+ */ | |
+ __release(grq.lock); | |
+ spin_release(&grq.lock.dep_map, 1, _THIS_IP_); | |
+ do_raw_spin_unlock(&grq.lock); | |
+ sched_preempt_enable_no_resched(); | |
+ | |
+ schedule(); | |
+ | |
+ return 0; | |
+} | |
+ | |
+static void __cond_resched(void) | |
+{ | |
+ __preempt_count_add(PREEMPT_ACTIVE); | |
+ schedule(); | |
+ __preempt_count_sub(PREEMPT_ACTIVE); | |
+} | |
+ | |
+int __sched _cond_resched(void) | |
+{ | |
+ if (should_resched()) { | |
+ __cond_resched(); | |
+ return 1; | |
+ } | |
+ return 0; | |
+} | |
+EXPORT_SYMBOL(_cond_resched); | |
+ | |
+/* | |
+ * __cond_resched_lock() - if a reschedule is pending, drop the given lock, | |
+ * call schedule, and on return reacquire the lock. | |
+ * | |
+ * This works OK both with and without CONFIG_PREEMPT. We do strange low-level | |
+ * operations here to prevent schedule() from being called twice (once via | |
+ * spin_unlock(), once by hand). | |
+ */ | |
+int __cond_resched_lock(spinlock_t *lock) | |
+{ | |
+ int resched = should_resched(); | |
+ int ret = 0; | |
+ | |
+ lockdep_assert_held(lock); | |
+ | |
+ if (spin_needbreak(lock) || resched) { | |
+ spin_unlock(lock); | |
+ if (resched) | |
+ __cond_resched(); | |
+ else | |
+ cpu_relax(); | |
+ ret = 1; | |
+ spin_lock(lock); | |
+ } | |
+ return ret; | |
+} | |
+EXPORT_SYMBOL(__cond_resched_lock); | |
+ | |
+int __sched __cond_resched_softirq(void) | |
+{ | |
+ BUG_ON(!in_softirq()); | |
+ | |
+ if (should_resched()) { | |
+ local_bh_enable(); | |
+ __cond_resched(); | |
+ local_bh_disable(); | |
+ return 1; | |
+ } | |
+ return 0; | |
+} | |
+EXPORT_SYMBOL(__cond_resched_softirq); | |
+ | |
+/** | |
+ * yield - yield the current processor to other threads. | |
+ * | |
+ * Do not ever use this function, there's a 99% chance you're doing it wrong. | |
+ * | |
+ * The scheduler is at all times free to pick the calling task as the most | |
+ * eligible task to run, if removing the yield() call from your code breaks | |
+ * it, its already broken. | |
+ * | |
+ * Typical broken usage is: | |
+ * | |
+ * while (!event) | |
+ * yield(); | |
+ * | |
+ * where one assumes that yield() will let 'the other' process run that will | |
+ * make event true. If the current task is a SCHED_FIFO task that will never | |
+ * happen. Never use yield() as a progress guarantee!! | |
+ * | |
+ * If you want to use yield() to wait for something, use wait_event(). | |
+ * If you want to use yield() to be 'nice' for others, use cond_resched(). | |
+ * If you still want to use yield(), do not! | |
+ */ | |
+void __sched yield(void) | |
+{ | |
+ set_current_state(TASK_RUNNING); | |
+ sys_sched_yield(); | |
+} | |
+EXPORT_SYMBOL(yield); | |
+ | |
+/** | |
+ * yield_to - yield the current processor to another thread in | |
+ * your thread group, or accelerate that thread toward the | |
+ * processor it's on. | |
+ * @p: target task | |
+ * @preempt: whether task preemption is allowed or not | |
+ * | |
+ * It's the caller's job to ensure that the target task struct | |
+ * can't go away on us before we can do any checks. | |
+ * | |
+ * Return: | |
+ * true (>0) if we indeed boosted the target task. | |
+ * false (0) if we failed to boost the target. | |
+ * -ESRCH if there's no task to yield to. | |
+ */ | |
+int __sched yield_to(struct task_struct *p, bool preempt) | |
+{ | |
+ struct rq *rq, *p_rq; | |
+ unsigned long flags; | |
+ int yielded = 0; | |
+ | |
+ rq = this_rq(); | |
+ grq_lock_irqsave(&flags); | |
+ if (task_running(p) || p->state) { | |
+ yielded = -ESRCH; | |
+ goto out_unlock; | |
+ } | |
+ | |
+ p_rq = task_rq(p); | |
+ yielded = 1; | |
+ if (p->deadline > rq->rq_deadline) | |
+ p->deadline = rq->rq_deadline; | |
+ p->time_slice += rq->rq_time_slice; | |
+ rq->rq_time_slice = 0; | |
+ if (p->time_slice > timeslice()) | |
+ p->time_slice = timeslice(); | |
+ if (preempt && rq != p_rq) | |
+ resched_curr(p_rq); | |
+out_unlock: | |
+ grq_unlock_irqrestore(&flags); | |
+ | |
+ if (yielded > 0) | |
+ schedule(); | |
+ return yielded; | |
+} | |
+EXPORT_SYMBOL_GPL(yield_to); | |
+ | |
+/* | |
+ * This task is about to go to sleep on IO. Increment rq->nr_iowait so | |
+ * that process accounting knows that this is a task in IO wait state. | |
+ * | |
+ * But don't do that if it is a deliberate, throttling IO wait (this task | |
+ * has set its backing_dev_info: the queue against which it should throttle) | |
+ */ | |
+void __sched io_schedule(void) | |
+{ | |
+ struct rq *rq = raw_rq(); | |
+ | |
+ delayacct_blkio_start(); | |
+ atomic_inc(&rq->nr_iowait); | |
+ blk_flush_plug(current); | |
+ current->in_iowait = 1; | |
+ schedule(); | |
+ current->in_iowait = 0; | |
+ atomic_dec(&rq->nr_iowait); | |
+ delayacct_blkio_end(); | |
+} | |
+EXPORT_SYMBOL(io_schedule); | |
+ | |
+long __sched io_schedule_timeout(long timeout) | |
+{ | |
+ struct rq *rq = raw_rq(); | |
+ long ret; | |
+ | |
+ delayacct_blkio_start(); | |
+ atomic_inc(&rq->nr_iowait); | |
+ blk_flush_plug(current); | |
+ current->in_iowait = 1; | |
+ ret = schedule_timeout(timeout); | |
+ current->in_iowait = 0; | |
+ atomic_dec(&rq->nr_iowait); | |
+ delayacct_blkio_end(); | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * sys_sched_get_priority_max - return maximum RT priority. | |
+ * @policy: scheduling class. | |
+ * | |
+ * Return: On success, this syscall returns the maximum | |
+ * rt_priority that can be used by a given scheduling class. | |
+ * On failure, a negative error code is returned. | |
+ */ | |
+SYSCALL_DEFINE1(sched_get_priority_max, int, policy) | |
+{ | |
+ int ret = -EINVAL; | |
+ | |
+ switch (policy) { | |
+ case SCHED_FIFO: | |
+ case SCHED_RR: | |
+ ret = MAX_USER_RT_PRIO-1; | |
+ break; | |
+ case SCHED_NORMAL: | |
+ case SCHED_BATCH: | |
+ case SCHED_ISO: | |
+ case SCHED_IDLEPRIO: | |
+ ret = 0; | |
+ break; | |
+ } | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * sys_sched_get_priority_min - return minimum RT priority. | |
+ * @policy: scheduling class. | |
+ * | |
+ * Return: On success, this syscall returns the minimum | |
+ * rt_priority that can be used by a given scheduling class. | |
+ * On failure, a negative error code is returned. | |
+ */ | |
+SYSCALL_DEFINE1(sched_get_priority_min, int, policy) | |
+{ | |
+ int ret = -EINVAL; | |
+ | |
+ switch (policy) { | |
+ case SCHED_FIFO: | |
+ case SCHED_RR: | |
+ ret = 1; | |
+ break; | |
+ case SCHED_NORMAL: | |
+ case SCHED_BATCH: | |
+ case SCHED_ISO: | |
+ case SCHED_IDLEPRIO: | |
+ ret = 0; | |
+ break; | |
+ } | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * sys_sched_rr_get_interval - return the default timeslice of a process. | |
+ * @pid: pid of the process. | |
+ * @interval: userspace pointer to the timeslice value. | |
+ * | |
+ * | |
+ * Return: On success, 0 and the timeslice is in @interval. Otherwise, | |
+ * an error code. | |
+ */ | |
+SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid, | |
+ struct timespec __user *, interval) | |
+{ | |
+ struct task_struct *p; | |
+ unsigned int time_slice; | |
+ unsigned long flags; | |
+ int retval; | |
+ struct timespec t; | |
+ | |
+ if (pid < 0) | |
+ return -EINVAL; | |
+ | |
+ retval = -ESRCH; | |
+ rcu_read_lock(); | |
+ p = find_process_by_pid(pid); | |
+ if (!p) | |
+ goto out_unlock; | |
+ | |
+ retval = security_task_getscheduler(p); | |
+ if (retval) | |
+ goto out_unlock; | |
+ | |
+ grq_lock_irqsave(&flags); | |
+ time_slice = p->policy == SCHED_FIFO ? 0 : MS_TO_NS(task_timeslice(p)); | |
+ grq_unlock_irqrestore(&flags); | |
+ | |
+ rcu_read_unlock(); | |
+ t = ns_to_timespec(time_slice); | |
+ retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; | |
+ return retval; | |
+ | |
+out_unlock: | |
+ rcu_read_unlock(); | |
+ return retval; | |
+} | |
+ | |
+static const char stat_nam[] = TASK_STATE_TO_CHAR_STR; | |
+ | |
+void sched_show_task(struct task_struct *p) | |
+{ | |
+ unsigned long free = 0; | |
+ int ppid; | |
+ unsigned state; | |
+ | |
+ state = p->state ? __ffs(p->state) + 1 : 0; | |
+ printk(KERN_INFO "%-15.15s %c", p->comm, | |
+ state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?'); | |
+#if BITS_PER_LONG == 32 | |
+ if (state == TASK_RUNNING) | |
+ printk(KERN_CONT " running "); | |
+ else | |
+ printk(KERN_CONT " %08lx ", thread_saved_pc(p)); | |
+#else | |
+ if (state == TASK_RUNNING) | |
+ printk(KERN_CONT " running task "); | |
+ else | |
+ printk(KERN_CONT " %016lx ", thread_saved_pc(p)); | |
+#endif | |
+#ifdef CONFIG_DEBUG_STACK_USAGE | |
+ free = stack_not_used(p); | |
+#endif | |
+ rcu_read_lock(); | |
+ ppid = task_pid_nr(rcu_dereference(p->real_parent)); | |
+ rcu_read_unlock(); | |
+ printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free, | |
+ task_pid_nr(p), ppid, | |
+ (unsigned long)task_thread_info(p)->flags); | |
+ | |
+ print_worker_info(KERN_INFO, p); | |
+ show_stack(p, NULL); | |
+} | |
+ | |
+void show_state_filter(unsigned long state_filter) | |
+{ | |
+ struct task_struct *g, *p; | |
+ | |
+#if BITS_PER_LONG == 32 | |
+ printk(KERN_INFO | |
+ " task PC stack pid father\n"); | |
+#else | |
+ printk(KERN_INFO | |
+ " task PC stack pid father\n"); | |
+#endif | |
+ rcu_read_lock(); | |
+ for_each_process_thread(g, p) { | |
+ /* | |
+ * reset the NMI-timeout, listing all files on a slow | |
+ * console might take a lot of time: | |
+ */ | |
+ touch_nmi_watchdog(); | |
+ if (!state_filter || (p->state & state_filter)) | |
+ sched_show_task(p); | |
+ } | |
+ | |
+ touch_all_softlockup_watchdogs(); | |
+ | |
+ rcu_read_unlock(); | |
+ /* | |
+ * Only show locks if all tasks are dumped: | |
+ */ | |
+ if (!state_filter) | |
+ debug_show_all_locks(); | |
+} | |
+ | |
+void dump_cpu_task(int cpu) | |
+{ | |
+ pr_info("Task dump for CPU %d:\n", cpu); | |
+ sched_show_task(cpu_curr(cpu)); | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask) | |
+{ | |
+ cpumask_copy(tsk_cpus_allowed(p), new_mask); | |
+} | |
+#endif | |
+ | |
+/** | |
+ * init_idle - set up an idle thread for a given CPU | |
+ * @idle: task in question | |
+ * @cpu: cpu the idle task belongs to | |
+ * | |
+ * NOTE: this function does not set the idle thread's NEED_RESCHED | |
+ * flag, to make booting more robust. | |
+ */ | |
+void init_idle(struct task_struct *idle, int cpu) | |
+{ | |
+ struct rq *rq = cpu_rq(cpu); | |
+ unsigned long flags; | |
+ | |
+ time_grq_lock(rq, &flags); | |
+ idle->last_ran = rq->clock_task; | |
+ idle->state = TASK_RUNNING; | |
+ /* Setting prio to illegal value shouldn't matter when never queued */ | |
+ idle->prio = PRIO_LIMIT; | |
+#ifdef CONFIG_SMT_NICE | |
+ idle->smt_bias = 0; | |
+#endif | |
+ set_rq_task(rq, idle); | |
+ do_set_cpus_allowed(idle, &cpumask_of_cpu(cpu)); | |
+ /* Silence PROVE_RCU */ | |
+ rcu_read_lock(); | |
+ set_task_cpu(idle, cpu); | |
+ rcu_read_unlock(); | |
+ rq->curr = rq->idle = idle; | |
+ idle->on_cpu = 1; | |
+ grq_unlock_irqrestore(&flags); | |
+ | |
+ /* Set the preempt count _outside_ the spinlocks! */ | |
+ init_idle_preempt_count(idle, cpu); | |
+ | |
+ ftrace_graph_init_idle_task(idle, cpu); | |
+#if defined(CONFIG_SMP) | |
+ sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu); | |
+#endif | |
+} | |
+ | |
+void resched_cpu(int cpu) | |
+{ | |
+ unsigned long flags; | |
+ | |
+ grq_lock_irqsave(&flags); | |
+ resched_task(cpu_curr(cpu)); | |
+ grq_unlock_irqrestore(&flags); | |
+} | |
+ | |
+#ifdef CONFIG_SMP | |
+#ifdef CONFIG_NO_HZ_COMMON | |
+void nohz_balance_enter_idle(int cpu) | |
+{ | |
+} | |
+ | |
+void select_nohz_load_balancer(int stop_tick) | |
+{ | |
+} | |
+ | |
+void set_cpu_sd_state_idle(void) {} | |
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT) | |
+/** | |
+ * lowest_flag_domain - Return lowest sched_domain containing flag. | |
+ * @cpu: The cpu whose lowest level of sched domain is to | |
+ * be returned. | |
+ * @flag: The flag to check for the lowest sched_domain | |
+ * for the given cpu. | |
+ * | |
+ * Returns the lowest sched_domain of a cpu which contains the given flag. | |
+ */ | |
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) | |
+{ | |
+ struct sched_domain *sd; | |
+ | |
+ for_each_domain(cpu, sd) | |
+ if (sd && (sd->flags & flag)) | |
+ break; | |
+ | |
+ return sd; | |
+} | |
+ | |
+/** | |
+ * for_each_flag_domain - Iterates over sched_domains containing the flag. | |
+ * @cpu: The cpu whose domains we're iterating over. | |
+ * @sd: variable holding the value of the power_savings_sd | |
+ * for cpu. | |
+ * @flag: The flag to filter the sched_domains to be iterated. | |
+ * | |
+ * Iterates over all the scheduler domains for a given cpu that has the 'flag' | |
+ * set, starting from the lowest sched_domain to the highest. | |
+ */ | |
+#define for_each_flag_domain(cpu, sd, flag) \ | |
+ for (sd = lowest_flag_domain(cpu, flag); \ | |
+ (sd && (sd->flags & flag)); sd = sd->parent) | |
+ | |
+#endif /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */ | |
+ | |
+/* | |
+ * In the semi idle case, use the nearest busy cpu for migrating timers | |
+ * from an idle cpu. This is good for power-savings. | |
+ * | |
+ * We don't do similar optimization for completely idle system, as | |
+ * selecting an idle cpu will add more delays to the timers than intended | |
+ * (as that cpu's timer base may not be uptodate wrt jiffies etc). | |
+ */ | |
+int get_nohz_timer_target(int pinned) | |
+{ | |
+ int cpu = smp_processor_id(); | |
+ int i; | |
+ struct sched_domain *sd; | |
+ | |
+ if (pinned || !get_sysctl_timer_migration() || !idle_cpu(cpu)) | |
+ return cpu; | |
+ | |
+ rcu_read_lock(); | |
+ for_each_domain(cpu, sd) { | |
+ for_each_cpu(i, sched_domain_span(sd)) { | |
+ if (!idle_cpu(i)) { | |
+ cpu = i; | |
+ goto unlock; | |
+ } | |
+ } | |
+ } | |
+unlock: | |
+ rcu_read_unlock(); | |
+ return cpu; | |
+} | |
+ | |
+/* | |
+ * When add_timer_on() enqueues a timer into the timer wheel of an | |
+ * idle CPU then this timer might expire before the next timer event | |
+ * which is scheduled to wake up that CPU. In case of a completely | |
+ * idle system the next event might even be infinite time into the | |
+ * future. wake_up_idle_cpu() ensures that the CPU is woken up and | |
+ * leaves the inner idle loop so the newly added timer is taken into | |
+ * account when the CPU goes back to idle and evaluates the timer | |
+ * wheel for the next timer event. | |
+ */ | |
+void wake_up_idle_cpu(int cpu) | |
+{ | |
+ if (cpu == smp_processor_id()) | |
+ return; | |
+ | |
+ set_tsk_need_resched(cpu_rq(cpu)->idle); | |
+ smp_send_reschedule(cpu); | |
+} | |
+ | |
+void wake_up_nohz_cpu(int cpu) | |
+{ | |
+ wake_up_idle_cpu(cpu); | |
+} | |
+#endif /* CONFIG_NO_HZ_COMMON */ | |
+ | |
+/* | |
+ * Change a given task's CPU affinity. Migrate the thread to a | |
+ * proper CPU and schedule it away if the CPU it's executing on | |
+ * is removed from the allowed bitmask. | |
+ * | |
+ * NOTE: the caller must have a valid reference to the task, the | |
+ * task must not exit() & deallocate itself prematurely. The | |
+ * call is not atomic; no spinlocks may be held. | |
+ */ | |
+int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask) | |
+{ | |
+ bool running_wrong = false; | |
+ bool queued = false; | |
+ unsigned long flags; | |
+ struct rq *rq; | |
+ int ret = 0; | |
+ | |
+ rq = task_grq_lock(p, &flags); | |
+ | |
+ if (cpumask_equal(tsk_cpus_allowed(p), new_mask)) | |
+ goto out; | |
+ | |
+ if (!cpumask_intersects(new_mask, cpu_active_mask)) { | |
+ ret = -EINVAL; | |
+ goto out; | |
+ } | |
+ | |
+ queued = task_queued(p); | |
+ | |
+ do_set_cpus_allowed(p, new_mask); | |
+ | |
+ /* Can the task run on the task's current CPU? If so, we're done */ | |
+ if (cpumask_test_cpu(task_cpu(p), new_mask)) | |
+ goto out; | |
+ | |
+ if (task_running(p)) { | |
+ /* Task is running on the wrong cpu now, reschedule it. */ | |
+ if (rq == this_rq()) { | |
+ set_tsk_need_resched(p); | |
+ running_wrong = true; | |
+ } else | |
+ resched_task(p); | |
+ } else | |
+ set_task_cpu(p, cpumask_any_and(cpu_active_mask, new_mask)); | |
+ | |
+out: | |
+ if (queued) | |
+ try_preempt(p, rq); | |
+ task_grq_unlock(&flags); | |
+ | |
+ if (running_wrong) | |
+ __cond_resched(); | |
+ | |
+ return ret; | |
+} | |
+EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr); | |
+ | |
+#ifdef CONFIG_HOTPLUG_CPU | |
+extern struct task_struct *cpu_stopper_task; | |
+/* Run through task list and find tasks affined to the dead cpu, then remove | |
+ * that cpu from the list, enable cpu0 and set the zerobound flag. */ | |
+static void bind_zero(int src_cpu) | |
+{ | |
+ struct task_struct *p, *t, *stopper; | |
+ int bound = 0; | |
+ | |
+ if (src_cpu == 0) | |
+ return; | |
+ | |
+ stopper = per_cpu(cpu_stopper_task, src_cpu); | |
+ do_each_thread(t, p) { | |
+ if (p != stopper && cpu_isset(src_cpu, *tsk_cpus_allowed(p))) { | |
+ cpumask_clear_cpu(src_cpu, tsk_cpus_allowed(p)); | |
+ cpumask_set_cpu(0, tsk_cpus_allowed(p)); | |
+ p->zerobound = true; | |
+ bound++; | |
+ } | |
+ clear_sticky(p); | |
+ } while_each_thread(t, p); | |
+ | |
+ if (bound) { | |
+ printk(KERN_INFO "Removed affinity for %d processes to cpu %d\n", | |
+ bound, src_cpu); | |
+ } | |
+} | |
+ | |
+/* Find processes with the zerobound flag and reenable their affinity for the | |
+ * CPU coming alive. */ | |
+static void unbind_zero(int src_cpu) | |
+{ | |
+ int unbound = 0, zerobound = 0; | |
+ struct task_struct *p, *t; | |
+ | |
+ if (src_cpu == 0) | |
+ return; | |
+ | |
+ do_each_thread(t, p) { | |
+ if (!p->mm) | |
+ p->zerobound = false; | |
+ if (p->zerobound) { | |
+ unbound++; | |
+ cpumask_set_cpu(src_cpu, tsk_cpus_allowed(p)); | |
+ /* Once every CPU affinity has been re-enabled, remove | |
+ * the zerobound flag */ | |
+ if (cpumask_subset(cpu_possible_mask, tsk_cpus_allowed(p))) { | |
+ p->zerobound = false; | |
+ zerobound++; | |
+ } | |
+ } | |
+ } while_each_thread(t, p); | |
+ | |
+ if (unbound) { | |
+ printk(KERN_INFO "Added affinity for %d processes to cpu %d\n", | |
+ unbound, src_cpu); | |
+ } | |
+ if (zerobound) { | |
+ printk(KERN_INFO "Released forced binding to cpu0 for %d processes\n", | |
+ zerobound); | |
+ } | |
+} | |
+ | |
+/* | |
+ * Ensures that the idle task is using init_mm right before its cpu goes | |
+ * offline. | |
+ */ | |
+void idle_task_exit(void) | |
+{ | |
+ struct mm_struct *mm = current->active_mm; | |
+ | |
+ BUG_ON(cpu_online(smp_processor_id())); | |
+ | |
+ if (mm != &init_mm) { | |
+ switch_mm(mm, &init_mm, current); | |
+ finish_arch_post_lock_switch(); | |
+ } | |
+ mmdrop(mm); | |
+} | |
+#else /* CONFIG_HOTPLUG_CPU */ | |
+static void unbind_zero(int src_cpu) {} | |
+#endif /* CONFIG_HOTPLUG_CPU */ | |
+ | |
+void sched_set_stop_task(int cpu, struct task_struct *stop) | |
+{ | |
+ struct sched_param stop_param = { .sched_priority = STOP_PRIO }; | |
+ struct sched_param start_param = { .sched_priority = 0 }; | |
+ struct task_struct *old_stop = cpu_rq(cpu)->stop; | |
+ | |
+ if (stop) { | |
+ /* | |
+ * Make it appear like a SCHED_FIFO task, its something | |
+ * userspace knows about and won't get confused about. | |
+ * | |
+ * Also, it will make PI more or less work without too | |
+ * much confusion -- but then, stop work should not | |
+ * rely on PI working anyway. | |
+ */ | |
+ sched_setscheduler_nocheck(stop, SCHED_FIFO, &stop_param); | |
+ } | |
+ | |
+ cpu_rq(cpu)->stop = stop; | |
+ | |
+ if (old_stop) { | |
+ /* | |
+ * Reset it back to a normal scheduling policy so that | |
+ * it can die in pieces. | |
+ */ | |
+ sched_setscheduler_nocheck(old_stop, SCHED_NORMAL, &start_param); | |
+ } | |
+} | |
+ | |
+ | |
+#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL) | |
+ | |
+static struct ctl_table sd_ctl_dir[] = { | |
+ { | |
+ .procname = "sched_domain", | |
+ .mode = 0555, | |
+ }, | |
+ {} | |
+}; | |
+ | |
+static struct ctl_table sd_ctl_root[] = { | |
+ { | |
+ .procname = "kernel", | |
+ .mode = 0555, | |
+ .child = sd_ctl_dir, | |
+ }, | |
+ {} | |
+}; | |
+ | |
+static struct ctl_table *sd_alloc_ctl_entry(int n) | |
+{ | |
+ struct ctl_table *entry = | |
+ kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); | |
+ | |
+ return entry; | |
+} | |
+ | |
+static void sd_free_ctl_entry(struct ctl_table **tablep) | |
+{ | |
+ struct ctl_table *entry; | |
+ | |
+ /* | |
+ * In the intermediate directories, both the child directory and | |
+ * procname are dynamically allocated and could fail but the mode | |
+ * will always be set. In the lowest directory the names are | |
+ * static strings and all have proc handlers. | |
+ */ | |
+ for (entry = *tablep; entry->mode; entry++) { | |
+ if (entry->child) | |
+ sd_free_ctl_entry(&entry->child); | |
+ if (entry->proc_handler == NULL) | |
+ kfree(entry->procname); | |
+ } | |
+ | |
+ kfree(*tablep); | |
+ *tablep = NULL; | |
+} | |
+ | |
+static void | |
+set_table_entry(struct ctl_table *entry, | |
+ const char *procname, void *data, int maxlen, | |
+ mode_t mode, proc_handler *proc_handler) | |
+{ | |
+ entry->procname = procname; | |
+ entry->data = data; | |
+ entry->maxlen = maxlen; | |
+ entry->mode = mode; | |
+ entry->proc_handler = proc_handler; | |
+} | |
+ | |
+static struct ctl_table * | |
+sd_alloc_ctl_domain_table(struct sched_domain *sd) | |
+{ | |
+ struct ctl_table *table = sd_alloc_ctl_entry(14); | |
+ | |
+ if (table == NULL) | |
+ return NULL; | |
+ | |
+ set_table_entry(&table[0], "min_interval", &sd->min_interval, | |
+ sizeof(long), 0644, proc_doulongvec_minmax); | |
+ set_table_entry(&table[1], "max_interval", &sd->max_interval, | |
+ sizeof(long), 0644, proc_doulongvec_minmax); | |
+ set_table_entry(&table[2], "busy_idx", &sd->busy_idx, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[3], "idle_idx", &sd->idle_idx, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[5], "wake_idx", &sd->wake_idx, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[7], "busy_factor", &sd->busy_factor, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[9], "cache_nice_tries", | |
+ &sd->cache_nice_tries, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[10], "flags", &sd->flags, | |
+ sizeof(int), 0644, proc_dointvec_minmax); | |
+ set_table_entry(&table[11], "max_newidle_lb_cost", | |
+ &sd->max_newidle_lb_cost, | |
+ sizeof(long), 0644, proc_doulongvec_minmax); | |
+ set_table_entry(&table[12], "name", sd->name, | |
+ CORENAME_MAX_SIZE, 0444, proc_dostring); | |
+ /* &table[13] is terminator */ | |
+ | |
+ return table; | |
+} | |
+ | |
+static struct ctl_table *sd_alloc_ctl_cpu_table(int cpu) | |
+{ | |
+ struct ctl_table *entry, *table; | |
+ struct sched_domain *sd; | |
+ int domain_num = 0, i; | |
+ char buf[32]; | |
+ | |
+ for_each_domain(cpu, sd) | |
+ domain_num++; | |
+ entry = table = sd_alloc_ctl_entry(domain_num + 1); | |
+ if (table == NULL) | |
+ return NULL; | |
+ | |
+ i = 0; | |
+ for_each_domain(cpu, sd) { | |
+ snprintf(buf, 32, "domain%d", i); | |
+ entry->procname = kstrdup(buf, GFP_KERNEL); | |
+ entry->mode = 0555; | |
+ entry->child = sd_alloc_ctl_domain_table(sd); | |
+ entry++; | |
+ i++; | |
+ } | |
+ return table; | |
+} | |
+ | |
+static struct ctl_table_header *sd_sysctl_header; | |
+static void register_sched_domain_sysctl(void) | |
+{ | |
+ int i, cpu_num = num_possible_cpus(); | |
+ struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); | |
+ char buf[32]; | |
+ | |
+ WARN_ON(sd_ctl_dir[0].child); | |
+ sd_ctl_dir[0].child = entry; | |
+ | |
+ if (entry == NULL) | |
+ return; | |
+ | |
+ for_each_possible_cpu(i) { | |
+ snprintf(buf, 32, "cpu%d", i); | |
+ entry->procname = kstrdup(buf, GFP_KERNEL); | |
+ entry->mode = 0555; | |
+ entry->child = sd_alloc_ctl_cpu_table(i); | |
+ entry++; | |
+ } | |
+ | |
+ WARN_ON(sd_sysctl_header); | |
+ sd_sysctl_header = register_sysctl_table(sd_ctl_root); | |
+} | |
+ | |
+/* may be called multiple times per register */ | |
+static void unregister_sched_domain_sysctl(void) | |
+{ | |
+ if (sd_sysctl_header) | |
+ unregister_sysctl_table(sd_sysctl_header); | |
+ sd_sysctl_header = NULL; | |
+ if (sd_ctl_dir[0].child) | |
+ sd_free_ctl_entry(&sd_ctl_dir[0].child); | |
+} | |
+#else | |
+static void register_sched_domain_sysctl(void) | |
+{ | |
+} | |
+static void unregister_sched_domain_sysctl(void) | |
+{ | |
+} | |
+#endif | |
+ | |
+static void set_rq_online(struct rq *rq) | |
+{ | |
+ if (!rq->online) { | |
+ cpumask_set_cpu(cpu_of(rq), rq->rd->online); | |
+ rq->online = true; | |
+ } | |
+} | |
+ | |
+static void set_rq_offline(struct rq *rq) | |
+{ | |
+ if (rq->online) { | |
+ cpumask_clear_cpu(cpu_of(rq), rq->rd->online); | |
+ rq->online = false; | |
+ } | |
+} | |
+ | |
+/* | |
+ * migration_call - callback that gets triggered when a CPU is added. | |
+ */ | |
+static int | |
+migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu) | |
+{ | |
+ int cpu = (long)hcpu; | |
+ unsigned long flags; | |
+ struct rq *rq = cpu_rq(cpu); | |
+#ifdef CONFIG_HOTPLUG_CPU | |
+ struct task_struct *idle = rq->idle; | |
+#endif | |
+ | |
+ switch (action & ~CPU_TASKS_FROZEN) { | |
+ case CPU_STARTING: | |
+ return NOTIFY_OK; | |
+ case CPU_UP_PREPARE: | |
+ break; | |
+ | |
+ case CPU_ONLINE: | |
+ /* Update our root-domain */ | |
+ grq_lock_irqsave(&flags); | |
+ if (rq->rd) { | |
+ BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); | |
+ | |
+ set_rq_online(rq); | |
+ } | |
+ unbind_zero(cpu); | |
+ grq.noc = num_online_cpus(); | |
+ grq_unlock_irqrestore(&flags); | |
+ break; | |
+ | |
+#ifdef CONFIG_HOTPLUG_CPU | |
+ case CPU_DEAD: | |
+ grq_lock_irq(); | |
+ set_rq_task(rq, idle); | |
+ update_clocks(rq); | |
+ grq_unlock_irq(); | |
+ break; | |
+ | |
+ case CPU_DYING: | |
+ /* Update our root-domain */ | |
+ grq_lock_irqsave(&flags); | |
+ if (rq->rd) { | |
+ BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span)); | |
+ set_rq_offline(rq); | |
+ } | |
+ bind_zero(cpu); | |
+ grq.noc = num_online_cpus(); | |
+ grq_unlock_irqrestore(&flags); | |
+ break; | |
+#endif | |
+ } | |
+ return NOTIFY_OK; | |
+} | |
+ | |
+/* | |
+ * Register at high priority so that task migration (migrate_all_tasks) | |
+ * happens before everything else. This has to be lower priority than | |
+ * the notifier in the perf_counter subsystem, though. | |
+ */ | |
+static struct notifier_block migration_notifier = { | |
+ .notifier_call = migration_call, | |
+ .priority = CPU_PRI_MIGRATION, | |
+}; | |
+ | |
+static int sched_cpu_active(struct notifier_block *nfb, | |
+ unsigned long action, void *hcpu) | |
+{ | |
+ switch (action & ~CPU_TASKS_FROZEN) { | |
+ case CPU_DOWN_FAILED: | |
+ set_cpu_active((long)hcpu, true); | |
+ return NOTIFY_OK; | |
+ default: | |
+ return NOTIFY_DONE; | |
+ } | |
+} | |
+ | |
+static int sched_cpu_inactive(struct notifier_block *nfb, | |
+ unsigned long action, void *hcpu) | |
+{ | |
+ switch (action & ~CPU_TASKS_FROZEN) { | |
+ case CPU_DOWN_PREPARE: | |
+ set_cpu_active((long)hcpu, false); | |
+ return NOTIFY_OK; | |
+ default: | |
+ return NOTIFY_DONE; | |
+ } | |
+} | |
+ | |
+int __init migration_init(void) | |
+{ | |
+ void *cpu = (void *)(long)smp_processor_id(); | |
+ int err; | |
+ | |
+ /* Initialise migration for the boot CPU */ | |
+ err = migration_call(&migration_notifier, CPU_UP_PREPARE, cpu); | |
+ BUG_ON(err == NOTIFY_BAD); | |
+ migration_call(&migration_notifier, CPU_ONLINE, cpu); | |
+ register_cpu_notifier(&migration_notifier); | |
+ | |
+ /* Register cpu active notifiers */ | |
+ cpu_notifier(sched_cpu_active, CPU_PRI_SCHED_ACTIVE); | |
+ cpu_notifier(sched_cpu_inactive, CPU_PRI_SCHED_INACTIVE); | |
+ | |
+ return 0; | |
+} | |
+early_initcall(migration_init); | |
+#endif | |
+ | |
+#ifdef CONFIG_SMP | |
+ | |
+static cpumask_var_t sched_domains_tmpmask; /* sched_domains_mutex */ | |
+ | |
+#ifdef CONFIG_SCHED_DEBUG | |
+ | |
+static __read_mostly int sched_debug_enabled; | |
+ | |
+static int __init sched_debug_setup(char *str) | |
+{ | |
+ sched_debug_enabled = 1; | |
+ | |
+ return 0; | |
+} | |
+early_param("sched_debug", sched_debug_setup); | |
+ | |
+static inline bool sched_debug(void) | |
+{ | |
+ return sched_debug_enabled; | |
+} | |
+ | |
+static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, | |
+ struct cpumask *groupmask) | |
+{ | |
+ char str[256]; | |
+ | |
+ cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd)); | |
+ cpumask_clear(groupmask); | |
+ | |
+ printk(KERN_DEBUG "%*s domain %d: ", level, "", level); | |
+ | |
+ if (!(sd->flags & SD_LOAD_BALANCE)) { | |
+ printk("does not load-balance\n"); | |
+ if (sd->parent) | |
+ printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain" | |
+ " has parent"); | |
+ return -1; | |
+ } | |
+ | |
+ printk(KERN_CONT "span %s level %s\n", str, sd->name); | |
+ | |
+ if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) { | |
+ printk(KERN_ERR "ERROR: domain->span does not contain " | |
+ "CPU%d\n", cpu); | |
+ } | |
+ | |
+ printk(KERN_CONT "\n"); | |
+ | |
+ if (!cpumask_equal(sched_domain_span(sd), groupmask)) | |
+ printk(KERN_ERR "ERROR: groups don't span domain->span\n"); | |
+ | |
+ if (sd->parent && | |
+ !cpumask_subset(groupmask, sched_domain_span(sd->parent))) | |
+ printk(KERN_ERR "ERROR: parent span is not a superset " | |
+ "of domain->span\n"); | |
+ return 0; | |
+} | |
+ | |
+static void sched_domain_debug(struct sched_domain *sd, int cpu) | |
+{ | |
+ int level = 0; | |
+ | |
+ if (!sched_debug_enabled) | |
+ return; | |
+ | |
+ if (!sd) { | |
+ printk(KERN_DEBUG "CPU%d attaching NULL sched-domain.\n", cpu); | |
+ return; | |
+ } | |
+ | |
+ printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu); | |
+ | |
+ for (;;) { | |
+ if (sched_domain_debug_one(sd, cpu, level, sched_domains_tmpmask)) | |
+ break; | |
+ level++; | |
+ sd = sd->parent; | |
+ if (!sd) | |
+ break; | |
+ } | |
+} | |
+#else /* !CONFIG_SCHED_DEBUG */ | |
+# define sched_domain_debug(sd, cpu) do { } while (0) | |
+static inline bool sched_debug(void) | |
+{ | |
+ return false; | |
+} | |
+#endif /* CONFIG_SCHED_DEBUG */ | |
+ | |
+static int sd_degenerate(struct sched_domain *sd) | |
+{ | |
+ if (cpumask_weight(sched_domain_span(sd)) == 1) | |
+ return 1; | |
+ | |
+ /* Following flags don't use groups */ | |
+ if (sd->flags & (SD_WAKE_AFFINE)) | |
+ return 0; | |
+ | |
+ return 1; | |
+} | |
+ | |
+static int | |
+sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) | |
+{ | |
+ unsigned long cflags = sd->flags, pflags = parent->flags; | |
+ | |
+ if (sd_degenerate(parent)) | |
+ return 1; | |
+ | |
+ if (!cpumask_equal(sched_domain_span(sd), sched_domain_span(parent))) | |
+ return 0; | |
+ | |
+ if (~cflags & pflags) | |
+ return 0; | |
+ | |
+ return 1; | |
+} | |
+ | |
+static void free_rootdomain(struct rcu_head *rcu) | |
+{ | |
+ struct root_domain *rd = container_of(rcu, struct root_domain, rcu); | |
+ | |
+ cpupri_cleanup(&rd->cpupri); | |
+ free_cpumask_var(rd->rto_mask); | |
+ free_cpumask_var(rd->online); | |
+ free_cpumask_var(rd->span); | |
+ kfree(rd); | |
+} | |
+ | |
+static void rq_attach_root(struct rq *rq, struct root_domain *rd) | |
+{ | |
+ struct root_domain *old_rd = NULL; | |
+ unsigned long flags; | |
+ | |
+ grq_lock_irqsave(&flags); | |
+ | |
+ if (rq->rd) { | |
+ old_rd = rq->rd; | |
+ | |
+ if (cpumask_test_cpu(rq->cpu, old_rd->online)) | |
+ set_rq_offline(rq); | |
+ | |
+ cpumask_clear_cpu(rq->cpu, old_rd->span); | |
+ | |
+ /* | |
+ * If we dont want to free the old_rd yet then | |
+ * set old_rd to NULL to skip the freeing later | |
+ * in this function: | |
+ */ | |
+ if (!atomic_dec_and_test(&old_rd->refcount)) | |
+ old_rd = NULL; | |
+ } | |
+ | |
+ atomic_inc(&rd->refcount); | |
+ rq->rd = rd; | |
+ | |
+ cpumask_set_cpu(rq->cpu, rd->span); | |
+ if (cpumask_test_cpu(rq->cpu, cpu_active_mask)) | |
+ set_rq_online(rq); | |
+ | |
+ grq_unlock_irqrestore(&flags); | |
+ | |
+ if (old_rd) | |
+ call_rcu_sched(&old_rd->rcu, free_rootdomain); | |
+} | |
+ | |
+static int init_rootdomain(struct root_domain *rd) | |
+{ | |
+ memset(rd, 0, sizeof(*rd)); | |
+ | |
+ if (!alloc_cpumask_var(&rd->span, GFP_KERNEL)) | |
+ goto out; | |
+ if (!alloc_cpumask_var(&rd->online, GFP_KERNEL)) | |
+ goto free_span; | |
+ if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL)) | |
+ goto free_online; | |
+ | |
+ if (cpupri_init(&rd->cpupri) != 0) | |
+ goto free_rto_mask; | |
+ return 0; | |
+ | |
+free_rto_mask: | |
+ free_cpumask_var(rd->rto_mask); | |
+free_online: | |
+ free_cpumask_var(rd->online); | |
+free_span: | |
+ free_cpumask_var(rd->span); | |
+out: | |
+ return -ENOMEM; | |
+} | |
+ | |
+static void init_defrootdomain(void) | |
+{ | |
+ init_rootdomain(&def_root_domain); | |
+ | |
+ atomic_set(&def_root_domain.refcount, 1); | |
+} | |
+ | |
+static struct root_domain *alloc_rootdomain(void) | |
+{ | |
+ struct root_domain *rd; | |
+ | |
+ rd = kmalloc(sizeof(*rd), GFP_KERNEL); | |
+ if (!rd) | |
+ return NULL; | |
+ | |
+ if (init_rootdomain(rd) != 0) { | |
+ kfree(rd); | |
+ return NULL; | |
+ } | |
+ | |
+ return rd; | |
+} | |
+ | |
+static void free_sched_domain(struct rcu_head *rcu) | |
+{ | |
+ struct sched_domain *sd = container_of(rcu, struct sched_domain, rcu); | |
+ | |
+ kfree(sd); | |
+} | |
+ | |
+static void destroy_sched_domain(struct sched_domain *sd, int cpu) | |
+{ | |
+ call_rcu(&sd->rcu, free_sched_domain); | |
+} | |
+ | |
+static void destroy_sched_domains(struct sched_domain *sd, int cpu) | |
+{ | |
+ for (; sd; sd = sd->parent) | |
+ destroy_sched_domain(sd, cpu); | |
+} | |
+ | |
+/* | |
+ * Attach the domain 'sd' to 'cpu' as its base domain. Callers must | |
+ * hold the hotplug lock. | |
+ */ | |
+static void | |
+cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) | |
+{ | |
+ struct rq *rq = cpu_rq(cpu); | |
+ struct sched_domain *tmp; | |
+ | |
+ /* Remove the sched domains which do not contribute to scheduling. */ | |
+ for (tmp = sd; tmp; ) { | |
+ struct sched_domain *parent = tmp->parent; | |
+ if (!parent) | |
+ break; | |
+ | |
+ if (sd_parent_degenerate(tmp, parent)) { | |
+ tmp->parent = parent->parent; | |
+ if (parent->parent) | |
+ parent->parent->child = tmp; | |
+ /* | |
+ * Transfer SD_PREFER_SIBLING down in case of a | |
+ * degenerate parent; the spans match for this | |
+ * so the property transfers. | |
+ */ | |
+ if (parent->flags & SD_PREFER_SIBLING) | |
+ tmp->flags |= SD_PREFER_SIBLING; | |
+ destroy_sched_domain(parent, cpu); | |
+ } else | |
+ tmp = tmp->parent; | |
+ } | |
+ | |
+ if (sd && sd_degenerate(sd)) { | |
+ tmp = sd; | |
+ sd = sd->parent; | |
+ destroy_sched_domain(tmp, cpu); | |
+ if (sd) | |
+ sd->child = NULL; | |
+ } | |
+ | |
+ sched_domain_debug(sd, cpu); | |
+ | |
+ rq_attach_root(rq, rd); | |
+ tmp = rq->sd; | |
+ rcu_assign_pointer(rq->sd, sd); | |
+ destroy_sched_domains(tmp, cpu); | |
+} | |
+ | |
+/* cpus with isolated domains */ | |
+static cpumask_var_t cpu_isolated_map; | |
+ | |
+/* Setup the mask of cpus configured for isolated domains */ | |
+static int __init isolated_cpu_setup(char *str) | |
+{ | |
+ alloc_bootmem_cpumask_var(&cpu_isolated_map); | |
+ cpulist_parse(str, cpu_isolated_map); | |
+ return 1; | |
+} | |
+ | |
+__setup("isolcpus=", isolated_cpu_setup); | |
+ | |
+struct s_data { | |
+ struct sched_domain ** __percpu sd; | |
+ struct root_domain *rd; | |
+}; | |
+ | |
+enum s_alloc { | |
+ sa_rootdomain, | |
+ sa_sd, | |
+ sa_sd_storage, | |
+ sa_none, | |
+}; | |
+ | |
+/* | |
+ * Initializers for schedule domains | |
+ * Non-inlined to reduce accumulated stack pressure in build_sched_domains() | |
+ */ | |
+ | |
+static int default_relax_domain_level = -1; | |
+int sched_domain_level_max; | |
+ | |
+static int __init setup_relax_domain_level(char *str) | |
+{ | |
+ if (kstrtoint(str, 0, &default_relax_domain_level)) | |
+ pr_warn("Unable to set relax_domain_level\n"); | |
+ | |
+ return 1; | |
+} | |
+__setup("relax_domain_level=", setup_relax_domain_level); | |
+ | |
+static void set_domain_attribute(struct sched_domain *sd, | |
+ struct sched_domain_attr *attr) | |
+{ | |
+ int request; | |
+ | |
+ if (!attr || attr->relax_domain_level < 0) { | |
+ if (default_relax_domain_level < 0) | |
+ return; | |
+ else | |
+ request = default_relax_domain_level; | |
+ } else | |
+ request = attr->relax_domain_level; | |
+ if (request < sd->level) { | |
+ /* turn off idle balance on this domain */ | |
+ sd->flags &= ~(SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE); | |
+ } else { | |
+ /* turn on idle balance on this domain */ | |
+ sd->flags |= (SD_BALANCE_WAKE|SD_BALANCE_NEWIDLE); | |
+ } | |
+} | |
+ | |
+static void __sdt_free(const struct cpumask *cpu_map); | |
+static int __sdt_alloc(const struct cpumask *cpu_map); | |
+ | |
+static void __free_domain_allocs(struct s_data *d, enum s_alloc what, | |
+ const struct cpumask *cpu_map) | |
+{ | |
+ switch (what) { | |
+ case sa_rootdomain: | |
+ if (!atomic_read(&d->rd->refcount)) | |
+ free_rootdomain(&d->rd->rcu); /* fall through */ | |
+ case sa_sd: | |
+ free_percpu(d->sd); /* fall through */ | |
+ case sa_sd_storage: | |
+ __sdt_free(cpu_map); /* fall through */ | |
+ case sa_none: | |
+ break; | |
+ } | |
+} | |
+ | |
+static enum s_alloc __visit_domain_allocation_hell(struct s_data *d, | |
+ const struct cpumask *cpu_map) | |
+{ | |
+ memset(d, 0, sizeof(*d)); | |
+ | |
+ if (__sdt_alloc(cpu_map)) | |
+ return sa_sd_storage; | |
+ d->sd = alloc_percpu(struct sched_domain *); | |
+ if (!d->sd) | |
+ return sa_sd_storage; | |
+ d->rd = alloc_rootdomain(); | |
+ if (!d->rd) | |
+ return sa_sd; | |
+ return sa_rootdomain; | |
+} | |
+ | |
+/* | |
+ * NULL the sd_data elements we've used to build the sched_domain | |
+ * structure so that the subsequent __free_domain_allocs() | |
+ * will not free the data we're using. | |
+ */ | |
+static void claim_allocations(int cpu, struct sched_domain *sd) | |
+{ | |
+ struct sd_data *sdd = sd->private; | |
+ | |
+ WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd); | |
+ *per_cpu_ptr(sdd->sd, cpu) = NULL; | |
+} | |
+ | |
+#ifdef CONFIG_NUMA | |
+static int sched_domains_numa_levels; | |
+static int *sched_domains_numa_distance; | |
+static struct cpumask ***sched_domains_numa_masks; | |
+static int sched_domains_curr_level; | |
+#endif | |
+ | |
+/* | |
+ * SD_flags allowed in topology descriptions. | |
+ * | |
+ * SD_SHARE_CPUCAPACITY - describes SMT topologies | |
+ * SD_SHARE_PKG_RESOURCES - describes shared caches | |
+ * SD_NUMA - describes NUMA topologies | |
+ * SD_SHARE_POWERDOMAIN - describes shared power domain | |
+ * | |
+ * Odd one out: | |
+ * SD_ASYM_PACKING - describes SMT quirks | |
+ */ | |
+#define TOPOLOGY_SD_FLAGS \ | |
+ (SD_SHARE_CPUCAPACITY | \ | |
+ SD_SHARE_PKG_RESOURCES | \ | |
+ SD_NUMA | \ | |
+ SD_ASYM_PACKING | \ | |
+ SD_SHARE_POWERDOMAIN) | |
+ | |
+static struct sched_domain * | |
+sd_init(struct sched_domain_topology_level *tl, int cpu) | |
+{ | |
+ struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); | |
+ int sd_weight, sd_flags = 0; | |
+ | |
+#ifdef CONFIG_NUMA | |
+ /* | |
+ * Ugly hack to pass state to sd_numa_mask()... | |
+ */ | |
+ sched_domains_curr_level = tl->numa_level; | |
+#endif | |
+ | |
+ sd_weight = cpumask_weight(tl->mask(cpu)); | |
+ | |
+ if (tl->sd_flags) | |
+ sd_flags = (*tl->sd_flags)(); | |
+ if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, | |
+ "wrong sd_flags in topology description\n")) | |
+ sd_flags &= ~TOPOLOGY_SD_FLAGS; | |
+ | |
+ *sd = (struct sched_domain){ | |
+ .min_interval = sd_weight, | |
+ .max_interval = 2*sd_weight, | |
+ .busy_factor = 32, | |
+ .imbalance_pct = 125, | |
+ | |
+ .cache_nice_tries = 0, | |
+ .busy_idx = 0, | |
+ .idle_idx = 0, | |
+ .newidle_idx = 0, | |
+ .wake_idx = 0, | |
+ .forkexec_idx = 0, | |
+ | |
+ .flags = 1*SD_LOAD_BALANCE | |
+ | 1*SD_BALANCE_NEWIDLE | |
+ | 1*SD_BALANCE_EXEC | |
+ | 1*SD_BALANCE_FORK | |
+ | 0*SD_BALANCE_WAKE | |
+ | 1*SD_WAKE_AFFINE | |
+ | 0*SD_SHARE_CPUCAPACITY | |
+ | 0*SD_SHARE_PKG_RESOURCES | |
+ | 0*SD_SERIALIZE | |
+ | 0*SD_PREFER_SIBLING | |
+ | 0*SD_NUMA | |
+ | sd_flags | |
+ , | |
+ | |
+ .last_balance = jiffies, | |
+ .balance_interval = sd_weight, | |
+ .smt_gain = 0, | |
+ .max_newidle_lb_cost = 0, | |
+ .next_decay_max_lb_cost = jiffies, | |
+#ifdef CONFIG_SCHED_DEBUG | |
+ .name = tl->name, | |
+#endif | |
+ }; | |
+ | |
+ /* | |
+ * Convert topological properties into behaviour. | |
+ */ | |
+ | |
+ if (sd->flags & SD_SHARE_CPUCAPACITY) { | |
+ sd->imbalance_pct = 110; | |
+ sd->smt_gain = 1178; /* ~15% */ | |
+ | |
+ } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { | |
+ sd->imbalance_pct = 117; | |
+ sd->cache_nice_tries = 1; | |
+ sd->busy_idx = 2; | |
+ | |
+#ifdef CONFIG_NUMA | |
+ } else if (sd->flags & SD_NUMA) { | |
+ sd->cache_nice_tries = 2; | |
+ sd->busy_idx = 3; | |
+ sd->idle_idx = 2; | |
+ | |
+ sd->flags |= SD_SERIALIZE; | |
+ if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { | |
+ sd->flags &= ~(SD_BALANCE_EXEC | | |
+ SD_BALANCE_FORK | | |
+ SD_WAKE_AFFINE); | |
+ } | |
+ | |
+#endif | |
+ } else { | |
+ sd->flags |= SD_PREFER_SIBLING; | |
+ sd->cache_nice_tries = 1; | |
+ sd->busy_idx = 2; | |
+ sd->idle_idx = 1; | |
+ } | |
+ | |
+ sd->private = &tl->data; | |
+ | |
+ return sd; | |
+} | |
+ | |
+/* | |
+ * Topology list, bottom-up. | |
+ */ | |
+static struct sched_domain_topology_level default_topology[] = { | |
+#ifdef CONFIG_SCHED_SMT | |
+ { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, | |
+#endif | |
+#ifdef CONFIG_SCHED_MC | |
+ { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, | |
+#endif | |
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) }, | |
+ { NULL, }, | |
+}; | |
+ | |
+struct sched_domain_topology_level *sched_domain_topology = default_topology; | |
+ | |
+#define for_each_sd_topology(tl) \ | |
+ for (tl = sched_domain_topology; tl->mask; tl++) | |
+ | |
+void set_sched_topology(struct sched_domain_topology_level *tl) | |
+{ | |
+ sched_domain_topology = tl; | |
+} | |
+ | |
+#ifdef CONFIG_NUMA | |
+ | |
+static const struct cpumask *sd_numa_mask(int cpu) | |
+{ | |
+ return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)]; | |
+} | |
+ | |
+static void sched_numa_warn(const char *str) | |
+{ | |
+ static int done = false; | |
+ int i,j; | |
+ | |
+ if (done) | |
+ return; | |
+ | |
+ done = true; | |
+ | |
+ printk(KERN_WARNING "ERROR: %s\n\n", str); | |
+ | |
+ for (i = 0; i < nr_node_ids; i++) { | |
+ printk(KERN_WARNING " "); | |
+ for (j = 0; j < nr_node_ids; j++) | |
+ printk(KERN_CONT "%02d ", node_distance(i,j)); | |
+ printk(KERN_CONT "\n"); | |
+ } | |
+ printk(KERN_WARNING "\n"); | |
+} | |
+ | |
+static bool find_numa_distance(int distance) | |
+{ | |
+ int i; | |
+ | |
+ if (distance == node_distance(0, 0)) | |
+ return true; | |
+ | |
+ for (i = 0; i < sched_domains_numa_levels; i++) { | |
+ if (sched_domains_numa_distance[i] == distance) | |
+ return true; | |
+ } | |
+ | |
+ return false; | |
+} | |
+ | |
+static void sched_init_numa(void) | |
+{ | |
+ int next_distance, curr_distance = node_distance(0, 0); | |
+ struct sched_domain_topology_level *tl; | |
+ int level = 0; | |
+ int i, j, k; | |
+ | |
+ sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL); | |
+ if (!sched_domains_numa_distance) | |
+ return; | |
+ | |
+ /* | |
+ * O(nr_nodes^2) deduplicating selection sort -- in order to find the | |
+ * unique distances in the node_distance() table. | |
+ * | |
+ * Assumes node_distance(0,j) includes all distances in | |
+ * node_distance(i,j) in order to avoid cubic time. | |
+ */ | |
+ next_distance = curr_distance; | |
+ for (i = 0; i < nr_node_ids; i++) { | |
+ for (j = 0; j < nr_node_ids; j++) { | |
+ for (k = 0; k < nr_node_ids; k++) { | |
+ int distance = node_distance(i, k); | |
+ | |
+ if (distance > curr_distance && | |
+ (distance < next_distance || | |
+ next_distance == curr_distance)) | |
+ next_distance = distance; | |
+ | |
+ /* | |
+ * While not a strong assumption it would be nice to know | |
+ * about cases where if node A is connected to B, B is not | |
+ * equally connected to A. | |
+ */ | |
+ if (sched_debug() && node_distance(k, i) != distance) | |
+ sched_numa_warn("Node-distance not symmetric"); | |
+ | |
+ if (sched_debug() && i && !find_numa_distance(distance)) | |
+ sched_numa_warn("Node-0 not representative"); | |
+ } | |
+ if (next_distance != curr_distance) { | |
+ sched_domains_numa_distance[level++] = next_distance; | |
+ sched_domains_numa_levels = level; | |
+ curr_distance = next_distance; | |
+ } else break; | |
+ } | |
+ | |
+ /* | |
+ * In case of sched_debug() we verify the above assumption. | |
+ */ | |
+ if (!sched_debug()) | |
+ break; | |
+ } | |
+ /* | |
+ * 'level' contains the number of unique distances, excluding the | |
+ * identity distance node_distance(i,i). | |
+ * | |
+ * The sched_domains_numa_distance[] array includes the actual distance | |
+ * numbers. | |
+ */ | |
+ | |
+ /* | |
+ * Here, we should temporarily reset sched_domains_numa_levels to 0. | |
+ * If it fails to allocate memory for array sched_domains_numa_masks[][], | |
+ * the array will contain less then 'level' members. This could be | |
+ * dangerous when we use it to iterate array sched_domains_numa_masks[][] | |
+ * in other functions. | |
+ * | |
+ * We reset it to 'level' at the end of this function. | |
+ */ | |
+ sched_domains_numa_levels = 0; | |
+ | |
+ sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); | |
+ if (!sched_domains_numa_masks) | |
+ return; | |
+ | |
+ /* | |
+ * Now for each level, construct a mask per node which contains all | |
+ * cpus of nodes that are that many hops away from us. | |
+ */ | |
+ for (i = 0; i < level; i++) { | |
+ sched_domains_numa_masks[i] = | |
+ kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL); | |
+ if (!sched_domains_numa_masks[i]) | |
+ return; | |
+ | |
+ for (j = 0; j < nr_node_ids; j++) { | |
+ struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL); | |
+ if (!mask) | |
+ return; | |
+ | |
+ sched_domains_numa_masks[i][j] = mask; | |
+ | |
+ for (k = 0; k < nr_node_ids; k++) { | |
+ if (node_distance(j, k) > sched_domains_numa_distance[i]) | |
+ continue; | |
+ | |
+ cpumask_or(mask, mask, cpumask_of_node(k)); | |
+ } | |
+ } | |
+ } | |
+ | |
+ /* Compute default topology size */ | |
+ for (i = 0; sched_domain_topology[i].mask; i++); | |
+ | |
+ tl = kzalloc((i + level + 1) * | |
+ sizeof(struct sched_domain_topology_level), GFP_KERNEL); | |
+ if (!tl) | |
+ return; | |
+ | |
+ /* | |
+ * Copy the default topology bits.. | |
+ */ | |
+ for (i = 0; sched_domain_topology[i].mask; i++) | |
+ tl[i] = sched_domain_topology[i]; | |
+ | |
+ /* | |
+ * .. and append 'j' levels of NUMA goodness. | |
+ */ | |
+ for (j = 0; j < level; i++, j++) { | |
+ tl[i] = (struct sched_domain_topology_level){ | |
+ .mask = sd_numa_mask, | |
+ .sd_flags = cpu_numa_flags, | |
+ .flags = SDTL_OVERLAP, | |
+ .numa_level = j, | |
+ SD_INIT_NAME(NUMA) | |
+ }; | |
+ } | |
+ | |
+ sched_domain_topology = tl; | |
+ | |
+ sched_domains_numa_levels = level; | |
+} | |
+ | |
+static void sched_domains_numa_masks_set(int cpu) | |
+{ | |
+ int i, j; | |
+ int node = cpu_to_node(cpu); | |
+ | |
+ for (i = 0; i < sched_domains_numa_levels; i++) { | |
+ for (j = 0; j < nr_node_ids; j++) { | |
+ if (node_distance(j, node) <= sched_domains_numa_distance[i]) | |
+ cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); | |
+ } | |
+ } | |
+} | |
+ | |
+static void sched_domains_numa_masks_clear(int cpu) | |
+{ | |
+ int i, j; | |
+ for (i = 0; i < sched_domains_numa_levels; i++) { | |
+ for (j = 0; j < nr_node_ids; j++) | |
+ cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); | |
+ } | |
+} | |
+ | |
+/* | |
+ * Update sched_domains_numa_masks[level][node] array when new cpus | |
+ * are onlined. | |
+ */ | |
+static int sched_domains_numa_masks_update(struct notifier_block *nfb, | |
+ unsigned long action, | |
+ void *hcpu) | |
+{ | |
+ int cpu = (long)hcpu; | |
+ | |
+ switch (action & ~CPU_TASKS_FROZEN) { | |
+ case CPU_ONLINE: | |
+ sched_domains_numa_masks_set(cpu); | |
+ break; | |
+ | |
+ case CPU_DEAD: | |
+ sched_domains_numa_masks_clear(cpu); | |
+ break; | |
+ | |
+ default: | |
+ return NOTIFY_DONE; | |
+ } | |
+ | |
+ return NOTIFY_OK; | |
+} | |
+#else | |
+static inline void sched_init_numa(void) | |
+{ | |
+} | |
+ | |
+static int sched_domains_numa_masks_update(struct notifier_block *nfb, | |
+ unsigned long action, | |
+ void *hcpu) | |
+{ | |
+ return 0; | |
+} | |
+#endif /* CONFIG_NUMA */ | |
+ | |
+static int __sdt_alloc(const struct cpumask *cpu_map) | |
+{ | |
+ struct sched_domain_topology_level *tl; | |
+ int j; | |
+ | |
+ for_each_sd_topology(tl) { | |
+ struct sd_data *sdd = &tl->data; | |
+ | |
+ sdd->sd = alloc_percpu(struct sched_domain *); | |
+ if (!sdd->sd) | |
+ return -ENOMEM; | |
+ | |
+ for_each_cpu(j, cpu_map) { | |
+ struct sched_domain *sd; | |
+ | |
+ sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(), | |
+ GFP_KERNEL, cpu_to_node(j)); | |
+ if (!sd) | |
+ return -ENOMEM; | |
+ | |
+ *per_cpu_ptr(sdd->sd, j) = sd; | |
+ } | |
+ } | |
+ | |
+ return 0; | |
+} | |
+ | |
+static void __sdt_free(const struct cpumask *cpu_map) | |
+{ | |
+ struct sched_domain_topology_level *tl; | |
+ int j; | |
+ | |
+ for_each_sd_topology(tl) { | |
+ struct sd_data *sdd = &tl->data; | |
+ | |
+ for_each_cpu(j, cpu_map) { | |
+ struct sched_domain *sd; | |
+ | |
+ if (sdd->sd) { | |
+ sd = *per_cpu_ptr(sdd->sd, j); | |
+ kfree(*per_cpu_ptr(sdd->sd, j)); | |
+ } | |
+ } | |
+ free_percpu(sdd->sd); | |
+ sdd->sd = NULL; | |
+ } | |
+} | |
+ | |
+struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, | |
+ const struct cpumask *cpu_map, struct sched_domain_attr *attr, | |
+ struct sched_domain *child, int cpu) | |
+{ | |
+ struct sched_domain *sd = sd_init(tl, cpu); | |
+ if (!sd) | |
+ return child; | |
+ | |
+ cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); | |
+ if (child) { | |
+ sd->level = child->level + 1; | |
+ sched_domain_level_max = max(sched_domain_level_max, sd->level); | |
+ child->parent = sd; | |
+ sd->child = child; | |
+ | |
+ if (!cpumask_subset(sched_domain_span(child), | |
+ sched_domain_span(sd))) { | |
+ pr_err("BUG: arch topology borken\n"); | |
+#ifdef CONFIG_SCHED_DEBUG | |
+ pr_err(" the %s domain not a subset of the %s domain\n", | |
+ child->name, sd->name); | |
+#endif | |
+ /* Fixup, ensure @sd has at least @child cpus. */ | |
+ cpumask_or(sched_domain_span(sd), | |
+ sched_domain_span(sd), | |
+ sched_domain_span(child)); | |
+ } | |
+ | |
+ } | |
+ set_domain_attribute(sd, attr); | |
+ | |
+ return sd; | |
+} | |
+ | |
+/* | |
+ * Build sched domains for a given set of cpus and attach the sched domains | |
+ * to the individual cpus | |
+ */ | |
+static int build_sched_domains(const struct cpumask *cpu_map, | |
+ struct sched_domain_attr *attr) | |
+{ | |
+ enum s_alloc alloc_state; | |
+ struct sched_domain *sd; | |
+ struct s_data d; | |
+ int i, ret = -ENOMEM; | |
+ | |
+ alloc_state = __visit_domain_allocation_hell(&d, cpu_map); | |
+ if (alloc_state != sa_rootdomain) | |
+ goto error; | |
+ | |
+ /* Set up domains for cpus specified by the cpu_map. */ | |
+ for_each_cpu(i, cpu_map) { | |
+ struct sched_domain_topology_level *tl; | |
+ | |
+ sd = NULL; | |
+ for_each_sd_topology(tl) { | |
+ sd = build_sched_domain(tl, cpu_map, attr, sd, i); | |
+ if (tl == sched_domain_topology) | |
+ *per_cpu_ptr(d.sd, i) = sd; | |
+ if (tl->flags & SDTL_OVERLAP) | |
+ sd->flags |= SD_OVERLAP; | |
+ if (cpumask_equal(cpu_map, sched_domain_span(sd))) | |
+ break; | |
+ } | |
+ } | |
+ | |
+ /* Calculate CPU capacity for physical packages and nodes */ | |
+ for (i = nr_cpumask_bits-1; i >= 0; i--) { | |
+ if (!cpumask_test_cpu(i, cpu_map)) | |
+ continue; | |
+ | |
+ for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { | |
+ claim_allocations(i, sd); | |
+ } | |
+ } | |
+ | |
+ /* Attach the domains */ | |
+ rcu_read_lock(); | |
+ for_each_cpu(i, cpu_map) { | |
+ sd = *per_cpu_ptr(d.sd, i); | |
+ cpu_attach_domain(sd, d.rd, i); | |
+ } | |
+ rcu_read_unlock(); | |
+ | |
+ ret = 0; | |
+error: | |
+ __free_domain_allocs(&d, alloc_state, cpu_map); | |
+ return ret; | |
+} | |
+ | |
+static cpumask_var_t *doms_cur; /* current sched domains */ | |
+static int ndoms_cur; /* number of sched domains in 'doms_cur' */ | |
+static struct sched_domain_attr *dattr_cur; | |
+ /* attribues of custom domains in 'doms_cur' */ | |
+ | |
+/* | |
+ * Special case: If a kmalloc of a doms_cur partition (array of | |
+ * cpumask) fails, then fallback to a single sched domain, | |
+ * as determined by the single cpumask fallback_doms. | |
+ */ | |
+static cpumask_var_t fallback_doms; | |
+ | |
+/* | |
+ * arch_update_cpu_topology lets virtualized architectures update the | |
+ * cpu core maps. It is supposed to return 1 if the topology changed | |
+ * or 0 if it stayed the same. | |
+ */ | |
+int __weak arch_update_cpu_topology(void) | |
+{ | |
+ return 0; | |
+} | |
+ | |
+cpumask_var_t *alloc_sched_domains(unsigned int ndoms) | |
+{ | |
+ int i; | |
+ cpumask_var_t *doms; | |
+ | |
+ doms = kmalloc(sizeof(*doms) * ndoms, GFP_KERNEL); | |
+ if (!doms) | |
+ return NULL; | |
+ for (i = 0; i < ndoms; i++) { | |
+ if (!alloc_cpumask_var(&doms[i], GFP_KERNEL)) { | |
+ free_sched_domains(doms, i); | |
+ return NULL; | |
+ } | |
+ } | |
+ return doms; | |
+} | |
+ | |
+void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms) | |
+{ | |
+ unsigned int i; | |
+ for (i = 0; i < ndoms; i++) | |
+ free_cpumask_var(doms[i]); | |
+ kfree(doms); | |
+} | |
+ | |
+/* | |
+ * Set up scheduler domains and groups. Callers must hold the hotplug lock. | |
+ * For now this just excludes isolated cpus, but could be used to | |
+ * exclude other special cases in the future. | |
+ */ | |
+static int init_sched_domains(const struct cpumask *cpu_map) | |
+{ | |
+ int err; | |
+ | |
+ arch_update_cpu_topology(); | |
+ ndoms_cur = 1; | |
+ doms_cur = alloc_sched_domains(ndoms_cur); | |
+ if (!doms_cur) | |
+ doms_cur = &fallback_doms; | |
+ cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map); | |
+ err = build_sched_domains(doms_cur[0], NULL); | |
+ register_sched_domain_sysctl(); | |
+ | |
+ return err; | |
+} | |
+ | |
+/* | |
+ * Detach sched domains from a group of cpus specified in cpu_map | |
+ * These cpus will now be attached to the NULL domain | |
+ */ | |
+static void detach_destroy_domains(const struct cpumask *cpu_map) | |
+{ | |
+ int i; | |
+ | |
+ rcu_read_lock(); | |
+ for_each_cpu(i, cpu_map) | |
+ cpu_attach_domain(NULL, &def_root_domain, i); | |
+ rcu_read_unlock(); | |
+} | |
+ | |
+/* handle null as "default" */ | |
+static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur, | |
+ struct sched_domain_attr *new, int idx_new) | |
+{ | |
+ struct sched_domain_attr tmp; | |
+ | |
+ /* fast path */ | |
+ if (!new && !cur) | |
+ return 1; | |
+ | |
+ tmp = SD_ATTR_INIT; | |
+ return !memcmp(cur ? (cur + idx_cur) : &tmp, | |
+ new ? (new + idx_new) : &tmp, | |
+ sizeof(struct sched_domain_attr)); | |
+} | |
+ | |
+/* | |
+ * Partition sched domains as specified by the 'ndoms_new' | |
+ * cpumasks in the array doms_new[] of cpumasks. This compares | |
+ * doms_new[] to the current sched domain partitioning, doms_cur[]. | |
+ * It destroys each deleted domain and builds each new domain. | |
+ * | |
+ * 'doms_new' is an array of cpumask_var_t's of length 'ndoms_new'. | |
+ * The masks don't intersect (don't overlap.) We should setup one | |
+ * sched domain for each mask. CPUs not in any of the cpumasks will | |
+ * not be load balanced. If the same cpumask appears both in the | |
+ * current 'doms_cur' domains and in the new 'doms_new', we can leave | |
+ * it as it is. | |
+ * | |
+ * The passed in 'doms_new' should be allocated using | |
+ * alloc_sched_domains. This routine takes ownership of it and will | |
+ * free_sched_domains it when done with it. If the caller failed the | |
+ * alloc call, then it can pass in doms_new == NULL && ndoms_new == 1, | |
+ * and partition_sched_domains() will fallback to the single partition | |
+ * 'fallback_doms', it also forces the domains to be rebuilt. | |
+ * | |
+ * If doms_new == NULL it will be replaced with cpu_online_mask. | |
+ * ndoms_new == 0 is a special case for destroying existing domains, | |
+ * and it will not create the default domain. | |
+ * | |
+ * Call with hotplug lock held | |
+ */ | |
+void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[], | |
+ struct sched_domain_attr *dattr_new) | |
+{ | |
+ int i, j, n; | |
+ int new_topology; | |
+ | |
+ mutex_lock(&sched_domains_mutex); | |
+ | |
+ /* always unregister in case we don't destroy any domains */ | |
+ unregister_sched_domain_sysctl(); | |
+ | |
+ /* Let architecture update cpu core mappings. */ | |
+ new_topology = arch_update_cpu_topology(); | |
+ | |
+ n = doms_new ? ndoms_new : 0; | |
+ | |
+ /* Destroy deleted domains */ | |
+ for (i = 0; i < ndoms_cur; i++) { | |
+ for (j = 0; j < n && !new_topology; j++) { | |
+ if (cpumask_equal(doms_cur[i], doms_new[j]) | |
+ && dattrs_equal(dattr_cur, i, dattr_new, j)) | |
+ goto match1; | |
+ } | |
+ /* no match - a current sched domain not in new doms_new[] */ | |
+ detach_destroy_domains(doms_cur[i]); | |
+match1: | |
+ ; | |
+ } | |
+ | |
+ n = ndoms_cur; | |
+ if (doms_new == NULL) { | |
+ n = 0; | |
+ doms_new = &fallback_doms; | |
+ cpumask_andnot(doms_new[0], cpu_active_mask, cpu_isolated_map); | |
+ WARN_ON_ONCE(dattr_new); | |
+ } | |
+ | |
+ /* Build new domains */ | |
+ for (i = 0; i < ndoms_new; i++) { | |
+ for (j = 0; j < n && !new_topology; j++) { | |
+ if (cpumask_equal(doms_new[i], doms_cur[j]) | |
+ && dattrs_equal(dattr_new, i, dattr_cur, j)) | |
+ goto match2; | |
+ } | |
+ /* no match - add a new doms_new */ | |
+ build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL); | |
+match2: | |
+ ; | |
+ } | |
+ | |
+ /* Remember the new sched domains */ | |
+ if (doms_cur != &fallback_doms) | |
+ free_sched_domains(doms_cur, ndoms_cur); | |
+ kfree(dattr_cur); /* kfree(NULL) is safe */ | |
+ doms_cur = doms_new; | |
+ dattr_cur = dattr_new; | |
+ ndoms_cur = ndoms_new; | |
+ | |
+ register_sched_domain_sysctl(); | |
+ | |
+ mutex_unlock(&sched_domains_mutex); | |
+} | |
+ | |
+static int num_cpus_frozen; /* used to mark begin/end of suspend/resume */ | |
+ | |
+/* | |
+ * Update cpusets according to cpu_active mask. If cpusets are | |
+ * disabled, cpuset_update_active_cpus() becomes a simple wrapper | |
+ * around partition_sched_domains(). | |
+ * | |
+ * If we come here as part of a suspend/resume, don't touch cpusets because we | |
+ * want to restore it back to its original state upon resume anyway. | |
+ */ | |
+static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action, | |
+ void *hcpu) | |
+{ | |
+ switch (action) { | |
+ case CPU_ONLINE_FROZEN: | |
+ case CPU_DOWN_FAILED_FROZEN: | |
+ | |
+ /* | |
+ * num_cpus_frozen tracks how many CPUs are involved in suspend | |
+ * resume sequence. As long as this is not the last online | |
+ * operation in the resume sequence, just build a single sched | |
+ * domain, ignoring cpusets. | |
+ */ | |
+ num_cpus_frozen--; | |
+ if (likely(num_cpus_frozen)) { | |
+ partition_sched_domains(1, NULL, NULL); | |
+ break; | |
+ } | |
+ | |
+ /* | |
+ * This is the last CPU online operation. So fall through and | |
+ * restore the original sched domains by considering the | |
+ * cpuset configurations. | |
+ */ | |
+ | |
+ case CPU_ONLINE: | |
+ case CPU_DOWN_FAILED: | |
+ cpuset_update_active_cpus(true); | |
+ break; | |
+ default: | |
+ return NOTIFY_DONE; | |
+ } | |
+ return NOTIFY_OK; | |
+} | |
+ | |
+static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action, | |
+ void *hcpu) | |
+{ | |
+ switch (action) { | |
+ case CPU_DOWN_PREPARE: | |
+ cpuset_update_active_cpus(false); | |
+ break; | |
+ case CPU_DOWN_PREPARE_FROZEN: | |
+ num_cpus_frozen++; | |
+ partition_sched_domains(1, NULL, NULL); | |
+ break; | |
+ default: | |
+ return NOTIFY_DONE; | |
+ } | |
+ return NOTIFY_OK; | |
+} | |
+ | |
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC) | |
+/* | |
+ * Cheaper version of the below functions in case support for SMT and MC is | |
+ * compiled in but CPUs have no siblings. | |
+ */ | |
+static bool sole_cpu_idle(int cpu) | |
+{ | |
+ return rq_idle(cpu_rq(cpu)); | |
+} | |
+#endif | |
+#ifdef CONFIG_SCHED_SMT | |
+static const cpumask_t *thread_cpumask(int cpu) | |
+{ | |
+ return topology_thread_cpumask(cpu); | |
+} | |
+/* All this CPU's SMT siblings are idle */ | |
+static bool siblings_cpu_idle(int cpu) | |
+{ | |
+ return cpumask_subset(thread_cpumask(cpu), &grq.cpu_idle_map); | |
+} | |
+#endif | |
+#ifdef CONFIG_SCHED_MC | |
+static const cpumask_t *core_cpumask(int cpu) | |
+{ | |
+ return topology_core_cpumask(cpu); | |
+} | |
+/* All this CPU's shared cache siblings are idle */ | |
+static bool cache_cpu_idle(int cpu) | |
+{ | |
+ return cpumask_subset(core_cpumask(cpu), &grq.cpu_idle_map); | |
+} | |
+#endif | |
+ | |
+enum sched_domain_level { | |
+ SD_LV_NONE = 0, | |
+ SD_LV_SIBLING, | |
+ SD_LV_MC, | |
+ SD_LV_BOOK, | |
+ SD_LV_CPU, | |
+ SD_LV_NODE, | |
+ SD_LV_ALLNODES, | |
+ SD_LV_MAX | |
+}; | |
+ | |
+void __init sched_init_smp(void) | |
+{ | |
+ struct sched_domain *sd; | |
+ int cpu, other_cpu; | |
+ | |
+ cpumask_var_t non_isolated_cpus; | |
+ | |
+ alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL); | |
+ alloc_cpumask_var(&fallback_doms, GFP_KERNEL); | |
+ | |
+ sched_init_numa(); | |
+ | |
+ /* | |
+ * There's no userspace yet to cause hotplug operations; hence all the | |
+ * cpu masks are stable and all blatant races in the below code cannot | |
+ * happen. | |
+ */ | |
+ mutex_lock(&sched_domains_mutex); | |
+ init_sched_domains(cpu_active_mask); | |
+ cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map); | |
+ if (cpumask_empty(non_isolated_cpus)) | |
+ cpumask_set_cpu(smp_processor_id(), non_isolated_cpus); | |
+ mutex_unlock(&sched_domains_mutex); | |
+ | |
+ hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); | |
+ hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); | |
+ hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); | |
+ | |
+ /* Move init over to a non-isolated CPU */ | |
+ if (set_cpus_allowed_ptr(current, non_isolated_cpus) < 0) | |
+ BUG(); | |
+ free_cpumask_var(non_isolated_cpus); | |
+ | |
+ grq_lock_irq(); | |
+ /* | |
+ * Set up the relative cache distance of each online cpu from each | |
+ * other in a simple array for quick lookup. Locality is determined | |
+ * by the closest sched_domain that CPUs are separated by. CPUs with | |
+ * shared cache in SMT and MC are treated as local. Separate CPUs | |
+ * (within the same package or physically) within the same node are | |
+ * treated as not local. CPUs not even in the same domain (different | |
+ * nodes) are treated as very distant. | |
+ */ | |
+ for_each_online_cpu(cpu) { | |
+ struct rq *rq = cpu_rq(cpu); | |
+ | |
+ /* First check if this cpu is in the same node */ | |
+ for_each_domain(cpu, sd) { | |
+ if (sd->level > SD_LV_NODE) | |
+ continue; | |
+ /* Set locality to local node if not already found lower */ | |
+ for_each_cpu_mask(other_cpu, *sched_domain_span(sd)) { | |
+ if (rq->cpu_locality[other_cpu] > 3) | |
+ rq->cpu_locality[other_cpu] = 3; | |
+ } | |
+ } | |
+ | |
+ /* | |
+ * Each runqueue has its own function in case it doesn't have | |
+ * siblings of its own allowing mixed topologies. | |
+ */ | |
+#ifdef CONFIG_SCHED_MC | |
+ for_each_cpu_mask(other_cpu, *core_cpumask(cpu)) { | |
+ if (rq->cpu_locality[other_cpu] > 2) | |
+ rq->cpu_locality[other_cpu] = 2; | |
+ } | |
+ if (cpus_weight(*core_cpumask(cpu)) > 1) | |
+ rq->cache_idle = cache_cpu_idle; | |
+#endif | |
+#ifdef CONFIG_SCHED_SMT | |
+ for_each_cpu_mask(other_cpu, *thread_cpumask(cpu)) | |
+ rq->cpu_locality[other_cpu] = 1; | |
+ if (cpus_weight(*thread_cpumask(cpu)) > 1) | |
+ rq->siblings_idle = siblings_cpu_idle; | |
+#endif | |
+ } | |
+ grq_unlock_irq(); | |
+ | |
+ for_each_online_cpu(cpu) { | |
+ struct rq *rq = cpu_rq(cpu); | |
+ for_each_online_cpu(other_cpu) { | |
+ if (other_cpu <= cpu) | |
+ continue; | |
+ printk(KERN_WARNING "LOCALITY CPU %d to %d: %d\n", cpu, other_cpu, rq->cpu_locality[other_cpu]); | |
+ } | |
+ } | |
+} | |
+#else | |
+void __init sched_init_smp(void) | |
+{ | |
+} | |
+#endif /* CONFIG_SMP */ | |
+ | |
+unsigned int sysctl_timer_migration = 1; | |
+ | |
+int in_sched_functions(unsigned long addr) | |
+{ | |
+ return in_lock_functions(addr) || | |
+ (addr >= (unsigned long)__sched_text_start | |
+ && addr < (unsigned long)__sched_text_end); | |
+} | |
+ | |
+void __init sched_init(void) | |
+{ | |
+#ifdef CONFIG_SMP | |
+ int cpu_ids; | |
+#endif | |
+ int i; | |
+ struct rq *rq; | |
+ | |
+ prio_ratios[0] = 128; | |
+ for (i = 1 ; i < NICE_WIDTH ; i++) | |
+ prio_ratios[i] = prio_ratios[i - 1] * 11 / 10; | |
+ | |
+ raw_spin_lock_init(&grq.lock); | |
+ grq.nr_running = grq.nr_uninterruptible = grq.nr_switches = 0; | |
+ grq.niffies = 0; | |
+ grq.last_jiffy = jiffies; | |
+ raw_spin_lock_init(&grq.iso_lock); | |
+ grq.iso_ticks = 0; | |
+ grq.iso_refractory = false; | |
+ grq.noc = 1; | |
+#ifdef CONFIG_SMP | |
+ init_defrootdomain(); | |
+ grq.qnr = grq.idle_cpus = 0; | |
+ cpumask_clear(&grq.cpu_idle_map); | |
+#else | |
+ uprq = &per_cpu(runqueues, 0); | |
+#endif | |
+ for_each_possible_cpu(i) { | |
+ rq = cpu_rq(i); | |
+ rq->user_pc = rq->nice_pc = rq->softirq_pc = rq->system_pc = | |
+ rq->iowait_pc = rq->idle_pc = 0; | |
+ rq->dither = false; | |
+#ifdef CONFIG_SMP | |
+ rq->sticky_task = NULL; | |
+ rq->last_niffy = 0; | |
+ rq->sd = NULL; | |
+ rq->rd = NULL; | |
+ rq->online = false; | |
+ rq->cpu = i; | |
+ rq_attach_root(rq, &def_root_domain); | |
+#endif | |
+ atomic_set(&rq->nr_iowait, 0); | |
+ } | |
+ | |
+#ifdef CONFIG_SMP | |
+ cpu_ids = i; | |
+ /* | |
+ * Set the base locality for cpu cache distance calculation to | |
+ * "distant" (3). Make sure the distance from a CPU to itself is 0. | |
+ */ | |
+ for_each_possible_cpu(i) { | |
+ int j; | |
+ | |
+ rq = cpu_rq(i); | |
+#ifdef CONFIG_SCHED_SMT | |
+ rq->siblings_idle = sole_cpu_idle; | |
+#endif | |
+#ifdef CONFIG_SCHED_MC | |
+ rq->cache_idle = sole_cpu_idle; | |
+#endif | |
+ rq->cpu_locality = kmalloc(cpu_ids * sizeof(int *), GFP_ATOMIC); | |
+ for_each_possible_cpu(j) { | |
+ if (i == j) | |
+ rq->cpu_locality[j] = 0; | |
+ else | |
+ rq->cpu_locality[j] = 4; | |
+ } | |
+ } | |
+#endif | |
+ | |
+ for (i = 0; i < PRIO_LIMIT; i++) | |
+ INIT_LIST_HEAD(grq.queue + i); | |
+ /* delimiter for bitsearch */ | |
+ __set_bit(PRIO_LIMIT, grq.prio_bitmap); | |
+ | |
+#ifdef CONFIG_PREEMPT_NOTIFIERS | |
+ INIT_HLIST_HEAD(&init_task.preempt_notifiers); | |
+#endif | |
+ | |
+ /* | |
+ * The boot idle thread does lazy MMU switching as well: | |
+ */ | |
+ atomic_inc(&init_mm.mm_count); | |
+ enter_lazy_tlb(&init_mm, current); | |
+ | |
+ /* | |
+ * Make us the idle thread. Technically, schedule() should not be | |
+ * called from this thread, however somewhere below it might be, | |
+ * but because we are the idle thread, we just pick up running again | |
+ * when this runqueue becomes "idle". | |
+ */ | |
+ init_idle(current, smp_processor_id()); | |
+ | |
+#ifdef CONFIG_SMP | |
+ zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT); | |
+ /* May be allocated at isolcpus cmdline parse time */ | |
+ if (cpu_isolated_map == NULL) | |
+ zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT); | |
+ idle_thread_set_boot_cpu(); | |
+#endif /* SMP */ | |
+} | |
+ | |
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP | |
+static inline int preempt_count_equals(int preempt_offset) | |
+{ | |
+ int nested = (preempt_count() & ~PREEMPT_ACTIVE) + rcu_preempt_depth(); | |
+ | |
+ return (nested == preempt_offset); | |
+} | |
+ | |
+void __might_sleep(const char *file, int line, int preempt_offset) | |
+{ | |
+ static unsigned long prev_jiffy; /* ratelimiting */ | |
+ | |
+ rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */ | |
+ if ((preempt_count_equals(preempt_offset) && !irqs_disabled() && | |
+ !is_idle_task(current)) || | |
+ system_state != SYSTEM_RUNNING || oops_in_progress) | |
+ return; | |
+ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) | |
+ return; | |
+ prev_jiffy = jiffies; | |
+ | |
+ printk(KERN_ERR | |
+ "BUG: sleeping function called from invalid context at %s:%d\n", | |
+ file, line); | |
+ printk(KERN_ERR | |
+ "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", | |
+ in_atomic(), irqs_disabled(), | |
+ current->pid, current->comm); | |
+ | |
+ debug_show_held_locks(current); | |
+ if (irqs_disabled()) | |
+ print_irqtrace_events(current); | |
+#ifdef CONFIG_DEBUG_PREEMPT | |
+ if (!preempt_count_equals(preempt_offset)) { | |
+ pr_err("Preemption disabled at:"); | |
+ print_ip_sym(current->preempt_disable_ip); | |
+ pr_cont("\n"); | |
+ } | |
+#endif | |
+ dump_stack(); | |
+} | |
+EXPORT_SYMBOL(__might_sleep); | |
+#endif | |
+ | |
+#ifdef CONFIG_MAGIC_SYSRQ | |
+void normalize_rt_tasks(void) | |
+{ | |
+ struct task_struct *g, *p; | |
+ unsigned long flags; | |
+ struct rq *rq; | |
+ int queued; | |
+ | |
+ read_lock(&tasklist_lock); | |
+ for_each_process_thread(g, p) { | |
+ if (!rt_task(p) && !iso_task(p)) | |
+ continue; | |
+ | |
+ rq = task_grq_lock(p, &flags); | |
+ queued = task_queued(p); | |
+ if (queued) | |
+ dequeue_task(p); | |
+ __setscheduler(p, rq, SCHED_NORMAL, 0); | |
+ if (queued) { | |
+ enqueue_task(p, rq); | |
+ try_preempt(p, rq); | |
+ } | |
+ | |
+ task_grq_unlock(&flags); | |
+ } | |
+ read_unlock(&tasklist_lock); | |
+} | |
+#endif /* CONFIG_MAGIC_SYSRQ */ | |
+ | |
+#if defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) | |
+/* | |
+ * These functions are only useful for the IA64 MCA handling, or kdb. | |
+ * | |
+ * They can only be called when the whole system has been | |
+ * stopped - every CPU needs to be quiescent, and no scheduling | |
+ * activity can take place. Using them for anything else would | |
+ * be a serious bug, and as a result, they aren't even visible | |
+ * under any other configuration. | |
+ */ | |
+ | |
+/** | |
+ * curr_task - return the current task for a given cpu. | |
+ * @cpu: the processor in question. | |
+ * | |
+ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED! | |
+ * | |
+ * Return: The current task for @cpu. | |
+ */ | |
+struct task_struct *curr_task(int cpu) | |
+{ | |
+ return cpu_curr(cpu); | |
+} | |
+ | |
+#endif /* defined(CONFIG_IA64) || defined(CONFIG_KGDB_KDB) */ | |
+ | |
+#ifdef CONFIG_IA64 | |
+/** | |
+ * set_curr_task - set the current task for a given cpu. | |
+ * @cpu: the processor in question. | |
+ * @p: the task pointer to set. | |
+ * | |
+ * Description: This function must only be used when non-maskable interrupts | |
+ * are serviced on a separate stack. It allows the architecture to switch the | |
+ * notion of the current task on a cpu in a non-blocking manner. This function | |
+ * must be called with all CPU's synchronised, and interrupts disabled, the | |
+ * and caller must save the original value of the current task (see | |
+ * curr_task() above) and restore that value before reenabling interrupts and | |
+ * re-starting the system. | |
+ * | |
+ * ONLY VALID WHEN THE WHOLE SYSTEM IS STOPPED! | |
+ */ | |
+void set_curr_task(int cpu, struct task_struct *p) | |
+{ | |
+ cpu_curr(cpu) = p; | |
+} | |
+ | |
+#endif | |
+ | |
+/* | |
+ * Use precise platform statistics if available: | |
+ */ | |
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE | |
+void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) | |
+{ | |
+ *ut = p->utime; | |
+ *st = p->stime; | |
+} | |
+ | |
+void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) | |
+{ | |
+ struct task_cputime cputime; | |
+ | |
+ thread_group_cputime(p, &cputime); | |
+ | |
+ *ut = cputime.utime; | |
+ *st = cputime.stime; | |
+} | |
+ | |
+void vtime_account_system_irqsafe(struct task_struct *tsk) | |
+{ | |
+ unsigned long flags; | |
+ | |
+ local_irq_save(flags); | |
+ vtime_account_system(tsk); | |
+ local_irq_restore(flags); | |
+} | |
+EXPORT_SYMBOL_GPL(vtime_account_system_irqsafe); | |
+ | |
+#ifndef __ARCH_HAS_VTIME_TASK_SWITCH | |
+void vtime_task_switch(struct task_struct *prev) | |
+{ | |
+ if (is_idle_task(prev)) | |
+ vtime_account_idle(prev); | |
+ else | |
+ vtime_account_system(prev); | |
+ | |
+ vtime_account_user(prev); | |
+ arch_vtime_task_switch(prev); | |
+} | |
+#endif | |
+ | |
+#else | |
+/* | |
+ * Perform (stime * rtime) / total, but avoid multiplication overflow by | |
+ * losing precision when the numbers are big. | |
+ */ | |
+static cputime_t scale_stime(u64 stime, u64 rtime, u64 total) | |
+{ | |
+ u64 scaled; | |
+ | |
+ for (;;) { | |
+ /* Make sure "rtime" is the bigger of stime/rtime */ | |
+ if (stime > rtime) { | |
+ u64 tmp = rtime; rtime = stime; stime = tmp; | |
+ } | |
+ | |
+ /* Make sure 'total' fits in 32 bits */ | |
+ if (total >> 32) | |
+ goto drop_precision; | |
+ | |
+ /* Does rtime (and thus stime) fit in 32 bits? */ | |
+ if (!(rtime >> 32)) | |
+ break; | |
+ | |
+ /* Can we just balance rtime/stime rather than dropping bits? */ | |
+ if (stime >> 31) | |
+ goto drop_precision; | |
+ | |
+ /* We can grow stime and shrink rtime and try to make them both fit */ | |
+ stime <<= 1; | |
+ rtime >>= 1; | |
+ continue; | |
+ | |
+drop_precision: | |
+ /* We drop from rtime, it has more bits than stime */ | |
+ rtime >>= 1; | |
+ total >>= 1; | |
+ } | |
+ | |
+ /* | |
+ * Make sure gcc understands that this is a 32x32->64 multiply, | |
+ * followed by a 64/32->64 divide. | |
+ */ | |
+ scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total); | |
+ return (__force cputime_t) scaled; | |
+} | |
+ | |
+/* | |
+ * Adjust tick based cputime random precision against scheduler | |
+ * runtime accounting. | |
+ */ | |
+static void cputime_adjust(struct task_cputime *curr, | |
+ struct cputime *prev, | |
+ cputime_t *ut, cputime_t *st) | |
+{ | |
+ cputime_t rtime, stime, utime, total; | |
+ | |
+ stime = curr->stime; | |
+ total = stime + curr->utime; | |
+ | |
+ /* | |
+ * Tick based cputime accounting depend on random scheduling | |
+ * timeslices of a task to be interrupted or not by the timer. | |
+ * Depending on these circumstances, the number of these interrupts | |
+ * may be over or under-optimistic, matching the real user and system | |
+ * cputime with a variable precision. | |
+ * | |
+ * Fix this by scaling these tick based values against the total | |
+ * runtime accounted by the CFS scheduler. | |
+ */ | |
+ rtime = nsecs_to_cputime(curr->sum_exec_runtime); | |
+ | |
+ /* | |
+ * Update userspace visible utime/stime values only if actual execution | |
+ * time is bigger than already exported. Note that can happen, that we | |
+ * provided bigger values due to scaling inaccuracy on big numbers. | |
+ */ | |
+ if (prev->stime + prev->utime >= rtime) | |
+ goto out; | |
+ | |
+ if (total) { | |
+ stime = scale_stime((__force u64)stime, | |
+ (__force u64)rtime, (__force u64)total); | |
+ utime = rtime - stime; | |
+ } else { | |
+ stime = rtime; | |
+ utime = 0; | |
+ } | |
+ | |
+ /* | |
+ * If the tick based count grows faster than the scheduler one, | |
+ * the result of the scaling may go backward. | |
+ * Let's enforce monotonicity. | |
+ */ | |
+ prev->stime = max(prev->stime, stime); | |
+ prev->utime = max(prev->utime, utime); | |
+ | |
+out: | |
+ *ut = prev->utime; | |
+ *st = prev->stime; | |
+} | |
+ | |
+void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) | |
+{ | |
+ struct task_cputime cputime = { | |
+ .sum_exec_runtime = tsk_seruntime(p), | |
+ }; | |
+ | |
+ task_cputime(p, &cputime.utime, &cputime.stime); | |
+ cputime_adjust(&cputime, &p->prev_cputime, ut, st); | |
+} | |
+ | |
+/* | |
+ * Must be called with siglock held. | |
+ */ | |
+void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st) | |
+{ | |
+ struct task_cputime cputime; | |
+ | |
+ thread_group_cputime(p, &cputime); | |
+ cputime_adjust(&cputime, &p->signal->prev_cputime, ut, st); | |
+} | |
+#endif | |
+ | |
+void init_idle_bootup_task(struct task_struct *idle) | |
+{} | |
+ | |
+#ifdef CONFIG_SCHED_DEBUG | |
+void proc_sched_show_task(struct task_struct *p, struct seq_file *m) | |
+{} | |
+ | |
+void proc_sched_set_task(struct task_struct *p) | |
+{} | |
+#endif | |
+ | |
+#ifdef CONFIG_SMP | |
+#define SCHED_LOAD_SHIFT (10) | |
+#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT) | |
+ | |
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu) | |
+{ | |
+ return SCHED_LOAD_SCALE; | |
+} | |
+ | |
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu) | |
+{ | |
+ unsigned long weight = cpumask_weight(sched_domain_span(sd)); | |
+ unsigned long smt_gain = sd->smt_gain; | |
+ | |
+ smt_gain /= weight; | |
+ | |
+ return smt_gain; | |
+} | |
+#endif | |
diff --git a/kernel/sched/bfs_sched.h b/kernel/sched/bfs_sched.h | |
new file mode 100644 | |
index 0000000..198108e | |
--- /dev/null | |
+++ b/kernel/sched/bfs_sched.h | |
@@ -0,0 +1,161 @@ | |
+#include <linux/sched.h> | |
+#include <linux/cpuidle.h> | |
+ | |
+#ifndef BFS_SCHED_H | |
+#define BFS_SCHED_H | |
+ | |
+/* | |
+ * This is the main, per-CPU runqueue data structure. | |
+ * This data should only be modified by the local cpu. | |
+ */ | |
+struct rq { | |
+ struct task_struct *curr, *idle, *stop; | |
+ struct mm_struct *prev_mm; | |
+ | |
+ /* Stored data about rq->curr to work outside grq lock */ | |
+ u64 rq_deadline; | |
+ unsigned int rq_policy; | |
+ int rq_time_slice; | |
+ u64 rq_last_ran; | |
+ int rq_prio; | |
+ bool rq_running; /* There is a task running */ | |
+ int soft_affined; /* Running or queued tasks with this set as their rq */ | |
+#ifdef CONFIG_SMT_NICE | |
+ int rq_smt_bias; /* Policy/nice level bias across smt siblings */ | |
+#endif | |
+ /* Accurate timekeeping data */ | |
+ u64 timekeep_clock; | |
+ unsigned long user_pc, nice_pc, irq_pc, softirq_pc, system_pc, | |
+ iowait_pc, idle_pc; | |
+ atomic_t nr_iowait; | |
+ | |
+#ifdef CONFIG_SMP | |
+ int cpu; /* cpu of this runqueue */ | |
+ bool online; | |
+ bool scaling; /* This CPU is managed by a scaling CPU freq governor */ | |
+ struct task_struct *sticky_task; | |
+ | |
+ struct root_domain *rd; | |
+ struct sched_domain *sd; | |
+ int *cpu_locality; /* CPU relative cache distance */ | |
+#ifdef CONFIG_SCHED_SMT | |
+ bool (*siblings_idle)(int cpu); | |
+ /* See if all smt siblings are idle */ | |
+#endif /* CONFIG_SCHED_SMT */ | |
+#ifdef CONFIG_SCHED_MC | |
+ bool (*cache_idle)(int cpu); | |
+ /* See if all cache siblings are idle */ | |
+#endif /* CONFIG_SCHED_MC */ | |
+ u64 last_niffy; /* Last time this RQ updated grq.niffies */ | |
+#endif /* CONFIG_SMP */ | |
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING | |
+ u64 prev_irq_time; | |
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ | |
+#ifdef CONFIG_PARAVIRT | |
+ u64 prev_steal_time; | |
+#endif /* CONFIG_PARAVIRT */ | |
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING | |
+ u64 prev_steal_time_rq; | |
+#endif /* CONFIG_PARAVIRT_TIME_ACCOUNTING */ | |
+ | |
+ u64 clock, old_clock, last_tick; | |
+ u64 clock_task; | |
+ bool dither; | |
+ | |
+#ifdef CONFIG_SCHEDSTATS | |
+ | |
+ /* latency stats */ | |
+ struct sched_info rq_sched_info; | |
+ unsigned long long rq_cpu_time; | |
+ /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */ | |
+ | |
+ /* sys_sched_yield() stats */ | |
+ unsigned int yld_count; | |
+ | |
+ /* schedule() stats */ | |
+ unsigned int sched_switch; | |
+ unsigned int sched_count; | |
+ unsigned int sched_goidle; | |
+ | |
+ /* try_to_wake_up() stats */ | |
+ unsigned int ttwu_count; | |
+ unsigned int ttwu_local; | |
+#endif /* CONFIG_SCHEDSTATS */ | |
+#ifdef CONFIG_CPU_IDLE | |
+ /* Must be inspected within a rcu lock section */ | |
+ struct cpuidle_state *idle_state; | |
+#endif | |
+}; | |
+ | |
+#ifdef CONFIG_SMP | |
+struct rq *cpu_rq(int cpu); | |
+#endif | |
+ | |
+#ifndef CONFIG_SMP | |
+static struct rq *uprq; | |
+#define cpu_rq(cpu) (uprq) | |
+#define this_rq() (uprq) | |
+#define raw_rq() (uprq) | |
+#define task_rq(p) (uprq) | |
+#define cpu_curr(cpu) ((uprq)->curr) | |
+#else /* CONFIG_SMP */ | |
+DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); | |
+#define this_rq() this_cpu_ptr(&runqueues) | |
+#define raw_rq() raw_cpu_ptr(&runqueues) | |
+#endif /* CONFIG_SMP */ | |
+ | |
+static inline u64 rq_clock(struct rq *rq) | |
+{ | |
+ return rq->clock; | |
+} | |
+ | |
+static inline u64 rq_clock_task(struct rq *rq) | |
+{ | |
+ return rq->clock_task; | |
+} | |
+ | |
+#define rcu_dereference_check_sched_domain(p) \ | |
+ rcu_dereference_check((p), \ | |
+ lockdep_is_held(&sched_domains_mutex)) | |
+ | |
+/* | |
+ * The domain tree (rq->sd) is protected by RCU's quiescent state transition. | |
+ * See detach_destroy_domains: synchronize_sched for details. | |
+ * | |
+ * The domain tree of any CPU may only be accessed from within | |
+ * preempt-disabled sections. | |
+ */ | |
+#define for_each_domain(cpu, __sd) \ | |
+ for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent) | |
+ | |
+static inline void sched_ttwu_pending(void) { } | |
+ | |
+static inline int task_on_rq_queued(struct task_struct *p) | |
+{ | |
+ return p->on_rq; | |
+} | |
+ | |
+#ifdef CONFIG_CPU_IDLE | |
+static inline void idle_set_state(struct rq *rq, | |
+ struct cpuidle_state *idle_state) | |
+{ | |
+ rq->idle_state = idle_state; | |
+} | |
+ | |
+static inline struct cpuidle_state *idle_get_state(struct rq *rq) | |
+{ | |
+ WARN_ON(!rcu_read_lock_held()); | |
+ return rq->idle_state; | |
+} | |
+#else | |
+static inline void idle_set_state(struct rq *rq, | |
+ struct cpuidle_state *idle_state) | |
+{ | |
+} | |
+ | |
+static inline struct cpuidle_state *idle_get_state(struct rq *rq) | |
+{ | |
+ return NULL; | |
+} | |
+#endif | |
+#endif /* BFS_SCHED_H */ | |
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c | |
index c47fce7..a9ef9f1 100644 | |
--- a/kernel/sched/idle.c | |
+++ b/kernel/sched/idle.c | |
@@ -12,7 +12,11 @@ | |
#include <trace/events/power.h> | |
+#ifdef CONFIG_SCHED_BFS | |
+#include "bfs_sched.h" | |
+#else | |
#include "sched.h" | |
+#endif | |
static int __read_mostly cpu_idle_force_poll; | |
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c | |
index a476bea..73d8cdd 100644 | |
--- a/kernel/sched/stats.c | |
+++ b/kernel/sched/stats.c | |
@@ -4,7 +4,11 @@ | |
#include <linux/seq_file.h> | |
#include <linux/proc_fs.h> | |
+#ifndef CONFIG_SCHED_BFS | |
#include "sched.h" | |
+#else | |
+#include "bfs_sched.h" | |
+#endif | |
/* | |
* bump this up when changing the output format or the meaning of an existing | |
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c | |
index 695f0c6..263b0e1 100644 | |
--- a/kernel/stop_machine.c | |
+++ b/kernel/stop_machine.c | |
@@ -41,7 +41,8 @@ struct cpu_stopper { | |
}; | |
static DEFINE_PER_CPU(struct cpu_stopper, cpu_stopper); | |
-static DEFINE_PER_CPU(struct task_struct *, cpu_stopper_task); | |
+DEFINE_PER_CPU(struct task_struct *, cpu_stopper_task); | |
+ | |
static bool stop_machine_initialized = false; | |
/* | |
diff --git a/kernel/sysctl.c b/kernel/sysctl.c | |
index 15f2511..7cdee7e 100644 | |
--- a/kernel/sysctl.c | |
+++ b/kernel/sysctl.c | |
@@ -125,7 +125,12 @@ static int __maybe_unused one = 1; | |
static int __maybe_unused two = 2; | |
static int __maybe_unused four = 4; | |
static unsigned long one_ul = 1; | |
-static int one_hundred = 100; | |
+static int __maybe_unused one_hundred = 100; | |
+#ifdef CONFIG_SCHED_BFS | |
+extern int rr_interval; | |
+extern int sched_iso_cpu; | |
+static int __read_mostly one_thousand = 1000; | |
+#endif | |
#ifdef CONFIG_PRINTK | |
static int ten_thousand = 10000; | |
#endif | |
@@ -260,7 +265,7 @@ static struct ctl_table sysctl_base_table[] = { | |
{ } | |
}; | |
-#ifdef CONFIG_SCHED_DEBUG | |
+#if defined(CONFIG_SCHED_DEBUG) && !defined(CONFIG_SCHED_BFS) | |
static int min_sched_granularity_ns = 100000; /* 100 usecs */ | |
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */ | |
static int min_wakeup_granularity_ns; /* 0 usecs */ | |
@@ -277,6 +282,7 @@ static int max_extfrag_threshold = 1000; | |
#endif | |
static struct ctl_table kern_table[] = { | |
+#ifndef CONFIG_SCHED_BFS | |
{ | |
.procname = "sched_child_runs_first", | |
.data = &sysctl_sched_child_runs_first, | |
@@ -443,6 +449,7 @@ static struct ctl_table kern_table[] = { | |
.extra1 = &one, | |
}, | |
#endif | |
+#endif /* !CONFIG_SCHED_BFS */ | |
#ifdef CONFIG_PROVE_LOCKING | |
{ | |
.procname = "prove_locking", | |
@@ -953,6 +960,26 @@ static struct ctl_table kern_table[] = { | |
.proc_handler = proc_dointvec, | |
}, | |
#endif | |
+#ifdef CONFIG_SCHED_BFS | |
+ { | |
+ .procname = "rr_interval", | |
+ .data = &rr_interval, | |
+ .maxlen = sizeof (int), | |
+ .mode = 0644, | |
+ .proc_handler = &proc_dointvec_minmax, | |
+ .extra1 = &one, | |
+ .extra2 = &one_thousand, | |
+ }, | |
+ { | |
+ .procname = "iso_cpu", | |
+ .data = &sched_iso_cpu, | |
+ .maxlen = sizeof (int), | |
+ .mode = 0644, | |
+ .proc_handler = &proc_dointvec_minmax, | |
+ .extra1 = &zero, | |
+ .extra2 = &one_hundred, | |
+ }, | |
+#endif | |
#if defined(CONFIG_S390) && defined(CONFIG_SMP) | |
{ | |
.procname = "spin_retry", | |
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig | |
index d626dc9..205a1a9 100644 | |
--- a/kernel/time/Kconfig | |
+++ b/kernel/time/Kconfig | |
@@ -95,7 +95,7 @@ config NO_HZ_IDLE | |
config NO_HZ_FULL | |
bool "Full dynticks system (tickless)" | |
# NO_HZ_COMMON dependency | |
- depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS | |
+ depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS && !SCHED_BFS | |
# We need at least one periodic CPU for timekeeping | |
depends on SMP | |
# RCU_USER_QS dependency | |
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c | |
index a16b678..b7fa15e 100644 | |
--- a/kernel/time/posix-cpu-timers.c | |
+++ b/kernel/time/posix-cpu-timers.c | |
@@ -425,7 +425,7 @@ static void cleanup_timers(struct list_head *head) | |
*/ | |
void posix_cpu_timers_exit(struct task_struct *tsk) | |
{ | |
- add_device_randomness((const void*) &tsk->se.sum_exec_runtime, | |
+ add_device_randomness((const void*) &tsk_seruntime(tsk), | |
sizeof(unsigned long long)); | |
cleanup_timers(tsk->cpu_timers); | |
@@ -847,7 +847,7 @@ static void check_thread_timers(struct task_struct *tsk, | |
tsk_expires->virt_exp = expires_to_cputime(expires); | |
tsk_expires->sched_exp = check_timers_list(++timers, firing, | |
- tsk->se.sum_exec_runtime); | |
+ tsk_seruntime(tsk)); | |
/* | |
* Check for the special case thread timers. | |
@@ -858,7 +858,7 @@ static void check_thread_timers(struct task_struct *tsk, | |
ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_max); | |
if (hard != RLIM_INFINITY && | |
- tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) { | |
+ tsk_rttimeout(tsk) > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) { | |
/* | |
* At the hard limit, we just die. | |
* No need to calculate anything else now. | |
@@ -866,7 +866,7 @@ static void check_thread_timers(struct task_struct *tsk, | |
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk); | |
return; | |
} | |
- if (tsk->rt.timeout > DIV_ROUND_UP(soft, USEC_PER_SEC/HZ)) { | |
+ if (tsk_rttimeout(tsk) > DIV_ROUND_UP(soft, USEC_PER_SEC/HZ)) { | |
/* | |
* At the soft limit, send a SIGXCPU every second. | |
*/ | |
@@ -1103,7 +1103,7 @@ static inline int fastpath_timer_check(struct task_struct *tsk) | |
struct task_cputime task_sample = { | |
.utime = utime, | |
.stime = stime, | |
- .sum_exec_runtime = tsk->se.sum_exec_runtime | |
+ .sum_exec_runtime = tsk_seruntime(tsk) | |
}; | |
if (task_cputime_expired(&task_sample, &tsk->cputime_expires)) | |
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug | |
index 4e35a5d..b1ac86e 100644 | |
--- a/lib/Kconfig.debug | |
+++ b/lib/Kconfig.debug | |
@@ -1197,7 +1197,7 @@ config TORTURE_TEST | |
config RCU_TORTURE_TEST | |
tristate "torture tests for RCU" | |
- depends on DEBUG_KERNEL | |
+ depends on DEBUG_KERNEL && !SCHED_BFS | |
select TORTURE_TEST | |
default n | |
help | |
diff --git a/lib/Makefile b/lib/Makefile | |
index 0211d2b..426536f 100644 | |
--- a/lib/Makefile | |
+++ b/lib/Makefile | |
@@ -8,7 +8,7 @@ KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) | |
endif | |
lib-y := ctype.o string.o vsprintf.o cmdline.o \ | |
- rbtree.o radix-tree.o dump_stack.o timerqueue.o\ | |
+ rbtree.o radix-tree.o sradix-tree.o dump_stack.o timerqueue.o\ | |
idr.o int_sqrt.o extable.o \ | |
sha1.o md5.o irq_regs.o argv_split.o \ | |
proportions.o flex_proportions.o ratelimit.o show_mem.o \ | |
diff --git a/lib/sradix-tree.c b/lib/sradix-tree.c | |
new file mode 100644 | |
index 0000000..8d06329 | |
--- /dev/null | |
+++ b/lib/sradix-tree.c | |
@@ -0,0 +1,476 @@ | |
+#include <linux/errno.h> | |
+#include <linux/mm.h> | |
+#include <linux/mman.h> | |
+#include <linux/spinlock.h> | |
+#include <linux/slab.h> | |
+#include <linux/gcd.h> | |
+#include <linux/sradix-tree.h> | |
+ | |
+static inline int sradix_node_full(struct sradix_tree_root *root, struct sradix_tree_node *node) | |
+{ | |
+ return node->fulls == root->stores_size || | |
+ (node->height == 1 && node->count == root->stores_size); | |
+} | |
+ | |
+/* | |
+ * Extend a sradix tree so it can store key @index. | |
+ */ | |
+static int sradix_tree_extend(struct sradix_tree_root *root, unsigned long index) | |
+{ | |
+ struct sradix_tree_node *node; | |
+ unsigned int height; | |
+ | |
+ if (unlikely(root->rnode == NULL)) { | |
+ if (!(node = root->alloc())) | |
+ return -ENOMEM; | |
+ | |
+ node->height = 1; | |
+ root->rnode = node; | |
+ root->height = 1; | |
+ } | |
+ | |
+ /* Figure out what the height should be. */ | |
+ height = root->height; | |
+ index >>= root->shift * height; | |
+ | |
+ while (index) { | |
+ index >>= root->shift; | |
+ height++; | |
+ } | |
+ | |
+ while (height > root->height) { | |
+ unsigned int newheight; | |
+ if (!(node = root->alloc())) | |
+ return -ENOMEM; | |
+ | |
+ /* Increase the height. */ | |
+ node->stores[0] = root->rnode; | |
+ root->rnode->parent = node; | |
+ if (root->extend) | |
+ root->extend(node, root->rnode); | |
+ | |
+ newheight = root->height + 1; | |
+ node->height = newheight; | |
+ node->count = 1; | |
+ if (sradix_node_full(root, root->rnode)) | |
+ node->fulls = 1; | |
+ | |
+ root->rnode = node; | |
+ root->height = newheight; | |
+ } | |
+ | |
+ return 0; | |
+} | |
+ | |
+/* | |
+ * Search the next item from the current node, that is not NULL | |
+ * and can satify root->iter(). | |
+ */ | |
+void *sradix_tree_next(struct sradix_tree_root *root, | |
+ struct sradix_tree_node *node, unsigned long index, | |
+ int (*iter)(void *item, unsigned long height)) | |
+{ | |
+ unsigned long offset; | |
+ void *item; | |
+ | |
+ if (unlikely(node == NULL)) { | |
+ node = root->rnode; | |
+ for (offset = 0; offset < root->stores_size; offset++) { | |
+ item = node->stores[offset]; | |
+ if (item && (!iter || iter(item, node->height))) | |
+ break; | |
+ } | |
+ | |
+ if (unlikely(offset >= root->stores_size)) | |
+ return NULL; | |
+ | |
+ if (node->height == 1) | |
+ return item; | |
+ else | |
+ goto go_down; | |
+ } | |
+ | |
+ while (node) { | |
+ offset = (index & root->mask) + 1; | |
+ for (;offset < root->stores_size; offset++) { | |
+ item = node->stores[offset]; | |
+ if (item && (!iter || iter(item, node->height))) | |
+ break; | |
+ } | |
+ | |
+ if (offset < root->stores_size) | |
+ break; | |
+ | |
+ node = node->parent; | |
+ index >>= root->shift; | |
+ } | |
+ | |
+ if (!node) | |
+ return NULL; | |
+ | |
+ while (node->height > 1) { | |
+go_down: | |
+ node = item; | |
+ for (offset = 0; offset < root->stores_size; offset++) { | |
+ item = node->stores[offset]; | |
+ if (item && (!iter || iter(item, node->height))) | |
+ break; | |
+ } | |
+ | |
+ if (unlikely(offset >= root->stores_size)) | |
+ return NULL; | |
+ } | |
+ | |
+ BUG_ON(offset > root->stores_size); | |
+ | |
+ return item; | |
+} | |
+ | |
+/* | |
+ * Blindly insert the item to the tree. Typically, we reuse the | |
+ * first empty store item. | |
+ */ | |
+int sradix_tree_enter(struct sradix_tree_root *root, void **item, int num) | |
+{ | |
+ unsigned long index; | |
+ unsigned int height; | |
+ struct sradix_tree_node *node, *tmp = NULL; | |
+ int offset, offset_saved; | |
+ void **store = NULL; | |
+ int error, i, j, shift; | |
+ | |
+go_on: | |
+ index = root->min; | |
+ | |
+ if (root->enter_node && !sradix_node_full(root, root->enter_node)) { | |
+ node = root->enter_node; | |
+ BUG_ON((index >> (root->shift * root->height))); | |
+ } else { | |
+ node = root->rnode; | |
+ if (node == NULL || (index >> (root->shift * root->height)) | |
+ || sradix_node_full(root, node)) { | |
+ error = sradix_tree_extend(root, index); | |
+ if (error) | |
+ return error; | |
+ | |
+ node = root->rnode; | |
+ } | |
+ } | |
+ | |
+ | |
+ height = node->height; | |
+ shift = (height - 1) * root->shift; | |
+ offset = (index >> shift) & root->mask; | |
+ while (shift > 0) { | |
+ offset_saved = offset; | |
+ for (; offset < root->stores_size; offset++) { | |
+ store = &node->stores[offset]; | |
+ tmp = *store; | |
+ | |
+ if (!tmp || !sradix_node_full(root, tmp)) | |
+ break; | |
+ } | |
+ BUG_ON(offset >= root->stores_size); | |
+ | |
+ if (offset != offset_saved) { | |
+ index += (offset - offset_saved) << shift; | |
+ index &= ~((1UL << shift) - 1); | |
+ } | |
+ | |
+ if (!tmp) { | |
+ if (!(tmp = root->alloc())) | |
+ return -ENOMEM; | |
+ | |
+ tmp->height = shift / root->shift; | |
+ *store = tmp; | |
+ tmp->parent = node; | |
+ node->count++; | |
+// if (root->extend) | |
+// root->extend(node, tmp); | |
+ } | |
+ | |
+ node = tmp; | |
+ shift -= root->shift; | |
+ offset = (index >> shift) & root->mask; | |
+ } | |
+ | |
+ BUG_ON(node->height != 1); | |
+ | |
+ | |
+ store = &node->stores[offset]; | |
+ for (i = 0, j = 0; | |
+ j < root->stores_size - node->count && | |
+ i < root->stores_size - offset && j < num; i++) { | |
+ if (!store[i]) { | |
+ store[i] = item[j]; | |
+ if (root->assign) | |
+ root->assign(node, index + i, item[j]); | |
+ j++; | |
+ } | |
+ } | |
+ | |
+ node->count += j; | |
+ root->num += j; | |
+ num -= j; | |
+ | |
+ while (sradix_node_full(root, node)) { | |
+ node = node->parent; | |
+ if (!node) | |
+ break; | |
+ | |
+ node->fulls++; | |
+ } | |
+ | |
+ if (unlikely(!node)) { | |
+ /* All nodes are full */ | |
+ root->min = 1 << (root->height * root->shift); | |
+ root->enter_node = NULL; | |
+ } else { | |
+ root->min = index + i - 1; | |
+ root->min |= (1UL << (node->height - 1)) - 1; | |
+ root->min++; | |
+ root->enter_node = node; | |
+ } | |
+ | |
+ if (num) { | |
+ item += j; | |
+ goto go_on; | |
+ } | |
+ | |
+ return 0; | |
+} | |
+ | |
+ | |
+/** | |
+ * sradix_tree_shrink - shrink height of a sradix tree to minimal | |
+ * @root sradix tree root | |
+ * | |
+ */ | |
+static inline void sradix_tree_shrink(struct sradix_tree_root *root) | |
+{ | |
+ /* try to shrink tree height */ | |
+ while (root->height > 1) { | |
+ struct sradix_tree_node *to_free = root->rnode; | |
+ | |
+ /* | |
+ * The candidate node has more than one child, or its child | |
+ * is not at the leftmost store, we cannot shrink. | |
+ */ | |
+ if (to_free->count != 1 || !to_free->stores[0]) | |
+ break; | |
+ | |
+ root->rnode = to_free->stores[0]; | |
+ root->rnode->parent = NULL; | |
+ root->height--; | |
+ if (unlikely(root->enter_node == to_free)) { | |
+ root->enter_node = NULL; | |
+ } | |
+ root->free(to_free); | |
+ } | |
+} | |
+ | |
+/* | |
+ * Del the item on the known leaf node and index | |
+ */ | |
+void sradix_tree_delete_from_leaf(struct sradix_tree_root *root, | |
+ struct sradix_tree_node *node, unsigned long index) | |
+{ | |
+ unsigned int offset; | |
+ struct sradix_tree_node *start, *end; | |
+ | |
+ BUG_ON(node->height != 1); | |
+ | |
+ start = node; | |
+ while (node && !(--node->count)) | |
+ node = node->parent; | |
+ | |
+ end = node; | |
+ if (!node) { | |
+ root->rnode = NULL; | |
+ root->height = 0; | |
+ root->min = 0; | |
+ root->num = 0; | |
+ root->enter_node = NULL; | |
+ } else { | |
+ offset = (index >> (root->shift * (node->height - 1))) & root->mask; | |
+ if (root->rm) | |
+ root->rm(node, offset); | |
+ node->stores[offset] = NULL; | |
+ root->num--; | |
+ if (root->min > index) { | |
+ root->min = index; | |
+ root->enter_node = node; | |
+ } | |
+ } | |
+ | |
+ if (start != end) { | |
+ do { | |
+ node = start; | |
+ start = start->parent; | |
+ if (unlikely(root->enter_node == node)) | |
+ root->enter_node = end; | |
+ root->free(node); | |
+ } while (start != end); | |
+ | |
+ /* | |
+ * Note that shrink may free "end", so enter_node still need to | |
+ * be checked inside. | |
+ */ | |
+ sradix_tree_shrink(root); | |
+ } else if (node->count == root->stores_size - 1) { | |
+ /* It WAS a full leaf node. Update the ancestors */ | |
+ node = node->parent; | |
+ while (node) { | |
+ node->fulls--; | |
+ if (node->fulls != root->stores_size - 1) | |
+ break; | |
+ | |
+ node = node->parent; | |
+ } | |
+ } | |
+} | |
+ | |
+void *sradix_tree_lookup(struct sradix_tree_root *root, unsigned long index) | |
+{ | |
+ unsigned int height, offset; | |
+ struct sradix_tree_node *node; | |
+ int shift; | |
+ | |
+ node = root->rnode; | |
+ if (node == NULL || (index >> (root->shift * root->height))) | |
+ return NULL; | |
+ | |
+ height = root->height; | |
+ shift = (height - 1) * root->shift; | |
+ | |
+ do { | |
+ offset = (index >> shift) & root->mask; | |
+ node = node->stores[offset]; | |
+ if (!node) | |
+ return NULL; | |
+ | |
+ shift -= root->shift; | |
+ } while (shift >= 0); | |
+ | |
+ return node; | |
+} | |
+ | |
+/* | |
+ * Return the item if it exists, otherwise create it in place | |
+ * and return the created item. | |
+ */ | |
+void *sradix_tree_lookup_create(struct sradix_tree_root *root, | |
+ unsigned long index, void *(*item_alloc)(void)) | |
+{ | |
+ unsigned int height, offset; | |
+ struct sradix_tree_node *node, *tmp; | |
+ void *item; | |
+ int shift, error; | |
+ | |
+ if (root->rnode == NULL || (index >> (root->shift * root->height))) { | |
+ if (item_alloc) { | |
+ error = sradix_tree_extend(root, index); | |
+ if (error) | |
+ return NULL; | |
+ } else { | |
+ return NULL; | |
+ } | |
+ } | |
+ | |
+ node = root->rnode; | |
+ height = root->height; | |
+ shift = (height - 1) * root->shift; | |
+ | |
+ do { | |
+ offset = (index >> shift) & root->mask; | |
+ if (!node->stores[offset]) { | |
+ if (!(tmp = root->alloc())) | |
+ return NULL; | |
+ | |
+ tmp->height = shift / root->shift; | |
+ node->stores[offset] = tmp; | |
+ tmp->parent = node; | |
+ node->count++; | |
+ node = tmp; | |
+ } else { | |
+ node = node->stores[offset]; | |
+ } | |
+ | |
+ shift -= root->shift; | |
+ } while (shift > 0); | |
+ | |
+ BUG_ON(node->height != 1); | |
+ offset = index & root->mask; | |
+ if (node->stores[offset]) { | |
+ return node->stores[offset]; | |
+ } else if (item_alloc) { | |
+ if (!(item = item_alloc())) | |
+ return NULL; | |
+ | |
+ node->stores[offset] = item; | |
+ | |
+ /* | |
+ * NOTE: we do NOT call root->assign here, since this item is | |
+ * newly created by us having no meaning. Caller can call this | |
+ * if it's necessary to do so. | |
+ */ | |
+ | |
+ node->count++; | |
+ root->num++; | |
+ | |
+ while (sradix_node_full(root, node)) { | |
+ node = node->parent; | |
+ if (!node) | |
+ break; | |
+ | |
+ node->fulls++; | |
+ } | |
+ | |
+ if (unlikely(!node)) { | |
+ /* All nodes are full */ | |
+ root->min = 1 << (root->height * root->shift); | |
+ } else { | |
+ if (root->min == index) { | |
+ root->min |= (1UL << (node->height - 1)) - 1; | |
+ root->min++; | |
+ root->enter_node = node; | |
+ } | |
+ } | |
+ | |
+ return item; | |
+ } else { | |
+ return NULL; | |
+ } | |
+ | |
+} | |
+ | |
+int sradix_tree_delete(struct sradix_tree_root *root, unsigned long index) | |
+{ | |
+ unsigned int height, offset; | |
+ struct sradix_tree_node *node; | |
+ int shift; | |
+ | |
+ node = root->rnode; | |
+ if (node == NULL || (index >> (root->shift * root->height))) | |
+ return -ENOENT; | |
+ | |
+ height = root->height; | |
+ shift = (height - 1) * root->shift; | |
+ | |
+ do { | |
+ offset = (index >> shift) & root->mask; | |
+ node = node->stores[offset]; | |
+ if (!node) | |
+ return -ENOENT; | |
+ | |
+ shift -= root->shift; | |
+ } while (shift > 0); | |
+ | |
+ offset = index & root->mask; | |
+ if (!node->stores[offset]) | |
+ return -ENOENT; | |
+ | |
+ sradix_tree_delete_from_leaf(root, node, index); | |
+ | |
+ return 0; | |
+} | |
diff --git a/mm/Kconfig b/mm/Kconfig | |
index 1d1ae6b..511dde6 100644 | |
--- a/mm/Kconfig | |
+++ b/mm/Kconfig | |
@@ -339,6 +339,32 @@ config KSM | |
See Documentation/vm/ksm.txt for more information: KSM is inactive | |
until a program has madvised that an area is MADV_MERGEABLE, and | |
root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set). | |
+choice | |
+ prompt "Choose UKSM/KSM strategy" | |
+ default UKSM | |
+ depends on KSM | |
+ help | |
+ This option allows to select a UKSM/KSM stragety. | |
+ | |
+config UKSM | |
+ bool "Ultra-KSM for page merging" | |
+ depends on KSM | |
+ help | |
+ UKSM is inspired by the Linux kernel project \u2014 KSM(Kernel Same | |
+ page Merging), but with a fundamentally rewritten core algorithm. With | |
+ an advanced algorithm, UKSM now can transparently scans all anonymously | |
+ mapped user space applications with an significantly improved scan speed | |
+ and CPU efficiency. Since KVM is friendly to KSM, KVM can also benefit from | |
+ UKSM. Now UKSM has its first stable release and first real world enterprise user. | |
+ For more information, please goto its project page. | |
+ (www.kerneldedup.org) | |
+ | |
+config KSM_LEGACY | |
+ bool "Legacy KSM implementation" | |
+ depends on KSM | |
+ help | |
+ The legacy KSM implementation from Redhat. | |
+endchoice | |
config DEFAULT_MMAP_MIN_ADDR | |
int "Low address space to protect from user allocation" | |
diff --git a/mm/Makefile b/mm/Makefile | |
index 8405eb0..7689f0c 100644 | |
--- a/mm/Makefile | |
+++ b/mm/Makefile | |
@@ -44,7 +44,8 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o | |
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o | |
obj-$(CONFIG_SLOB) += slob.o | |
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o | |
-obj-$(CONFIG_KSM) += ksm.o | |
+obj-$(CONFIG_KSM_LEGACY) += ksm.o | |
+obj-$(CONFIG_UKSM) += uksm.o | |
obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o | |
obj-$(CONFIG_SLAB) += slab.o | |
obj-$(CONFIG_SLUB) += slub.o | |
diff --git a/mm/memory.c b/mm/memory.c | |
index d5f2ae9..86b5d09 100644 | |
--- a/mm/memory.c | |
+++ b/mm/memory.c | |
@@ -120,6 +120,28 @@ unsigned long highest_memmap_pfn __read_mostly; | |
EXPORT_SYMBOL(zero_pfn); | |
+#ifdef CONFIG_UKSM | |
+unsigned long uksm_zero_pfn __read_mostly; | |
+EXPORT_SYMBOL_GPL(uksm_zero_pfn); | |
+struct page *empty_uksm_zero_page; | |
+ | |
+static int __init setup_uksm_zero_page(void) | |
+{ | |
+ unsigned long addr; | |
+ addr = __get_free_pages(GFP_KERNEL | __GFP_ZERO, 0); | |
+ if (!addr) | |
+ panic("Oh boy, that early out of memory?"); | |
+ | |
+ empty_uksm_zero_page = virt_to_page((void *) addr); | |
+ SetPageReserved(empty_uksm_zero_page); | |
+ | |
+ uksm_zero_pfn = page_to_pfn(empty_uksm_zero_page); | |
+ | |
+ return 0; | |
+} | |
+core_initcall(setup_uksm_zero_page); | |
+#endif | |
+ | |
/* | |
* CONFIG_MMU architectures set up ZERO_PAGE in their paging_init() | |
*/ | |
@@ -131,6 +153,7 @@ static int __init init_zero_pfn(void) | |
core_initcall(init_zero_pfn); | |
+ | |
#if defined(SPLIT_RSS_COUNTING) | |
void sync_mm_rss(struct mm_struct *mm) | |
@@ -878,6 +901,11 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, | |
rss[MM_ANONPAGES]++; | |
else | |
rss[MM_FILEPAGES]++; | |
+ | |
+ /* Should return NULL in vm_normal_page() */ | |
+ uksm_bugon_zeropage(pte); | |
+ } else { | |
+ uksm_map_zero_page(pte); | |
} | |
out_set_pte: | |
@@ -1120,8 +1148,10 @@ again: | |
ptent = ptep_get_and_clear_full(mm, addr, pte, | |
tlb->fullmm); | |
tlb_remove_tlb_entry(tlb, pte, addr); | |
- if (unlikely(!page)) | |
+ if (unlikely(!page)) { | |
+ uksm_unmap_zero_page(ptent); | |
continue; | |
+ } | |
if (unlikely(details) && details->nonlinear_vma | |
&& linear_page_index(details->nonlinear_vma, | |
addr) != page->index) { | |
@@ -1983,8 +2013,10 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo | |
clear_page(kaddr); | |
kunmap_atomic(kaddr); | |
flush_dcache_page(dst); | |
- } else | |
+ } else { | |
copy_user_highpage(dst, src, va, vma); | |
+ uksm_cow_page(vma, src); | |
+ } | |
} | |
/* | |
@@ -2198,6 +2230,7 @@ gotten: | |
new_page = alloc_zeroed_user_highpage_movable(vma, address); | |
if (!new_page) | |
goto oom; | |
+ uksm_cow_pte(vma, orig_pte); | |
} else { | |
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); | |
if (!new_page) | |
@@ -2223,8 +2256,11 @@ gotten: | |
dec_mm_counter_fast(mm, MM_FILEPAGES); | |
inc_mm_counter_fast(mm, MM_ANONPAGES); | |
} | |
- } else | |
+ uksm_bugon_zeropage(orig_pte); | |
+ } else { | |
+ uksm_unmap_zero_page(orig_pte); | |
inc_mm_counter_fast(mm, MM_ANONPAGES); | |
+ } | |
flush_cache_page(vma, address, pte_pfn(orig_pte)); | |
entry = mk_pte(new_page, vma->vm_page_prot); | |
entry = maybe_mkwrite(pte_mkdirty(entry), vma); | |
diff --git a/mm/mmap.c b/mm/mmap.c | |
index ae91989..844f366 100644 | |
--- a/mm/mmap.c | |
+++ b/mm/mmap.c | |
@@ -41,6 +41,7 @@ | |
#include <linux/notifier.h> | |
#include <linux/memory.h> | |
#include <linux/printk.h> | |
+#include <linux/ksm.h> | |
#include <asm/uaccess.h> | |
#include <asm/cacheflush.h> | |
@@ -279,6 +280,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) | |
if (vma->vm_file) | |
fput(vma->vm_file); | |
mpol_put(vma_policy(vma)); | |
+ uksm_remove_vma(vma); | |
kmem_cache_free(vm_area_cachep, vma); | |
return next; | |
} | |
@@ -739,9 +741,16 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, | |
long adjust_next = 0; | |
int remove_next = 0; | |
+/* | |
+ * to avoid deadlock, ksm_remove_vma must be done before any spin_lock is | |
+ * acquired | |
+ */ | |
+ uksm_remove_vma(vma); | |
+ | |
if (next && !insert) { | |
struct vm_area_struct *exporter = NULL; | |
+ uksm_remove_vma(next); | |
if (end >= next->vm_end) { | |
/* | |
* vma expands, overlapping all the next, and | |
@@ -838,6 +847,7 @@ again: remove_next = 1 + (end > next->vm_end); | |
end_changed = true; | |
} | |
vma->vm_pgoff = pgoff; | |
+ | |
if (adjust_next) { | |
next->vm_start += adjust_next << PAGE_SHIFT; | |
next->vm_pgoff += adjust_next; | |
@@ -908,16 +918,22 @@ again: remove_next = 1 + (end > next->vm_end); | |
* up the code too much to do both in one go. | |
*/ | |
next = vma->vm_next; | |
- if (remove_next == 2) | |
+ if (remove_next == 2) { | |
+ uksm_remove_vma(next); | |
goto again; | |
- else if (next) | |
+ } else if (next) { | |
vma_gap_update(next); | |
- else | |
+ } else { | |
mm->highest_vm_end = end; | |
+ } | |
+ } else { | |
+ if (next && !insert) | |
+ uksm_vma_add_new(next); | |
} | |
if (insert && file) | |
uprobe_mmap(insert); | |
+ uksm_vma_add_new(vma); | |
validate_mm(mm); | |
return 0; | |
@@ -1310,6 +1326,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, | |
vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) | | |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; | |
+ /* If uksm is enabled, we add VM_MERGABLE to new VMAs. */ | |
+ uksm_vm_flags_mod(&vm_flags); | |
+ | |
if (flags & MAP_LOCKED) | |
if (!can_do_mlock()) | |
return -EPERM; | |
@@ -1651,6 +1670,7 @@ munmap_back: | |
allow_write_access(file); | |
} | |
file = vma->vm_file; | |
+ uksm_vma_add_new(vma); | |
out: | |
perf_event_mmap(vma); | |
@@ -1692,6 +1712,7 @@ allow_write_and_free_vma: | |
if (vm_flags & VM_DENYWRITE) | |
allow_write_access(file); | |
free_vma: | |
+ uksm_remove_vma(vma); | |
kmem_cache_free(vm_area_cachep, vma); | |
unacct_error: | |
if (charged) | |
@@ -2488,6 +2509,8 @@ static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, | |
else | |
err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new); | |
+ uksm_vma_add_new(new); | |
+ | |
/* Success. */ | |
if (!err) | |
return 0; | |
@@ -2654,6 +2677,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) | |
return addr; | |
flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; | |
+ uksm_vm_flags_mod(&flags); | |
error = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED); | |
if (error & ~PAGE_MASK) | |
@@ -2712,6 +2736,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) | |
vma->vm_flags = flags; | |
vma->vm_page_prot = vm_get_page_prot(flags); | |
vma_link(mm, vma, prev, rb_link, rb_parent); | |
+ uksm_vma_add_new(vma); | |
out: | |
perf_event_mmap(vma); | |
mm->total_vm += len >> PAGE_SHIFT; | |
@@ -2747,6 +2772,12 @@ void exit_mmap(struct mm_struct *mm) | |
/* mm's last user has gone, and its about to be pulled down */ | |
mmu_notifier_release(mm); | |
+ /* | |
+ * Taking write lock on mmap_sem does not harm others, | |
+ * but it's crucial for uksm to avoid races. | |
+ */ | |
+ down_write(&mm->mmap_sem); | |
+ | |
if (mm->locked_vm) { | |
vma = mm->mmap; | |
while (vma) { | |
@@ -2783,6 +2814,11 @@ void exit_mmap(struct mm_struct *mm) | |
} | |
vm_unacct_memory(nr_accounted); | |
+ mm->mmap = NULL; | |
+ mm->mm_rb = RB_ROOT; | |
+ vmacache_invalidate(mm); | |
+ up_write(&mm->mmap_sem); | |
+ | |
WARN_ON(atomic_long_read(&mm->nr_ptes) > | |
(FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT); | |
} | |
@@ -2891,6 +2927,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, | |
new_vma->vm_ops->open(new_vma); | |
vma_link(mm, new_vma, prev, rb_link, rb_parent); | |
*need_rmap_locks = false; | |
+ uksm_vma_add_new(new_vma); | |
} | |
} | |
return new_vma; | |
@@ -3004,10 +3041,10 @@ static struct vm_area_struct *__install_special_mapping( | |
ret = insert_vm_struct(mm, vma); | |
if (ret) | |
goto out; | |
- | |
mm->total_vm += len >> PAGE_SHIFT; | |
perf_event_mmap(vma); | |
+ uksm_vma_add_new(vma); | |
return vma; | |
diff --git a/mm/rmap.c b/mm/rmap.c | |
index 3e4c721..d39d8a3 100644 | |
--- a/mm/rmap.c | |
+++ b/mm/rmap.c | |
@@ -908,9 +908,9 @@ void page_move_anon_rmap(struct page *page, | |
/** | |
* __page_set_anon_rmap - set up new anonymous rmap | |
- * @page: Page to add to rmap | |
+ * @page: Page to add to rmap | |
* @vma: VM area to add page to. | |
- * @address: User virtual address of the mapping | |
+ * @address: User virtual address of the mapping | |
* @exclusive: the page is exclusively owned by the current process | |
*/ | |
static void __page_set_anon_rmap(struct page *page, | |
diff --git a/mm/uksm.c b/mm/uksm.c | |
new file mode 100644 | |
index 0000000..c76fcfc | |
--- /dev/null | |
+++ b/mm/uksm.c | |
@@ -0,0 +1,5519 @@ | |
+/* | |
+ * Ultra KSM. Copyright (C) 2011-2012 Nai Xia | |
+ * | |
+ * This is an improvement upon KSM. Some basic data structures and routines | |
+ * are borrowed from ksm.c . | |
+ * | |
+ * Its new features: | |
+ * 1. Full system scan: | |
+ * It automatically scans all user processes' anonymous VMAs. Kernel-user | |
+ * interaction to submit a memory area to KSM is no longer needed. | |
+ * | |
+ * 2. Rich area detection: | |
+ * It automatically detects rich areas containing abundant duplicated | |
+ * pages based. Rich areas are given a full scan speed. Poor areas are | |
+ * sampled at a reasonable speed with very low CPU consumption. | |
+ * | |
+ * 3. Ultra Per-page scan speed improvement: | |
+ * A new hash algorithm is proposed. As a result, on a machine with | |
+ * Core(TM)2 Quad Q9300 CPU in 32-bit mode and 800MHZ DDR2 main memory, it | |
+ * can scan memory areas that does not contain duplicated pages at speed of | |
+ * 627MB/sec ~ 2445MB/sec and can merge duplicated areas at speed of | |
+ * 477MB/sec ~ 923MB/sec. | |
+ * | |
+ * 4. Thrashing area avoidance: | |
+ * Thrashing area(an VMA that has frequent Ksm page break-out) can be | |
+ * filtered out. My benchmark shows it's more efficient than KSM's per-page | |
+ * hash value based volatile page detection. | |
+ * | |
+ * | |
+ * 5. Misc changes upon KSM: | |
+ * * It has a fully x86-opitmized memcmp dedicated for 4-byte-aligned page | |
+ * comparison. It's much faster than default C version on x86. | |
+ * * rmap_item now has an struct *page member to loosely cache a | |
+ * address-->page mapping, which reduces too much time-costly | |
+ * follow_page(). | |
+ * * The VMA creation/exit procedures are hooked to let the Ultra KSM know. | |
+ * * try_to_merge_two_pages() now can revert a pte if it fails. No break_ | |
+ * ksm is needed for this case. | |
+ * | |
+ * 6. Full Zero Page consideration(contributed by Figo Zhang) | |
+ * Now uksmd consider full zero pages as special pages and merge them to an | |
+ * special unswappable uksm zero page. | |
+ */ | |
+ | |
+#include <linux/errno.h> | |
+#include <linux/mm.h> | |
+#include <linux/fs.h> | |
+#include <linux/mman.h> | |
+#include <linux/sched.h> | |
+#include <linux/rwsem.h> | |
+#include <linux/pagemap.h> | |
+#include <linux/rmap.h> | |
+#include <linux/spinlock.h> | |
+#include <linux/jhash.h> | |
+#include <linux/delay.h> | |
+#include <linux/kthread.h> | |
+#include <linux/wait.h> | |
+#include <linux/slab.h> | |
+#include <linux/rbtree.h> | |
+#include <linux/memory.h> | |
+#include <linux/mmu_notifier.h> | |
+#include <linux/swap.h> | |
+#include <linux/ksm.h> | |
+#include <linux/crypto.h> | |
+#include <linux/scatterlist.h> | |
+#include <crypto/hash.h> | |
+#include <linux/random.h> | |
+#include <linux/math64.h> | |
+#include <linux/gcd.h> | |
+#include <linux/freezer.h> | |
+#include <linux/sradix-tree.h> | |
+ | |
+#include <asm/tlbflush.h> | |
+#include "internal.h" | |
+ | |
+#ifdef CONFIG_X86 | |
+#undef memcmp | |
+ | |
+#ifdef CONFIG_X86_32 | |
+#define memcmp memcmpx86_32 | |
+/* | |
+ * Compare 4-byte-aligned address s1 and s2, with length n | |
+ */ | |
+int memcmpx86_32(void *s1, void *s2, size_t n) | |
+{ | |
+ size_t num = n / 4; | |
+ register int res; | |
+ | |
+ __asm__ __volatile__ | |
+ ( | |
+ "testl %3,%3\n\t" | |
+ "repe; cmpsd\n\t" | |
+ "je 1f\n\t" | |
+ "sbbl %0,%0\n\t" | |
+ "orl $1,%0\n" | |
+ "1:" | |
+ : "=&a" (res), "+&S" (s1), "+&D" (s2), "+&c" (num) | |
+ : "0" (0) | |
+ : "cc"); | |
+ | |
+ return res; | |
+} | |
+ | |
+/* | |
+ * Check the page is all zero ? | |
+ */ | |
+static int is_full_zero(const void *s1, size_t len) | |
+{ | |
+ unsigned char same; | |
+ | |
+ len /= 4; | |
+ | |
+ __asm__ __volatile__ | |
+ ("repe; scasl;" | |
+ "sete %0" | |
+ : "=qm" (same), "+D" (s1), "+c" (len) | |
+ : "a" (0) | |
+ : "cc"); | |
+ | |
+ return same; | |
+} | |
+ | |
+ | |
+#elif defined(CONFIG_X86_64) | |
+#define memcmp memcmpx86_64 | |
+/* | |
+ * Compare 8-byte-aligned address s1 and s2, with length n | |
+ */ | |
+int memcmpx86_64(void *s1, void *s2, size_t n) | |
+{ | |
+ size_t num = n / 8; | |
+ register int res; | |
+ | |
+ __asm__ __volatile__ | |
+ ( | |
+ "testq %q3,%q3\n\t" | |
+ "repe; cmpsq\n\t" | |
+ "je 1f\n\t" | |
+ "sbbq %q0,%q0\n\t" | |
+ "orq $1,%q0\n" | |
+ "1:" | |
+ : "=&a" (res), "+&S" (s1), "+&D" (s2), "+&c" (num) | |
+ : "0" (0) | |
+ : "cc"); | |
+ | |
+ return res; | |
+} | |
+ | |
+static int is_full_zero(const void *s1, size_t len) | |
+{ | |
+ unsigned char same; | |
+ | |
+ len /= 8; | |
+ | |
+ __asm__ __volatile__ | |
+ ("repe; scasq;" | |
+ "sete %0" | |
+ : "=qm" (same), "+D" (s1), "+c" (len) | |
+ : "a" (0) | |
+ : "cc"); | |
+ | |
+ return same; | |
+} | |
+ | |
+#endif | |
+#else | |
+static int is_full_zero(const void *s1, size_t len) | |
+{ | |
+ unsigned long *src = s1; | |
+ int i; | |
+ | |
+ len /= sizeof(*src); | |
+ | |
+ for (i = 0; i < len; i++) { | |
+ if (src[i]) | |
+ return 0; | |
+ } | |
+ | |
+ return 1; | |
+} | |
+#endif | |
+ | |
+#define UKSM_RUNG_ROUND_FINISHED (1 << 0) | |
+#define TIME_RATIO_SCALE 10000 | |
+ | |
+#define SLOT_TREE_NODE_SHIFT 8 | |
+#define SLOT_TREE_NODE_STORE_SIZE (1UL << SLOT_TREE_NODE_SHIFT) | |
+struct slot_tree_node { | |
+ unsigned long size; | |
+ struct sradix_tree_node snode; | |
+ void *stores[SLOT_TREE_NODE_STORE_SIZE]; | |
+}; | |
+ | |
+static struct kmem_cache *slot_tree_node_cachep; | |
+ | |
+static struct sradix_tree_node *slot_tree_node_alloc(void) | |
+{ | |
+ struct slot_tree_node *p; | |
+ p = kmem_cache_zalloc(slot_tree_node_cachep, GFP_KERNEL); | |
+ if (!p) | |
+ return NULL; | |
+ | |
+ return &p->snode; | |
+} | |
+ | |
+static void slot_tree_node_free(struct sradix_tree_node *node) | |
+{ | |
+ struct slot_tree_node *p; | |
+ | |
+ p = container_of(node, struct slot_tree_node, snode); | |
+ kmem_cache_free(slot_tree_node_cachep, p); | |
+} | |
+ | |
+static void slot_tree_node_extend(struct sradix_tree_node *parent, | |
+ struct sradix_tree_node *child) | |
+{ | |
+ struct slot_tree_node *p, *c; | |
+ | |
+ p = container_of(parent, struct slot_tree_node, snode); | |
+ c = container_of(child, struct slot_tree_node, snode); | |
+ | |
+ p->size += c->size; | |
+} | |
+ | |
+void slot_tree_node_assign(struct sradix_tree_node *node, | |
+ unsigned index, void *item) | |
+{ | |
+ struct vma_slot *slot = item; | |
+ struct slot_tree_node *cur; | |
+ | |
+ slot->snode = node; | |
+ slot->sindex = index; | |
+ | |
+ while (node) { | |
+ cur = container_of(node, struct slot_tree_node, snode); | |
+ cur->size += slot->pages; | |
+ node = node->parent; | |
+ } | |
+} | |
+ | |
+void slot_tree_node_rm(struct sradix_tree_node *node, unsigned offset) | |
+{ | |
+ struct vma_slot *slot; | |
+ struct slot_tree_node *cur; | |
+ unsigned long pages; | |
+ | |
+ if (node->height == 1) { | |
+ slot = node->stores[offset]; | |
+ pages = slot->pages; | |
+ } else { | |
+ cur = container_of(node->stores[offset], | |
+ struct slot_tree_node, snode); | |
+ pages = cur->size; | |
+ } | |
+ | |
+ while (node) { | |
+ cur = container_of(node, struct slot_tree_node, snode); | |
+ cur->size -= pages; | |
+ node = node->parent; | |
+ } | |
+} | |
+ | |
+unsigned long slot_iter_index; | |
+int slot_iter(void *item, unsigned long height) | |
+{ | |
+ struct slot_tree_node *node; | |
+ struct vma_slot *slot; | |
+ | |
+ if (height == 1) { | |
+ slot = item; | |
+ if (slot_iter_index < slot->pages) { | |
+ /*in this one*/ | |
+ return 1; | |
+ } else { | |
+ slot_iter_index -= slot->pages; | |
+ return 0; | |
+ } | |
+ | |
+ } else { | |
+ node = container_of(item, struct slot_tree_node, snode); | |
+ if (slot_iter_index < node->size) { | |
+ /*in this one*/ | |
+ return 1; | |
+ } else { | |
+ slot_iter_index -= node->size; | |
+ return 0; | |
+ } | |
+ } | |
+} | |
+ | |
+ | |
+static inline void slot_tree_init_root(struct sradix_tree_root *root) | |
+{ | |
+ init_sradix_tree_root(root, SLOT_TREE_NODE_SHIFT); | |
+ root->alloc = slot_tree_node_alloc; | |
+ root->free = slot_tree_node_free; | |
+ root->extend = slot_tree_node_extend; | |
+ root->assign = slot_tree_node_assign; | |
+ root->rm = slot_tree_node_rm; | |
+} | |
+ | |
+void slot_tree_init(void) | |
+{ | |
+ slot_tree_node_cachep = kmem_cache_create("slot_tree_node", | |
+ sizeof(struct slot_tree_node), 0, | |
+ SLAB_PANIC | SLAB_RECLAIM_ACCOUNT, | |
+ NULL); | |
+} | |
+ | |
+ | |
+/* Each rung of this ladder is a list of VMAs having a same scan ratio */ | |
+struct scan_rung { | |
+ //struct list_head scanned_list; | |
+ struct sradix_tree_root vma_root; | |
+ struct sradix_tree_root vma_root2; | |
+ | |
+ struct vma_slot *current_scan; | |
+ unsigned long current_offset; | |
+ | |
+ /* | |
+ * The initial value for current_offset, it should loop over | |
+ * [0~ step - 1] to let all slot have its chance to be scanned. | |
+ */ | |
+ unsigned long offset_init; | |
+ unsigned long step; /* dynamic step for current_offset */ | |
+ unsigned int flags; | |
+ unsigned long pages_to_scan; | |
+ //unsigned long fully_scanned_slots; | |
+ /* | |
+ * a little bit tricky - if cpu_time_ratio > 0, then the value is the | |
+ * the cpu time ratio it can spend in rung_i for every scan | |
+ * period. if < 0, then it is the cpu time ratio relative to the | |
+ * max cpu percentage user specified. Both in unit of | |
+ * 1/TIME_RATIO_SCALE | |
+ */ | |
+ int cpu_ratio; | |
+ | |
+ /* | |
+ * How long it will take for all slots in this rung to be fully | |
+ * scanned? If it's zero, we don't care about the cover time: | |
+ * it's fully scanned. | |
+ */ | |
+ unsigned int cover_msecs; | |
+ //unsigned long vma_num; | |
+ //unsigned long pages; /* Sum of all slot's pages in rung */ | |
+}; | |
+ | |
+/** | |
+ * node of either the stable or unstale rbtree | |
+ * | |
+ */ | |
+struct tree_node { | |
+ struct rb_node node; /* link in the main (un)stable rbtree */ | |
+ struct rb_root sub_root; /* rb_root for sublevel collision rbtree */ | |
+ u32 hash; | |
+ unsigned long count; /* TODO: merged with sub_root */ | |
+ struct list_head all_list; /* all tree nodes in stable/unstable tree */ | |
+}; | |
+ | |
+/** | |
+ * struct stable_node - node of the stable rbtree | |
+ * @node: rb node of this ksm page in the stable tree | |
+ * @hlist: hlist head of rmap_items using this ksm page | |
+ * @kpfn: page frame number of this ksm page | |
+ */ | |
+struct stable_node { | |
+ struct rb_node node; /* link in sub-rbtree */ | |
+ struct tree_node *tree_node; /* it's tree node root in stable tree, NULL if it's in hell list */ | |
+ struct hlist_head hlist; | |
+ unsigned long kpfn; | |
+ u32 hash_max; /* if ==0 then it's not been calculated yet */ | |
+ struct list_head all_list; /* in a list for all stable nodes */ | |
+}; | |
+ | |
+/** | |
+ * struct node_vma - group rmap_items linked in a same stable | |
+ * node together. | |
+ */ | |
+struct node_vma { | |
+ union { | |
+ struct vma_slot *slot; | |
+ unsigned long key; /* slot is used as key sorted on hlist */ | |
+ }; | |
+ struct hlist_node hlist; | |
+ struct hlist_head rmap_hlist; | |
+ struct stable_node *head; | |
+}; | |
+ | |
+/** | |
+ * struct rmap_item - reverse mapping item for virtual addresses | |
+ * @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list | |
+ * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree | |
+ * @mm: the memory structure this rmap_item is pointing into | |
+ * @address: the virtual address this rmap_item tracks (+ flags in low bits) | |
+ * @node: rb node of this rmap_item in the unstable tree | |
+ * @head: pointer to stable_node heading this list in the stable tree | |
+ * @hlist: link into hlist of rmap_items hanging off that stable_node | |
+ */ | |
+struct rmap_item { | |
+ struct vma_slot *slot; | |
+ struct page *page; | |
+ unsigned long address; /* + low bits used for flags below */ | |
+ unsigned long hash_round; | |
+ unsigned long entry_index; | |
+ union { | |
+ struct {/* when in unstable tree */ | |
+ struct rb_node node; | |
+ struct tree_node *tree_node; | |
+ u32 hash_max; | |
+ }; | |
+ struct { /* when in stable tree */ | |
+ struct node_vma *head; | |
+ struct hlist_node hlist; | |
+ struct anon_vma *anon_vma; | |
+ }; | |
+ }; | |
+} __attribute__((aligned(4))); | |
+ | |
+struct rmap_list_entry { | |
+ union { | |
+ struct rmap_item *item; | |
+ unsigned long addr; | |
+ }; | |
+ /* lowest bit is used for is_addr tag */ | |
+} __attribute__((aligned(4))); /* 4 aligned to fit in to pages*/ | |
+ | |
+ | |
+/* Basic data structure definition ends */ | |
+ | |
+ | |
+/* | |
+ * Flags for rmap_item to judge if it's listed in the stable/unstable tree. | |
+ * The flags use the low bits of rmap_item.address | |
+ */ | |
+#define UNSTABLE_FLAG 0x1 | |
+#define STABLE_FLAG 0x2 | |
+#define get_rmap_addr(x) ((x)->address & PAGE_MASK) | |
+ | |
+/* | |
+ * rmap_list_entry helpers | |
+ */ | |
+#define IS_ADDR_FLAG 1 | |
+#define is_addr(ptr) ((unsigned long)(ptr) & IS_ADDR_FLAG) | |
+#define set_is_addr(ptr) ((ptr) |= IS_ADDR_FLAG) | |
+#define get_clean_addr(ptr) (((ptr) & ~(__typeof__(ptr))IS_ADDR_FLAG)) | |
+ | |
+ | |
+/* | |
+ * High speed caches for frequently allocated and freed structs | |
+ */ | |
+static struct kmem_cache *rmap_item_cache; | |
+static struct kmem_cache *stable_node_cache; | |
+static struct kmem_cache *node_vma_cache; | |
+static struct kmem_cache *vma_slot_cache; | |
+static struct kmem_cache *tree_node_cache; | |
+#define UKSM_KMEM_CACHE(__struct, __flags) kmem_cache_create("uksm_"#__struct,\ | |
+ sizeof(struct __struct), __alignof__(struct __struct),\ | |
+ (__flags), NULL) | |
+ | |
+/* Array of all scan_rung, uksm_scan_ladder[0] having the minimum scan ratio */ | |
+#define SCAN_LADDER_SIZE 4 | |
+static struct scan_rung uksm_scan_ladder[SCAN_LADDER_SIZE]; | |
+ | |
+/* The evaluation rounds uksmd has finished */ | |
+static unsigned long long uksm_eval_round = 1; | |
+ | |
+/* | |
+ * we add 1 to this var when we consider we should rebuild the whole | |
+ * unstable tree. | |
+ */ | |
+static unsigned long uksm_hash_round = 1; | |
+ | |
+/* | |
+ * How many times the whole memory is scanned. | |
+ */ | |
+static unsigned long long fully_scanned_round = 1; | |
+ | |
+/* The total number of virtual pages of all vma slots */ | |
+static u64 uksm_pages_total; | |
+ | |
+/* The number of pages has been scanned since the start up */ | |
+static u64 uksm_pages_scanned; | |
+ | |
+static u64 scanned_virtual_pages; | |
+ | |
+/* The number of pages has been scanned since last encode_benefit call */ | |
+static u64 uksm_pages_scanned_last; | |
+ | |
+/* If the scanned number is tooo large, we encode it here */ | |
+static u64 pages_scanned_stored; | |
+ | |
+static unsigned long pages_scanned_base; | |
+ | |
+/* The number of nodes in the stable tree */ | |
+static unsigned long uksm_pages_shared; | |
+ | |
+/* The number of page slots additionally sharing those nodes */ | |
+static unsigned long uksm_pages_sharing; | |
+ | |
+/* The number of nodes in the unstable tree */ | |
+static unsigned long uksm_pages_unshared; | |
+ | |
+/* | |
+ * Milliseconds ksmd should sleep between scans, | |
+ * >= 100ms to be consistent with | |
+ * scan_time_to_sleep_msec() | |
+ */ | |
+static unsigned int uksm_sleep_jiffies; | |
+ | |
+/* The real value for the uksmd next sleep */ | |
+static unsigned int uksm_sleep_real; | |
+ | |
+/* Saved value for user input uksm_sleep_jiffies when it's enlarged */ | |
+static unsigned int uksm_sleep_saved; | |
+ | |
+/* Max percentage of cpu utilization ksmd can take to scan in one batch */ | |
+static unsigned int uksm_max_cpu_percentage; | |
+ | |
+static int uksm_cpu_governor; | |
+ | |
+static char *uksm_cpu_governor_str[4] = { "full", "medium", "low", "quiet" }; | |
+ | |
+struct uksm_cpu_preset_s { | |
+ int cpu_ratio[SCAN_LADDER_SIZE]; | |
+ unsigned int cover_msecs[SCAN_LADDER_SIZE]; | |
+ unsigned int max_cpu; /* percentage */ | |
+}; | |
+ | |
+struct uksm_cpu_preset_s uksm_cpu_preset[4] = { | |
+ { {20, 40, -2500, -10000}, {1000, 500, 200, 50}, 95}, | |
+ { {20, 30, -2500, -10000}, {1000, 500, 400, 100}, 50}, | |
+ { {10, 20, -5000, -10000}, {1500, 1000, 1000, 250}, 20}, | |
+ { {10, 20, 40, 75}, {2000, 1000, 1000, 1000}, 1}, | |
+}; | |
+ | |
+/* The default value for uksm_ema_page_time if it's not initialized */ | |
+#define UKSM_PAGE_TIME_DEFAULT 500 | |
+ | |
+/*cost to scan one page by expotional moving average in nsecs */ | |
+static unsigned long uksm_ema_page_time = UKSM_PAGE_TIME_DEFAULT; | |
+ | |
+/* The expotional moving average alpha weight, in percentage. */ | |
+#define EMA_ALPHA 20 | |
+ | |
+/* | |
+ * The threshold used to filter out thrashing areas, | |
+ * If it == 0, filtering is disabled, otherwise it's the percentage up-bound | |
+ * of the thrashing ratio of all areas. Any area with a bigger thrashing ratio | |
+ * will be considered as having a zero duplication ratio. | |
+ */ | |
+static unsigned int uksm_thrash_threshold = 50; | |
+ | |
+/* How much dedup ratio is considered to be abundant*/ | |
+static unsigned int uksm_abundant_threshold = 10; | |
+ | |
+/* All slots having merged pages in this eval round. */ | |
+struct list_head vma_slot_dedup = LIST_HEAD_INIT(vma_slot_dedup); | |
+ | |
+/* How many times the ksmd has slept since startup */ | |
+static unsigned long long uksm_sleep_times; | |
+ | |
+#define UKSM_RUN_STOP 0 | |
+#define UKSM_RUN_MERGE 1 | |
+static unsigned int uksm_run = 1; | |
+ | |
+static DECLARE_WAIT_QUEUE_HEAD(uksm_thread_wait); | |
+static DEFINE_MUTEX(uksm_thread_mutex); | |
+ | |
+/* | |
+ * List vma_slot_new is for newly created vma_slot waiting to be added by | |
+ * ksmd. If one cannot be added(e.g. due to it's too small), it's moved to | |
+ * vma_slot_noadd. vma_slot_del is the list for vma_slot whose corresponding | |
+ * VMA has been removed/freed. | |
+ */ | |
+struct list_head vma_slot_new = LIST_HEAD_INIT(vma_slot_new); | |
+struct list_head vma_slot_noadd = LIST_HEAD_INIT(vma_slot_noadd); | |
+struct list_head vma_slot_del = LIST_HEAD_INIT(vma_slot_del); | |
+static DEFINE_SPINLOCK(vma_slot_list_lock); | |
+ | |
+/* The unstable tree heads */ | |
+static struct rb_root root_unstable_tree = RB_ROOT; | |
+ | |
+/* | |
+ * All tree_nodes are in a list to be freed at once when unstable tree is | |
+ * freed after each scan round. | |
+ */ | |
+static struct list_head unstable_tree_node_list = | |
+ LIST_HEAD_INIT(unstable_tree_node_list); | |
+ | |
+/* List contains all stable nodes */ | |
+static struct list_head stable_node_list = LIST_HEAD_INIT(stable_node_list); | |
+ | |
+/* | |
+ * When the hash strength is changed, the stable tree must be delta_hashed and | |
+ * re-structured. We use two set of below structs to speed up the | |
+ * re-structuring of stable tree. | |
+ */ | |
+static struct list_head | |
+stable_tree_node_list[2] = {LIST_HEAD_INIT(stable_tree_node_list[0]), | |
+ LIST_HEAD_INIT(stable_tree_node_list[1])}; | |
+ | |
+static struct list_head *stable_tree_node_listp = &stable_tree_node_list[0]; | |
+static struct rb_root root_stable_tree[2] = {RB_ROOT, RB_ROOT}; | |
+static struct rb_root *root_stable_treep = &root_stable_tree[0]; | |
+static unsigned long stable_tree_index; | |
+ | |
+/* The hash strength needed to hash a full page */ | |
+#define HASH_STRENGTH_FULL (PAGE_SIZE / sizeof(u32)) | |
+ | |
+/* The hash strength needed for loop-back hashing */ | |
+#define HASH_STRENGTH_MAX (HASH_STRENGTH_FULL + 10) | |
+ | |
+/* The random offsets in a page */ | |
+static u32 *random_nums; | |
+ | |
+/* The hash strength */ | |
+static unsigned long hash_strength = HASH_STRENGTH_FULL >> 4; | |
+ | |
+/* The delta value each time the hash strength increases or decreases */ | |
+static unsigned long hash_strength_delta; | |
+#define HASH_STRENGTH_DELTA_MAX 5 | |
+ | |
+/* The time we have saved due to random_sample_hash */ | |
+static u64 rshash_pos; | |
+ | |
+/* The time we have wasted due to hash collision */ | |
+static u64 rshash_neg; | |
+ | |
+struct uksm_benefit { | |
+ u64 pos; | |
+ u64 neg; | |
+ u64 scanned; | |
+ unsigned long base; | |
+} benefit; | |
+ | |
+/* | |
+ * The relative cost of memcmp, compared to 1 time unit of random sample | |
+ * hash, this value is tested when ksm module is initialized | |
+ */ | |
+static unsigned long memcmp_cost; | |
+ | |
+static unsigned long rshash_neg_cont_zero; | |
+static unsigned long rshash_cont_obscure; | |
+ | |
+/* The possible states of hash strength adjustment heuristic */ | |
+enum rshash_states { | |
+ RSHASH_STILL, | |
+ RSHASH_TRYUP, | |
+ RSHASH_TRYDOWN, | |
+ RSHASH_NEW, | |
+ RSHASH_PRE_STILL, | |
+}; | |
+ | |
+/* The possible direction we are about to adjust hash strength */ | |
+enum rshash_direct { | |
+ GO_UP, | |
+ GO_DOWN, | |
+ OBSCURE, | |
+ STILL, | |
+}; | |
+ | |
+/* random sampling hash state machine */ | |
+static struct { | |
+ enum rshash_states state; | |
+ enum rshash_direct pre_direct; | |
+ u8 below_count; | |
+ /* Keep a lookup window of size 5, iff above_count/below_count > 3 | |
+ * in this window we stop trying. | |
+ */ | |
+ u8 lookup_window_index; | |
+ u64 stable_benefit; | |
+ unsigned long turn_point_down; | |
+ unsigned long turn_benefit_down; | |
+ unsigned long turn_point_up; | |
+ unsigned long turn_benefit_up; | |
+ unsigned long stable_point; | |
+} rshash_state; | |
+ | |
+/*zero page hash table, hash_strength [0 ~ HASH_STRENGTH_MAX]*/ | |
+static u32 *zero_hash_table; | |
+ | |
+static inline struct node_vma *alloc_node_vma(void) | |
+{ | |
+ struct node_vma *node_vma; | |
+ node_vma = kmem_cache_zalloc(node_vma_cache, GFP_KERNEL); | |
+ if (node_vma) { | |
+ INIT_HLIST_HEAD(&node_vma->rmap_hlist); | |
+ INIT_HLIST_NODE(&node_vma->hlist); | |
+ } | |
+ return node_vma; | |
+} | |
+ | |
+static inline void free_node_vma(struct node_vma *node_vma) | |
+{ | |
+ kmem_cache_free(node_vma_cache, node_vma); | |
+} | |
+ | |
+ | |
+static inline struct vma_slot *alloc_vma_slot(void) | |
+{ | |
+ struct vma_slot *slot; | |
+ | |
+ /* | |
+ * In case ksm is not initialized by now. | |
+ * Oops, we need to consider the call site of uksm_init() in the future. | |
+ */ | |
+ if (!vma_slot_cache) | |
+ return NULL; | |
+ | |
+ slot = kmem_cache_zalloc(vma_slot_cache, GFP_KERNEL); | |
+ if (slot) { | |
+ INIT_LIST_HEAD(&slot->slot_list); | |
+ INIT_LIST_HEAD(&slot->dedup_list); | |
+ slot->flags |= UKSM_SLOT_NEED_RERAND; | |
+ } | |
+ return slot; | |
+} | |
+ | |
+static inline void free_vma_slot(struct vma_slot *vma_slot) | |
+{ | |
+ kmem_cache_free(vma_slot_cache, vma_slot); | |
+} | |
+ | |
+ | |
+ | |
+static inline struct rmap_item *alloc_rmap_item(void) | |
+{ | |
+ struct rmap_item *rmap_item; | |
+ | |
+ rmap_item = kmem_cache_zalloc(rmap_item_cache, GFP_KERNEL); | |
+ if (rmap_item) { | |
+ /* bug on lowest bit is not clear for flag use */ | |
+ BUG_ON(is_addr(rmap_item)); | |
+ } | |
+ return rmap_item; | |
+} | |
+ | |
+static inline void free_rmap_item(struct rmap_item *rmap_item) | |
+{ | |
+ rmap_item->slot = NULL; /* debug safety */ | |
+ kmem_cache_free(rmap_item_cache, rmap_item); | |
+} | |
+ | |
+static inline struct stable_node *alloc_stable_node(void) | |
+{ | |
+ struct stable_node *node; | |
+ node = kmem_cache_alloc(stable_node_cache, GFP_KERNEL | GFP_ATOMIC); | |
+ if (!node) | |
+ return NULL; | |
+ | |
+ INIT_HLIST_HEAD(&node->hlist); | |
+ list_add(&node->all_list, &stable_node_list); | |
+ return node; | |
+} | |
+ | |
+static inline void free_stable_node(struct stable_node *stable_node) | |
+{ | |
+ list_del(&stable_node->all_list); | |
+ kmem_cache_free(stable_node_cache, stable_node); | |
+} | |
+ | |
+static inline struct tree_node *alloc_tree_node(struct list_head *list) | |
+{ | |
+ struct tree_node *node; | |
+ node = kmem_cache_zalloc(tree_node_cache, GFP_KERNEL | GFP_ATOMIC); | |
+ if (!node) | |
+ return NULL; | |
+ | |
+ list_add(&node->all_list, list); | |
+ return node; | |
+} | |
+ | |
+static inline void free_tree_node(struct tree_node *node) | |
+{ | |
+ list_del(&node->all_list); | |
+ kmem_cache_free(tree_node_cache, node); | |
+} | |
+ | |
+static void uksm_drop_anon_vma(struct rmap_item *rmap_item) | |
+{ | |
+ struct anon_vma *anon_vma = rmap_item->anon_vma; | |
+ | |
+ put_anon_vma(anon_vma); | |
+} | |
+ | |
+ | |
+/** | |
+ * Remove a stable node from stable_tree, may unlink from its tree_node and | |
+ * may remove its parent tree_node if no other stable node is pending. | |
+ * | |
+ * @stable_node The node need to be removed | |
+ * @unlink_rb Will this node be unlinked from the rbtree? | |
+ * @remove_tree_ node Will its tree_node be removed if empty? | |
+ */ | |
+static void remove_node_from_stable_tree(struct stable_node *stable_node, | |
+ int unlink_rb, int remove_tree_node) | |
+{ | |
+ struct node_vma *node_vma; | |
+ struct rmap_item *rmap_item; | |
+ struct hlist_node *n; | |
+ | |
+ if (!hlist_empty(&stable_node->hlist)) { | |
+ hlist_for_each_entry_safe(node_vma, n, | |
+ &stable_node->hlist, hlist) { | |
+ hlist_for_each_entry(rmap_item, &node_vma->rmap_hlist, hlist) { | |
+ uksm_pages_sharing--; | |
+ | |
+ uksm_drop_anon_vma(rmap_item); | |
+ rmap_item->address &= PAGE_MASK; | |
+ } | |
+ free_node_vma(node_vma); | |
+ cond_resched(); | |
+ } | |
+ | |
+ /* the last one is counted as shared */ | |
+ uksm_pages_shared--; | |
+ uksm_pages_sharing++; | |
+ } | |
+ | |
+ if (stable_node->tree_node && unlink_rb) { | |
+ rb_erase(&stable_node->node, | |
+ &stable_node->tree_node->sub_root); | |
+ | |
+ if (RB_EMPTY_ROOT(&stable_node->tree_node->sub_root) && | |
+ remove_tree_node) { | |
+ rb_erase(&stable_node->tree_node->node, | |
+ root_stable_treep); | |
+ free_tree_node(stable_node->tree_node); | |
+ } else { | |
+ stable_node->tree_node->count--; | |
+ } | |
+ } | |
+ | |
+ free_stable_node(stable_node); | |
+} | |
+ | |
+ | |
+/* | |
+ * get_uksm_page: checks if the page indicated by the stable node | |
+ * is still its ksm page, despite having held no reference to it. | |
+ * In which case we can trust the content of the page, and it | |
+ * returns the gotten page; but if the page has now been zapped, | |
+ * remove the stale node from the stable tree and return NULL. | |
+ * | |
+ * You would expect the stable_node to hold a reference to the ksm page. | |
+ * But if it increments the page's count, swapping out has to wait for | |
+ * ksmd to come around again before it can free the page, which may take | |
+ * seconds or even minutes: much too unresponsive. So instead we use a | |
+ * "keyhole reference": access to the ksm page from the stable node peeps | |
+ * out through its keyhole to see if that page still holds the right key, | |
+ * pointing back to this stable node. This relies on freeing a PageAnon | |
+ * page to reset its page->mapping to NULL, and relies on no other use of | |
+ * a page to put something that might look like our key in page->mapping. | |
+ * | |
+ * include/linux/pagemap.h page_cache_get_speculative() is a good reference, | |
+ * but this is different - made simpler by uksm_thread_mutex being held, but | |
+ * interesting for assuming that no other use of the struct page could ever | |
+ * put our expected_mapping into page->mapping (or a field of the union which | |
+ * coincides with page->mapping). The RCU calls are not for KSM at all, but | |
+ * to keep the page_count protocol described with page_cache_get_speculative. | |
+ * | |
+ * Note: it is possible that get_uksm_page() will return NULL one moment, | |
+ * then page the next, if the page is in between page_freeze_refs() and | |
+ * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page | |
+ * is on its way to being freed; but it is an anomaly to bear in mind. | |
+ * | |
+ * @unlink_rb: if the removal of this node will firstly unlink from | |
+ * its rbtree. stable_node_reinsert will prevent this when restructuring the | |
+ * node from its old tree. | |
+ * | |
+ * @remove_tree_node: if this is the last one of its tree_node, will the | |
+ * tree_node be freed ? If we are inserting stable node, this tree_node may | |
+ * be reused, so don't free it. | |
+ */ | |
+static struct page *get_uksm_page(struct stable_node *stable_node, | |
+ int unlink_rb, int remove_tree_node) | |
+{ | |
+ struct page *page; | |
+ void *expected_mapping; | |
+ | |
+ page = pfn_to_page(stable_node->kpfn); | |
+ expected_mapping = (void *)stable_node + | |
+ (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM); | |
+ rcu_read_lock(); | |
+ if (page->mapping != expected_mapping) | |
+ goto stale; | |
+ if (!get_page_unless_zero(page)) | |
+ goto stale; | |
+ if (page->mapping != expected_mapping) { | |
+ put_page(page); | |
+ goto stale; | |
+ } | |
+ rcu_read_unlock(); | |
+ return page; | |
+stale: | |
+ rcu_read_unlock(); | |
+ remove_node_from_stable_tree(stable_node, unlink_rb, remove_tree_node); | |
+ | |
+ return NULL; | |
+} | |
+ | |
+/* | |
+ * Removing rmap_item from stable or unstable tree. | |
+ * This function will clean the information from the stable/unstable tree. | |
+ */ | |
+static inline void remove_rmap_item_from_tree(struct rmap_item *rmap_item) | |
+{ | |
+ if (rmap_item->address & STABLE_FLAG) { | |
+ struct stable_node *stable_node; | |
+ struct node_vma *node_vma; | |
+ struct page *page; | |
+ | |
+ node_vma = rmap_item->head; | |
+ stable_node = node_vma->head; | |
+ page = get_uksm_page(stable_node, 1, 1); | |
+ if (!page) | |
+ goto out; | |
+ | |
+ /* | |
+ * page lock is needed because it's racing with | |
+ * try_to_unmap_ksm(), etc. | |
+ */ | |
+ lock_page(page); | |
+ hlist_del(&rmap_item->hlist); | |
+ | |
+ if (hlist_empty(&node_vma->rmap_hlist)) { | |
+ hlist_del(&node_vma->hlist); | |
+ free_node_vma(node_vma); | |
+ } | |
+ unlock_page(page); | |
+ | |
+ put_page(page); | |
+ if (hlist_empty(&stable_node->hlist)) { | |
+ /* do NOT call remove_node_from_stable_tree() here, | |
+ * it's possible for a forked rmap_item not in | |
+ * stable tree while the in-tree rmap_items were | |
+ * deleted. | |
+ */ | |
+ uksm_pages_shared--; | |
+ } else | |
+ uksm_pages_sharing--; | |
+ | |
+ | |
+ uksm_drop_anon_vma(rmap_item); | |
+ } else if (rmap_item->address & UNSTABLE_FLAG) { | |
+ if (rmap_item->hash_round == uksm_hash_round) { | |
+ | |
+ rb_erase(&rmap_item->node, | |
+ &rmap_item->tree_node->sub_root); | |
+ if (RB_EMPTY_ROOT(&rmap_item->tree_node->sub_root)) { | |
+ rb_erase(&rmap_item->tree_node->node, | |
+ &root_unstable_tree); | |
+ | |
+ free_tree_node(rmap_item->tree_node); | |
+ } else | |
+ rmap_item->tree_node->count--; | |
+ } | |
+ uksm_pages_unshared--; | |
+ } | |
+ | |
+ rmap_item->address &= PAGE_MASK; | |
+ rmap_item->hash_max = 0; | |
+ | |
+out: | |
+ cond_resched(); /* we're called from many long loops */ | |
+} | |
+ | |
+static inline int slot_in_uksm(struct vma_slot *slot) | |
+{ | |
+ return list_empty(&slot->slot_list); | |
+} | |
+ | |
+/* | |
+ * Test if the mm is exiting | |
+ */ | |
+static inline bool uksm_test_exit(struct mm_struct *mm) | |
+{ | |
+ return atomic_read(&mm->mm_users) == 0; | |
+} | |
+ | |
+/** | |
+ * Need to do two things: | |
+ * 1. check if slot was moved to del list | |
+ * 2. make sure the mmap_sem is manipulated under valid vma. | |
+ * | |
+ * My concern here is that in some cases, this may make | |
+ * vma_slot_list_lock() waiters to serialized further by some | |
+ * sem->wait_lock, can this really be expensive? | |
+ * | |
+ * | |
+ * @return | |
+ * 0: if successfully locked mmap_sem | |
+ * -ENOENT: this slot was moved to del list | |
+ * -EBUSY: vma lock failed | |
+ */ | |
+static int try_down_read_slot_mmap_sem(struct vma_slot *slot) | |
+{ | |
+ struct vm_area_struct *vma; | |
+ struct mm_struct *mm; | |
+ struct rw_semaphore *sem; | |
+ | |
+ spin_lock(&vma_slot_list_lock); | |
+ | |
+ /* the slot_list was removed and inited from new list, when it enters | |
+ * uksm_list. If now it's not empty, then it must be moved to del list | |
+ */ | |
+ if (!slot_in_uksm(slot)) { | |
+ spin_unlock(&vma_slot_list_lock); | |
+ return -ENOENT; | |
+ } | |
+ | |
+ BUG_ON(slot->pages != vma_pages(slot->vma)); | |
+ /* Ok, vma still valid */ | |
+ vma = slot->vma; | |
+ mm = vma->vm_mm; | |
+ sem = &mm->mmap_sem; | |
+ | |
+ if (uksm_test_exit(mm)) { | |
+ spin_unlock(&vma_slot_list_lock); | |
+ return -ENOENT; | |
+ } | |
+ | |
+ if (down_read_trylock(sem)) { | |
+ spin_unlock(&vma_slot_list_lock); | |
+ return 0; | |
+ } | |
+ | |
+ spin_unlock(&vma_slot_list_lock); | |
+ return -EBUSY; | |
+} | |
+ | |
+static inline unsigned long | |
+vma_page_address(struct page *page, struct vm_area_struct *vma) | |
+{ | |
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); | |
+ unsigned long address; | |
+ | |
+ address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); | |
+ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) { | |
+ /* page should be within @vma mapping range */ | |
+ return -EFAULT; | |
+ } | |
+ return address; | |
+} | |
+ | |
+ | |
+/* return 0 on success with the item's mmap_sem locked */ | |
+static inline int get_mergeable_page_lock_mmap(struct rmap_item *item) | |
+{ | |
+ struct mm_struct *mm; | |
+ struct vma_slot *slot = item->slot; | |
+ int err = -EINVAL; | |
+ | |
+ struct page *page; | |
+ | |
+ /* | |
+ * try_down_read_slot_mmap_sem() returns non-zero if the slot | |
+ * has been removed by uksm_remove_vma(). | |
+ */ | |
+ if (try_down_read_slot_mmap_sem(slot)) | |
+ return -EBUSY; | |
+ | |
+ mm = slot->vma->vm_mm; | |
+ | |
+ if (uksm_test_exit(mm)) | |
+ goto failout_up; | |
+ | |
+ page = item->page; | |
+ rcu_read_lock(); | |
+ if (!get_page_unless_zero(page)) { | |
+ rcu_read_unlock(); | |
+ goto failout_up; | |
+ } | |
+ | |
+ /* No need to consider huge page here. */ | |
+ if (item->slot->vma->anon_vma != page_anon_vma(page) || | |
+ vma_page_address(page, item->slot->vma) != get_rmap_addr(item)) { | |
+ /* | |
+ * TODO: | |
+ * should we release this item becase of its stale page | |
+ * mapping? | |
+ */ | |
+ put_page(page); | |
+ rcu_read_unlock(); | |
+ goto failout_up; | |
+ } | |
+ rcu_read_unlock(); | |
+ return 0; | |
+ | |
+failout_up: | |
+ up_read(&mm->mmap_sem); | |
+ return err; | |
+} | |
+ | |
+/* | |
+ * What kind of VMA is considered ? | |
+ */ | |
+static inline int vma_can_enter(struct vm_area_struct *vma) | |
+{ | |
+ return uksm_flags_can_scan(vma->vm_flags); | |
+} | |
+ | |
+/* | |
+ * Called whenever a fresh new vma is created A new vma_slot. | |
+ * is created and inserted into a global list Must be called. | |
+ * after vma is inserted to its mm . | |
+ */ | |
+void uksm_vma_add_new(struct vm_area_struct *vma) | |
+{ | |
+ struct vma_slot *slot; | |
+ | |
+ if (!vma_can_enter(vma)) { | |
+ vma->uksm_vma_slot = NULL; | |
+ return; | |
+ } | |
+ | |
+ slot = alloc_vma_slot(); | |
+ if (!slot) { | |
+ vma->uksm_vma_slot = NULL; | |
+ return; | |
+ } | |
+ | |
+ vma->uksm_vma_slot = slot; | |
+ vma->vm_flags |= VM_MERGEABLE; | |
+ slot->vma = vma; | |
+ slot->mm = vma->vm_mm; | |
+ slot->ctime_j = jiffies; | |
+ slot->pages = vma_pages(vma); | |
+ spin_lock(&vma_slot_list_lock); | |
+ list_add_tail(&slot->slot_list, &vma_slot_new); | |
+ spin_unlock(&vma_slot_list_lock); | |
+} | |
+ | |
+/* | |
+ * Called after vma is unlinked from its mm | |
+ */ | |
+void uksm_remove_vma(struct vm_area_struct *vma) | |
+{ | |
+ struct vma_slot *slot; | |
+ | |
+ if (!vma->uksm_vma_slot) | |
+ return; | |
+ | |
+ slot = vma->uksm_vma_slot; | |
+ spin_lock(&vma_slot_list_lock); | |
+ if (slot_in_uksm(slot)) { | |
+ /** | |
+ * This slot has been added by ksmd, so move to the del list | |
+ * waiting ksmd to free it. | |
+ */ | |
+ list_add_tail(&slot->slot_list, &vma_slot_del); | |
+ } else { | |
+ /** | |
+ * It's still on new list. It's ok to free slot directly. | |
+ */ | |
+ list_del(&slot->slot_list); | |
+ free_vma_slot(slot); | |
+ } | |
+ spin_unlock(&vma_slot_list_lock); | |
+ vma->uksm_vma_slot = NULL; | |
+} | |
+ | |
+/* 32/3 < they < 32/2 */ | |
+#define shiftl 8 | |
+#define shiftr 12 | |
+ | |
+#define HASH_FROM_TO(from, to) \ | |
+for (index = from; index < to; index++) { \ | |
+ pos = random_nums[index]; \ | |
+ hash += key[pos]; \ | |
+ hash += (hash << shiftl); \ | |
+ hash ^= (hash >> shiftr); \ | |
+} | |
+ | |
+ | |
+#define HASH_FROM_DOWN_TO(from, to) \ | |
+for (index = from - 1; index >= to; index--) { \ | |
+ hash ^= (hash >> shiftr); \ | |
+ hash ^= (hash >> (shiftr*2)); \ | |
+ hash -= (hash << shiftl); \ | |
+ hash += (hash << (shiftl*2)); \ | |
+ pos = random_nums[index]; \ | |
+ hash -= key[pos]; \ | |
+} | |
+ | |
+/* | |
+ * The main random sample hash function. | |
+ */ | |
+static u32 random_sample_hash(void *addr, u32 hash_strength) | |
+{ | |
+ u32 hash = 0xdeadbeef; | |
+ int index, pos, loop = hash_strength; | |
+ u32 *key = (u32 *)addr; | |
+ | |
+ if (loop > HASH_STRENGTH_FULL) | |
+ loop = HASH_STRENGTH_FULL; | |
+ | |
+ HASH_FROM_TO(0, loop); | |
+ | |
+ if (hash_strength > HASH_STRENGTH_FULL) { | |
+ loop = hash_strength - HASH_STRENGTH_FULL; | |
+ HASH_FROM_TO(0, loop); | |
+ } | |
+ | |
+ return hash; | |
+} | |
+ | |
+ | |
+/** | |
+ * It's used when hash strength is adjusted | |
+ * | |
+ * @addr The page's virtual address | |
+ * @from The original hash strength | |
+ * @to The hash strength changed to | |
+ * @hash The hash value generated with "from" hash value | |
+ * | |
+ * return the hash value | |
+ */ | |
+static u32 delta_hash(void *addr, int from, int to, u32 hash) | |
+{ | |
+ u32 *key = (u32 *)addr; | |
+ int index, pos; /* make sure they are int type */ | |
+ | |
+ if (to > from) { | |
+ if (from >= HASH_STRENGTH_FULL) { | |
+ from -= HASH_STRENGTH_FULL; | |
+ to -= HASH_STRENGTH_FULL; | |
+ HASH_FROM_TO(from, to); | |
+ } else if (to <= HASH_STRENGTH_FULL) { | |
+ HASH_FROM_TO(from, to); | |
+ } else { | |
+ HASH_FROM_TO(from, HASH_STRENGTH_FULL); | |
+ HASH_FROM_TO(0, to - HASH_STRENGTH_FULL); | |
+ } | |
+ } else { | |
+ if (from <= HASH_STRENGTH_FULL) { | |
+ HASH_FROM_DOWN_TO(from, to); | |
+ } else if (to >= HASH_STRENGTH_FULL) { | |
+ from -= HASH_STRENGTH_FULL; | |
+ to -= HASH_STRENGTH_FULL; | |
+ HASH_FROM_DOWN_TO(from, to); | |
+ } else { | |
+ HASH_FROM_DOWN_TO(from - HASH_STRENGTH_FULL, 0); | |
+ HASH_FROM_DOWN_TO(HASH_STRENGTH_FULL, to); | |
+ } | |
+ } | |
+ | |
+ return hash; | |
+} | |
+ | |
+ | |
+ | |
+ | |
+#define CAN_OVERFLOW_U64(x, delta) (U64_MAX - (x) < (delta)) | |
+ | |
+/** | |
+ * | |
+ * Called when: rshash_pos or rshash_neg is about to overflow or a scan round | |
+ * has finished. | |
+ * | |
+ * return 0 if no page has been scanned since last call, 1 otherwise. | |
+ */ | |
+static inline int encode_benefit(void) | |
+{ | |
+ u64 scanned_delta, pos_delta, neg_delta; | |
+ unsigned long base = benefit.base; | |
+ | |
+ scanned_delta = uksm_pages_scanned - uksm_pages_scanned_last; | |
+ | |
+ if (!scanned_delta) | |
+ return 0; | |
+ | |
+ scanned_delta >>= base; | |
+ pos_delta = rshash_pos >> base; | |
+ neg_delta = rshash_neg >> base; | |
+ | |
+ if (CAN_OVERFLOW_U64(benefit.pos, pos_delta) || | |
+ CAN_OVERFLOW_U64(benefit.neg, neg_delta) || | |
+ CAN_OVERFLOW_U64(benefit.scanned, scanned_delta)) { | |
+ benefit.scanned >>= 1; | |
+ benefit.neg >>= 1; | |
+ benefit.pos >>= 1; | |
+ benefit.base++; | |
+ scanned_delta >>= 1; | |
+ pos_delta >>= 1; | |
+ neg_delta >>= 1; | |
+ } | |
+ | |
+ benefit.pos += pos_delta; | |
+ benefit.neg += neg_delta; | |
+ benefit.scanned += scanned_delta; | |
+ | |
+ BUG_ON(!benefit.scanned); | |
+ | |
+ rshash_pos = rshash_neg = 0; | |
+ uksm_pages_scanned_last = uksm_pages_scanned; | |
+ | |
+ return 1; | |
+} | |
+ | |
+static inline void reset_benefit(void) | |
+{ | |
+ benefit.pos = 0; | |
+ benefit.neg = 0; | |
+ benefit.base = 0; | |
+ benefit.scanned = 0; | |
+} | |
+ | |
+static inline void inc_rshash_pos(unsigned long delta) | |
+{ | |
+ if (CAN_OVERFLOW_U64(rshash_pos, delta)) | |
+ encode_benefit(); | |
+ | |
+ rshash_pos += delta; | |
+} | |
+ | |
+static inline void inc_rshash_neg(unsigned long delta) | |
+{ | |
+ if (CAN_OVERFLOW_U64(rshash_neg, delta)) | |
+ encode_benefit(); | |
+ | |
+ rshash_neg += delta; | |
+} | |
+ | |
+ | |
+static inline u32 page_hash(struct page *page, unsigned long hash_strength, | |
+ int cost_accounting) | |
+{ | |
+ u32 val; | |
+ unsigned long delta; | |
+ | |
+ void *addr = kmap_atomic(page); | |
+ | |
+ val = random_sample_hash(addr, hash_strength); | |
+ kunmap_atomic(addr); | |
+ | |
+ if (cost_accounting) { | |
+ if (HASH_STRENGTH_FULL > hash_strength) | |
+ delta = HASH_STRENGTH_FULL - hash_strength; | |
+ else | |
+ delta = 0; | |
+ | |
+ inc_rshash_pos(delta); | |
+ } | |
+ | |
+ return val; | |
+} | |
+ | |
+static int memcmp_pages(struct page *page1, struct page *page2, | |
+ int cost_accounting) | |
+{ | |
+ char *addr1, *addr2; | |
+ int ret; | |
+ | |
+ addr1 = kmap_atomic(page1); | |
+ addr2 = kmap_atomic(page2); | |
+ ret = memcmp(addr1, addr2, PAGE_SIZE); | |
+ kunmap_atomic(addr2); | |
+ kunmap_atomic(addr1); | |
+ | |
+ if (cost_accounting) | |
+ inc_rshash_neg(memcmp_cost); | |
+ | |
+ return ret; | |
+} | |
+ | |
+static inline int pages_identical(struct page *page1, struct page *page2) | |
+{ | |
+ return !memcmp_pages(page1, page2, 0); | |
+} | |
+ | |
+static inline int is_page_full_zero(struct page *page) | |
+{ | |
+ char *addr; | |
+ int ret; | |
+ | |
+ addr = kmap_atomic(page); | |
+ ret = is_full_zero(addr, PAGE_SIZE); | |
+ kunmap_atomic(addr); | |
+ | |
+ return ret; | |
+} | |
+ | |
+static int write_protect_page(struct vm_area_struct *vma, struct page *page, | |
+ pte_t *orig_pte, pte_t *old_pte) | |
+{ | |
+ struct mm_struct *mm = vma->vm_mm; | |
+ unsigned long addr; | |
+ pte_t *ptep; | |
+ spinlock_t *ptl; | |
+ int swapped; | |
+ int err = -EFAULT; | |
+ unsigned long mmun_start; /* For mmu_notifiers */ | |
+ unsigned long mmun_end; /* For mmu_notifiers */ | |
+ | |
+ addr = page_address_in_vma(page, vma); | |
+ if (addr == -EFAULT) | |
+ goto out; | |
+ | |
+ BUG_ON(PageTransCompound(page)); | |
+ | |
+ mmun_start = addr; | |
+ mmun_end = addr + PAGE_SIZE; | |
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); | |
+ | |
+ ptep = page_check_address(page, mm, addr, &ptl, 0); | |
+ if (!ptep) | |
+ goto out_mn; | |
+ | |
+ if (old_pte) | |
+ *old_pte = *ptep; | |
+ | |
+ if (pte_write(*ptep) || pte_dirty(*ptep)) { | |
+ pte_t entry; | |
+ | |
+ swapped = PageSwapCache(page); | |
+ flush_cache_page(vma, addr, page_to_pfn(page)); | |
+ /* | |
+ * Ok this is tricky, when get_user_pages_fast() run it doesnt | |
+ * take any lock, therefore the check that we are going to make | |
+ * with the pagecount against the mapcount is racey and | |
+ * O_DIRECT can happen right after the check. | |
+ * So we clear the pte and flush the tlb before the check | |
+ * this assure us that no O_DIRECT can happen after the check | |
+ * or in the middle of the check. | |
+ */ | |
+ entry = ptep_clear_flush(vma, addr, ptep); | |
+ /* | |
+ * Check that no O_DIRECT or similar I/O is in progress on the | |
+ * page | |
+ */ | |
+ if (page_mapcount(page) + 1 + swapped != page_count(page)) { | |
+ set_pte_at(mm, addr, ptep, entry); | |
+ goto out_unlock; | |
+ } | |
+ if (pte_dirty(entry)) | |
+ set_page_dirty(page); | |
+ entry = pte_mkclean(pte_wrprotect(entry)); | |
+ set_pte_at_notify(mm, addr, ptep, entry); | |
+ } | |
+ *orig_pte = *ptep; | |
+ err = 0; | |
+ | |
+out_unlock: | |
+ pte_unmap_unlock(ptep, ptl); | |
+out_mn: | |
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); | |
+out: | |
+ return err; | |
+} | |
+ | |
+#define MERGE_ERR_PGERR 1 /* the page is invalid cannot continue */ | |
+#define MERGE_ERR_COLLI 2 /* there is a collision */ | |
+#define MERGE_ERR_COLLI_MAX 3 /* collision at the max hash strength */ | |
+#define MERGE_ERR_CHANGED 4 /* the page has changed since last hash */ | |
+ | |
+ | |
+/** | |
+ * replace_page - replace page in vma by new ksm page | |
+ * @vma: vma that holds the pte pointing to page | |
+ * @page: the page we are replacing by kpage | |
+ * @kpage: the ksm page we replace page by | |
+ * @orig_pte: the original value of the pte | |
+ * | |
+ * Returns 0 on success, MERGE_ERR_PGERR on failure. | |
+ */ | |
+static int replace_page(struct vm_area_struct *vma, struct page *page, | |
+ struct page *kpage, pte_t orig_pte) | |
+{ | |
+ struct mm_struct *mm = vma->vm_mm; | |
+ pgd_t *pgd; | |
+ pud_t *pud; | |
+ pmd_t *pmd; | |
+ pte_t *ptep; | |
+ spinlock_t *ptl; | |
+ pte_t entry; | |
+ | |
+ unsigned long addr; | |
+ int err = MERGE_ERR_PGERR; | |
+ unsigned long mmun_start; /* For mmu_notifiers */ | |
+ unsigned long mmun_end; /* For mmu_notifiers */ | |
+ | |
+ addr = page_address_in_vma(page, vma); | |
+ if (addr == -EFAULT) | |
+ goto out; | |
+ | |
+ pgd = pgd_offset(mm, addr); | |
+ if (!pgd_present(*pgd)) | |
+ goto out; | |
+ | |
+ pud = pud_offset(pgd, addr); | |
+ if (!pud_present(*pud)) | |
+ goto out; | |
+ | |
+ pmd = pmd_offset(pud, addr); | |
+ BUG_ON(pmd_trans_huge(*pmd)); | |
+ if (!pmd_present(*pmd)) | |
+ goto out; | |
+ | |
+ mmun_start = addr; | |
+ mmun_end = addr + PAGE_SIZE; | |
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); | |
+ | |
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); | |
+ if (!pte_same(*ptep, orig_pte)) { | |
+ pte_unmap_unlock(ptep, ptl); | |
+ goto out_mn; | |
+ } | |
+ | |
+ flush_cache_page(vma, addr, pte_pfn(*ptep)); | |
+ ptep_clear_flush(vma, addr, ptep); | |
+ entry = mk_pte(kpage, vma->vm_page_prot); | |
+ | |
+ /* special treatment is needed for zero_page */ | |
+ if ((page_to_pfn(kpage) == uksm_zero_pfn) || | |
+ (page_to_pfn(kpage) == zero_pfn)) | |
+ entry = pte_mkspecial(entry); | |
+ else { | |
+ get_page(kpage); | |
+ page_add_anon_rmap(kpage, vma, addr); | |
+ } | |
+ | |
+ set_pte_at_notify(mm, addr, ptep, entry); | |
+ | |
+ page_remove_rmap(page); | |
+ if (!page_mapped(page)) | |
+ try_to_free_swap(page); | |
+ put_page(page); | |
+ | |
+ pte_unmap_unlock(ptep, ptl); | |
+ err = 0; | |
+out_mn: | |
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); | |
+out: | |
+ return err; | |
+} | |
+ | |
+ | |
+/** | |
+ * Fully hash a page with HASH_STRENGTH_MAX return a non-zero hash value. The | |
+ * zero hash value at HASH_STRENGTH_MAX is used to indicated that its | |
+ * hash_max member has not been calculated. | |
+ * | |
+ * @page The page needs to be hashed | |
+ * @hash_old The hash value calculated with current hash strength | |
+ * | |
+ * return the new hash value calculated at HASH_STRENGTH_MAX | |
+ */ | |
+static inline u32 page_hash_max(struct page *page, u32 hash_old) | |
+{ | |
+ u32 hash_max = 0; | |
+ void *addr; | |
+ | |
+ addr = kmap_atomic(page); | |
+ hash_max = delta_hash(addr, hash_strength, | |
+ HASH_STRENGTH_MAX, hash_old); | |
+ | |
+ kunmap_atomic(addr); | |
+ | |
+ if (!hash_max) | |
+ hash_max = 1; | |
+ | |
+ inc_rshash_neg(HASH_STRENGTH_MAX - hash_strength); | |
+ return hash_max; | |
+} | |
+ | |
+/* | |
+ * We compare the hash again, to ensure that it is really a hash collision | |
+ * instead of being caused by page write. | |
+ */ | |
+static inline int check_collision(struct rmap_item *rmap_item, | |
+ u32 hash) | |
+{ | |
+ int err; | |
+ struct page *page = rmap_item->page; | |
+ | |
+ /* if this rmap_item has already been hash_maxed, then the collision | |
+ * must appears in the second-level rbtree search. In this case we check | |
+ * if its hash_max value has been changed. Otherwise, the collision | |
+ * happens in the first-level rbtree search, so we check against it's | |
+ * current hash value. | |
+ */ | |
+ if (rmap_item->hash_max) { | |
+ inc_rshash_neg(memcmp_cost); | |
+ inc_rshash_neg(HASH_STRENGTH_MAX - hash_strength); | |
+ | |
+ if (rmap_item->hash_max == page_hash_max(page, hash)) | |
+ err = MERGE_ERR_COLLI; | |
+ else | |
+ err = MERGE_ERR_CHANGED; | |
+ } else { | |
+ inc_rshash_neg(memcmp_cost + hash_strength); | |
+ | |
+ if (page_hash(page, hash_strength, 0) == hash) | |
+ err = MERGE_ERR_COLLI; | |
+ else | |
+ err = MERGE_ERR_CHANGED; | |
+ } | |
+ | |
+ return err; | |
+} | |
+ | |
+static struct page *page_trans_compound_anon(struct page *page) | |
+{ | |
+ if (PageTransCompound(page)) { | |
+ struct page *head = compound_head(page); | |
+ /* | |
+ * head may actually be splitted and freed from under | |
+ * us but it's ok here. | |
+ */ | |
+ if (PageAnon(head)) | |
+ return head; | |
+ } | |
+ return NULL; | |
+} | |
+ | |
+static int page_trans_compound_anon_split(struct page *page) | |
+{ | |
+ int ret = 0; | |
+ struct page *transhuge_head = page_trans_compound_anon(page); | |
+ if (transhuge_head) { | |
+ /* Get the reference on the head to split it. */ | |
+ if (get_page_unless_zero(transhuge_head)) { | |
+ /* | |
+ * Recheck we got the reference while the head | |
+ * was still anonymous. | |
+ */ | |
+ if (PageAnon(transhuge_head)) | |
+ ret = split_huge_page(transhuge_head); | |
+ else | |
+ /* | |
+ * Retry later if split_huge_page run | |
+ * from under us. | |
+ */ | |
+ ret = 1; | |
+ put_page(transhuge_head); | |
+ } else | |
+ /* Retry later if split_huge_page run from under us. */ | |
+ ret = 1; | |
+ } | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * Try to merge a rmap_item.page with a kpage in stable node. kpage must | |
+ * already be a ksm page. | |
+ * | |
+ * @return 0 if the pages were merged, -EFAULT otherwise. | |
+ */ | |
+static int try_to_merge_with_uksm_page(struct rmap_item *rmap_item, | |
+ struct page *kpage, u32 hash) | |
+{ | |
+ struct vm_area_struct *vma = rmap_item->slot->vma; | |
+ struct mm_struct *mm = vma->vm_mm; | |
+ pte_t orig_pte = __pte(0); | |
+ int err = MERGE_ERR_PGERR; | |
+ struct page *page; | |
+ | |
+ if (uksm_test_exit(mm)) | |
+ goto out; | |
+ | |
+ page = rmap_item->page; | |
+ | |
+ if (page == kpage) { /* ksm page forked */ | |
+ err = 0; | |
+ goto out; | |
+ } | |
+ | |
+ if (PageTransCompound(page) && page_trans_compound_anon_split(page)) | |
+ goto out; | |
+ BUG_ON(PageTransCompound(page)); | |
+ | |
+ if (!PageAnon(page) || !PageKsm(kpage)) | |
+ goto out; | |
+ | |
+ /* | |
+ * We need the page lock to read a stable PageSwapCache in | |
+ * write_protect_page(). We use trylock_page() instead of | |
+ * lock_page() because we don't want to wait here - we | |
+ * prefer to continue scanning and merging different pages, | |
+ * then come back to this page when it is unlocked. | |
+ */ | |
+ if (!trylock_page(page)) | |
+ goto out; | |
+ /* | |
+ * If this anonymous page is mapped only here, its pte may need | |
+ * to be write-protected. If it's mapped elsewhere, all of its | |
+ * ptes are necessarily already write-protected. But in either | |
+ * case, we need to lock and check page_count is not raised. | |
+ */ | |
+ if (write_protect_page(vma, page, &orig_pte, NULL) == 0) { | |
+ if (pages_identical(page, kpage)) | |
+ err = replace_page(vma, page, kpage, orig_pte); | |
+ else | |
+ err = check_collision(rmap_item, hash); | |
+ } | |
+ | |
+ if ((vma->vm_flags & VM_LOCKED) && kpage && !err) { | |
+ munlock_vma_page(page); | |
+ if (!PageMlocked(kpage)) { | |
+ unlock_page(page); | |
+ lock_page(kpage); | |
+ mlock_vma_page(kpage); | |
+ page = kpage; /* for final unlock */ | |
+ } | |
+ } | |
+ | |
+ unlock_page(page); | |
+out: | |
+ return err; | |
+} | |
+ | |
+ | |
+ | |
+/** | |
+ * If two pages fail to merge in try_to_merge_two_pages, then we have a chance | |
+ * to restore a page mapping that has been changed in try_to_merge_two_pages. | |
+ * | |
+ * @return 0 on success. | |
+ */ | |
+static int restore_uksm_page_pte(struct vm_area_struct *vma, unsigned long addr, | |
+ pte_t orig_pte, pte_t wprt_pte) | |
+{ | |
+ struct mm_struct *mm = vma->vm_mm; | |
+ pgd_t *pgd; | |
+ pud_t *pud; | |
+ pmd_t *pmd; | |
+ pte_t *ptep; | |
+ spinlock_t *ptl; | |
+ | |
+ int err = -EFAULT; | |
+ | |
+ pgd = pgd_offset(mm, addr); | |
+ if (!pgd_present(*pgd)) | |
+ goto out; | |
+ | |
+ pud = pud_offset(pgd, addr); | |
+ if (!pud_present(*pud)) | |
+ goto out; | |
+ | |
+ pmd = pmd_offset(pud, addr); | |
+ if (!pmd_present(*pmd)) | |
+ goto out; | |
+ | |
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); | |
+ if (!pte_same(*ptep, wprt_pte)) { | |
+ /* already copied, let it be */ | |
+ pte_unmap_unlock(ptep, ptl); | |
+ goto out; | |
+ } | |
+ | |
+ /* | |
+ * Good boy, still here. When we still get the ksm page, it does not | |
+ * return to the free page pool, there is no way that a pte was changed | |
+ * to other page and gets back to this page. And remind that ksm page | |
+ * do not reuse in do_wp_page(). So it's safe to restore the original | |
+ * pte. | |
+ */ | |
+ flush_cache_page(vma, addr, pte_pfn(*ptep)); | |
+ ptep_clear_flush(vma, addr, ptep); | |
+ set_pte_at_notify(mm, addr, ptep, orig_pte); | |
+ | |
+ pte_unmap_unlock(ptep, ptl); | |
+ err = 0; | |
+out: | |
+ return err; | |
+} | |
+ | |
+/** | |
+ * try_to_merge_two_pages() - take two identical pages and prepare | |
+ * them to be merged into one page(rmap_item->page) | |
+ * | |
+ * @return 0 if we successfully merged two identical pages into | |
+ * one ksm page. MERGE_ERR_COLLI if it's only a hash collision | |
+ * search in rbtree. MERGE_ERR_CHANGED if rmap_item has been | |
+ * changed since it's hashed. MERGE_ERR_PGERR otherwise. | |
+ * | |
+ */ | |
+static int try_to_merge_two_pages(struct rmap_item *rmap_item, | |
+ struct rmap_item *tree_rmap_item, | |
+ u32 hash) | |
+{ | |
+ pte_t orig_pte1 = __pte(0), orig_pte2 = __pte(0); | |
+ pte_t wprt_pte1 = __pte(0), wprt_pte2 = __pte(0); | |
+ struct vm_area_struct *vma1 = rmap_item->slot->vma; | |
+ struct vm_area_struct *vma2 = tree_rmap_item->slot->vma; | |
+ struct page *page = rmap_item->page; | |
+ struct page *tree_page = tree_rmap_item->page; | |
+ int err = MERGE_ERR_PGERR; | |
+ struct address_space *saved_mapping; | |
+ | |
+ | |
+ if (rmap_item->page == tree_rmap_item->page) | |
+ goto out; | |
+ | |
+ if (PageTransCompound(page) && page_trans_compound_anon_split(page)) | |
+ goto out; | |
+ BUG_ON(PageTransCompound(page)); | |
+ | |
+ if (PageTransCompound(tree_page) && page_trans_compound_anon_split(tree_page)) | |
+ goto out; | |
+ BUG_ON(PageTransCompound(tree_page)); | |
+ | |
+ if (!PageAnon(page) || !PageAnon(tree_page)) | |
+ goto out; | |
+ | |
+ if (!trylock_page(page)) | |
+ goto out; | |
+ | |
+ | |
+ if (write_protect_page(vma1, page, &wprt_pte1, &orig_pte1) != 0) { | |
+ unlock_page(page); | |
+ goto out; | |
+ } | |
+ | |
+ /* | |
+ * While we hold page lock, upgrade page from | |
+ * PageAnon+anon_vma to PageKsm+NULL stable_node: | |
+ * stable_tree_insert() will update stable_node. | |
+ */ | |
+ saved_mapping = page->mapping; | |
+ set_page_stable_node(page, NULL); | |
+ mark_page_accessed(page); | |
+ unlock_page(page); | |
+ | |
+ if (!trylock_page(tree_page)) | |
+ goto restore_out; | |
+ | |
+ if (write_protect_page(vma2, tree_page, &wprt_pte2, &orig_pte2) != 0) { | |
+ unlock_page(tree_page); | |
+ goto restore_out; | |
+ } | |
+ | |
+ if (pages_identical(page, tree_page)) { | |
+ err = replace_page(vma2, tree_page, page, wprt_pte2); | |
+ if (err) { | |
+ unlock_page(tree_page); | |
+ goto restore_out; | |
+ } | |
+ | |
+ if ((vma2->vm_flags & VM_LOCKED)) { | |
+ munlock_vma_page(tree_page); | |
+ if (!PageMlocked(page)) { | |
+ unlock_page(tree_page); | |
+ lock_page(page); | |
+ mlock_vma_page(page); | |
+ tree_page = page; /* for final unlock */ | |
+ } | |
+ } | |
+ | |
+ unlock_page(tree_page); | |
+ | |
+ goto out; /* success */ | |
+ | |
+ } else { | |
+ if (tree_rmap_item->hash_max && | |
+ tree_rmap_item->hash_max == rmap_item->hash_max) { | |
+ err = MERGE_ERR_COLLI_MAX; | |
+ } else if (page_hash(page, hash_strength, 0) == | |
+ page_hash(tree_page, hash_strength, 0)) { | |
+ inc_rshash_neg(memcmp_cost + hash_strength * 2); | |
+ err = MERGE_ERR_COLLI; | |
+ } else { | |
+ err = MERGE_ERR_CHANGED; | |
+ } | |
+ | |
+ unlock_page(tree_page); | |
+ } | |
+ | |
+restore_out: | |
+ lock_page(page); | |
+ if (!restore_uksm_page_pte(vma1, get_rmap_addr(rmap_item), | |
+ orig_pte1, wprt_pte1)) | |
+ page->mapping = saved_mapping; | |
+ | |
+ unlock_page(page); | |
+out: | |
+ return err; | |
+} | |
+ | |
+static inline int hash_cmp(u32 new_val, u32 node_val) | |
+{ | |
+ if (new_val > node_val) | |
+ return 1; | |
+ else if (new_val < node_val) | |
+ return -1; | |
+ else | |
+ return 0; | |
+} | |
+ | |
+static inline u32 rmap_item_hash_max(struct rmap_item *item, u32 hash) | |
+{ | |
+ u32 hash_max = item->hash_max; | |
+ | |
+ if (!hash_max) { | |
+ hash_max = page_hash_max(item->page, hash); | |
+ | |
+ item->hash_max = hash_max; | |
+ } | |
+ | |
+ return hash_max; | |
+} | |
+ | |
+ | |
+ | |
+/** | |
+ * stable_tree_search() - search the stable tree for a page | |
+ * | |
+ * @item: the rmap_item we are comparing with | |
+ * @hash: the hash value of this item->page already calculated | |
+ * | |
+ * @return the page we have found, NULL otherwise. The page returned has | |
+ * been gotten. | |
+ */ | |
+static struct page *stable_tree_search(struct rmap_item *item, u32 hash) | |
+{ | |
+ struct rb_node *node = root_stable_treep->rb_node; | |
+ struct tree_node *tree_node; | |
+ unsigned long hash_max; | |
+ struct page *page = item->page; | |
+ struct stable_node *stable_node; | |
+ | |
+ stable_node = page_stable_node(page); | |
+ if (stable_node) { | |
+ /* ksm page forked, that is | |
+ * if (PageKsm(page) && !in_stable_tree(rmap_item)) | |
+ * it's actually gotten once outside. | |
+ */ | |
+ get_page(page); | |
+ return page; | |
+ } | |
+ | |
+ while (node) { | |
+ int cmp; | |
+ | |
+ tree_node = rb_entry(node, struct tree_node, node); | |
+ | |
+ cmp = hash_cmp(hash, tree_node->hash); | |
+ | |
+ if (cmp < 0) | |
+ node = node->rb_left; | |
+ else if (cmp > 0) | |
+ node = node->rb_right; | |
+ else | |
+ break; | |
+ } | |
+ | |
+ if (!node) | |
+ return NULL; | |
+ | |
+ if (tree_node->count == 1) { | |
+ stable_node = rb_entry(tree_node->sub_root.rb_node, | |
+ struct stable_node, node); | |
+ BUG_ON(!stable_node); | |
+ | |
+ goto get_page_out; | |
+ } | |
+ | |
+ /* | |
+ * ok, we have to search the second | |
+ * level subtree, hash the page to a | |
+ * full strength. | |
+ */ | |
+ node = tree_node->sub_root.rb_node; | |
+ BUG_ON(!node); | |
+ hash_max = rmap_item_hash_max(item, hash); | |
+ | |
+ while (node) { | |
+ int cmp; | |
+ | |
+ stable_node = rb_entry(node, struct stable_node, node); | |
+ | |
+ cmp = hash_cmp(hash_max, stable_node->hash_max); | |
+ | |
+ if (cmp < 0) | |
+ node = node->rb_left; | |
+ else if (cmp > 0) | |
+ node = node->rb_right; | |
+ else | |
+ goto get_page_out; | |
+ } | |
+ | |
+ return NULL; | |
+ | |
+get_page_out: | |
+ page = get_uksm_page(stable_node, 1, 1); | |
+ return page; | |
+} | |
+ | |
+static int try_merge_rmap_item(struct rmap_item *item, | |
+ struct page *kpage, | |
+ struct page *tree_page) | |
+{ | |
+ spinlock_t *ptl; | |
+ pte_t *ptep; | |
+ unsigned long addr; | |
+ struct vm_area_struct *vma = item->slot->vma; | |
+ | |
+ addr = get_rmap_addr(item); | |
+ ptep = page_check_address(kpage, vma->vm_mm, addr, &ptl, 0); | |
+ if (!ptep) | |
+ return 0; | |
+ | |
+ if (pte_write(*ptep)) { | |
+ /* has changed, abort! */ | |
+ pte_unmap_unlock(ptep, ptl); | |
+ return 0; | |
+ } | |
+ | |
+ get_page(tree_page); | |
+ page_add_anon_rmap(tree_page, vma, addr); | |
+ | |
+ flush_cache_page(vma, addr, pte_pfn(*ptep)); | |
+ ptep_clear_flush(vma, addr, ptep); | |
+ set_pte_at_notify(vma->vm_mm, addr, ptep, | |
+ mk_pte(tree_page, vma->vm_page_prot)); | |
+ | |
+ page_remove_rmap(kpage); | |
+ put_page(kpage); | |
+ | |
+ pte_unmap_unlock(ptep, ptl); | |
+ | |
+ return 1; | |
+} | |
+ | |
+/** | |
+ * try_to_merge_with_stable_page() - when two rmap_items need to be inserted | |
+ * into stable tree, the page was found to be identical to a stable ksm page, | |
+ * this is the last chance we can merge them into one. | |
+ * | |
+ * @item1: the rmap_item holding the page which we wanted to insert | |
+ * into stable tree. | |
+ * @item2: the other rmap_item we found when unstable tree search | |
+ * @oldpage: the page currently mapped by the two rmap_items | |
+ * @tree_page: the page we found identical in stable tree node | |
+ * @success1: return if item1 is successfully merged | |
+ * @success2: return if item2 is successfully merged | |
+ */ | |
+static void try_merge_with_stable(struct rmap_item *item1, | |
+ struct rmap_item *item2, | |
+ struct page **kpage, | |
+ struct page *tree_page, | |
+ int *success1, int *success2) | |
+{ | |
+ struct vm_area_struct *vma1 = item1->slot->vma; | |
+ struct vm_area_struct *vma2 = item2->slot->vma; | |
+ *success1 = 0; | |
+ *success2 = 0; | |
+ | |
+ if (unlikely(*kpage == tree_page)) { | |
+ /* I don't think this can really happen */ | |
+ printk(KERN_WARNING "UKSM: unexpected condition detected in " | |
+ "try_merge_with_stable() -- *kpage == tree_page !\n"); | |
+ *success1 = 1; | |
+ *success2 = 1; | |
+ return; | |
+ } | |
+ | |
+ if (!PageAnon(*kpage) || !PageKsm(*kpage)) | |
+ goto failed; | |
+ | |
+ if (!trylock_page(tree_page)) | |
+ goto failed; | |
+ | |
+ /* If the oldpage is still ksm and still pointed | |
+ * to in the right place, and still write protected, | |
+ * we are confident it's not changed, no need to | |
+ * memcmp anymore. | |
+ * be ware, we cannot take nested pte locks, | |
+ * deadlock risk. | |
+ */ | |
+ if (!try_merge_rmap_item(item1, *kpage, tree_page)) | |
+ goto unlock_failed; | |
+ | |
+ /* ok, then vma2, remind that pte1 already set */ | |
+ if (!try_merge_rmap_item(item2, *kpage, tree_page)) | |
+ goto success_1; | |
+ | |
+ *success2 = 1; | |
+success_1: | |
+ *success1 = 1; | |
+ | |
+ | |
+ if ((*success1 && vma1->vm_flags & VM_LOCKED) || | |
+ (*success2 && vma2->vm_flags & VM_LOCKED)) { | |
+ munlock_vma_page(*kpage); | |
+ if (!PageMlocked(tree_page)) | |
+ mlock_vma_page(tree_page); | |
+ } | |
+ | |
+ /* | |
+ * We do not need oldpage any more in the caller, so can break the lock | |
+ * now. | |
+ */ | |
+ unlock_page(*kpage); | |
+ *kpage = tree_page; /* Get unlocked outside. */ | |
+ return; | |
+ | |
+unlock_failed: | |
+ unlock_page(tree_page); | |
+failed: | |
+ return; | |
+} | |
+ | |
+static inline void stable_node_hash_max(struct stable_node *node, | |
+ struct page *page, u32 hash) | |
+{ | |
+ u32 hash_max = node->hash_max; | |
+ | |
+ if (!hash_max) { | |
+ hash_max = page_hash_max(page, hash); | |
+ node->hash_max = hash_max; | |
+ } | |
+} | |
+ | |
+static inline | |
+struct stable_node *new_stable_node(struct tree_node *tree_node, | |
+ struct page *kpage, u32 hash_max) | |
+{ | |
+ struct stable_node *new_stable_node; | |
+ | |
+ new_stable_node = alloc_stable_node(); | |
+ if (!new_stable_node) | |
+ return NULL; | |
+ | |
+ new_stable_node->kpfn = page_to_pfn(kpage); | |
+ new_stable_node->hash_max = hash_max; | |
+ new_stable_node->tree_node = tree_node; | |
+ set_page_stable_node(kpage, new_stable_node); | |
+ | |
+ return new_stable_node; | |
+} | |
+ | |
+static inline | |
+struct stable_node *first_level_insert(struct tree_node *tree_node, | |
+ struct rmap_item *rmap_item, | |
+ struct rmap_item *tree_rmap_item, | |
+ struct page **kpage, u32 hash, | |
+ int *success1, int *success2) | |
+{ | |
+ int cmp; | |
+ struct page *tree_page; | |
+ u32 hash_max = 0; | |
+ struct stable_node *stable_node, *new_snode; | |
+ struct rb_node *parent = NULL, **new; | |
+ | |
+ /* this tree node contains no sub-tree yet */ | |
+ stable_node = rb_entry(tree_node->sub_root.rb_node, | |
+ struct stable_node, node); | |
+ | |
+ tree_page = get_uksm_page(stable_node, 1, 0); | |
+ if (tree_page) { | |
+ cmp = memcmp_pages(*kpage, tree_page, 1); | |
+ if (!cmp) { | |
+ try_merge_with_stable(rmap_item, tree_rmap_item, kpage, | |
+ tree_page, success1, success2); | |
+ put_page(tree_page); | |
+ if (!*success1 && !*success2) | |
+ goto failed; | |
+ | |
+ return stable_node; | |
+ | |
+ } else { | |
+ /* | |
+ * collision in first level try to create a subtree. | |
+ * A new node need to be created. | |
+ */ | |
+ put_page(tree_page); | |
+ | |
+ stable_node_hash_max(stable_node, tree_page, | |
+ tree_node->hash); | |
+ hash_max = rmap_item_hash_max(rmap_item, hash); | |
+ cmp = hash_cmp(hash_max, stable_node->hash_max); | |
+ | |
+ parent = &stable_node->node; | |
+ if (cmp < 0) { | |
+ new = &parent->rb_left; | |
+ } else if (cmp > 0) { | |
+ new = &parent->rb_right; | |
+ } else { | |
+ goto failed; | |
+ } | |
+ } | |
+ | |
+ } else { | |
+ /* the only stable_node deleted, we reuse its tree_node. | |
+ */ | |
+ parent = NULL; | |
+ new = &tree_node->sub_root.rb_node; | |
+ } | |
+ | |
+ new_snode = new_stable_node(tree_node, *kpage, hash_max); | |
+ if (!new_snode) | |
+ goto failed; | |
+ | |
+ rb_link_node(&new_snode->node, parent, new); | |
+ rb_insert_color(&new_snode->node, &tree_node->sub_root); | |
+ tree_node->count++; | |
+ *success1 = *success2 = 1; | |
+ | |
+ return new_snode; | |
+ | |
+failed: | |
+ return NULL; | |
+} | |
+ | |
+static inline | |
+struct stable_node *stable_subtree_insert(struct tree_node *tree_node, | |
+ struct rmap_item *rmap_item, | |
+ struct rmap_item *tree_rmap_item, | |
+ struct page **kpage, u32 hash, | |
+ int *success1, int *success2) | |
+{ | |
+ struct page *tree_page; | |
+ u32 hash_max; | |
+ struct stable_node *stable_node, *new_snode; | |
+ struct rb_node *parent, **new; | |
+ | |
+research: | |
+ parent = NULL; | |
+ new = &tree_node->sub_root.rb_node; | |
+ BUG_ON(!*new); | |
+ hash_max = rmap_item_hash_max(rmap_item, hash); | |
+ while (*new) { | |
+ int cmp; | |
+ | |
+ stable_node = rb_entry(*new, struct stable_node, node); | |
+ | |
+ cmp = hash_cmp(hash_max, stable_node->hash_max); | |
+ | |
+ if (cmp < 0) { | |
+ parent = *new; | |
+ new = &parent->rb_left; | |
+ } else if (cmp > 0) { | |
+ parent = *new; | |
+ new = &parent->rb_right; | |
+ } else { | |
+ tree_page = get_uksm_page(stable_node, 1, 0); | |
+ if (tree_page) { | |
+ cmp = memcmp_pages(*kpage, tree_page, 1); | |
+ if (!cmp) { | |
+ try_merge_with_stable(rmap_item, | |
+ tree_rmap_item, kpage, | |
+ tree_page, success1, success2); | |
+ | |
+ put_page(tree_page); | |
+ if (!*success1 && !*success2) | |
+ goto failed; | |
+ /* | |
+ * successfully merged with a stable | |
+ * node | |
+ */ | |
+ return stable_node; | |
+ } else { | |
+ put_page(tree_page); | |
+ goto failed; | |
+ } | |
+ } else { | |
+ /* | |
+ * stable node may be deleted, | |
+ * and subtree maybe | |
+ * restructed, cannot | |
+ * continue, research it. | |
+ */ | |
+ if (tree_node->count) { | |
+ goto research; | |
+ } else { | |
+ /* reuse the tree node*/ | |
+ parent = NULL; | |
+ new = &tree_node->sub_root.rb_node; | |
+ } | |
+ } | |
+ } | |
+ } | |
+ | |
+ new_snode = new_stable_node(tree_node, *kpage, hash_max); | |
+ if (!new_snode) | |
+ goto failed; | |
+ | |
+ rb_link_node(&new_snode->node, parent, new); | |
+ rb_insert_color(&new_snode->node, &tree_node->sub_root); | |
+ tree_node->count++; | |
+ *success1 = *success2 = 1; | |
+ | |
+ return new_snode; | |
+ | |
+failed: | |
+ return NULL; | |
+} | |
+ | |
+ | |
+/** | |
+ * stable_tree_insert() - try to insert a merged page in unstable tree to | |
+ * the stable tree | |
+ * | |
+ * @kpage: the page need to be inserted | |
+ * @hash: the current hash of this page | |
+ * @rmap_item: the rmap_item being scanned | |
+ * @tree_rmap_item: the rmap_item found on unstable tree | |
+ * @success1: return if rmap_item is merged | |
+ * @success2: return if tree_rmap_item is merged | |
+ * | |
+ * @return the stable_node on stable tree if at least one | |
+ * rmap_item is inserted into stable tree, NULL | |
+ * otherwise. | |
+ */ | |
+static struct stable_node * | |
+stable_tree_insert(struct page **kpage, u32 hash, | |
+ struct rmap_item *rmap_item, | |
+ struct rmap_item *tree_rmap_item, | |
+ int *success1, int *success2) | |
+{ | |
+ struct rb_node **new = &root_stable_treep->rb_node; | |
+ struct rb_node *parent = NULL; | |
+ struct stable_node *stable_node; | |
+ struct tree_node *tree_node; | |
+ u32 hash_max = 0; | |
+ | |
+ *success1 = *success2 = 0; | |
+ | |
+ while (*new) { | |
+ int cmp; | |
+ | |
+ tree_node = rb_entry(*new, struct tree_node, node); | |
+ | |
+ cmp = hash_cmp(hash, tree_node->hash); | |
+ | |
+ if (cmp < 0) { | |
+ parent = *new; | |
+ new = &parent->rb_left; | |
+ } else if (cmp > 0) { | |
+ parent = *new; | |
+ new = &parent->rb_right; | |
+ } else | |
+ break; | |
+ } | |
+ | |
+ if (*new) { | |
+ if (tree_node->count == 1) { | |
+ stable_node = first_level_insert(tree_node, rmap_item, | |
+ tree_rmap_item, kpage, | |
+ hash, success1, success2); | |
+ } else { | |
+ stable_node = stable_subtree_insert(tree_node, | |
+ rmap_item, tree_rmap_item, kpage, | |
+ hash, success1, success2); | |
+ } | |
+ } else { | |
+ | |
+ /* no tree node found */ | |
+ tree_node = alloc_tree_node(stable_tree_node_listp); | |
+ if (!tree_node) { | |
+ stable_node = NULL; | |
+ goto out; | |
+ } | |
+ | |
+ stable_node = new_stable_node(tree_node, *kpage, hash_max); | |
+ if (!stable_node) { | |
+ free_tree_node(tree_node); | |
+ goto out; | |
+ } | |
+ | |
+ tree_node->hash = hash; | |
+ rb_link_node(&tree_node->node, parent, new); | |
+ rb_insert_color(&tree_node->node, root_stable_treep); | |
+ parent = NULL; | |
+ new = &tree_node->sub_root.rb_node; | |
+ | |
+ rb_link_node(&stable_node->node, parent, new); | |
+ rb_insert_color(&stable_node->node, &tree_node->sub_root); | |
+ tree_node->count++; | |
+ *success1 = *success2 = 1; | |
+ } | |
+ | |
+out: | |
+ return stable_node; | |
+} | |
+ | |
+ | |
+/** | |
+ * get_tree_rmap_item_page() - try to get the page and lock the mmap_sem | |
+ * | |
+ * @return 0 on success, -EBUSY if unable to lock the mmap_sem, | |
+ * -EINVAL if the page mapping has been changed. | |
+ */ | |
+static inline int get_tree_rmap_item_page(struct rmap_item *tree_rmap_item) | |
+{ | |
+ int err; | |
+ | |
+ err = get_mergeable_page_lock_mmap(tree_rmap_item); | |
+ | |
+ if (err == -EINVAL) { | |
+ /* its page map has been changed, remove it */ | |
+ remove_rmap_item_from_tree(tree_rmap_item); | |
+ } | |
+ | |
+ /* The page is gotten and mmap_sem is locked now. */ | |
+ return err; | |
+} | |
+ | |
+ | |
+/** | |
+ * unstable_tree_search_insert() - search an unstable tree rmap_item with the | |
+ * same hash value. Get its page and trylock the mmap_sem | |
+ */ | |
+static inline | |
+struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, | |
+ u32 hash) | |
+ | |
+{ | |
+ struct rb_node **new = &root_unstable_tree.rb_node; | |
+ struct rb_node *parent = NULL; | |
+ struct tree_node *tree_node; | |
+ u32 hash_max; | |
+ struct rmap_item *tree_rmap_item; | |
+ | |
+ while (*new) { | |
+ int cmp; | |
+ | |
+ tree_node = rb_entry(*new, struct tree_node, node); | |
+ | |
+ cmp = hash_cmp(hash, tree_node->hash); | |
+ | |
+ if (cmp < 0) { | |
+ parent = *new; | |
+ new = &parent->rb_left; | |
+ } else if (cmp > 0) { | |
+ parent = *new; | |
+ new = &parent->rb_right; | |
+ } else | |
+ break; | |
+ } | |
+ | |
+ if (*new) { | |
+ /* got the tree_node */ | |
+ if (tree_node->count == 1) { | |
+ tree_rmap_item = rb_entry(tree_node->sub_root.rb_node, | |
+ struct rmap_item, node); | |
+ BUG_ON(!tree_rmap_item); | |
+ | |
+ goto get_page_out; | |
+ } | |
+ | |
+ /* well, search the collision subtree */ | |
+ new = &tree_node->sub_root.rb_node; | |
+ BUG_ON(!*new); | |
+ hash_max = rmap_item_hash_max(rmap_item, hash); | |
+ | |
+ while (*new) { | |
+ int cmp; | |
+ | |
+ tree_rmap_item = rb_entry(*new, struct rmap_item, | |
+ node); | |
+ | |
+ cmp = hash_cmp(hash_max, tree_rmap_item->hash_max); | |
+ parent = *new; | |
+ if (cmp < 0) | |
+ new = &parent->rb_left; | |
+ else if (cmp > 0) | |
+ new = &parent->rb_right; | |
+ else | |
+ goto get_page_out; | |
+ } | |
+ } else { | |
+ /* alloc a new tree_node */ | |
+ tree_node = alloc_tree_node(&unstable_tree_node_list); | |
+ if (!tree_node) | |
+ return NULL; | |
+ | |
+ tree_node->hash = hash; | |
+ rb_link_node(&tree_node->node, parent, new); | |
+ rb_insert_color(&tree_node->node, &root_unstable_tree); | |
+ parent = NULL; | |
+ new = &tree_node->sub_root.rb_node; | |
+ } | |
+ | |
+ /* did not found even in sub-tree */ | |
+ rmap_item->tree_node = tree_node; | |
+ rmap_item->address |= UNSTABLE_FLAG; | |
+ rmap_item->hash_round = uksm_hash_round; | |
+ rb_link_node(&rmap_item->node, parent, new); | |
+ rb_insert_color(&rmap_item->node, &tree_node->sub_root); | |
+ | |
+ uksm_pages_unshared++; | |
+ return NULL; | |
+ | |
+get_page_out: | |
+ if (tree_rmap_item->page == rmap_item->page) | |
+ return NULL; | |
+ | |
+ if (get_tree_rmap_item_page(tree_rmap_item)) | |
+ return NULL; | |
+ | |
+ return tree_rmap_item; | |
+} | |
+ | |
+static void hold_anon_vma(struct rmap_item *rmap_item, | |
+ struct anon_vma *anon_vma) | |
+{ | |
+ rmap_item->anon_vma = anon_vma; | |
+ get_anon_vma(anon_vma); | |
+} | |
+ | |
+ | |
+/** | |
+ * stable_tree_append() - append a rmap_item to a stable node. Deduplication | |
+ * ratio statistics is done in this function. | |
+ * | |
+ */ | |
+static void stable_tree_append(struct rmap_item *rmap_item, | |
+ struct stable_node *stable_node, int logdedup) | |
+{ | |
+ struct node_vma *node_vma = NULL, *new_node_vma, *node_vma_cont = NULL; | |
+ unsigned long key = (unsigned long)rmap_item->slot; | |
+ unsigned long factor = rmap_item->slot->rung->step; | |
+ | |
+ BUG_ON(!stable_node); | |
+ rmap_item->address |= STABLE_FLAG; | |
+ | |
+ if (hlist_empty(&stable_node->hlist)) { | |
+ uksm_pages_shared++; | |
+ goto node_vma_new; | |
+ } else { | |
+ uksm_pages_sharing++; | |
+ } | |
+ | |
+ hlist_for_each_entry(node_vma, &stable_node->hlist, hlist) { | |
+ if (node_vma->key >= key) | |
+ break; | |
+ | |
+ if (logdedup) { | |
+ node_vma->slot->pages_bemerged += factor; | |
+ if (list_empty(&node_vma->slot->dedup_list)) | |
+ list_add(&node_vma->slot->dedup_list, | |
+ &vma_slot_dedup); | |
+ } | |
+ } | |
+ | |
+ if (node_vma) { | |
+ if (node_vma->key == key) { | |
+ node_vma_cont = hlist_entry_safe(node_vma->hlist.next, struct node_vma, hlist); | |
+ goto node_vma_ok; | |
+ } else if (node_vma->key > key) { | |
+ node_vma_cont = node_vma; | |
+ } | |
+ } | |
+ | |
+node_vma_new: | |
+ /* no same vma already in node, alloc a new node_vma */ | |
+ new_node_vma = alloc_node_vma(); | |
+ BUG_ON(!new_node_vma); | |
+ new_node_vma->head = stable_node; | |
+ new_node_vma->slot = rmap_item->slot; | |
+ | |
+ if (!node_vma) { | |
+ hlist_add_head(&new_node_vma->hlist, &stable_node->hlist); | |
+ } else if (node_vma->key != key) { | |
+ if (node_vma->key < key) | |
+ hlist_add_behind(&new_node_vma->hlist, &node_vma->hlist); | |
+ else { | |
+ hlist_add_before(&new_node_vma->hlist, | |
+ &node_vma->hlist); | |
+ } | |
+ | |
+ } | |
+ node_vma = new_node_vma; | |
+ | |
+node_vma_ok: /* ok, ready to add to the list */ | |
+ rmap_item->head = node_vma; | |
+ hlist_add_head(&rmap_item->hlist, &node_vma->rmap_hlist); | |
+ hold_anon_vma(rmap_item, rmap_item->slot->vma->anon_vma); | |
+ if (logdedup) { | |
+ rmap_item->slot->pages_merged++; | |
+ if (node_vma_cont) { | |
+ node_vma = node_vma_cont; | |
+ hlist_for_each_entry_continue(node_vma, hlist) { | |
+ node_vma->slot->pages_bemerged += factor; | |
+ if (list_empty(&node_vma->slot->dedup_list)) | |
+ list_add(&node_vma->slot->dedup_list, | |
+ &vma_slot_dedup); | |
+ } | |
+ } | |
+ } | |
+} | |
+ | |
+/* | |
+ * We use break_ksm to break COW on a ksm page: it's a stripped down | |
+ * | |
+ * if (get_user_pages(current, mm, addr, 1, 1, 1, &page, NULL) == 1) | |
+ * put_page(page); | |
+ * | |
+ * but taking great care only to touch a ksm page, in a VM_MERGEABLE vma, | |
+ * in case the application has unmapped and remapped mm,addr meanwhile. | |
+ * Could a ksm page appear anywhere else? Actually yes, in a VM_PFNMAP | |
+ * mmap of /dev/mem or /dev/kmem, where we would not want to touch it. | |
+ */ | |
+static int break_ksm(struct vm_area_struct *vma, unsigned long addr) | |
+{ | |
+ struct page *page; | |
+ int ret = 0; | |
+ | |
+ do { | |
+ cond_resched(); | |
+ page = follow_page(vma, addr, FOLL_GET); | |
+ if (IS_ERR_OR_NULL(page)) | |
+ break; | |
+ if (PageKsm(page)) { | |
+ ret = handle_mm_fault(vma->vm_mm, vma, addr, | |
+ FAULT_FLAG_WRITE); | |
+ } else | |
+ ret = VM_FAULT_WRITE; | |
+ put_page(page); | |
+ } while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_OOM))); | |
+ /* | |
+ * We must loop because handle_mm_fault() may back out if there's | |
+ * any difficulty e.g. if pte accessed bit gets updated concurrently. | |
+ * | |
+ * VM_FAULT_WRITE is what we have been hoping for: it indicates that | |
+ * COW has been broken, even if the vma does not permit VM_WRITE; | |
+ * but note that a concurrent fault might break PageKsm for us. | |
+ * | |
+ * VM_FAULT_SIGBUS could occur if we race with truncation of the | |
+ * backing file, which also invalidates anonymous pages: that's | |
+ * okay, that truncation will have unmapped the PageKsm for us. | |
+ * | |
+ * VM_FAULT_OOM: at the time of writing (late July 2009), setting | |
+ * aside mem_cgroup limits, VM_FAULT_OOM would only be set if the | |
+ * current task has TIF_MEMDIE set, and will be OOM killed on return | |
+ * to user; and ksmd, having no mm, would never be chosen for that. | |
+ * | |
+ * But if the mm is in a limited mem_cgroup, then the fault may fail | |
+ * with VM_FAULT_OOM even if the current task is not TIF_MEMDIE; and | |
+ * even ksmd can fail in this way - though it's usually breaking ksm | |
+ * just to undo a merge it made a moment before, so unlikely to oom. | |
+ * | |
+ * That's a pity: we might therefore have more kernel pages allocated | |
+ * than we're counting as nodes in the stable tree; but uksm_do_scan | |
+ * will retry to break_cow on each pass, so should recover the page | |
+ * in due course. The important thing is to not let VM_MERGEABLE | |
+ * be cleared while any such pages might remain in the area. | |
+ */ | |
+ return (ret & VM_FAULT_OOM) ? -ENOMEM : 0; | |
+} | |
+ | |
+static void break_cow(struct rmap_item *rmap_item) | |
+{ | |
+ struct vm_area_struct *vma = rmap_item->slot->vma; | |
+ struct mm_struct *mm = vma->vm_mm; | |
+ unsigned long addr = get_rmap_addr(rmap_item); | |
+ | |
+ if (uksm_test_exit(mm)) | |
+ goto out; | |
+ | |
+ break_ksm(vma, addr); | |
+out: | |
+ return; | |
+} | |
+ | |
+/* | |
+ * Though it's very tempting to unmerge in_stable_tree(rmap_item)s rather | |
+ * than check every pte of a given vma, the locking doesn't quite work for | |
+ * that - an rmap_item is assigned to the stable tree after inserting ksm | |
+ * page and upping mmap_sem. Nor does it fit with the way we skip dup'ing | |
+ * rmap_items from parent to child at fork time (so as not to waste time | |
+ * if exit comes before the next scan reaches it). | |
+ * | |
+ * Similarly, although we'd like to remove rmap_items (so updating counts | |
+ * and freeing memory) when unmerging an area, it's easier to leave that | |
+ * to the next pass of ksmd - consider, for example, how ksmd might be | |
+ * in cmp_and_merge_page on one of the rmap_items we would be removing. | |
+ */ | |
+inline int unmerge_uksm_pages(struct vm_area_struct *vma, | |
+ unsigned long start, unsigned long end) | |
+{ | |
+ unsigned long addr; | |
+ int err = 0; | |
+ | |
+ for (addr = start; addr < end && !err; addr += PAGE_SIZE) { | |
+ if (uksm_test_exit(vma->vm_mm)) | |
+ break; | |
+ if (signal_pending(current)) | |
+ err = -ERESTARTSYS; | |
+ else | |
+ err = break_ksm(vma, addr); | |
+ } | |
+ return err; | |
+} | |
+ | |
+static inline void inc_uksm_pages_scanned(void) | |
+{ | |
+ u64 delta; | |
+ | |
+ | |
+ if (uksm_pages_scanned == U64_MAX) { | |
+ encode_benefit(); | |
+ | |
+ delta = uksm_pages_scanned >> pages_scanned_base; | |
+ | |
+ if (CAN_OVERFLOW_U64(pages_scanned_stored, delta)) { | |
+ pages_scanned_stored >>= 1; | |
+ delta >>= 1; | |
+ pages_scanned_base++; | |
+ } | |
+ | |
+ pages_scanned_stored += delta; | |
+ | |
+ uksm_pages_scanned = uksm_pages_scanned_last = 0; | |
+ } | |
+ | |
+ uksm_pages_scanned++; | |
+} | |
+ | |
+static inline int find_zero_page_hash(int strength, u32 hash) | |
+{ | |
+ return (zero_hash_table[strength] == hash); | |
+} | |
+ | |
+static | |
+int cmp_and_merge_zero_page(struct vm_area_struct *vma, struct page *page) | |
+{ | |
+ struct page *zero_page = empty_uksm_zero_page; | |
+ struct mm_struct *mm = vma->vm_mm; | |
+ pte_t orig_pte = __pte(0); | |
+ int err = -EFAULT; | |
+ | |
+ if (uksm_test_exit(mm)) | |
+ goto out; | |
+ | |
+ if (PageTransCompound(page) && page_trans_compound_anon_split(page)) | |
+ goto out; | |
+ BUG_ON(PageTransCompound(page)); | |
+ | |
+ if (!PageAnon(page)) | |
+ goto out; | |
+ | |
+ if (!trylock_page(page)) | |
+ goto out; | |
+ | |
+ if (write_protect_page(vma, page, &orig_pte, 0) == 0) { | |
+ if (is_page_full_zero(page)) | |
+ err = replace_page(vma, page, zero_page, orig_pte); | |
+ } | |
+ | |
+ unlock_page(page); | |
+out: | |
+ return err; | |
+} | |
+ | |
+/* | |
+ * cmp_and_merge_page() - first see if page can be merged into the stable | |
+ * tree; if not, compare hash to previous and if it's the same, see if page | |
+ * can be inserted into the unstable tree, or merged with a page already there | |
+ * and both transferred to the stable tree. | |
+ * | |
+ * @page: the page that we are searching identical page to. | |
+ * @rmap_item: the reverse mapping into the virtual address of this page | |
+ */ | |
+static void cmp_and_merge_page(struct rmap_item *rmap_item, u32 hash) | |
+{ | |
+ struct rmap_item *tree_rmap_item; | |
+ struct page *page; | |
+ struct page *kpage = NULL; | |
+ u32 hash_max; | |
+ int err; | |
+ unsigned int success1, success2; | |
+ struct stable_node *snode; | |
+ int cmp; | |
+ struct rb_node *parent = NULL, **new; | |
+ | |
+ remove_rmap_item_from_tree(rmap_item); | |
+ page = rmap_item->page; | |
+ | |
+ /* We first start with searching the page inside the stable tree */ | |
+ kpage = stable_tree_search(rmap_item, hash); | |
+ if (kpage) { | |
+ err = try_to_merge_with_uksm_page(rmap_item, kpage, | |
+ hash); | |
+ if (!err) { | |
+ /* | |
+ * The page was successfully merged, add | |
+ * its rmap_item to the stable tree. | |
+ * page lock is needed because it's | |
+ * racing with try_to_unmap_ksm(), etc. | |
+ */ | |
+ lock_page(kpage); | |
+ snode = page_stable_node(kpage); | |
+ stable_tree_append(rmap_item, snode, 1); | |
+ unlock_page(kpage); | |
+ put_page(kpage); | |
+ return; /* success */ | |
+ } | |
+ put_page(kpage); | |
+ | |
+ /* | |
+ * if it's a collision and it has been search in sub-rbtree | |
+ * (hash_max != 0), we want to abort, because if it is | |
+ * successfully merged in unstable tree, the collision trends to | |
+ * happen again. | |
+ */ | |
+ if (err == MERGE_ERR_COLLI && rmap_item->hash_max) | |
+ return; | |
+ } | |
+ | |
+ tree_rmap_item = | |
+ unstable_tree_search_insert(rmap_item, hash); | |
+ if (tree_rmap_item) { | |
+ err = try_to_merge_two_pages(rmap_item, tree_rmap_item, hash); | |
+ /* | |
+ * As soon as we merge this page, we want to remove the | |
+ * rmap_item of the page we have merged with from the unstable | |
+ * tree, and insert it instead as new node in the stable tree. | |
+ */ | |
+ if (!err) { | |
+ kpage = page; | |
+ remove_rmap_item_from_tree(tree_rmap_item); | |
+ lock_page(kpage); | |
+ snode = stable_tree_insert(&kpage, hash, | |
+ rmap_item, tree_rmap_item, | |
+ &success1, &success2); | |
+ | |
+ /* | |
+ * Do not log dedup for tree item, it's not counted as | |
+ * scanned in this round. | |
+ */ | |
+ if (success2) | |
+ stable_tree_append(tree_rmap_item, snode, 0); | |
+ | |
+ /* | |
+ * The order of these two stable append is important: | |
+ * we are scanning rmap_item. | |
+ */ | |
+ if (success1) | |
+ stable_tree_append(rmap_item, snode, 1); | |
+ | |
+ /* | |
+ * The original kpage may be unlocked inside | |
+ * stable_tree_insert() already. This page | |
+ * should be unlocked before doing | |
+ * break_cow(). | |
+ */ | |
+ unlock_page(kpage); | |
+ | |
+ if (!success1) | |
+ break_cow(rmap_item); | |
+ | |
+ if (!success2) | |
+ break_cow(tree_rmap_item); | |
+ | |
+ } else if (err == MERGE_ERR_COLLI) { | |
+ BUG_ON(tree_rmap_item->tree_node->count > 1); | |
+ | |
+ rmap_item_hash_max(tree_rmap_item, | |
+ tree_rmap_item->tree_node->hash); | |
+ | |
+ hash_max = rmap_item_hash_max(rmap_item, hash); | |
+ cmp = hash_cmp(hash_max, tree_rmap_item->hash_max); | |
+ parent = &tree_rmap_item->node; | |
+ if (cmp < 0) | |
+ new = &parent->rb_left; | |
+ else if (cmp > 0) | |
+ new = &parent->rb_right; | |
+ else | |
+ goto put_up_out; | |
+ | |
+ rmap_item->tree_node = tree_rmap_item->tree_node; | |
+ rmap_item->address |= UNSTABLE_FLAG; | |
+ rmap_item->hash_round = uksm_hash_round; | |
+ rb_link_node(&rmap_item->node, parent, new); | |
+ rb_insert_color(&rmap_item->node, | |
+ &tree_rmap_item->tree_node->sub_root); | |
+ rmap_item->tree_node->count++; | |
+ } else { | |
+ /* | |
+ * either one of the page has changed or they collide | |
+ * at the max hash, we consider them as ill items. | |
+ */ | |
+ remove_rmap_item_from_tree(tree_rmap_item); | |
+ } | |
+put_up_out: | |
+ put_page(tree_rmap_item->page); | |
+ up_read(&tree_rmap_item->slot->vma->vm_mm->mmap_sem); | |
+ } | |
+} | |
+ | |
+ | |
+ | |
+ | |
+static inline unsigned long get_pool_index(struct vma_slot *slot, | |
+ unsigned long index) | |
+{ | |
+ unsigned long pool_index; | |
+ | |
+ pool_index = (sizeof(struct rmap_list_entry *) * index) >> PAGE_SHIFT; | |
+ if (pool_index >= slot->pool_size) | |
+ BUG(); | |
+ return pool_index; | |
+} | |
+ | |
+static inline unsigned long index_page_offset(unsigned long index) | |
+{ | |
+ return offset_in_page(sizeof(struct rmap_list_entry *) * index); | |
+} | |
+ | |
+static inline | |
+struct rmap_list_entry *get_rmap_list_entry(struct vma_slot *slot, | |
+ unsigned long index, int need_alloc) | |
+{ | |
+ unsigned long pool_index; | |
+ struct page *page; | |
+ void *addr; | |
+ | |
+ | |
+ pool_index = get_pool_index(slot, index); | |
+ if (!slot->rmap_list_pool[pool_index]) { | |
+ if (!need_alloc) | |
+ return NULL; | |
+ | |
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN); | |
+ if (!page) | |
+ return NULL; | |
+ | |
+ slot->rmap_list_pool[pool_index] = page; | |
+ } | |
+ | |
+ addr = kmap(slot->rmap_list_pool[pool_index]); | |
+ addr += index_page_offset(index); | |
+ | |
+ return addr; | |
+} | |
+ | |
+static inline void put_rmap_list_entry(struct vma_slot *slot, | |
+ unsigned long index) | |
+{ | |
+ unsigned long pool_index; | |
+ | |
+ pool_index = get_pool_index(slot, index); | |
+ BUG_ON(!slot->rmap_list_pool[pool_index]); | |
+ kunmap(slot->rmap_list_pool[pool_index]); | |
+} | |
+ | |
+static inline int entry_is_new(struct rmap_list_entry *entry) | |
+{ | |
+ return !entry->item; | |
+} | |
+ | |
+static inline unsigned long get_index_orig_addr(struct vma_slot *slot, | |
+ unsigned long index) | |
+{ | |
+ return slot->vma->vm_start + (index << PAGE_SHIFT); | |
+} | |
+ | |
+static inline unsigned long get_entry_address(struct rmap_list_entry *entry) | |
+{ | |
+ unsigned long addr; | |
+ | |
+ if (is_addr(entry->addr)) | |
+ addr = get_clean_addr(entry->addr); | |
+ else if (entry->item) | |
+ addr = get_rmap_addr(entry->item); | |
+ else | |
+ BUG(); | |
+ | |
+ return addr; | |
+} | |
+ | |
+static inline struct rmap_item *get_entry_item(struct rmap_list_entry *entry) | |
+{ | |
+ if (is_addr(entry->addr)) | |
+ return NULL; | |
+ | |
+ return entry->item; | |
+} | |
+ | |
+static inline void inc_rmap_list_pool_count(struct vma_slot *slot, | |
+ unsigned long index) | |
+{ | |
+ unsigned long pool_index; | |
+ | |
+ pool_index = get_pool_index(slot, index); | |
+ BUG_ON(!slot->rmap_list_pool[pool_index]); | |
+ slot->pool_counts[pool_index]++; | |
+} | |
+ | |
+static inline void dec_rmap_list_pool_count(struct vma_slot *slot, | |
+ unsigned long index) | |
+{ | |
+ unsigned long pool_index; | |
+ | |
+ pool_index = get_pool_index(slot, index); | |
+ BUG_ON(!slot->rmap_list_pool[pool_index]); | |
+ BUG_ON(!slot->pool_counts[pool_index]); | |
+ slot->pool_counts[pool_index]--; | |
+} | |
+ | |
+static inline int entry_has_rmap(struct rmap_list_entry *entry) | |
+{ | |
+ return !is_addr(entry->addr) && entry->item; | |
+} | |
+ | |
+static inline void swap_entries(struct rmap_list_entry *entry1, | |
+ unsigned long index1, | |
+ struct rmap_list_entry *entry2, | |
+ unsigned long index2) | |
+{ | |
+ struct rmap_list_entry tmp; | |
+ | |
+ /* swapping two new entries is meaningless */ | |
+ BUG_ON(entry_is_new(entry1) && entry_is_new(entry2)); | |
+ | |
+ tmp = *entry1; | |
+ *entry1 = *entry2; | |
+ *entry2 = tmp; | |
+ | |
+ if (entry_has_rmap(entry1)) | |
+ entry1->item->entry_index = index1; | |
+ | |
+ if (entry_has_rmap(entry2)) | |
+ entry2->item->entry_index = index2; | |
+ | |
+ if (entry_has_rmap(entry1) && !entry_has_rmap(entry2)) { | |
+ inc_rmap_list_pool_count(entry1->item->slot, index1); | |
+ dec_rmap_list_pool_count(entry1->item->slot, index2); | |
+ } else if (!entry_has_rmap(entry1) && entry_has_rmap(entry2)) { | |
+ inc_rmap_list_pool_count(entry2->item->slot, index2); | |
+ dec_rmap_list_pool_count(entry2->item->slot, index1); | |
+ } | |
+} | |
+ | |
+static inline void free_entry_item(struct rmap_list_entry *entry) | |
+{ | |
+ unsigned long index; | |
+ struct rmap_item *item; | |
+ | |
+ if (!is_addr(entry->addr)) { | |
+ BUG_ON(!entry->item); | |
+ item = entry->item; | |
+ entry->addr = get_rmap_addr(item); | |
+ set_is_addr(entry->addr); | |
+ index = item->entry_index; | |
+ remove_rmap_item_from_tree(item); | |
+ dec_rmap_list_pool_count(item->slot, index); | |
+ free_rmap_item(item); | |
+ } | |
+} | |
+ | |
+static inline int pool_entry_boundary(unsigned long index) | |
+{ | |
+ unsigned long linear_addr; | |
+ | |
+ linear_addr = sizeof(struct rmap_list_entry *) * index; | |
+ return index && !offset_in_page(linear_addr); | |
+} | |
+ | |
+static inline void try_free_last_pool(struct vma_slot *slot, | |
+ unsigned long index) | |
+{ | |
+ unsigned long pool_index; | |
+ | |
+ pool_index = get_pool_index(slot, index); | |
+ if (slot->rmap_list_pool[pool_index] && | |
+ !slot->pool_counts[pool_index]) { | |
+ __free_page(slot->rmap_list_pool[pool_index]); | |
+ slot->rmap_list_pool[pool_index] = NULL; | |
+ slot->flags |= UKSM_SLOT_NEED_SORT; | |
+ } | |
+ | |
+} | |
+ | |
+static inline unsigned long vma_item_index(struct vm_area_struct *vma, | |
+ struct rmap_item *item) | |
+{ | |
+ return (get_rmap_addr(item) - vma->vm_start) >> PAGE_SHIFT; | |
+} | |
+ | |
+static int within_same_pool(struct vma_slot *slot, | |
+ unsigned long i, unsigned long j) | |
+{ | |
+ unsigned long pool_i, pool_j; | |
+ | |
+ pool_i = get_pool_index(slot, i); | |
+ pool_j = get_pool_index(slot, j); | |
+ | |
+ return (pool_i == pool_j); | |
+} | |
+ | |
+static void sort_rmap_entry_list(struct vma_slot *slot) | |
+{ | |
+ unsigned long i, j; | |
+ struct rmap_list_entry *entry, *swap_entry; | |
+ | |
+ entry = get_rmap_list_entry(slot, 0, 0); | |
+ for (i = 0; i < slot->pages; ) { | |
+ | |
+ if (!entry) | |
+ goto skip_whole_pool; | |
+ | |
+ if (entry_is_new(entry)) | |
+ goto next_entry; | |
+ | |
+ if (is_addr(entry->addr)) { | |
+ entry->addr = 0; | |
+ goto next_entry; | |
+ } | |
+ | |
+ j = vma_item_index(slot->vma, entry->item); | |
+ if (j == i) | |
+ goto next_entry; | |
+ | |
+ if (within_same_pool(slot, i, j)) | |
+ swap_entry = entry + j - i; | |
+ else | |
+ swap_entry = get_rmap_list_entry(slot, j, 1); | |
+ | |
+ swap_entries(entry, i, swap_entry, j); | |
+ if (!within_same_pool(slot, i, j)) | |
+ put_rmap_list_entry(slot, j); | |
+ continue; | |
+ | |
+skip_whole_pool: | |
+ i += PAGE_SIZE / sizeof(*entry); | |
+ if (i < slot->pages) | |
+ entry = get_rmap_list_entry(slot, i, 0); | |
+ continue; | |
+ | |
+next_entry: | |
+ if (i >= slot->pages - 1 || | |
+ !within_same_pool(slot, i, i + 1)) { | |
+ put_rmap_list_entry(slot, i); | |
+ if (i + 1 < slot->pages) | |
+ entry = get_rmap_list_entry(slot, i + 1, 0); | |
+ } else | |
+ entry++; | |
+ i++; | |
+ continue; | |
+ } | |
+ | |
+ /* free empty pool entries which contain no rmap_item */ | |
+ /* CAN be simplied to based on only pool_counts when bug freed !!!!! */ | |
+ for (i = 0; i < slot->pool_size; i++) { | |
+ unsigned char has_rmap; | |
+ void *addr; | |
+ | |
+ if (!slot->rmap_list_pool[i]) | |
+ continue; | |
+ | |
+ has_rmap = 0; | |
+ addr = kmap(slot->rmap_list_pool[i]); | |
+ BUG_ON(!addr); | |
+ for (j = 0; j < PAGE_SIZE / sizeof(*entry); j++) { | |
+ entry = (struct rmap_list_entry *)addr + j; | |
+ if (is_addr(entry->addr)) | |
+ continue; | |
+ if (!entry->item) | |
+ continue; | |
+ has_rmap = 1; | |
+ } | |
+ kunmap(slot->rmap_list_pool[i]); | |
+ if (!has_rmap) { | |
+ BUG_ON(slot->pool_counts[i]); | |
+ __free_page(slot->rmap_list_pool[i]); | |
+ slot->rmap_list_pool[i] = NULL; | |
+ } | |
+ } | |
+ | |
+ slot->flags &= ~UKSM_SLOT_NEED_SORT; | |
+} | |
+ | |
+/* | |
+ * vma_fully_scanned() - if all the pages in this slot have been scanned. | |
+ */ | |
+static inline int vma_fully_scanned(struct vma_slot *slot) | |
+{ | |
+ return slot->pages_scanned == slot->pages; | |
+} | |
+ | |
+/** | |
+ * get_next_rmap_item() - Get the next rmap_item in a vma_slot according to | |
+ * its random permutation. This function is embedded with the random | |
+ * permutation index management code. | |
+ */ | |
+static struct rmap_item *get_next_rmap_item(struct vma_slot *slot, u32 *hash) | |
+{ | |
+ unsigned long rand_range, addr, swap_index, scan_index; | |
+ struct rmap_item *item = NULL; | |
+ struct rmap_list_entry *scan_entry, *swap_entry = NULL; | |
+ struct page *page; | |
+ | |
+ scan_index = swap_index = slot->pages_scanned % slot->pages; | |
+ | |
+ if (pool_entry_boundary(scan_index)) | |
+ try_free_last_pool(slot, scan_index - 1); | |
+ | |
+ if (vma_fully_scanned(slot)) { | |
+ if (slot->flags & UKSM_SLOT_NEED_SORT) | |
+ slot->flags |= UKSM_SLOT_NEED_RERAND; | |
+ else | |
+ slot->flags &= ~UKSM_SLOT_NEED_RERAND; | |
+ if (slot->flags & UKSM_SLOT_NEED_SORT) | |
+ sort_rmap_entry_list(slot); | |
+ } | |
+ | |
+ scan_entry = get_rmap_list_entry(slot, scan_index, 1); | |
+ if (!scan_entry) | |
+ return NULL; | |
+ | |
+ if (entry_is_new(scan_entry)) { | |
+ scan_entry->addr = get_index_orig_addr(slot, scan_index); | |
+ set_is_addr(scan_entry->addr); | |
+ } | |
+ | |
+ if (slot->flags & UKSM_SLOT_NEED_RERAND) { | |
+ rand_range = slot->pages - scan_index; | |
+ BUG_ON(!rand_range); | |
+ swap_index = scan_index + (prandom_u32() % rand_range); | |
+ } | |
+ | |
+ if (swap_index != scan_index) { | |
+ swap_entry = get_rmap_list_entry(slot, swap_index, 1); | |
+ if (entry_is_new(swap_entry)) { | |
+ swap_entry->addr = get_index_orig_addr(slot, | |
+ swap_index); | |
+ set_is_addr(swap_entry->addr); | |
+ } | |
+ swap_entries(scan_entry, scan_index, swap_entry, swap_index); | |
+ } | |
+ | |
+ addr = get_entry_address(scan_entry); | |
+ item = get_entry_item(scan_entry); | |
+ BUG_ON(addr > slot->vma->vm_end || addr < slot->vma->vm_start); | |
+ | |
+ page = follow_page(slot->vma, addr, FOLL_GET); | |
+ if (IS_ERR_OR_NULL(page)) | |
+ goto nopage; | |
+ | |
+ if (!PageAnon(page) && !page_trans_compound_anon(page)) | |
+ goto putpage; | |
+ | |
+ /*check is zero_page pfn or uksm_zero_page*/ | |
+ if ((page_to_pfn(page) == zero_pfn) | |
+ || (page_to_pfn(page) == uksm_zero_pfn)) | |
+ goto putpage; | |
+ | |
+ flush_anon_page(slot->vma, page, addr); | |
+ flush_dcache_page(page); | |
+ | |
+ | |
+ *hash = page_hash(page, hash_strength, 1); | |
+ inc_uksm_pages_scanned(); | |
+ /*if the page content all zero, re-map to zero-page*/ | |
+ if (find_zero_page_hash(hash_strength, *hash)) { | |
+ if (!cmp_and_merge_zero_page(slot->vma, page)) { | |
+ slot->pages_merged++; | |
+ inc_zone_page_state(page, NR_UKSM_ZERO_PAGES); | |
+ dec_mm_counter(slot->mm, MM_ANONPAGES); | |
+ | |
+ /* For full-zero pages, no need to create rmap item */ | |
+ goto putpage; | |
+ } else { | |
+ inc_rshash_neg(memcmp_cost / 2); | |
+ } | |
+ } | |
+ | |
+ if (!item) { | |
+ item = alloc_rmap_item(); | |
+ if (item) { | |
+ /* It has already been zeroed */ | |
+ item->slot = slot; | |
+ item->address = addr; | |
+ item->entry_index = scan_index; | |
+ scan_entry->item = item; | |
+ inc_rmap_list_pool_count(slot, scan_index); | |
+ } else | |
+ goto putpage; | |
+ } | |
+ | |
+ BUG_ON(item->slot != slot); | |
+ /* the page may have changed */ | |
+ item->page = page; | |
+ put_rmap_list_entry(slot, scan_index); | |
+ if (swap_entry) | |
+ put_rmap_list_entry(slot, swap_index); | |
+ return item; | |
+ | |
+putpage: | |
+ put_page(page); | |
+ page = NULL; | |
+nopage: | |
+ /* no page, store addr back and free rmap_item if possible */ | |
+ free_entry_item(scan_entry); | |
+ put_rmap_list_entry(slot, scan_index); | |
+ if (swap_entry) | |
+ put_rmap_list_entry(slot, swap_index); | |
+ return NULL; | |
+} | |
+ | |
+static inline int in_stable_tree(struct rmap_item *rmap_item) | |
+{ | |
+ return rmap_item->address & STABLE_FLAG; | |
+} | |
+ | |
+/** | |
+ * scan_vma_one_page() - scan the next page in a vma_slot. Called with | |
+ * mmap_sem locked. | |
+ */ | |
+static noinline void scan_vma_one_page(struct vma_slot *slot) | |
+{ | |
+ u32 hash; | |
+ struct mm_struct *mm; | |
+ struct rmap_item *rmap_item = NULL; | |
+ struct vm_area_struct *vma = slot->vma; | |
+ | |
+ mm = vma->vm_mm; | |
+ BUG_ON(!mm); | |
+ BUG_ON(!slot); | |
+ | |
+ rmap_item = get_next_rmap_item(slot, &hash); | |
+ if (!rmap_item) | |
+ goto out1; | |
+ | |
+ if (PageKsm(rmap_item->page) && in_stable_tree(rmap_item)) | |
+ goto out2; | |
+ | |
+ cmp_and_merge_page(rmap_item, hash); | |
+out2: | |
+ put_page(rmap_item->page); | |
+out1: | |
+ slot->pages_scanned++; | |
+ if (slot->fully_scanned_round != fully_scanned_round) | |
+ scanned_virtual_pages++; | |
+ | |
+ if (vma_fully_scanned(slot)) | |
+ slot->fully_scanned_round = fully_scanned_round; | |
+} | |
+ | |
+static inline unsigned long rung_get_pages(struct scan_rung *rung) | |
+{ | |
+ struct slot_tree_node *node; | |
+ | |
+ if (!rung->vma_root.rnode) | |
+ return 0; | |
+ | |
+ node = container_of(rung->vma_root.rnode, struct slot_tree_node, snode); | |
+ | |
+ return node->size; | |
+} | |
+ | |
+#define RUNG_SAMPLED_MIN 3 | |
+ | |
+static inline | |
+void uksm_calc_rung_step(struct scan_rung *rung, | |
+ unsigned long page_time, unsigned long ratio) | |
+{ | |
+ unsigned long sampled, pages; | |
+ | |
+ /* will be fully scanned ? */ | |
+ if (!rung->cover_msecs) { | |
+ rung->step = 1; | |
+ return; | |
+ } | |
+ | |
+ sampled = rung->cover_msecs * (NSEC_PER_MSEC / TIME_RATIO_SCALE) | |
+ * ratio / page_time; | |
+ | |
+ /* | |
+ * Before we finsish a scan round and expensive per-round jobs, | |
+ * we need to have a chance to estimate the per page time. So | |
+ * the sampled number can not be too small. | |
+ */ | |
+ if (sampled < RUNG_SAMPLED_MIN) | |
+ sampled = RUNG_SAMPLED_MIN; | |
+ | |
+ pages = rung_get_pages(rung); | |
+ if (likely(pages > sampled)) | |
+ rung->step = pages / sampled; | |
+ else | |
+ rung->step = 1; | |
+} | |
+ | |
+static inline int step_need_recalc(struct scan_rung *rung) | |
+{ | |
+ unsigned long pages, stepmax; | |
+ | |
+ pages = rung_get_pages(rung); | |
+ stepmax = pages / RUNG_SAMPLED_MIN; | |
+ | |
+ return pages && (rung->step > pages || | |
+ (stepmax && rung->step > stepmax)); | |
+} | |
+ | |
+static inline | |
+void reset_current_scan(struct scan_rung *rung, int finished, int step_recalc) | |
+{ | |
+ struct vma_slot *slot; | |
+ | |
+ if (finished) | |
+ rung->flags |= UKSM_RUNG_ROUND_FINISHED; | |
+ | |
+ if (step_recalc || step_need_recalc(rung)) { | |
+ uksm_calc_rung_step(rung, uksm_ema_page_time, rung->cpu_ratio); | |
+ BUG_ON(step_need_recalc(rung)); | |
+ } | |
+ | |
+ slot_iter_index = prandom_u32() % rung->step; | |
+ BUG_ON(!rung->vma_root.rnode); | |
+ slot = sradix_tree_next(&rung->vma_root, NULL, 0, slot_iter); | |
+ BUG_ON(!slot); | |
+ | |
+ rung->current_scan = slot; | |
+ rung->current_offset = slot_iter_index; | |
+} | |
+ | |
+static inline struct sradix_tree_root *slot_get_root(struct vma_slot *slot) | |
+{ | |
+ return &slot->rung->vma_root; | |
+} | |
+ | |
+/* | |
+ * return if resetted. | |
+ */ | |
+static int advance_current_scan(struct scan_rung *rung) | |
+{ | |
+ unsigned short n; | |
+ struct vma_slot *slot, *next = NULL; | |
+ | |
+ BUG_ON(!rung->vma_root.num); | |
+ | |
+ slot = rung->current_scan; | |
+ n = (slot->pages - rung->current_offset) % rung->step; | |
+ slot_iter_index = rung->step - n; | |
+ next = sradix_tree_next(&rung->vma_root, slot->snode, | |
+ slot->sindex, slot_iter); | |
+ | |
+ if (next) { | |
+ rung->current_offset = slot_iter_index; | |
+ rung->current_scan = next; | |
+ return 0; | |
+ } else { | |
+ reset_current_scan(rung, 1, 0); | |
+ return 1; | |
+ } | |
+} | |
+ | |
+static inline void rung_rm_slot(struct vma_slot *slot) | |
+{ | |
+ struct scan_rung *rung = slot->rung; | |
+ struct sradix_tree_root *root; | |
+ | |
+ if (rung->current_scan == slot) | |
+ advance_current_scan(rung); | |
+ | |
+ root = slot_get_root(slot); | |
+ sradix_tree_delete_from_leaf(root, slot->snode, slot->sindex); | |
+ slot->snode = NULL; | |
+ if (step_need_recalc(rung)) { | |
+ uksm_calc_rung_step(rung, uksm_ema_page_time, rung->cpu_ratio); | |
+ BUG_ON(step_need_recalc(rung)); | |
+ } | |
+ | |
+ /* In case advance_current_scan loop back to this slot again */ | |
+ if (rung->vma_root.num && rung->current_scan == slot) | |
+ reset_current_scan(slot->rung, 1, 0); | |
+} | |
+ | |
+static inline void rung_add_new_slots(struct scan_rung *rung, | |
+ struct vma_slot **slots, unsigned long num) | |
+{ | |
+ int err; | |
+ struct vma_slot *slot; | |
+ unsigned long i; | |
+ struct sradix_tree_root *root = &rung->vma_root; | |
+ | |
+ err = sradix_tree_enter(root, (void **)slots, num); | |
+ BUG_ON(err); | |
+ | |
+ for (i = 0; i < num; i++) { | |
+ slot = slots[i]; | |
+ slot->rung = rung; | |
+ BUG_ON(vma_fully_scanned(slot)); | |
+ } | |
+ | |
+ if (rung->vma_root.num == num) | |
+ reset_current_scan(rung, 0, 1); | |
+} | |
+ | |
+static inline int rung_add_one_slot(struct scan_rung *rung, | |
+ struct vma_slot *slot) | |
+{ | |
+ int err; | |
+ | |
+ err = sradix_tree_enter(&rung->vma_root, (void **)&slot, 1); | |
+ if (err) | |
+ return err; | |
+ | |
+ slot->rung = rung; | |
+ if (rung->vma_root.num == 1) | |
+ reset_current_scan(rung, 0, 1); | |
+ | |
+ return 0; | |
+} | |
+ | |
+/* | |
+ * Return true if the slot is deleted from its rung. | |
+ */ | |
+static inline int vma_rung_enter(struct vma_slot *slot, struct scan_rung *rung) | |
+{ | |
+ struct scan_rung *old_rung = slot->rung; | |
+ int err; | |
+ | |
+ if (old_rung == rung) | |
+ return 0; | |
+ | |
+ rung_rm_slot(slot); | |
+ err = rung_add_one_slot(rung, slot); | |
+ if (err) { | |
+ err = rung_add_one_slot(old_rung, slot); | |
+ WARN_ON(err); /* OOPS, badly OOM, we lost this slot */ | |
+ } | |
+ | |
+ return 1; | |
+} | |
+ | |
+static inline int vma_rung_up(struct vma_slot *slot) | |
+{ | |
+ struct scan_rung *rung; | |
+ | |
+ rung = slot->rung; | |
+ if (slot->rung != &uksm_scan_ladder[SCAN_LADDER_SIZE-1]) | |
+ rung++; | |
+ | |
+ return vma_rung_enter(slot, rung); | |
+} | |
+ | |
+static inline int vma_rung_down(struct vma_slot *slot) | |
+{ | |
+ struct scan_rung *rung; | |
+ | |
+ rung = slot->rung; | |
+ if (slot->rung != &uksm_scan_ladder[0]) | |
+ rung--; | |
+ | |
+ return vma_rung_enter(slot, rung); | |
+} | |
+ | |
+/** | |
+ * cal_dedup_ratio() - Calculate the deduplication ratio for this slot. | |
+ */ | |
+static unsigned long cal_dedup_ratio(struct vma_slot *slot) | |
+{ | |
+ unsigned long ret; | |
+ | |
+ BUG_ON(slot->pages_scanned == slot->last_scanned); | |
+ | |
+ ret = slot->pages_merged; | |
+ | |
+ /* Thrashing area filtering */ | |
+ if (ret && uksm_thrash_threshold) { | |
+ if (slot->pages_cowed * 100 / slot->pages_merged | |
+ > uksm_thrash_threshold) { | |
+ ret = 0; | |
+ } else { | |
+ ret = slot->pages_merged - slot->pages_cowed; | |
+ } | |
+ } | |
+ | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * cal_dedup_ratio() - Calculate the deduplication ratio for this slot. | |
+ */ | |
+static unsigned long cal_dedup_ratio_old(struct vma_slot *slot) | |
+{ | |
+ unsigned long ret; | |
+ unsigned long pages_scanned; | |
+ | |
+ pages_scanned = slot->pages_scanned; | |
+ if (!pages_scanned) { | |
+ if (uksm_thrash_threshold) | |
+ return 0; | |
+ else | |
+ pages_scanned = slot->pages_scanned; | |
+ } | |
+ | |
+ ret = slot->pages_bemerged * 100 / pages_scanned; | |
+ | |
+ /* Thrashing area filtering */ | |
+ if (ret && uksm_thrash_threshold) { | |
+ if (slot->pages_cowed * 100 / slot->pages_bemerged | |
+ > uksm_thrash_threshold) { | |
+ ret = 0; | |
+ } else { | |
+ ret = slot->pages_bemerged - slot->pages_cowed; | |
+ } | |
+ } | |
+ | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * stable_node_reinsert() - When the hash_strength has been adjusted, the | |
+ * stable tree need to be restructured, this is the function re-inserting the | |
+ * stable node. | |
+ */ | |
+static inline void stable_node_reinsert(struct stable_node *new_node, | |
+ struct page *page, | |
+ struct rb_root *root_treep, | |
+ struct list_head *tree_node_listp, | |
+ u32 hash) | |
+{ | |
+ struct rb_node **new = &root_treep->rb_node; | |
+ struct rb_node *parent = NULL; | |
+ struct stable_node *stable_node; | |
+ struct tree_node *tree_node; | |
+ struct page *tree_page; | |
+ int cmp; | |
+ | |
+ while (*new) { | |
+ int cmp; | |
+ | |
+ tree_node = rb_entry(*new, struct tree_node, node); | |
+ | |
+ cmp = hash_cmp(hash, tree_node->hash); | |
+ | |
+ if (cmp < 0) { | |
+ parent = *new; | |
+ new = &parent->rb_left; | |
+ } else if (cmp > 0) { | |
+ parent = *new; | |
+ new = &parent->rb_right; | |
+ } else | |
+ break; | |
+ } | |
+ | |
+ if (*new) { | |
+ /* find a stable tree node with same first level hash value */ | |
+ stable_node_hash_max(new_node, page, hash); | |
+ if (tree_node->count == 1) { | |
+ stable_node = rb_entry(tree_node->sub_root.rb_node, | |
+ struct stable_node, node); | |
+ tree_page = get_uksm_page(stable_node, 1, 0); | |
+ if (tree_page) { | |
+ stable_node_hash_max(stable_node, | |
+ tree_page, hash); | |
+ put_page(tree_page); | |
+ | |
+ /* prepare for stable node insertion */ | |
+ | |
+ cmp = hash_cmp(new_node->hash_max, | |
+ stable_node->hash_max); | |
+ parent = &stable_node->node; | |
+ if (cmp < 0) | |
+ new = &parent->rb_left; | |
+ else if (cmp > 0) | |
+ new = &parent->rb_right; | |
+ else | |
+ goto failed; | |
+ | |
+ goto add_node; | |
+ } else { | |
+ /* the only stable_node deleted, the tree node | |
+ * was not deleted. | |
+ */ | |
+ goto tree_node_reuse; | |
+ } | |
+ } | |
+ | |
+ /* well, search the collision subtree */ | |
+ new = &tree_node->sub_root.rb_node; | |
+ parent = NULL; | |
+ BUG_ON(!*new); | |
+ while (*new) { | |
+ int cmp; | |
+ | |
+ stable_node = rb_entry(*new, struct stable_node, node); | |
+ | |
+ cmp = hash_cmp(new_node->hash_max, | |
+ stable_node->hash_max); | |
+ | |
+ if (cmp < 0) { | |
+ parent = *new; | |
+ new = &parent->rb_left; | |
+ } else if (cmp > 0) { | |
+ parent = *new; | |
+ new = &parent->rb_right; | |
+ } else { | |
+ /* oh, no, still a collision */ | |
+ goto failed; | |
+ } | |
+ } | |
+ | |
+ goto add_node; | |
+ } | |
+ | |
+ /* no tree node found */ | |
+ tree_node = alloc_tree_node(tree_node_listp); | |
+ if (!tree_node) { | |
+ printk(KERN_ERR "UKSM: memory allocation error!\n"); | |
+ goto failed; | |
+ } else { | |
+ tree_node->hash = hash; | |
+ rb_link_node(&tree_node->node, parent, new); | |
+ rb_insert_color(&tree_node->node, root_treep); | |
+ | |
+tree_node_reuse: | |
+ /* prepare for stable node insertion */ | |
+ parent = NULL; | |
+ new = &tree_node->sub_root.rb_node; | |
+ } | |
+ | |
+add_node: | |
+ rb_link_node(&new_node->node, parent, new); | |
+ rb_insert_color(&new_node->node, &tree_node->sub_root); | |
+ new_node->tree_node = tree_node; | |
+ tree_node->count++; | |
+ return; | |
+ | |
+failed: | |
+ /* This can only happen when two nodes have collided | |
+ * in two levels. | |
+ */ | |
+ new_node->tree_node = NULL; | |
+ return; | |
+} | |
+ | |
+static inline void free_all_tree_nodes(struct list_head *list) | |
+{ | |
+ struct tree_node *node, *tmp; | |
+ | |
+ list_for_each_entry_safe(node, tmp, list, all_list) { | |
+ free_tree_node(node); | |
+ } | |
+} | |
+ | |
+/** | |
+ * stable_tree_delta_hash() - Delta hash the stable tree from previous hash | |
+ * strength to the current hash_strength. It re-structures the hole tree. | |
+ */ | |
+static inline void stable_tree_delta_hash(u32 prev_hash_strength) | |
+{ | |
+ struct stable_node *node, *tmp; | |
+ struct rb_root *root_new_treep; | |
+ struct list_head *new_tree_node_listp; | |
+ | |
+ stable_tree_index = (stable_tree_index + 1) % 2; | |
+ root_new_treep = &root_stable_tree[stable_tree_index]; | |
+ new_tree_node_listp = &stable_tree_node_list[stable_tree_index]; | |
+ *root_new_treep = RB_ROOT; | |
+ BUG_ON(!list_empty(new_tree_node_listp)); | |
+ | |
+ /* | |
+ * we need to be safe, the node could be removed by get_uksm_page() | |
+ */ | |
+ list_for_each_entry_safe(node, tmp, &stable_node_list, all_list) { | |
+ void *addr; | |
+ struct page *node_page; | |
+ u32 hash; | |
+ | |
+ /* | |
+ * We are completely re-structuring the stable nodes to a new | |
+ * stable tree. We don't want to touch the old tree unlinks and | |
+ * old tree_nodes. The old tree_nodes will be freed at once. | |
+ */ | |
+ node_page = get_uksm_page(node, 0, 0); | |
+ if (!node_page) | |
+ continue; | |
+ | |
+ if (node->tree_node) { | |
+ hash = node->tree_node->hash; | |
+ | |
+ addr = kmap_atomic(node_page); | |
+ | |
+ hash = delta_hash(addr, prev_hash_strength, | |
+ hash_strength, hash); | |
+ kunmap_atomic(addr); | |
+ } else { | |
+ /* | |
+ *it was not inserted to rbtree due to collision in last | |
+ *round scan. | |
+ */ | |
+ hash = page_hash(node_page, hash_strength, 0); | |
+ } | |
+ | |
+ stable_node_reinsert(node, node_page, root_new_treep, | |
+ new_tree_node_listp, hash); | |
+ put_page(node_page); | |
+ } | |
+ | |
+ root_stable_treep = root_new_treep; | |
+ free_all_tree_nodes(stable_tree_node_listp); | |
+ BUG_ON(!list_empty(stable_tree_node_listp)); | |
+ stable_tree_node_listp = new_tree_node_listp; | |
+} | |
+ | |
+static inline void inc_hash_strength(unsigned long delta) | |
+{ | |
+ hash_strength += 1 << delta; | |
+ if (hash_strength > HASH_STRENGTH_MAX) | |
+ hash_strength = HASH_STRENGTH_MAX; | |
+} | |
+ | |
+static inline void dec_hash_strength(unsigned long delta) | |
+{ | |
+ unsigned long change = 1 << delta; | |
+ | |
+ if (hash_strength <= change + 1) | |
+ hash_strength = 1; | |
+ else | |
+ hash_strength -= change; | |
+} | |
+ | |
+static inline void inc_hash_strength_delta(void) | |
+{ | |
+ hash_strength_delta++; | |
+ if (hash_strength_delta > HASH_STRENGTH_DELTA_MAX) | |
+ hash_strength_delta = HASH_STRENGTH_DELTA_MAX; | |
+} | |
+ | |
+/* | |
+static inline unsigned long get_current_neg_ratio(void) | |
+{ | |
+ if (!rshash_pos || rshash_neg > rshash_pos) | |
+ return 100; | |
+ | |
+ return div64_u64(100 * rshash_neg , rshash_pos); | |
+} | |
+*/ | |
+ | |
+static inline unsigned long get_current_neg_ratio(void) | |
+{ | |
+ u64 pos = benefit.pos; | |
+ u64 neg = benefit.neg; | |
+ | |
+ if (!neg) | |
+ return 0; | |
+ | |
+ if (!pos || neg > pos) | |
+ return 100; | |
+ | |
+ if (neg > div64_u64(U64_MAX, 100)) | |
+ pos = div64_u64(pos, 100); | |
+ else | |
+ neg *= 100; | |
+ | |
+ return div64_u64(neg, pos); | |
+} | |
+ | |
+static inline unsigned long get_current_benefit(void) | |
+{ | |
+ u64 pos = benefit.pos; | |
+ u64 neg = benefit.neg; | |
+ u64 scanned = benefit.scanned; | |
+ | |
+ if (neg > pos) | |
+ return 0; | |
+ | |
+ return div64_u64((pos - neg), scanned); | |
+} | |
+ | |
+static inline int judge_rshash_direction(void) | |
+{ | |
+ u64 current_neg_ratio, stable_benefit; | |
+ u64 current_benefit, delta = 0; | |
+ int ret = STILL; | |
+ | |
+ /* Try to probe a value after the boot, and in case the system | |
+ are still for a long time. */ | |
+ if ((fully_scanned_round & 0xFFULL) == 10) { | |
+ ret = OBSCURE; | |
+ goto out; | |
+ } | |
+ | |
+ current_neg_ratio = get_current_neg_ratio(); | |
+ | |
+ if (current_neg_ratio == 0) { | |
+ rshash_neg_cont_zero++; | |
+ if (rshash_neg_cont_zero > 2) | |
+ return GO_DOWN; | |
+ else | |
+ return STILL; | |
+ } | |
+ rshash_neg_cont_zero = 0; | |
+ | |
+ if (current_neg_ratio > 90) { | |
+ ret = GO_UP; | |
+ goto out; | |
+ } | |
+ | |
+ current_benefit = get_current_benefit(); | |
+ stable_benefit = rshash_state.stable_benefit; | |
+ | |
+ if (!stable_benefit) { | |
+ ret = OBSCURE; | |
+ goto out; | |
+ } | |
+ | |
+ if (current_benefit > stable_benefit) | |
+ delta = current_benefit - stable_benefit; | |
+ else if (current_benefit < stable_benefit) | |
+ delta = stable_benefit - current_benefit; | |
+ | |
+ delta = div64_u64(100 * delta , stable_benefit); | |
+ | |
+ if (delta > 50) { | |
+ rshash_cont_obscure++; | |
+ if (rshash_cont_obscure > 2) | |
+ return OBSCURE; | |
+ else | |
+ return STILL; | |
+ } | |
+ | |
+out: | |
+ rshash_cont_obscure = 0; | |
+ return ret; | |
+} | |
+ | |
+/** | |
+ * rshash_adjust() - The main function to control the random sampling state | |
+ * machine for hash strength adapting. | |
+ * | |
+ * return true if hash_strength has changed. | |
+ */ | |
+static inline int rshash_adjust(void) | |
+{ | |
+ unsigned long prev_hash_strength = hash_strength; | |
+ | |
+ if (!encode_benefit()) | |
+ return 0; | |
+ | |
+ switch (rshash_state.state) { | |
+ case RSHASH_STILL: | |
+ switch (judge_rshash_direction()) { | |
+ case GO_UP: | |
+ if (rshash_state.pre_direct == GO_DOWN) | |
+ hash_strength_delta = 0; | |
+ | |
+ inc_hash_strength(hash_strength_delta); | |
+ inc_hash_strength_delta(); | |
+ rshash_state.stable_benefit = get_current_benefit(); | |
+ rshash_state.pre_direct = GO_UP; | |
+ break; | |
+ | |
+ case GO_DOWN: | |
+ if (rshash_state.pre_direct == GO_UP) | |
+ hash_strength_delta = 0; | |
+ | |
+ dec_hash_strength(hash_strength_delta); | |
+ inc_hash_strength_delta(); | |
+ rshash_state.stable_benefit = get_current_benefit(); | |
+ rshash_state.pre_direct = GO_DOWN; | |
+ break; | |
+ | |
+ case OBSCURE: | |
+ rshash_state.stable_point = hash_strength; | |
+ rshash_state.turn_point_down = hash_strength; | |
+ rshash_state.turn_point_up = hash_strength; | |
+ rshash_state.turn_benefit_down = get_current_benefit(); | |
+ rshash_state.turn_benefit_up = get_current_benefit(); | |
+ rshash_state.lookup_window_index = 0; | |
+ rshash_state.state = RSHASH_TRYDOWN; | |
+ dec_hash_strength(hash_strength_delta); | |
+ inc_hash_strength_delta(); | |
+ break; | |
+ | |
+ case STILL: | |
+ break; | |
+ default: | |
+ BUG(); | |
+ } | |
+ break; | |
+ | |
+ case RSHASH_TRYDOWN: | |
+ if (rshash_state.lookup_window_index++ % 5 == 0) | |
+ rshash_state.below_count = 0; | |
+ | |
+ if (get_current_benefit() < rshash_state.stable_benefit) | |
+ rshash_state.below_count++; | |
+ else if (get_current_benefit() > | |
+ rshash_state.turn_benefit_down) { | |
+ rshash_state.turn_point_down = hash_strength; | |
+ rshash_state.turn_benefit_down = get_current_benefit(); | |
+ } | |
+ | |
+ if (rshash_state.below_count >= 3 || | |
+ judge_rshash_direction() == GO_UP || | |
+ hash_strength == 1) { | |
+ hash_strength = rshash_state.stable_point; | |
+ hash_strength_delta = 0; | |
+ inc_hash_strength(hash_strength_delta); | |
+ inc_hash_strength_delta(); | |
+ rshash_state.lookup_window_index = 0; | |
+ rshash_state.state = RSHASH_TRYUP; | |
+ hash_strength_delta = 0; | |
+ } else { | |
+ dec_hash_strength(hash_strength_delta); | |
+ inc_hash_strength_delta(); | |
+ } | |
+ break; | |
+ | |
+ case RSHASH_TRYUP: | |
+ if (rshash_state.lookup_window_index++ % 5 == 0) | |
+ rshash_state.below_count = 0; | |
+ | |
+ if (get_current_benefit() < rshash_state.turn_benefit_down) | |
+ rshash_state.below_count++; | |
+ else if (get_current_benefit() > rshash_state.turn_benefit_up) { | |
+ rshash_state.turn_point_up = hash_strength; | |
+ rshash_state.turn_benefit_up = get_current_benefit(); | |
+ } | |
+ | |
+ if (rshash_state.below_count >= 3 || | |
+ judge_rshash_direction() == GO_DOWN || | |
+ hash_strength == HASH_STRENGTH_MAX) { | |
+ hash_strength = rshash_state.turn_benefit_up > | |
+ rshash_state.turn_benefit_down ? | |
+ rshash_state.turn_point_up : | |
+ rshash_state.turn_point_down; | |
+ | |
+ rshash_state.state = RSHASH_PRE_STILL; | |
+ } else { | |
+ inc_hash_strength(hash_strength_delta); | |
+ inc_hash_strength_delta(); | |
+ } | |
+ | |
+ break; | |
+ | |
+ case RSHASH_NEW: | |
+ case RSHASH_PRE_STILL: | |
+ rshash_state.stable_benefit = get_current_benefit(); | |
+ rshash_state.state = RSHASH_STILL; | |
+ hash_strength_delta = 0; | |
+ break; | |
+ default: | |
+ BUG(); | |
+ } | |
+ | |
+ /* rshash_neg = rshash_pos = 0; */ | |
+ reset_benefit(); | |
+ | |
+ if (prev_hash_strength != hash_strength) | |
+ stable_tree_delta_hash(prev_hash_strength); | |
+ | |
+ return prev_hash_strength != hash_strength; | |
+} | |
+ | |
+/** | |
+ * round_update_ladder() - The main function to do update of all the | |
+ * adjustments whenever a scan round is finished. | |
+ */ | |
+static noinline void round_update_ladder(void) | |
+{ | |
+ int i; | |
+ unsigned long dedup; | |
+ struct vma_slot *slot, *tmp_slot; | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ uksm_scan_ladder[i].flags &= ~UKSM_RUNG_ROUND_FINISHED; | |
+ } | |
+ | |
+ list_for_each_entry_safe(slot, tmp_slot, &vma_slot_dedup, dedup_list) { | |
+ | |
+ /* slot may be rung_rm_slot() when mm exits */ | |
+ if (slot->snode) { | |
+ dedup = cal_dedup_ratio_old(slot); | |
+ if (dedup && dedup >= uksm_abundant_threshold) | |
+ vma_rung_up(slot); | |
+ } | |
+ | |
+ slot->pages_bemerged = 0; | |
+ slot->pages_cowed = 0; | |
+ | |
+ list_del_init(&slot->dedup_list); | |
+ } | |
+} | |
+ | |
+static void uksm_del_vma_slot(struct vma_slot *slot) | |
+{ | |
+ int i, j; | |
+ struct rmap_list_entry *entry; | |
+ | |
+ if (slot->snode) { | |
+ /* | |
+ * In case it just failed when entering the rung, it's not | |
+ * necessary. | |
+ */ | |
+ rung_rm_slot(slot); | |
+ } | |
+ | |
+ if (!list_empty(&slot->dedup_list)) | |
+ list_del(&slot->dedup_list); | |
+ | |
+ if (!slot->rmap_list_pool || !slot->pool_counts) { | |
+ /* In case it OOMed in uksm_vma_enter() */ | |
+ goto out; | |
+ } | |
+ | |
+ for (i = 0; i < slot->pool_size; i++) { | |
+ void *addr; | |
+ | |
+ if (!slot->rmap_list_pool[i]) | |
+ continue; | |
+ | |
+ addr = kmap(slot->rmap_list_pool[i]); | |
+ for (j = 0; j < PAGE_SIZE / sizeof(*entry); j++) { | |
+ entry = (struct rmap_list_entry *)addr + j; | |
+ if (is_addr(entry->addr)) | |
+ continue; | |
+ if (!entry->item) | |
+ continue; | |
+ | |
+ remove_rmap_item_from_tree(entry->item); | |
+ free_rmap_item(entry->item); | |
+ slot->pool_counts[i]--; | |
+ } | |
+ BUG_ON(slot->pool_counts[i]); | |
+ kunmap(slot->rmap_list_pool[i]); | |
+ __free_page(slot->rmap_list_pool[i]); | |
+ } | |
+ kfree(slot->rmap_list_pool); | |
+ kfree(slot->pool_counts); | |
+ | |
+out: | |
+ slot->rung = NULL; | |
+ BUG_ON(uksm_pages_total < slot->pages); | |
+ if (slot->flags & UKSM_SLOT_IN_UKSM) | |
+ uksm_pages_total -= slot->pages; | |
+ | |
+ if (slot->fully_scanned_round == fully_scanned_round) | |
+ scanned_virtual_pages -= slot->pages; | |
+ else | |
+ scanned_virtual_pages -= slot->pages_scanned; | |
+ free_vma_slot(slot); | |
+} | |
+ | |
+ | |
+#define SPIN_LOCK_PERIOD 32 | |
+static struct vma_slot *cleanup_slots[SPIN_LOCK_PERIOD]; | |
+static inline void cleanup_vma_slots(void) | |
+{ | |
+ struct vma_slot *slot; | |
+ int i; | |
+ | |
+ i = 0; | |
+ spin_lock(&vma_slot_list_lock); | |
+ while (!list_empty(&vma_slot_del)) { | |
+ slot = list_entry(vma_slot_del.next, | |
+ struct vma_slot, slot_list); | |
+ list_del(&slot->slot_list); | |
+ cleanup_slots[i++] = slot; | |
+ if (i == SPIN_LOCK_PERIOD) { | |
+ spin_unlock(&vma_slot_list_lock); | |
+ while (--i >= 0) | |
+ uksm_del_vma_slot(cleanup_slots[i]); | |
+ i = 0; | |
+ spin_lock(&vma_slot_list_lock); | |
+ } | |
+ } | |
+ spin_unlock(&vma_slot_list_lock); | |
+ | |
+ while (--i >= 0) | |
+ uksm_del_vma_slot(cleanup_slots[i]); | |
+} | |
+ | |
+/* | |
+*expotional moving average formula | |
+*/ | |
+static inline unsigned long ema(unsigned long curr, unsigned long last_ema) | |
+{ | |
+ /* | |
+ * For a very high burst, even the ema cannot work well, a false very | |
+ * high per-page time estimation can result in feedback in very high | |
+ * overhead of context swith and rung update -- this will then lead | |
+ * to higher per-paper time, this may not converge. | |
+ * | |
+ * Instead, we try to approach this value in a binary manner. | |
+ */ | |
+ if (curr > last_ema * 10) | |
+ return last_ema * 2; | |
+ | |
+ return (EMA_ALPHA * curr + (100 - EMA_ALPHA) * last_ema) / 100; | |
+} | |
+ | |
+/* | |
+ * convert cpu ratio in 1/TIME_RATIO_SCALE configured by user to | |
+ * nanoseconds based on current uksm_sleep_jiffies. | |
+ */ | |
+static inline unsigned long cpu_ratio_to_nsec(unsigned int ratio) | |
+{ | |
+ return NSEC_PER_USEC * jiffies_to_usecs(uksm_sleep_jiffies) / | |
+ (TIME_RATIO_SCALE - ratio) * ratio; | |
+} | |
+ | |
+ | |
+static inline unsigned long rung_real_ratio(int cpu_time_ratio) | |
+{ | |
+ unsigned long ret; | |
+ | |
+ BUG_ON(!cpu_time_ratio); | |
+ | |
+ if (cpu_time_ratio > 0) | |
+ ret = cpu_time_ratio; | |
+ else | |
+ ret = (unsigned long)(-cpu_time_ratio) * | |
+ uksm_max_cpu_percentage / 100UL; | |
+ | |
+ return ret ? ret : 1; | |
+} | |
+ | |
+static noinline void uksm_calc_scan_pages(void) | |
+{ | |
+ struct scan_rung *ladder = uksm_scan_ladder; | |
+ unsigned long sleep_usecs, nsecs; | |
+ unsigned long ratio; | |
+ int i; | |
+ unsigned long per_page; | |
+ | |
+ if (uksm_ema_page_time > 100000 || | |
+ (((unsigned long) uksm_eval_round & (256UL - 1)) == 0UL)) | |
+ uksm_ema_page_time = UKSM_PAGE_TIME_DEFAULT; | |
+ | |
+ per_page = uksm_ema_page_time; | |
+ BUG_ON(!per_page); | |
+ | |
+ /* | |
+ * For every 8 eval round, we try to probe a uksm_sleep_jiffies value | |
+ * based on saved user input. | |
+ */ | |
+ if (((unsigned long) uksm_eval_round & (8UL - 1)) == 0UL) | |
+ uksm_sleep_jiffies = uksm_sleep_saved; | |
+ | |
+ /* We require a rung scan at least 1 page in a period. */ | |
+ nsecs = per_page; | |
+ ratio = rung_real_ratio(ladder[0].cpu_ratio); | |
+ if (cpu_ratio_to_nsec(ratio) < nsecs) { | |
+ sleep_usecs = nsecs * (TIME_RATIO_SCALE - ratio) / ratio | |
+ / NSEC_PER_USEC; | |
+ uksm_sleep_jiffies = usecs_to_jiffies(sleep_usecs) + 1; | |
+ } | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ ratio = rung_real_ratio(ladder[i].cpu_ratio); | |
+ ladder[i].pages_to_scan = cpu_ratio_to_nsec(ratio) / | |
+ per_page; | |
+ BUG_ON(!ladder[i].pages_to_scan); | |
+ uksm_calc_rung_step(&ladder[i], per_page, ratio); | |
+ } | |
+} | |
+ | |
+/* | |
+ * From the scan time of this round (ns) to next expected min sleep time | |
+ * (ms), be careful of the possible overflows. ratio is taken from | |
+ * rung_real_ratio() | |
+ */ | |
+static inline | |
+unsigned int scan_time_to_sleep(unsigned long long scan_time, unsigned long ratio) | |
+{ | |
+ scan_time >>= 20; /* to msec level now */ | |
+ BUG_ON(scan_time > (ULONG_MAX / TIME_RATIO_SCALE)); | |
+ | |
+ return (unsigned int) ((unsigned long) scan_time * | |
+ (TIME_RATIO_SCALE - ratio) / ratio); | |
+} | |
+ | |
+#define __round_mask(x, y) ((__typeof__(x))((y)-1)) | |
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1) | |
+ | |
+static inline unsigned long vma_pool_size(struct vma_slot *slot) | |
+{ | |
+ return round_up(sizeof(struct rmap_list_entry) * slot->pages, | |
+ PAGE_SIZE) >> PAGE_SHIFT; | |
+} | |
+ | |
+static void uksm_vma_enter(struct vma_slot **slots, unsigned long num) | |
+{ | |
+ struct scan_rung *rung; | |
+ unsigned long pool_size, i; | |
+ struct vma_slot *slot; | |
+ int failed; | |
+ | |
+ rung = &uksm_scan_ladder[0]; | |
+ | |
+ failed = 0; | |
+ for (i = 0; i < num; i++) { | |
+ slot = slots[i]; | |
+ | |
+ pool_size = vma_pool_size(slot); | |
+ slot->rmap_list_pool = kzalloc(sizeof(struct page *) * | |
+ pool_size, GFP_KERNEL); | |
+ if (!slot->rmap_list_pool) | |
+ break; | |
+ | |
+ slot->pool_counts = kzalloc(sizeof(unsigned int) * pool_size, | |
+ GFP_KERNEL); | |
+ if (!slot->pool_counts) { | |
+ kfree(slot->rmap_list_pool); | |
+ break; | |
+ } | |
+ | |
+ slot->pool_size = pool_size; | |
+ BUG_ON(CAN_OVERFLOW_U64(uksm_pages_total, slot->pages)); | |
+ slot->flags |= UKSM_SLOT_IN_UKSM; | |
+ uksm_pages_total += slot->pages; | |
+ } | |
+ | |
+ if (i) | |
+ rung_add_new_slots(rung, slots, i); | |
+ | |
+ return; | |
+} | |
+ | |
+static struct vma_slot *batch_slots[SLOT_TREE_NODE_STORE_SIZE]; | |
+ | |
+static void uksm_enter_all_slots(void) | |
+{ | |
+ struct vma_slot *slot; | |
+ unsigned long index; | |
+ struct list_head empty_vma_list; | |
+ int i; | |
+ | |
+ i = 0; | |
+ index = 0; | |
+ INIT_LIST_HEAD(&empty_vma_list); | |
+ | |
+ spin_lock(&vma_slot_list_lock); | |
+ while (!list_empty(&vma_slot_new)) { | |
+ slot = list_entry(vma_slot_new.next, | |
+ struct vma_slot, slot_list); | |
+ | |
+ if (!slot->vma->anon_vma) { | |
+ list_move(&slot->slot_list, &empty_vma_list); | |
+ } else if (vma_can_enter(slot->vma)) { | |
+ batch_slots[index++] = slot; | |
+ list_del_init(&slot->slot_list); | |
+ } else { | |
+ list_move(&slot->slot_list, &vma_slot_noadd); | |
+ } | |
+ | |
+ if (++i == SPIN_LOCK_PERIOD || | |
+ (index && !(index % SLOT_TREE_NODE_STORE_SIZE))) { | |
+ spin_unlock(&vma_slot_list_lock); | |
+ | |
+ if (index && !(index % SLOT_TREE_NODE_STORE_SIZE)) { | |
+ uksm_vma_enter(batch_slots, index); | |
+ index = 0; | |
+ } | |
+ i = 0; | |
+ cond_resched(); | |
+ spin_lock(&vma_slot_list_lock); | |
+ } | |
+ } | |
+ | |
+ list_splice(&empty_vma_list, &vma_slot_new); | |
+ | |
+ spin_unlock(&vma_slot_list_lock); | |
+ | |
+ if (index) | |
+ uksm_vma_enter(batch_slots, index); | |
+ | |
+} | |
+ | |
+static inline int rung_round_finished(struct scan_rung *rung) | |
+{ | |
+ return rung->flags & UKSM_RUNG_ROUND_FINISHED; | |
+} | |
+ | |
+static inline void judge_slot(struct vma_slot *slot) | |
+{ | |
+ struct scan_rung *rung = slot->rung; | |
+ unsigned long dedup; | |
+ int deleted; | |
+ | |
+ dedup = cal_dedup_ratio(slot); | |
+ if (vma_fully_scanned(slot) && uksm_thrash_threshold) | |
+ deleted = vma_rung_enter(slot, &uksm_scan_ladder[0]); | |
+ else if (dedup && dedup >= uksm_abundant_threshold) | |
+ deleted = vma_rung_up(slot); | |
+ else | |
+ deleted = vma_rung_down(slot); | |
+ | |
+ slot->pages_merged = 0; | |
+ slot->pages_cowed = 0; | |
+ | |
+ if (vma_fully_scanned(slot)) | |
+ slot->pages_scanned = 0; | |
+ | |
+ slot->last_scanned = slot->pages_scanned; | |
+ | |
+ /* If its deleted in above, then rung was already advanced. */ | |
+ if (!deleted) | |
+ advance_current_scan(rung); | |
+} | |
+ | |
+ | |
+static inline int hash_round_finished(void) | |
+{ | |
+ if (scanned_virtual_pages > (uksm_pages_total >> 2)) { | |
+ scanned_virtual_pages = 0; | |
+ if (uksm_pages_scanned) | |
+ fully_scanned_round++; | |
+ | |
+ return 1; | |
+ } else { | |
+ return 0; | |
+ } | |
+} | |
+ | |
+#define UKSM_MMSEM_BATCH 5 | |
+#define BUSY_RETRY 100 | |
+ | |
+/** | |
+ * uksm_do_scan() - the main worker function. | |
+ */ | |
+static noinline void uksm_do_scan(void) | |
+{ | |
+ struct vma_slot *slot, *iter; | |
+ struct mm_struct *busy_mm; | |
+ unsigned char round_finished, all_rungs_emtpy; | |
+ int i, err, mmsem_batch; | |
+ unsigned long pcost; | |
+ long long delta_exec; | |
+ unsigned long vpages, max_cpu_ratio; | |
+ unsigned long long start_time, end_time, scan_time; | |
+ unsigned int expected_jiffies; | |
+ | |
+ might_sleep(); | |
+ | |
+ vpages = 0; | |
+ | |
+ start_time = task_sched_runtime(current); | |
+ max_cpu_ratio = 0; | |
+ mmsem_batch = 0; | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE;) { | |
+ struct scan_rung *rung = &uksm_scan_ladder[i]; | |
+ unsigned long ratio; | |
+ int busy_retry; | |
+ | |
+ if (!rung->pages_to_scan) { | |
+ i++; | |
+ continue; | |
+ } | |
+ | |
+ if (!rung->vma_root.num) { | |
+ rung->pages_to_scan = 0; | |
+ i++; | |
+ continue; | |
+ } | |
+ | |
+ ratio = rung_real_ratio(rung->cpu_ratio); | |
+ if (ratio > max_cpu_ratio) | |
+ max_cpu_ratio = ratio; | |
+ | |
+ busy_retry = BUSY_RETRY; | |
+ /* | |
+ * Do not consider rung_round_finished() here, just used up the | |
+ * rung->pages_to_scan quota. | |
+ */ | |
+ while (rung->pages_to_scan && rung->vma_root.num && | |
+ likely(!freezing(current))) { | |
+ int reset = 0; | |
+ | |
+ slot = rung->current_scan; | |
+ | |
+ BUG_ON(vma_fully_scanned(slot)); | |
+ | |
+ if (mmsem_batch) { | |
+ err = 0; | |
+ } else { | |
+ err = try_down_read_slot_mmap_sem(slot); | |
+ } | |
+ | |
+ if (err == -ENOENT) { | |
+rm_slot: | |
+ rung_rm_slot(slot); | |
+ continue; | |
+ } | |
+ | |
+ busy_mm = slot->mm; | |
+ | |
+ if (err == -EBUSY) { | |
+ /* skip other vmas on the same mm */ | |
+ do { | |
+ reset = advance_current_scan(rung); | |
+ iter = rung->current_scan; | |
+ busy_retry--; | |
+ if (iter->vma->vm_mm != busy_mm || | |
+ !busy_retry || reset) | |
+ break; | |
+ } while (1); | |
+ | |
+ if (iter->vma->vm_mm != busy_mm) { | |
+ continue; | |
+ } else { | |
+ /* scan round finsished */ | |
+ break; | |
+ } | |
+ } | |
+ | |
+ BUG_ON(!vma_can_enter(slot->vma)); | |
+ if (uksm_test_exit(slot->vma->vm_mm)) { | |
+ mmsem_batch = 0; | |
+ up_read(&slot->vma->vm_mm->mmap_sem); | |
+ goto rm_slot; | |
+ } | |
+ | |
+ if (mmsem_batch) | |
+ mmsem_batch--; | |
+ else | |
+ mmsem_batch = UKSM_MMSEM_BATCH; | |
+ | |
+ /* Ok, we have take the mmap_sem, ready to scan */ | |
+ scan_vma_one_page(slot); | |
+ rung->pages_to_scan--; | |
+ vpages++; | |
+ | |
+ if (rung->current_offset + rung->step > slot->pages - 1 | |
+ || vma_fully_scanned(slot)) { | |
+ up_read(&slot->vma->vm_mm->mmap_sem); | |
+ judge_slot(slot); | |
+ mmsem_batch = 0; | |
+ } else { | |
+ rung->current_offset += rung->step; | |
+ if (!mmsem_batch) | |
+ up_read(&slot->vma->vm_mm->mmap_sem); | |
+ } | |
+ | |
+ busy_retry = BUSY_RETRY; | |
+ cond_resched(); | |
+ } | |
+ | |
+ if (mmsem_batch) { | |
+ up_read(&slot->vma->vm_mm->mmap_sem); | |
+ mmsem_batch = 0; | |
+ } | |
+ | |
+ if (freezing(current)) | |
+ break; | |
+ | |
+ cond_resched(); | |
+ } | |
+ end_time = task_sched_runtime(current); | |
+ delta_exec = end_time - start_time; | |
+ | |
+ if (freezing(current)) | |
+ return; | |
+ | |
+ cleanup_vma_slots(); | |
+ uksm_enter_all_slots(); | |
+ | |
+ round_finished = 1; | |
+ all_rungs_emtpy = 1; | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ struct scan_rung *rung = &uksm_scan_ladder[i]; | |
+ | |
+ if (rung->vma_root.num) { | |
+ all_rungs_emtpy = 0; | |
+ if (!rung_round_finished(rung)) | |
+ round_finished = 0; | |
+ } | |
+ } | |
+ | |
+ if (all_rungs_emtpy) | |
+ round_finished = 0; | |
+ | |
+ if (round_finished) { | |
+ round_update_ladder(); | |
+ uksm_eval_round++; | |
+ | |
+ if (hash_round_finished() && rshash_adjust()) { | |
+ /* Reset the unstable root iff hash strength changed */ | |
+ uksm_hash_round++; | |
+ root_unstable_tree = RB_ROOT; | |
+ free_all_tree_nodes(&unstable_tree_node_list); | |
+ } | |
+ | |
+ /* | |
+ * A number of pages can hang around indefinitely on per-cpu | |
+ * pagevecs, raised page count preventing write_protect_page | |
+ * from merging them. Though it doesn't really matter much, | |
+ * it is puzzling to see some stuck in pages_volatile until | |
+ * other activity jostles them out, and they also prevented | |
+ * LTP's KSM test from succeeding deterministically; so drain | |
+ * them here (here rather than on entry to uksm_do_scan(), | |
+ * so we don't IPI too often when pages_to_scan is set low). | |
+ */ | |
+ lru_add_drain_all(); | |
+ } | |
+ | |
+ | |
+ if (vpages && delta_exec > 0) { | |
+ pcost = (unsigned long) delta_exec / vpages; | |
+ if (likely(uksm_ema_page_time)) | |
+ uksm_ema_page_time = ema(pcost, uksm_ema_page_time); | |
+ else | |
+ uksm_ema_page_time = pcost; | |
+ } | |
+ | |
+ uksm_calc_scan_pages(); | |
+ uksm_sleep_real = uksm_sleep_jiffies; | |
+ /* in case of radical cpu bursts, apply the upper bound */ | |
+ end_time = task_sched_runtime(current); | |
+ if (max_cpu_ratio && end_time > start_time) { | |
+ scan_time = end_time - start_time; | |
+ expected_jiffies = msecs_to_jiffies( | |
+ scan_time_to_sleep(scan_time, max_cpu_ratio)); | |
+ | |
+ if (expected_jiffies > uksm_sleep_real) | |
+ uksm_sleep_real = expected_jiffies; | |
+ | |
+ /* We have a 1 second up bound for responsiveness. */ | |
+ if (jiffies_to_msecs(uksm_sleep_real) > MSEC_PER_SEC) | |
+ uksm_sleep_real = msecs_to_jiffies(1000); | |
+ } | |
+ | |
+ return; | |
+} | |
+ | |
+static int ksmd_should_run(void) | |
+{ | |
+ return uksm_run & UKSM_RUN_MERGE; | |
+} | |
+ | |
+static int uksm_scan_thread(void *nothing) | |
+{ | |
+ set_freezable(); | |
+ set_user_nice(current, 5); | |
+ | |
+ while (!kthread_should_stop()) { | |
+ mutex_lock(&uksm_thread_mutex); | |
+ if (ksmd_should_run()) { | |
+ uksm_do_scan(); | |
+ } | |
+ mutex_unlock(&uksm_thread_mutex); | |
+ | |
+ try_to_freeze(); | |
+ | |
+ if (ksmd_should_run()) { | |
+ schedule_timeout_interruptible(uksm_sleep_real); | |
+ uksm_sleep_times++; | |
+ } else { | |
+ wait_event_freezable(uksm_thread_wait, | |
+ ksmd_should_run() || kthread_should_stop()); | |
+ } | |
+ } | |
+ return 0; | |
+} | |
+ | |
+int rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) | |
+{ | |
+ struct stable_node *stable_node; | |
+ struct node_vma *node_vma; | |
+ struct rmap_item *rmap_item; | |
+ int ret = SWAP_AGAIN; | |
+ int search_new_forks = 0; | |
+ unsigned long address; | |
+ | |
+ VM_BUG_ON_PAGE(!PageKsm(page), page); | |
+ VM_BUG_ON_PAGE(!PageLocked(page), page); | |
+ | |
+ stable_node = page_stable_node(page); | |
+ if (!stable_node) | |
+ return ret; | |
+again: | |
+ hlist_for_each_entry(node_vma, &stable_node->hlist, hlist) { | |
+ hlist_for_each_entry(rmap_item, &node_vma->rmap_hlist, hlist) { | |
+ struct anon_vma *anon_vma = rmap_item->anon_vma; | |
+ struct anon_vma_chain *vmac; | |
+ struct vm_area_struct *vma; | |
+ | |
+ anon_vma_lock_read(anon_vma); | |
+ anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, | |
+ 0, ULONG_MAX) { | |
+ vma = vmac->vma; | |
+ address = get_rmap_addr(rmap_item); | |
+ | |
+ if (address < vma->vm_start || | |
+ address >= vma->vm_end) | |
+ continue; | |
+ | |
+ if ((rmap_item->slot->vma == vma) == | |
+ search_new_forks) | |
+ continue; | |
+ | |
+ if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) | |
+ continue; | |
+ | |
+ ret = rwc->rmap_one(page, vma, address, rwc->arg); | |
+ if (ret != SWAP_AGAIN) { | |
+ anon_vma_unlock_read(anon_vma); | |
+ goto out; | |
+ } | |
+ | |
+ if (rwc->done && rwc->done(page)) { | |
+ anon_vma_unlock_read(anon_vma); | |
+ goto out; | |
+ } | |
+ } | |
+ anon_vma_unlock_read(anon_vma); | |
+ } | |
+ } | |
+ if (!search_new_forks++) | |
+ goto again; | |
+out: | |
+ return ret; | |
+} | |
+ | |
+#ifdef CONFIG_MIGRATION | |
+/* Common ksm interface but may be specific to uksm */ | |
+void ksm_migrate_page(struct page *newpage, struct page *oldpage) | |
+{ | |
+ struct stable_node *stable_node; | |
+ | |
+ VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); | |
+ VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); | |
+ VM_BUG_ON(newpage->mapping != oldpage->mapping); | |
+ | |
+ stable_node = page_stable_node(newpage); | |
+ if (stable_node) { | |
+ VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage)); | |
+ stable_node->kpfn = page_to_pfn(newpage); | |
+ } | |
+} | |
+#endif /* CONFIG_MIGRATION */ | |
+ | |
+#ifdef CONFIG_MEMORY_HOTREMOVE | |
+static struct stable_node *uksm_check_stable_tree(unsigned long start_pfn, | |
+ unsigned long end_pfn) | |
+{ | |
+ struct rb_node *node; | |
+ | |
+ for (node = rb_first(root_stable_treep); node; node = rb_next(node)) { | |
+ struct stable_node *stable_node; | |
+ | |
+ stable_node = rb_entry(node, struct stable_node, node); | |
+ if (stable_node->kpfn >= start_pfn && | |
+ stable_node->kpfn < end_pfn) | |
+ return stable_node; | |
+ } | |
+ return NULL; | |
+} | |
+ | |
+static int uksm_memory_callback(struct notifier_block *self, | |
+ unsigned long action, void *arg) | |
+{ | |
+ struct memory_notify *mn = arg; | |
+ struct stable_node *stable_node; | |
+ | |
+ switch (action) { | |
+ case MEM_GOING_OFFLINE: | |
+ /* | |
+ * Keep it very simple for now: just lock out ksmd and | |
+ * MADV_UNMERGEABLE while any memory is going offline. | |
+ * mutex_lock_nested() is necessary because lockdep was alarmed | |
+ * that here we take uksm_thread_mutex inside notifier chain | |
+ * mutex, and later take notifier chain mutex inside | |
+ * uksm_thread_mutex to unlock it. But that's safe because both | |
+ * are inside mem_hotplug_mutex. | |
+ */ | |
+ mutex_lock_nested(&uksm_thread_mutex, SINGLE_DEPTH_NESTING); | |
+ break; | |
+ | |
+ case MEM_OFFLINE: | |
+ /* | |
+ * Most of the work is done by page migration; but there might | |
+ * be a few stable_nodes left over, still pointing to struct | |
+ * pages which have been offlined: prune those from the tree. | |
+ */ | |
+ while ((stable_node = uksm_check_stable_tree(mn->start_pfn, | |
+ mn->start_pfn + mn->nr_pages)) != NULL) | |
+ remove_node_from_stable_tree(stable_node, 1, 1); | |
+ /* fallthrough */ | |
+ | |
+ case MEM_CANCEL_OFFLINE: | |
+ mutex_unlock(&uksm_thread_mutex); | |
+ break; | |
+ } | |
+ return NOTIFY_OK; | |
+} | |
+#endif /* CONFIG_MEMORY_HOTREMOVE */ | |
+ | |
+#ifdef CONFIG_SYSFS | |
+/* | |
+ * This all compiles without CONFIG_SYSFS, but is a waste of space. | |
+ */ | |
+ | |
+#define UKSM_ATTR_RO(_name) \ | |
+ static struct kobj_attribute _name##_attr = __ATTR_RO(_name) | |
+#define UKSM_ATTR(_name) \ | |
+ static struct kobj_attribute _name##_attr = \ | |
+ __ATTR(_name, 0644, _name##_show, _name##_store) | |
+ | |
+static ssize_t max_cpu_percentage_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%u\n", uksm_max_cpu_percentage); | |
+} | |
+ | |
+static ssize_t max_cpu_percentage_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ unsigned long max_cpu_percentage; | |
+ int err; | |
+ | |
+ err = kstrtoul(buf, 10, &max_cpu_percentage); | |
+ if (err || max_cpu_percentage > 100) | |
+ return -EINVAL; | |
+ | |
+ if (max_cpu_percentage == 100) | |
+ max_cpu_percentage = 99; | |
+ else if (max_cpu_percentage < 10) | |
+ max_cpu_percentage = 10; | |
+ | |
+ uksm_max_cpu_percentage = max_cpu_percentage; | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(max_cpu_percentage); | |
+ | |
+static ssize_t sleep_millisecs_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%u\n", jiffies_to_msecs(uksm_sleep_jiffies)); | |
+} | |
+ | |
+static ssize_t sleep_millisecs_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ unsigned long msecs; | |
+ int err; | |
+ | |
+ err = kstrtoul(buf, 10, &msecs); | |
+ if (err || msecs > MSEC_PER_SEC) | |
+ return -EINVAL; | |
+ | |
+ uksm_sleep_jiffies = msecs_to_jiffies(msecs); | |
+ uksm_sleep_saved = uksm_sleep_jiffies; | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(sleep_millisecs); | |
+ | |
+ | |
+static ssize_t cpu_governor_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ int n = sizeof(uksm_cpu_governor_str) / sizeof(char *); | |
+ int i; | |
+ | |
+ buf[0] = '\0'; | |
+ for (i = 0; i < n ; i++) { | |
+ if (uksm_cpu_governor == i) | |
+ strcat(buf, "["); | |
+ | |
+ strcat(buf, uksm_cpu_governor_str[i]); | |
+ | |
+ if (uksm_cpu_governor == i) | |
+ strcat(buf, "]"); | |
+ | |
+ strcat(buf, " "); | |
+ } | |
+ strcat(buf, "\n"); | |
+ | |
+ return strlen(buf); | |
+} | |
+ | |
+static inline void init_performance_values(void) | |
+{ | |
+ int i; | |
+ struct scan_rung *rung; | |
+ struct uksm_cpu_preset_s *preset = uksm_cpu_preset + uksm_cpu_governor; | |
+ | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ rung = uksm_scan_ladder + i; | |
+ rung->cpu_ratio = preset->cpu_ratio[i]; | |
+ rung->cover_msecs = preset->cover_msecs[i]; | |
+ } | |
+ | |
+ uksm_max_cpu_percentage = preset->max_cpu; | |
+} | |
+ | |
+static ssize_t cpu_governor_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ int n = sizeof(uksm_cpu_governor_str) / sizeof(char *); | |
+ | |
+ for (n--; n >=0 ; n--) { | |
+ if (!strncmp(buf, uksm_cpu_governor_str[n], | |
+ strlen(uksm_cpu_governor_str[n]))) | |
+ break; | |
+ } | |
+ | |
+ if (n < 0) | |
+ return -EINVAL; | |
+ else | |
+ uksm_cpu_governor = n; | |
+ | |
+ init_performance_values(); | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(cpu_governor); | |
+ | |
+static ssize_t run_show(struct kobject *kobj, struct kobj_attribute *attr, | |
+ char *buf) | |
+{ | |
+ return sprintf(buf, "%u\n", uksm_run); | |
+} | |
+ | |
+static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ int err; | |
+ unsigned long flags; | |
+ | |
+ err = kstrtoul(buf, 10, &flags); | |
+ if (err || flags > UINT_MAX) | |
+ return -EINVAL; | |
+ if (flags > UKSM_RUN_MERGE) | |
+ return -EINVAL; | |
+ | |
+ mutex_lock(&uksm_thread_mutex); | |
+ if (uksm_run != flags) { | |
+ uksm_run = flags; | |
+ } | |
+ mutex_unlock(&uksm_thread_mutex); | |
+ | |
+ if (flags & UKSM_RUN_MERGE) | |
+ wake_up_interruptible(&uksm_thread_wait); | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(run); | |
+ | |
+static ssize_t abundant_threshold_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%u\n", uksm_abundant_threshold); | |
+} | |
+ | |
+static ssize_t abundant_threshold_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ int err; | |
+ unsigned long flags; | |
+ | |
+ err = kstrtoul(buf, 10, &flags); | |
+ if (err || flags > 99) | |
+ return -EINVAL; | |
+ | |
+ uksm_abundant_threshold = flags; | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(abundant_threshold); | |
+ | |
+static ssize_t thrash_threshold_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%u\n", uksm_thrash_threshold); | |
+} | |
+ | |
+static ssize_t thrash_threshold_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ int err; | |
+ unsigned long flags; | |
+ | |
+ err = kstrtoul(buf, 10, &flags); | |
+ if (err || flags > 99) | |
+ return -EINVAL; | |
+ | |
+ uksm_thrash_threshold = flags; | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(thrash_threshold); | |
+ | |
+static ssize_t cpu_ratios_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ int i, size; | |
+ struct scan_rung *rung; | |
+ char *p = buf; | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ rung = &uksm_scan_ladder[i]; | |
+ | |
+ if (rung->cpu_ratio > 0) | |
+ size = sprintf(p, "%d ", rung->cpu_ratio); | |
+ else | |
+ size = sprintf(p, "MAX/%d ", | |
+ TIME_RATIO_SCALE / -rung->cpu_ratio); | |
+ | |
+ p += size; | |
+ } | |
+ | |
+ *p++ = '\n'; | |
+ *p = '\0'; | |
+ | |
+ return p - buf; | |
+} | |
+ | |
+static ssize_t cpu_ratios_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ int i, cpuratios[SCAN_LADDER_SIZE], err; | |
+ unsigned long value; | |
+ struct scan_rung *rung; | |
+ char *p, *end = NULL; | |
+ | |
+ p = kzalloc(count, GFP_KERNEL); | |
+ if (!p) | |
+ return -ENOMEM; | |
+ | |
+ memcpy(p, buf, count); | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ if (i != SCAN_LADDER_SIZE -1) { | |
+ end = strchr(p, ' '); | |
+ if (!end) | |
+ return -EINVAL; | |
+ | |
+ *end = '\0'; | |
+ } | |
+ | |
+ if (strstr(p, "MAX/")) { | |
+ p = strchr(p, '/') + 1; | |
+ err = kstrtoul(p, 10, &value); | |
+ if (err || value > TIME_RATIO_SCALE || !value) | |
+ return -EINVAL; | |
+ | |
+ cpuratios[i] = - (int) (TIME_RATIO_SCALE / value); | |
+ } else { | |
+ err = kstrtoul(p, 10, &value); | |
+ if (err || value > TIME_RATIO_SCALE || !value) | |
+ return -EINVAL; | |
+ | |
+ cpuratios[i] = value; | |
+ } | |
+ | |
+ p = end + 1; | |
+ } | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ rung = &uksm_scan_ladder[i]; | |
+ | |
+ rung->cpu_ratio = cpuratios[i]; | |
+ } | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(cpu_ratios); | |
+ | |
+static ssize_t eval_intervals_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ int i, size; | |
+ struct scan_rung *rung; | |
+ char *p = buf; | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ rung = &uksm_scan_ladder[i]; | |
+ size = sprintf(p, "%u ", rung->cover_msecs); | |
+ p += size; | |
+ } | |
+ | |
+ *p++ = '\n'; | |
+ *p = '\0'; | |
+ | |
+ return p - buf; | |
+} | |
+ | |
+static ssize_t eval_intervals_store(struct kobject *kobj, | |
+ struct kobj_attribute *attr, | |
+ const char *buf, size_t count) | |
+{ | |
+ int i, err; | |
+ unsigned long values[SCAN_LADDER_SIZE]; | |
+ struct scan_rung *rung; | |
+ char *p, *end = NULL; | |
+ | |
+ p = kzalloc(count, GFP_KERNEL); | |
+ if (!p) | |
+ return -ENOMEM; | |
+ | |
+ memcpy(p, buf, count); | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ if (i != SCAN_LADDER_SIZE -1) { | |
+ end = strchr(p, ' '); | |
+ if (!end) | |
+ return -EINVAL; | |
+ | |
+ *end = '\0'; | |
+ } | |
+ | |
+ err = kstrtoul(p, 10, &values[i]); | |
+ if (err) | |
+ return -EINVAL; | |
+ | |
+ p = end + 1; | |
+ } | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ rung = &uksm_scan_ladder[i]; | |
+ | |
+ rung->cover_msecs = values[i]; | |
+ } | |
+ | |
+ return count; | |
+} | |
+UKSM_ATTR(eval_intervals); | |
+ | |
+static ssize_t ema_per_page_time_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%lu\n", uksm_ema_page_time); | |
+} | |
+UKSM_ATTR_RO(ema_per_page_time); | |
+ | |
+static ssize_t pages_shared_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%lu\n", uksm_pages_shared); | |
+} | |
+UKSM_ATTR_RO(pages_shared); | |
+ | |
+static ssize_t pages_sharing_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%lu\n", uksm_pages_sharing); | |
+} | |
+UKSM_ATTR_RO(pages_sharing); | |
+ | |
+static ssize_t pages_unshared_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%lu\n", uksm_pages_unshared); | |
+} | |
+UKSM_ATTR_RO(pages_unshared); | |
+ | |
+static ssize_t full_scans_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%llu\n", fully_scanned_round); | |
+} | |
+UKSM_ATTR_RO(full_scans); | |
+ | |
+static ssize_t pages_scanned_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ unsigned long base = 0; | |
+ u64 delta, ret; | |
+ | |
+ if (pages_scanned_stored) { | |
+ base = pages_scanned_base; | |
+ ret = pages_scanned_stored; | |
+ delta = uksm_pages_scanned >> base; | |
+ if (CAN_OVERFLOW_U64(ret, delta)) { | |
+ ret >>= 1; | |
+ delta >>= 1; | |
+ base++; | |
+ ret += delta; | |
+ } | |
+ } else { | |
+ ret = uksm_pages_scanned; | |
+ } | |
+ | |
+ while (ret > ULONG_MAX) { | |
+ ret >>= 1; | |
+ base++; | |
+ } | |
+ | |
+ if (base) | |
+ return sprintf(buf, "%lu * 2^%lu\n", (unsigned long)ret, base); | |
+ else | |
+ return sprintf(buf, "%lu\n", (unsigned long)ret); | |
+} | |
+UKSM_ATTR_RO(pages_scanned); | |
+ | |
+static ssize_t hash_strength_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%lu\n", hash_strength); | |
+} | |
+UKSM_ATTR_RO(hash_strength); | |
+ | |
+static ssize_t sleep_times_show(struct kobject *kobj, | |
+ struct kobj_attribute *attr, char *buf) | |
+{ | |
+ return sprintf(buf, "%llu\n", uksm_sleep_times); | |
+} | |
+UKSM_ATTR_RO(sleep_times); | |
+ | |
+ | |
+static struct attribute *uksm_attrs[] = { | |
+ &max_cpu_percentage_attr.attr, | |
+ &sleep_millisecs_attr.attr, | |
+ &cpu_governor_attr.attr, | |
+ &run_attr.attr, | |
+ &ema_per_page_time_attr.attr, | |
+ &pages_shared_attr.attr, | |
+ &pages_sharing_attr.attr, | |
+ &pages_unshared_attr.attr, | |
+ &full_scans_attr.attr, | |
+ &pages_scanned_attr.attr, | |
+ &hash_strength_attr.attr, | |
+ &sleep_times_attr.attr, | |
+ &thrash_threshold_attr.attr, | |
+ &abundant_threshold_attr.attr, | |
+ &cpu_ratios_attr.attr, | |
+ &eval_intervals_attr.attr, | |
+ NULL, | |
+}; | |
+ | |
+static struct attribute_group uksm_attr_group = { | |
+ .attrs = uksm_attrs, | |
+ .name = "uksm", | |
+}; | |
+#endif /* CONFIG_SYSFS */ | |
+ | |
+static inline void init_scan_ladder(void) | |
+{ | |
+ int i; | |
+ struct scan_rung *rung; | |
+ | |
+ for (i = 0; i < SCAN_LADDER_SIZE; i++) { | |
+ rung = uksm_scan_ladder + i; | |
+ slot_tree_init_root(&rung->vma_root); | |
+ } | |
+ | |
+ init_performance_values(); | |
+ uksm_calc_scan_pages(); | |
+} | |
+ | |
+static inline int cal_positive_negative_costs(void) | |
+{ | |
+ struct page *p1, *p2; | |
+ unsigned char *addr1, *addr2; | |
+ unsigned long i, time_start, hash_cost; | |
+ unsigned long loopnum = 0; | |
+ | |
+ /*IMPORTANT: volatile is needed to prevent over-optimization by gcc. */ | |
+ volatile u32 hash; | |
+ volatile int ret; | |
+ | |
+ p1 = alloc_page(GFP_KERNEL); | |
+ if (!p1) | |
+ return -ENOMEM; | |
+ | |
+ p2 = alloc_page(GFP_KERNEL); | |
+ if (!p2) | |
+ return -ENOMEM; | |
+ | |
+ addr1 = kmap_atomic(p1); | |
+ addr2 = kmap_atomic(p2); | |
+ memset(addr1, prandom_u32(), PAGE_SIZE); | |
+ memcpy(addr2, addr1, PAGE_SIZE); | |
+ | |
+ /* make sure that the two pages differ in last byte */ | |
+ addr2[PAGE_SIZE-1] = ~addr2[PAGE_SIZE-1]; | |
+ kunmap_atomic(addr2); | |
+ kunmap_atomic(addr1); | |
+ | |
+ time_start = jiffies; | |
+ while (jiffies - time_start < 100) { | |
+ for (i = 0; i < 100; i++) | |
+ hash = page_hash(p1, HASH_STRENGTH_FULL, 0); | |
+ loopnum += 100; | |
+ } | |
+ hash_cost = (jiffies - time_start); | |
+ | |
+ time_start = jiffies; | |
+ for (i = 0; i < loopnum; i++) | |
+ ret = pages_identical(p1, p2); | |
+ memcmp_cost = HASH_STRENGTH_FULL * (jiffies - time_start); | |
+ memcmp_cost /= hash_cost; | |
+ printk(KERN_INFO "UKSM: relative memcmp_cost = %lu " | |
+ "hash=%u cmp_ret=%d.\n", | |
+ memcmp_cost, hash, ret); | |
+ | |
+ __free_page(p1); | |
+ __free_page(p2); | |
+ return 0; | |
+} | |
+ | |
+static int init_zeropage_hash_table(void) | |
+{ | |
+ struct page *page; | |
+ char *addr; | |
+ int i; | |
+ | |
+ page = alloc_page(GFP_KERNEL); | |
+ if (!page) | |
+ return -ENOMEM; | |
+ | |
+ addr = kmap_atomic(page); | |
+ memset(addr, 0, PAGE_SIZE); | |
+ kunmap_atomic(addr); | |
+ | |
+ zero_hash_table = kmalloc(HASH_STRENGTH_MAX * sizeof(u32), | |
+ GFP_KERNEL); | |
+ if (!zero_hash_table) | |
+ return -ENOMEM; | |
+ | |
+ for (i = 0; i < HASH_STRENGTH_MAX; i++) | |
+ zero_hash_table[i] = page_hash(page, i, 0); | |
+ | |
+ __free_page(page); | |
+ | |
+ return 0; | |
+} | |
+ | |
+static inline int init_random_sampling(void) | |
+{ | |
+ unsigned long i; | |
+ random_nums = kmalloc(PAGE_SIZE, GFP_KERNEL); | |
+ if (!random_nums) | |
+ return -ENOMEM; | |
+ | |
+ for (i = 0; i < HASH_STRENGTH_FULL; i++) | |
+ random_nums[i] = i; | |
+ | |
+ for (i = 0; i < HASH_STRENGTH_FULL; i++) { | |
+ unsigned long rand_range, swap_index, tmp; | |
+ | |
+ rand_range = HASH_STRENGTH_FULL - i; | |
+ swap_index = i + prandom_u32() % rand_range; | |
+ tmp = random_nums[i]; | |
+ random_nums[i] = random_nums[swap_index]; | |
+ random_nums[swap_index] = tmp; | |
+ } | |
+ | |
+ rshash_state.state = RSHASH_NEW; | |
+ rshash_state.below_count = 0; | |
+ rshash_state.lookup_window_index = 0; | |
+ | |
+ return cal_positive_negative_costs(); | |
+} | |
+ | |
+static int __init uksm_slab_init(void) | |
+{ | |
+ rmap_item_cache = UKSM_KMEM_CACHE(rmap_item, 0); | |
+ if (!rmap_item_cache) | |
+ goto out; | |
+ | |
+ stable_node_cache = UKSM_KMEM_CACHE(stable_node, 0); | |
+ if (!stable_node_cache) | |
+ goto out_free1; | |
+ | |
+ node_vma_cache = UKSM_KMEM_CACHE(node_vma, 0); | |
+ if (!node_vma_cache) | |
+ goto out_free2; | |
+ | |
+ vma_slot_cache = UKSM_KMEM_CACHE(vma_slot, 0); | |
+ if (!vma_slot_cache) | |
+ goto out_free3; | |
+ | |
+ tree_node_cache = UKSM_KMEM_CACHE(tree_node, 0); | |
+ if (!tree_node_cache) | |
+ goto out_free4; | |
+ | |
+ return 0; | |
+ | |
+out_free4: | |
+ kmem_cache_destroy(vma_slot_cache); | |
+out_free3: | |
+ kmem_cache_destroy(node_vma_cache); | |
+out_free2: | |
+ kmem_cache_destroy(stable_node_cache); | |
+out_free1: | |
+ kmem_cache_destroy(rmap_item_cache); | |
+out: | |
+ return -ENOMEM; | |
+} | |
+ | |
+static void __init uksm_slab_free(void) | |
+{ | |
+ kmem_cache_destroy(stable_node_cache); | |
+ kmem_cache_destroy(rmap_item_cache); | |
+ kmem_cache_destroy(node_vma_cache); | |
+ kmem_cache_destroy(vma_slot_cache); | |
+ kmem_cache_destroy(tree_node_cache); | |
+} | |
+ | |
+/* Common interface to ksm, different to it. */ | |
+int ksm_madvise(struct vm_area_struct *vma, unsigned long start, | |
+ unsigned long end, int advice, unsigned long *vm_flags) | |
+{ | |
+ int err; | |
+ | |
+ switch (advice) { | |
+ case MADV_MERGEABLE: | |
+ return 0; /* just ignore the advice */ | |
+ | |
+ case MADV_UNMERGEABLE: | |
+ if (!(*vm_flags & VM_MERGEABLE)) | |
+ return 0; /* just ignore the advice */ | |
+ | |
+ if (vma->anon_vma) { | |
+ err = unmerge_uksm_pages(vma, start, end); | |
+ if (err) | |
+ return err; | |
+ } | |
+ | |
+ uksm_remove_vma(vma); | |
+ *vm_flags &= ~VM_MERGEABLE; | |
+ break; | |
+ } | |
+ | |
+ return 0; | |
+} | |
+ | |
+/* Common interface to ksm, actually the same. */ | |
+struct page *ksm_might_need_to_copy(struct page *page, | |
+ struct vm_area_struct *vma, unsigned long address) | |
+{ | |
+ struct anon_vma *anon_vma = page_anon_vma(page); | |
+ struct page *new_page; | |
+ | |
+ if (PageKsm(page)) { | |
+ if (page_stable_node(page)) | |
+ return page; /* no need to copy it */ | |
+ } else if (!anon_vma) { | |
+ return page; /* no need to copy it */ | |
+ } else if (anon_vma->root == vma->anon_vma->root && | |
+ page->index == linear_page_index(vma, address)) { | |
+ return page; /* still no need to copy it */ | |
+ } | |
+ if (!PageUptodate(page)) | |
+ return page; /* let do_swap_page report the error */ | |
+ | |
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); | |
+ if (new_page) { | |
+ copy_user_highpage(new_page, page, address, vma); | |
+ | |
+ SetPageDirty(new_page); | |
+ __SetPageUptodate(new_page); | |
+ __set_page_locked(new_page); | |
+ } | |
+ | |
+ return new_page; | |
+} | |
+ | |
+static int __init uksm_init(void) | |
+{ | |
+ struct task_struct *uksm_thread; | |
+ int err; | |
+ | |
+ uksm_sleep_jiffies = msecs_to_jiffies(100); | |
+ uksm_sleep_saved = uksm_sleep_jiffies; | |
+ | |
+ slot_tree_init(); | |
+ init_scan_ladder(); | |
+ | |
+ | |
+ err = init_random_sampling(); | |
+ if (err) | |
+ goto out_free2; | |
+ | |
+ err = uksm_slab_init(); | |
+ if (err) | |
+ goto out_free1; | |
+ | |
+ err = init_zeropage_hash_table(); | |
+ if (err) | |
+ goto out_free0; | |
+ | |
+ uksm_thread = kthread_run(uksm_scan_thread, NULL, "uksmd"); | |
+ if (IS_ERR(uksm_thread)) { | |
+ printk(KERN_ERR "uksm: creating kthread failed\n"); | |
+ err = PTR_ERR(uksm_thread); | |
+ goto out_free; | |
+ } | |
+ | |
+#ifdef CONFIG_SYSFS | |
+ err = sysfs_create_group(mm_kobj, &uksm_attr_group); | |
+ if (err) { | |
+ printk(KERN_ERR "uksm: register sysfs failed\n"); | |
+ kthread_stop(uksm_thread); | |
+ goto out_free; | |
+ } | |
+#else | |
+ uksm_run = UKSM_RUN_MERGE; /* no way for user to start it */ | |
+ | |
+#endif /* CONFIG_SYSFS */ | |
+ | |
+#ifdef CONFIG_MEMORY_HOTREMOVE | |
+ /* | |
+ * Choose a high priority since the callback takes uksm_thread_mutex: | |
+ * later callbacks could only be taking locks which nest within that. | |
+ */ | |
+ hotplug_memory_notifier(uksm_memory_callback, 100); | |
+#endif | |
+ return 0; | |
+ | |
+out_free: | |
+ kfree(zero_hash_table); | |
+out_free0: | |
+ uksm_slab_free(); | |
+out_free1: | |
+ kfree(random_nums); | |
+out_free2: | |
+ kfree(uksm_scan_ladder); | |
+ return err; | |
+} | |
+ | |
+#ifdef MODULE | |
+subsys_initcall(ksm_init); | |
+#else | |
+late_initcall(uksm_init); | |
+#endif | |
+ | |
diff --git a/mm/vmstat.c b/mm/vmstat.c | |
index 1b12d39..f3d174b 100644 | |
--- a/mm/vmstat.c | |
+++ b/mm/vmstat.c | |
@@ -795,6 +795,10 @@ const char * const vmstat_text[] = { | |
"nr_anon_transparent_hugepages", | |
"nr_free_cma", | |
+#ifdef CONFIG_UKSM | |
+ "nr_uksm_zero_pages", | |
+#endif | |
+ | |
/* enum writeback_stat_item counters */ | |
"nr_dirty_threshold", | |
"nr_dirty_background_threshold", |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment